English Wikipedia @ Freddythechick:AutoWikiBrowser/Regular expression
- Home
Introduction and rules - User manual
How to use AWB - Discussion
Discuss AWB, report errors, and request features - User tasks
Request or help with AWB-able tasks - Technical
Technical documentation
<inputbox> bgcolor=transparent type=fulltext prefix=Wikipedia:AutoWikiBrowser break=no width=22 searchbuttonlabel=Search </inputbox>
![]() | This is the Regular expressions subsection of the user manual for AutoWikiBrowser.
|
Chapters: | Core · Database scanner · Find and replace · Regular expressions · General fixes |
---|
A regular expression or regex is a sequence of characters that define a pattern to be searched for in a text. Each occurrence of the pattern may then be automatically replaced with another string, which may include parts of the identified pattern. AutoWikiBrowser uses the .NET flavor of regex.[1]
Syntax
Anchors
Used to anchor the search pattern to certain points in the searched text.
Syntax | Comments | |
---|---|---|
<syntaxhighlight lang="text" class="" style="" inline="1">^</syntaxhighlight> | Start of string | Before all other characters on page (or line if multiline option is active) (Note that "^" has a different meaning inside a token.) |
<syntaxhighlight lang="text" class="" style="" inline="1">\A</syntaxhighlight> | Start of string | Before all other characters on page |
<syntaxhighlight lang="text" class="" style="" inline="1">$</syntaxhighlight> | End of string | After all other characters on page (or line if multiline option is active) |
<syntaxhighlight lang="text" class="" style="" inline="1">\Z</syntaxhighlight> | End of string | After all other characters on page |
<syntaxhighlight lang="text" class="" style="" inline="1">\b</syntaxhighlight> | On a word boundary | On a letter, number or underscore character |
<syntaxhighlight lang="text" class="" style="" inline="1">\B</syntaxhighlight> | Not on a word boundary | Not on a letter, number or underscore character |
Character classes
Expressions which match any character in a pre-defined set. This list is not exhaustive.
Character class | Will match | |
---|---|---|
<syntaxhighlight lang="text" class="" style="" inline="1">.</syntaxhighlight> | "wildcard" | Any character except newline (Newline is included if singleline option is active; see #Regex behavior options below) |
<syntaxhighlight lang="text" class="" style="" inline="1">\w</syntaxhighlight> | Any "word" character (letters, digits, underscore) | abcdefghijklmnopqstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789_ |
<syntaxhighlight lang="text" class="" style="" inline="1">\W</syntaxhighlight> | Any character other than "word" characters | $?!#%*@&;:.,+-±=^"`\|/<>{}[]()~(newline)(tab)(space) |
<syntaxhighlight lang="text" class="" style="" inline="1">\s</syntaxhighlight> | Any whitespace character | (space) (tab) (literal new line) (return) |
<syntaxhighlight lang="text" class="" style="" inline="1">\S</syntaxhighlight> | Any character other than white space | abcxyz_ABCXYZ$?!#%*@&;:.,+-=^"/<{[(~0123789 (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">\d</syntaxhighlight> | Any digit | 0123456789 |
<syntaxhighlight lang="text" class="" style="" inline="1">\D</syntaxhighlight> | Any character other than digits | abcxyz_ABCXYZ$?!#%*@&;:.,+-=^"/<{[(~(newline)(tab)(space) (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">\n</syntaxhighlight> | Newline | (newline) |
<syntaxhighlight lang="text" class="" style="" inline="1">\p{L} </syntaxhighlight> | Any Unicode letter[2] | AaÃãÂâĂăÄäÅå (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">\p{Ll} </syntaxhighlight> | Any lowercase Unicode letter | aãâăäå (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">\p{Lu} </syntaxhighlight> | Any uppercase Unicode letter | AÃÂĂÄÅ (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">\r</syntaxhighlight> | Carriage return | (carriage return) |
<syntaxhighlight lang="text" class="" style="" inline="1">\t</syntaxhighlight> | Tab | (tab) |
<syntaxhighlight lang="text" class="" style="" inline="1">\c</syntaxhighlight> | Control character | Ctrl-A through Ctrl-Z (0x01–0x1A) |
<syntaxhighlight lang="text" class="" style="" inline="1">\x</syntaxhighlight> | Any hexadecimal digit | 0123456789abcdefABCDEF |
<syntaxhighlight lang="text" class="" style="" inline="1">\0</syntaxhighlight> | Any octal digit | 01234567 |
Tokens
Tokens match a single character from a specified set or range of characters.
Tokens | Examples | |
---|---|---|
[ ...]
|
Set – matches any single character in the brackets | <syntaxhighlight lang="text" class="" style="" inline="1">[def]</syntaxhighlight> matches d or e or f |
[^ ...]
|
Inverse – match any single character except those in the brackets | [^abc] – anything (including newline) except a or b or c
|
[ ...- ...]
|
Range – matches any single character in the specified range (including the characters given as the endpoints of the range) |
<syntaxhighlight lang="text" class="" style="" inline="1">[a-q]</syntaxhighlight> – any lowercase letter between a and q <syntaxhighlight lang="text" class="" style="" inline="1">[A-Q]</syntaxhighlight> – any uppercase letter between A and Q |
Groups
Groups match a string of characters (including tokens) in sequence. By default, matches to groups are captured for later reference. Groups may be nested within other groups.
Syntax | Examples | |
---|---|---|
( ...)
|
Capture group – matches the string in parentheses (Output captured groups in the replacement string with <syntaxhighlight lang="text" class="" style="" inline="1">$1</syntaxhighlight>, <syntaxhighlight lang="text" class="" style="" inline="1">$2</syntaxhighlight>, etc.) |
(abc) matches abc
|
(?<name> ...) |
Named capture group (for use in back references or the replacement string) |
(?<year>\b\d{4}\b) matches the whole word 2016
Output the named group using |
(?: ...) |
Non-capturing parentheses | <syntaxhighlight lang="text" inline>(?:abc)</syntaxhighlight> matches and consumes, but doesn't capture, abc |
|
|
Alternation/disjunction (read as "or") | cd|ef)</syntaxhighlight> matches ab or cd or ef <syntaxhighlight lang="text" class="" style="" inline="1">(ab(cd|ef))</syntaxhighlight> matches abcd or abef |
Quantifiers
Quantifiers specify how many of the preceding token or group may be matched.
Syntax | Examples | |
---|---|---|
*
|
0 or more | b* matches nothing, b, bb, bbb, etc. |
+
|
1 or more | b+ matches b, bb, bbb, etc. |
?
|
0 or 1 | b? matches nothing, or b |
{3}
|
Exactly 3 | b{3} matches bbb |
{3,}
|
3 or more | b{3,} matches bbb, bbbb, etc. |
{2,4}
|
At least 2 and no more than 4 | b{2,4} matches bb, bbb, or bbbb |
By default, quantifiers are "greedy", meaning they will match as many characters as possible while still allowing the full expression to find a match. Adding a question mark ("?") after a qualifier will make it non-greedy, meaning it will match as few characters as possible while still allowing the full expression to find a match. See #Greed and quantifiers for examples.
Metacharacters and the escape character
Metacharacters are characters with special meaning in regex; to match these characters literally, they must be "escaped" by being preceded with with the escape character \.
Escape character | Comments | |
---|---|---|
\
|
Escape Character | Allows metacharacters (listed below) to be matched literally |
Metacharacter | Metacharacter escaped | |
<syntaxhighlight lang="text" class="" style="" inline="1">^</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\^</syntaxhighlight> | Not in this list: <syntaxhighlight lang="text" inline>=}#!/%&_:;</syntaxhighlight> (incomplete list) |
<syntaxhighlight lang="text" class="" style="" inline="1">$</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\$</syntaxhighlight> | |
<syntaxhighlight lang="text" class="" style="" inline="1">(</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\(</syntaxhighlight> | |
<syntaxhighlight lang="text" class="" style="" inline="1">)</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\)</syntaxhighlight> | |
<
|
<syntaxhighlight lang="text" class="" style="" inline="1">\<</syntaxhighlight> | |
.
|
<syntaxhighlight lang="text" class="" style="" inline="1">\.</syntaxhighlight> | |
*
|
<syntaxhighlight lang="text" class="" style="" inline="1">\*</syntaxhighlight> | |
+
|
<syntaxhighlight lang="text" class="" style="" inline="1">\+</syntaxhighlight> | |
?
|
\?
| |
<syntaxhighlight lang="text" class="" style="" inline="1">[</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\[</syntaxhighlight> | |
<syntaxhighlight lang="text" class="" style="" inline="1">]</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\]</syntaxhighlight> | |
{
|
\{
| |
<syntaxhighlight lang="text" class="" style="" inline="1">\</syntaxhighlight> | <syntaxhighlight lang="text" class="" style="" inline="1">\\</syntaxhighlight> | |
|
|
\|
| |
>
|
\>
| |
-
|
\-
|
Hyphens must be escaped within tokens, where they indicate a range; outside of tokens, they do not need to be escaped. |
Back references
Used to match a previously captured group again.
Syntax | Comments | |
---|---|---|
<syntaxhighlight lang="text" class="" style="" inline="1">\1</syntaxhighlight>, <syntaxhighlight lang="text" class="" style="" inline="1">\2</syntaxhighlight>, <syntaxhighlight lang="text" class="" style="" inline="1">\3</syntaxhighlight>, etc. | Match unnamed captured groups in order. | <syntaxhighlight lang="ragel" inline>(\n[^\n]+)\1</syntaxhighlight> matches identical adjacent lines; <syntaxhighlight lang="text" class="" style="" inline="1">$1</syntaxhighlight> will replace with a single copy. |
\k<name>
))</syntaxhighlight> | ||
Replace With | <syntaxhighlight lang="text" class="" style="" inline="1">new value</syntaxhighlight> | |
Example of text to search | {{infobox person|name=Steveo|occupation=dancer|nationality=The moon}}
| |
Result | {{infobox person|name=Steveo|occupation=new value|nationality=The moon}}
| |
Comments |
Commonly used expressions
<syntaxhighlight lang="ragel">
Match inside Cite error: There are <ref>
tags on this page without content in them (see the help page).
Regex: <ref[^>]*>([^<]|<[^/]|</[^r]|</r[^e]|</re[^f]|</ref[^>])+</ref>
</syntaxhighlight>
<syntaxhighlight lang="ragel">
Match inside Cite error: There are <ref>
tags on this page without content in them (see the help page). using a (?! not match) notation
Regex: <ref[^>]*>([^<]|<(?!/ref>))+</ref>
</syntaxhighlight>
<syntaxhighlight lang="ragel"> Match template Template:... possibly with templates inside it, but no templates inside those Regex: {{([^{]|{[^{]|{{[^{}]+}})+}} </syntaxhighlight>
<syntaxhighlight lang="ragel"> Match words and spaces Regex: [\w\s]+ </syntaxhighlight>
<syntaxhighlight lang="ragel"> Match bracketed URLs Regex: \[(https?://[^\]\[<>\s"]+) *((?<= )[^\n\]]*|)\] </syntaxhighlight>
Tips and tricks
Regex behavior options
Regex offers several options to change the default behavior.[3] Five of these options can be controlled with inline expressions, as described below. Four of these options can also be applied to the entire search pattern with check boxes in the AWB "Find-and-replace" tools. By default, all options are off.
Option | Inline flag | Check box available | Effect |
---|---|---|---|
IgnoreCase | i | Yes | Specifies case-insensitive matching (upper and lowercase letters are treated the same). |
SingleLine | s | Yes | Treats the searched text as a single line, by allowing (. ) to match newlines (\n ), which it otherwise does not.
|
MultiLine | m | Yes | Changes the meaning of the (^ ) and ($ ) anchors to match the beginning and end, respectively, of any line, rather than just the start and end of the whole string.
|
ExplicitCapture | n | Yes | Specifies that only groups that are named or numbered (e.g. with the form (?<name>) ) will be captured.
|
IgnorePatternWhitespace | x | No | Causes whitespace characters (spaces, tabs, and newlines) in the pattern to be ignored, so that they can be used to keep the pattern visually organized.[a] |
- ^ To match whitespace characters while the IgnorePatternWhitespace option is enabled, they must be identified with character classes, i.e.
\s
(whitespace),\n
(newline), or\t
(tab). (To match only a space, but not a tab or newline, use the pattern\p{Zs}
.)
Inline syntax
The options statement <syntaxhighlight lang="text" class="" style="" inline="1">(?flags-flags)</syntaxhighlight> turns the options given by "flags" on (or off, for any flags preceded by a minus sign) from the point where the statement appears to the end of the pattern, or to the point where a given option is cancelled by another options statement. For example:
<syntaxhighlight lang="ragel"> (?im-s) #Turn ON IgnoreCase (i) and MultiLine (m) options, and turn OFF SingleLine (s) option, from here to the end of the pattern or until cancelled </syntaxhighlight>
Alternatively, the syntax <syntaxhighlight lang="text" class="" style="" inline="1">(?flags-flags:pattern)</syntaxhighlight> applies the specified options only to the part of the pattern appearing inside the parentheses:
<syntaxhighlight lang="ragel"> (?x:pattern1)pattern2 #Apply the IgnorePatternWhitespace (x) option to pattern1, but not to pattern2 </syntaxhighlight>
User-made shortcut editing macros
You can make your own shortcut editing macros. When you edit a page, you can enter your short-cut macro keys into the page anywhere you want AWB to act upon them.
For example, you are examining a page in the AWB edit box. You see numerous items like adding {{fact}}
, inserting line breaks <br />
, commenting out entire lines <!--comment-->
, inserting state names, <ref>Insert footnote text here</ref>
, insert Level 2,3,or even 4 headlines, etc... This can all be done by creating your short-cut macro keys.
- The process
- Create a rule. See Find and replace, Advanced settings.
- Edit your page in the edit box. Insert your short-cut editing macro key(s) anywhere in the page you want AWB to make the change(s) for you.
- Re-parse the page. Right click on the edit box and select Re-parse from the context pop up menu. AWB will then re-examine your page with your macro short-cut key(s), find your short-cut key(s) and perform the action you specified in the rule.
Naming a short-cut macro key can be any name. But it is best to try and make it unique so that it will not interfere with any other process that AWB may find and suggest. For that reason using /// followed by a set of lowercase characters that you can easily remember is best (lowercase is used so that you do not have to use the shift key). You can then enter these short-cut macros keys you create into the page manually or by using the edit box context menu paste more function. The reason why we use three '/' is so that AWB will not confuse web addresses/url's in a page when re-parsing.
Examples:
Create a rule as a regular expression.
///col Comment out entire line
| |
---|---|
Short-cut key: | ///col |
Name | Comment out entire line |
Find | ///col(.*) |
Replace With | <!--$1-->
|
Example before reparsing | ///colThe quick brown fox jumps over the lazy dog |
Result after re-parsing | <!--The quick brown fox jumps over the lazy dog-->
|
Comments | |
///fac Insert {{citation needed}} with current date
| |
Short-cut key | ///fac |
Name | Insert {{citation needed}} with current date
|
Find | ///fac |
Replace With | {{citation needed|date={{subst:CURRENTMONTHNAME}} {{subst:CURRENTYEAR}}}}
|
Example before reparsing | The quick brown fox jumps over the lazy dog///fac |
Result after re-parsing | The quick brown fox jumps over the lazy dog[citation needed] |
Comments |
Efficiency
![]() | This section needs expansion. You can help by adding to it. |
Efficiency is how long the regex engine takes to find matches, which is a function of how many characters the engine has to read, including backtracking. Complex regular expressions can often be constructed in several different ways, all with the same outputs but with greatly varying efficiency. If AWB is taking a long time to generate results because of a regex rule:
- Try constructing the expression a different way. There are several online resources with guidance to creating efficient regex patterns.
- Using the "advanced settings" find-and-replace tool, enter expressions on the "If" tab to filter the pages that an expensive find-and-replace rule is applied to.
References
- ^ adegeo (18 June 2022). "Regular Expression Language - Quick Reference". learn.microsoft.com. Archived from the original on 2023-02-05. Retrieved 2023-02-05.
- ^ "Regex Tutorial – Unicode Characters and Properties". www.regular-expressions.info. Archived from the original on 19 December 2022. Retrieved 3 January 2023.
- ^ adegeo (29 June 2022). "Options for regular expression". learn.microsoft.com. Archived from the original on 2023-02-05. Retrieved 2023-02-05.
External links
Online regular expressions testing tools
- RegEx Storm (supporting .NET regex flavour);
- RegEx101 (supporting .NET regex flavour)
- RexEx Pal
- RegExr
- Rubular
Desktop regular expression testing tool
Documentation about regular expressions
- Regular Expressions in .NET Well House Consultants.
- Regular-Expressions.info
- Regular Expressions perldoc.perl.org.
- Regular Expression Syntax docs.python.org.
- Regular Expression Language – Quick Reference MSDN.
- .NET regular expressions MSDN.
- Regular Expressions – User Guide zytrax.com.