Regular Expressions: Difference between revisions
From charlesreid1
| Line 119: | Line 119: | ||
=Grouping Symbols= | =Grouping Symbols= | ||
=References= | |||
* http://www.regular-expressions.info/characters.html | |||
Revision as of 01:04, 29 April 2011
Intro
Regular expressions provide a way of searching for, and matching, text patterns. They allow for very fine control over pattern-matching, and are useful in many programs, like Awk, Sed, Vim, Grep, PHP, Perl, and other programming languages.
Regular expressions have several special groups of symbols:
- Literal characters - these mean exactly what they say
- Special characters - characters with special meaning in regular expressions (e.g. brackets and parentheses)
- Character sets - these symbols represent a set of characters
- Anchors - these match locations in a line, such as the beginning or end, or the boundary between words
- Repetition symbols - these allow you to locate patterns with repetition in them
- Grouping symbols - these provide a way to group terms
Literal Characters
Special Characters
Character Sets
In regular expressions, you can use expressions to match entire sets of characters, instead of individual characters. These are typically several (or a range of) characters surrounded by brackets.
Character sets can be inclusive (meaning, include a list of characters in the expression being searched for) or exclusive (meaning, excluding a list of characters from the expressions being searched for).
Inclusive Sets
Inclusive sets are straightforward: just put the characters you're searching for in between brackets. For example,
[aeiou]
will match any vowel. So searching for a pattern like:
gr[ae]y
will match gray or grey.
Exclusive Sets
If the
Character Set Special Characters
The only characters with special meaning in character sets are the backslash, the hyphen, the caret, and the close-bracket. These are usually escaped with a backslash.
Special characters don't always have to be escaped (this makes the regular expression more readable).
Backslash: the backslash always needs to be escaped
Caret: the caret can go unescaped anywhere EXCEPT right after the opening bracket
Closing bracket: the closing bracket can go right after the opening bracket or right after the negating caret. For example:
[^]a
matches any character that is not a closing bracket or an "a".
Hyphen: this can go right after the opening bracket or right before the closing bracket. For example:
[-abcdefg] [abcdefg-]
These will both match the first 6 letters of the alphabet, or a hyphen.
Character Classes
Character classes can be used to match a pre-defined set of characters, e.g. all numbers or all non-whitespace characters:
\d - matches [0-9] \w - matches [A-Za-z0-9_] \s - matches [ \t\r\n]
Repeating Sets
You can add a plus sign to the end of a set to indicate that it repeats multiple times:
[A-Z]+
will match any longer-than-one sequence of capital letters.
If you want to match repetitions where the same character is repeated, rather than just any character in the set being matched multiple times, you can indicate you want only the first character matched to be repeated:
([0-9])\1+
will match 666 but not 123.
Anchors
Lines
Line anchors include beginning-of-line:
^
and end-of-line:
$