From charlesreid1

(Redirected from Regular expressions)

Intro

Regular expressions provide a way of searching for, and matching, text patterns. They allow for very fine control over pattern-matching, and are useful in many programs, like Awk, Sed, Vim, Grep, PHP, Perl, and other programming languages.

Regular expressions have several special groups of symbols:

  • Literal characters - these mean exactly what they say
  • Special characters - characters with special meaning in regular expressions (e.g. brackets and parentheses)
  • Character sets - these symbols represent a set of characters
  • Anchors - these match locations in a line, such as the beginning or end, or the boundary between words
  • Repetition symbols - these allow you to locate patterns with repetition in them
  • Grouping symbols - these provide a way to group terms


Literal Characters

Special Characters

Character Sets

In regular expressions, you can use expressions to match entire sets of characters, instead of individual characters. These are typically several (or a range of) characters surrounded by brackets.

Character sets can be inclusive (meaning, include a list of characters in the expression being searched for) or exclusive (meaning, excluding a list of characters from the expressions being searched for).

Inclusive Sets

Inclusive sets are straightforward: just put the characters you're searching for in between brackets. For example,

[aeiou]

will match any vowel. So searching for a pattern like:

gr[ae]y

will match gray or grey.

Exclusive Sets

If the

Character Set Special Characters

The only characters with special meaning in character sets are the backslash, the hyphen, the caret, and the close-bracket. These are usually escaped with a backslash.

Special characters don't always have to be escaped (this makes the regular expression more readable).

Backslash: the backslash always needs to be escaped

Caret: the caret can go unescaped anywhere EXCEPT right after the opening bracket

Closing bracket: the closing bracket can go right after the opening bracket or right after the negating caret. For example:

[^]a

matches any character that is not a closing bracket or an "a".

Hyphen: this can go right after the opening bracket or right before the closing bracket. For example:

[-abcdefg]
[abcdefg-]

These will both match the first 6 letters of the alphabet, or a hyphen.

Character Classes

Character classes can be used to match a pre-defined set of characters, e.g. all numbers or all non-whitespace characters:

\d - matches [0-9]
\w - matches [A-Za-z0-9_]
\s - matches [ \t\r\n]

Repeating Sets

You can add a plus sign to the end of a set to indicate that it repeats multiple times:

[A-Z]+

will match any longer-than-one sequence of capital letters.

If you want to match repetitions where the same character is repeated, rather than just any character in the set being matched multiple times, you can indicate you want only the first character matched to be repeated:

([0-9])\1+

will match 666 but not 123.

Anchors

Lines

Line anchors include beginning-of-line:

^

and end-of-line:

$

Words

Repetition Symbols

Grouping Symbols

Examples

Surrounding Variable with Quotes

I needed to find all instances of the following four patterns in a file:

${X1},${X1},${X1}

${X2},${X2},${X2}

${X3},${X3},${X3}

${X4},${X4},${X4}

and replace them with the same thing, but just surrounded by single quotes:

'${X1},${X1},${X1}'

'${X2},${X2},${X2}'

'${X3},${X3},${X3}'

'${X4},${X4},${X4}'

I wanted to construct one search using Vim (or using Sed) that would do this. It is:

:%s/\(\${X[1-4]},\${X[1-4]},\${X[1-4]}\)/'\1'/g

Breaking this down:

:%s/\(...\)/'\1'/g

this will take whatever is in between \( and \), and will replace it with the same thing, hence the \1, but surrounded by single quotes, hence '\1'.

Next, the stuff going between \( and \) has to match all instances above, so it needs to use regexp to match a digit between 1 and 4, hence the [1-4].

Extracting Links from an HTML Page

I wanted to write a regular expression that would extract links from several HTML pages I had. To do this, I used sed. Given HTML source code:

<h1>Some html junk</h1>
<p>This is a bunch of junk.  The thing I really want is the <a href="http://www.yournamehere.com">link location</a>.

I wanted to pull out all the HTML links, like http://www.yournamehere.com.

To start with, I ran the sed script on the output from an echo command:

$ echo '<a href="as:df/qw.er/ty"><img src="">' | sed ...

This way, I could test the sed script, then run it on the full HTML source code when I was done.

The sed script I used was:

sed -n 's/.*href="\([^"]\{1,\}\).*/\1/p'

First, the -n flag tells sed not to print the result of the sed processing (this will be done manually).

Next, the search pattern consists of an arbitrary number of characters .*, followed by href=", followed by the link text itself (what we're interested in getting).

The link text is, first, put between \( and \) because whatever is between these, as in the previous example, can be printed out using \1.

The link text is matcehd using [^"], which matches any character not a double-quote, followed by \{1,\}, which means 1 or more. The problem with using a simpler pattern like href="\(.*\)" is that the <a></a> tags could be followed by other tags, like <img src="..." />, which would incorrectly match the link location, plus the closing of the <a> tag, plus the opening of the image tag, up to src=. That's bad news.

Using [^"] is smarter because it will guarantee you only get the link text. This can create problems if, say, a double-quote is escaped in a link location, so it appears as \" in the link location, but this would be straightforward to correct.

References