vaughnfriesen.com

Introduction to regular expressions

If you're like me, regular expressions sounds horribly complicated - just short of coding in assembly language while balancing a ball on your nose and drinking green tea (trying not to cringe at the taste, of course). Seriously, it isn't that bad.

What are regular expressions? You could call it a pattern matching "language." If you want to do something as simple as search a file for a specific phrase, or more complex like validate a phone number, regular expressions can help you. How will you get started? This article is mainly geared toward grep users, but there are other ways to use regular expressions. So grab your green tea and get started.

I'll use lots of examples so it should be easier to figure out. Here's the format:

'[Hh]ello'

example01.txt:
Hello, how are you hello today heLlo?
Hm?

The first line is the pattern - what you type in (or pass to grep), and after that is the text you are working with, which I'll save in a file and upload at the end. The bolded red letters show what is being found. To try this out with grep, use this command:

grep 'pattern' example01.txt

The single quotes are just so that bash knows that what's inside is to be passed to grep without modification.

Finally it's time to start. Maybe you know about * and ? if you've been around computers much. However, regular expressions use a bit different syntax. Here's a rundown of the basics of regular expressions:

a A character in a regular expression (letter or number) matches exactly that character.
. Matches a single character.
.* Matches any number of characters (including 0). Generally anything with an asterisk ( * ) after it means "any number of this character"
[] One of any letter in between the brackets. For example, [Hh258] matches either H, h, 2, 5, or 8.
Also, you can search for a range of numbers: [0-9] matches any number, [a-z] matches any lowercase character.
Another example: [0-9a-zAEIOU] matches any number, lowercase letter, or uppercase vowel.

'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'

example02.txt:
My name is Joe and my phone number is 933-8439.
You can call me on my cell: 948-5830.

Obviously that searches for a phone number.

'a.aZ*j[0-9]*'

example03.txt:
aa abaZZZZZZj84873920 kljljdflasjdfjalsdf
as;lkjl;glsakdfjgldfg

Matches an 'a' with any character after it, then another 'a', any amount of Zs, and then a 'j', then any amount of numbers.

Suppose you want to use one of the special characters * . [ ] and others. What do you do? You need to escape them. Wha...? The backslash is called an escape character - if you put it in front of special characters it will take them literally instead of as a special character.

'\. \[.\*.\]'

example04.txt:
What do you mean? he said. [h*i]

This looks for a period, then a space, then a left bracket, any letter, an asterisk ( * ), any letter, and a right bracket. Note that the space doesn't need to be escaped. Of course that's a pointless text file. Oh well.

There are a couple characters that have to do with the position in a line:

^ The start of a line.
$ The end of a line.

If these characters aren't at the beginning and end of the regular expression (respectively) they will be translated as literal characters.

'^J.*d$'

example05.txt:
Jack and
Jill went
up the
hill to
fetch a
pail of
water.
Poor Jack. d

Notice that the last line, which contains a J and ends with a d, doesn't match. That's because the ^ says that the J needs to be right at the beginning of the line.

Suppose you want exactly a certain number of some character. You can put the minimum and maximum number within escaped braces \{ \}, separated by a comma:

\{4,6\} Matches 4 - 6 instances of a character in a row.
\{2,\} Matches at least 2 instances of a character.
\{5\} Matches exactly 5 instances of a character.

't\{5,10\}'

example06.txt:

hhhhhello ttttttthere tthis is a tttttttttttttttest.

Thinking back, our telephone number expression could have been written like this: [0-9]\{3\}-[0-9]\{4\}. Is that prettier? You can decide.

Remember to escape the braces. If you just use the expression A{5} it will find Hello AAAAA A{5}. which may or may not be what you want.

Consider this example:

'do'

example07.txt:
I don't know what's for supper, but I do know it will be great.

The regular expression finds don't, as well as do. What if you want to find only do? Well, you might decide to put it in spaces: ' do '. At first it seems to work: I don't know what's for supper, but I do know it will be great.

But there's a caveat:

' do '

example08.txt:

If you do and I do, then we both do.

What's wrong (besides the nonsensical sentence)? There are two do's that aren't found. Because the spaces are being matched literally as spaces, and the do's have punctuation after them. How can we get around that? Escaped angle brackets ( \< \> ) will find a word - even with punctuation.

'\<do\>'

example08.txt:

If you do and I do, then we both do.

Yippee!

Well, that concludes this article on regular expressions. It's getting pretty long, and you're probably going crazy from thinking and have a horrible taste in your mouth from all the green tea. So I'll quit. But here's a couple (more useful) examples you can try, to see what they can do.

'[0-9]\{1,2\}/[0-9]\{1,2\}/[0-9]\{1,2\}'

example09.txt:

The date is 23/6/12.
4993 Range Rd. is where he lives.

Nothing new here, but the expression may be a bit more complicated than most we've been doing. Look at it carefully.

'^[0-9][0-9]* [a-zA-Z]* [a-zA-Z]*\.'

example09.txt:
The date is 23/6/12.
4993 Range Rd. is where he lives.

In the last example, we matched one or more characters using the construct [0-9][0-9]*. The first one makes sure there's one character, and the second one says that there can be zero or more. Another way of doing that would be [0-9]\{1,\}. In regular expressions, there is often more than one way to do things. You decide what's easiest.

And, as promised, the text files I used: regexp.zip. If you use Windows and try to open it with Notepad, it might not display the line breaks properly. Try Wordpad, Notepad++, or a different text editor.