From charlesreid1

Awk is a really powerful data-driven programming language.

Basics

The basic syntax of an awk program is:

awk 'condition { statement }'


Parsing

You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.

$ echo "token1.token2.token3"
token1.token2.token3

$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1

$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2

This can be done with any character, not just punctuation:

$ echo "token1atoken2atoken3"
token1atoken2atoken3

$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1

$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2

Examples

Complex Parsing of Tokenized List

I had a list of movies, along with dates and times; it looked like this:

Rashomon    08/05/10    00:33:57
The Men Who Stare at Goats 08/04/10    01:28:21
Young Sherlock Holmes   08/03/10    01:20:26
The Girl with the Dragon Tattoo     08/02/10    02:19:57
Paycheck    08/01/10    01:53:20

I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.

In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).

I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:

awk '{ time=NF; date=time-1; numwords=date-1; for(i=1; i<=numwords; i++) printf("%s ",$i); printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt

Let's walk through this.

time=NF;
date=time-1;
numwords=date-1;

This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.

The date is always the next-to-last field on the line.

And the number of words in the movie title is equal to the number of fields, minus 2.

for(i=1; i<=numwords; i++) printf("%s ",$i);

This loops through the number of words, and prints each word, separated by a space.

printf("\t%s",$date); printf("\t%s",$time); print ""

This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.

Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.