Awk
From charlesreid1
Awk is a really powerful data-driven programming language.
Basics
The basic syntax of an awk program is:
awk 'condition { statement }'
Parsing
You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.
$ echo "token1.token2.token3"
token1.token2.token3
$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1
$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2
This can be done with any character, not just punctuation:
$ echo "token1atoken2atoken3"
token1atoken2atoken3
$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1
$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2
Examples
Complex Parsing of Tokenized List
I had a list of movies, along with dates and times; it looked like this:
Rashomon 08/05/10 00:33:57 The Men Who Stare at Goats 08/04/10 01:28:21 Young Sherlock Holmes 08/03/10 01:20:26 The Girl with the Dragon Tattoo 08/02/10 02:19:57 Paycheck 08/01/10 01:53:20
I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.
In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).
I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:
awk '{ time=NF; date=time-1; numwords=date-1;
for(i=1; i<=numwords; i++) printf("%s ",$i);
printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt
Let's walk through this.
time=NF;
date=time-1;
numwords=date-1;
This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.
The date is always the next-to-last field on the line.
And the number of words in the movie title is equal to the number of fields, minus 2.
for(i=1; i<=numwords; i++) printf("%s ",$i);
This loops through the number of words, and prints each word, separated by a space.
printf("\t%s",$date); printf("\t%s",$time); print ""
This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.
Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.