Revision as of 23:57, 6 March 2011

Awk is a really powerful data-driven programming language.

Basics

The basic syntax of an awk program is:

awk 'condition { statement }'

Parsing

You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.

$ echo "token1.token2.token3"
token1.token2.token3

$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1

$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2

This can be done with any character, not just punctuation:

$ echo "token1atoken2atoken3"
token1atoken2atoken3

$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1

$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2

Examples

Complex Parsing of Tokenized List

When you run a file through awk, it will break each line up into tokens, called fields in awk; each line is called a record in awk.

Next, I had a list of movies, along with dates and times; it looked like this:

Rashomon    08/05/10    00:33:57
The Men Who Stare at Goats 08/04/10    01:28:21
Young Sherlock Holmes   08/03/10    01:20:26
The Girl with the Dragon Tattoo     08/02/10    02:19:57
Paycheck    08/01/10    01:53:20

I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.

In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).

I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:

awk '{ time=NF; date=time-1; numwords=date-1; for(i=1; i<=numwords; i++) printf("%s ",$i); printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt

Let's walk through this.

time=NF;
date=time-1;
numwords=date-1;

This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.

The date is always the next-to-last field on the line.

And the number of words in the movie title is equal to the number of fields, minus 2.

for(i=1; i<=numwords; i++) printf("%s ",$i);

This loops through the number of words, and prints each word, separated by a space.

printf("\t%s",$date); printf("\t%s",$time); print ""

This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.

Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.

Awk: Difference between revisions

From charlesreid1

Revision as of 23:57, 6 March 2011

Contents

Basics

Parsing

Examples

Complex Parsing of Tokenized List

@@ Line 38: / Line 38: @@
 token2
 </syntaxhighlight>
+= Examples =
+== Complex Parsing of Tokenized List ==
+When you run a file through awk, it will break each line up into tokens, called ''fields'' in awk; each line is called a ''record'' in awk.
+Next, I had a list of movies, along with dates and times; it looked like this:
+<pre>
+Rashomon    08/05/10    00:33:57
+The Men Who Stare at Goats 08/04/10    01:28:21
+Young Sherlock Holmes   08/03/10    01:20:26
+The Girl with the Dragon Tattoo     08/02/10    02:19:57
+Paycheck    08/01/10    01:53:20
+</pre>
+I wanted to move the list into a spreadsheet program, so that I could easily deal with the data.  So I needed three columns: one for the movie name, one for the date, and one for the time.
+In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).
+I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:
+<syntaxhighlight lang="awk">
+awk '{ time=NF; date=time-1; numwords=date-1; for(i=1; i<=numwords; i++) printf("%s ",$i); printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt
+</syntaxhighlight>
+Let's walk through this.
+<syntaxhighlight lang="awk">
+time=NF;
+date=time-1;
+numwords=date-1;
+</syntaxhighlight>
+This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line.  In other words, the time is always the last field on the line.
+The date is always the next-to-last field on the line.
+And the number of words in the movie title is equal to the number of fields, minus 2.
+<syntaxhighlight lang="awk">
+for(i=1; i<=numwords; i++) printf("%s ",$i);
+</syntaxhighlight>
+This loops through the number of words, and prints each word, separated by a space.
+<syntaxhighlight lang="awk">
+printf("\t%s",$date); printf("\t%s",$time); print ""
+</syntaxhighlight>
+This portion prints a tab, then the date, then a tab, then the time.  It finishes by printing a new line.  Awk will then repeat this process for the next record.
+Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.