From charlesreid1

(Created page with "Awk is a really powerful data-driven programming language. = Basics = The basic syntax of an awk program is: <syntaxhighlight lang="bash"> awk 'condition { statement }' </synt...")
 
No edit summary
Line 38: Line 38:
token2
token2
</syntaxhighlight>
</syntaxhighlight>
= Examples =
== Complex Parsing of Tokenized List ==
When you run a file through awk, it will break each line up into tokens, called ''fields'' in awk; each line is called a ''record'' in awk.
Next, I had a list of movies, along with dates and times; it looked like this:
<pre>
Rashomon    08/05/10    00:33:57
The Men Who Stare at Goats 08/04/10    01:28:21
Young Sherlock Holmes  08/03/10    01:20:26
The Girl with the Dragon Tattoo    08/02/10    02:19:57
Paycheck    08/01/10    01:53:20
</pre>
I wanted to move the list into a spreadsheet program, so that I could easily deal with the data.  So I needed three columns: one for the movie name, one for the date, and one for the time.
In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).
I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:
<syntaxhighlight lang="awk">
awk '{ time=NF; date=time-1; numwords=date-1; for(i=1; i<=numwords; i++) printf("%s ",$i); printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt
</syntaxhighlight>
Let's walk through this.
<syntaxhighlight lang="awk">
time=NF;
date=time-1;
numwords=date-1;
</syntaxhighlight>
This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line.  In other words, the time is always the last field on the line.
The date is always the next-to-last field on the line.
And the number of words in the movie title is equal to the number of fields, minus 2.
<syntaxhighlight lang="awk">
for(i=1; i<=numwords; i++) printf("%s ",$i);
</syntaxhighlight>
This loops through the number of words, and prints each word, separated by a space.
<syntaxhighlight lang="awk">
printf("\t%s",$date); printf("\t%s",$time); print ""
</syntaxhighlight>
This portion prints a tab, then the date, then a tab, then the time.  It finishes by printing a new line.  Awk will then repeat this process for the next record.
Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.

Revision as of 23:57, 6 March 2011

Awk is a really powerful data-driven programming language.

Basics

The basic syntax of an awk program is:

awk 'condition { statement }'


Parsing

You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.

$ echo "token1.token2.token3"
token1.token2.token3

$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1

$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2

This can be done with any character, not just punctuation:

$ echo "token1atoken2atoken3"
token1atoken2atoken3

$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1

$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2

Examples

Complex Parsing of Tokenized List

When you run a file through awk, it will break each line up into tokens, called fields in awk; each line is called a record in awk.

Next, I had a list of movies, along with dates and times; it looked like this:

Rashomon    08/05/10    00:33:57
The Men Who Stare at Goats 08/04/10    01:28:21
Young Sherlock Holmes   08/03/10    01:20:26
The Girl with the Dragon Tattoo     08/02/10    02:19:57
Paycheck    08/01/10    01:53:20

I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.

In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).

I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:

awk '{ time=NF; date=time-1; numwords=date-1; for(i=1; i<=numwords; i++) printf("%s ",$i); printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt

Let's walk through this.

time=NF;
date=time-1;
numwords=date-1;

This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.

The date is always the next-to-last field on the line.

And the number of words in the movie title is equal to the number of fields, minus 2.

for(i=1; i<=numwords; i++) printf("%s ",$i);

This loops through the number of words, and prints each word, separated by a space.

printf("\t%s",$date); printf("\t%s",$time); print ""

This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.

Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.