Awk: Difference between revisions
From charlesreid1
| Line 46: | Line 46: | ||
ARGC | ARGC | ||
* The number of elements in the ARGV array. | |||
ARGV | ARGV | ||
* An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1. | |||
* The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument. | |||
CONVFMT | CONVFMT | ||
* The printf format for converting numbers to strings (except for output statements, where OFMT is used); "%.6g" by default. | |||
ENVIRON | ENVIRON | ||
* An array representing the value of the environment, as described in the exec functions defined in the System Interfaces volume of IEEE Std 1003.1-2001. The indices of the array shall be strings consisting of the names of the environment variables, and the value of each array element shall be a string consisting of the value of that variable. If appropriate, the environment variable shall be considered a numeric string (see Expressions in awk); the array element shall also have its numeric value. | |||
* In all cases where the behavior of awk is affected by environment variables (including the environment of any commands that awk executes via the system function or via pipeline redirections with the print statement, the printf statement, or the getline function), the environment used shall be the environment at the time awk began executing; it is implementation-defined whether any modification of ENVIRON affects this environment. | |||
FILENAME | FILENAME | ||
* A pathname of the current input file. Inside a BEGIN action the value is undefined. Inside an END action the value shall be the name of the last input file processed. | |||
FNR | FNR | ||
* The ordinal number of the current record in the current file. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed in the last file processed. | |||
FS | FS | ||
* Input field separator regular expression; a <space> by default. | |||
NF | NF | ||
* The number of fields in the current record. Inside a BEGIN action, the use of NF is undefined unless a getline function without a var argument is executed previously. Inside an END action, NF shall retain the value it had for the last record read, unless a subsequent, redirected, getline function without a var argument is performed prior to entering the END action. | |||
NR | NR | ||
* The ordinal number of the current record from the start of input. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed. | |||
OFMT | OFMT | ||
* The printf format for converting numbers to strings in output statements (see Output Statements); "%.6g" by default. The result of the conversion is unspecified if the value of OFMT is not a floating-point format specification. | |||
OFS | OFS | ||
* The print statement output field separation; <space> by default. | |||
ORS | ORS | ||
* The print statement output record separator; a <newline> by default. | |||
RLENGTH | RLENGTH | ||
* The length of the string matched by the match function. | |||
RS | RS | ||
* The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is. | |||
RSTART | RSTART | ||
* The starting position of the string matched by the match function, numbering from 1. This shall always be equivalent to the return value of the match function. | |||
SUBSEP | SUBSEP | ||
* The subscript separator string for multi-dimensional arrays; the default value is implementation-defined. | |||
= Examples = | = Examples = | ||
Revision as of 00:06, 7 March 2011
Awk is a really powerful data-driven programming language.
Overview of Awk
Basics
The basic syntax of an awk program is:
awk 'condition { statement }'
Parsing
You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.
$ echo "token1.token2.token3"
token1.token2.token3
$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1
$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2
This can be done with any character, not just punctuation:
$ echo "token1atoken2atoken3"
token1atoken2atoken3
$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1
$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2
Built-In Variables
From the awk manpage:
ARGC
- The number of elements in the ARGV array.
ARGV
- An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1.
- The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument.
CONVFMT
- The printf format for converting numbers to strings (except for output statements, where OFMT is used); "%.6g" by default.
ENVIRON
- An array representing the value of the environment, as described in the exec functions defined in the System Interfaces volume of IEEE Std 1003.1-2001. The indices of the array shall be strings consisting of the names of the environment variables, and the value of each array element shall be a string consisting of the value of that variable. If appropriate, the environment variable shall be considered a numeric string (see Expressions in awk); the array element shall also have its numeric value.
- In all cases where the behavior of awk is affected by environment variables (including the environment of any commands that awk executes via the system function or via pipeline redirections with the print statement, the printf statement, or the getline function), the environment used shall be the environment at the time awk began executing; it is implementation-defined whether any modification of ENVIRON affects this environment.
FILENAME
- A pathname of the current input file. Inside a BEGIN action the value is undefined. Inside an END action the value shall be the name of the last input file processed.
FNR
- The ordinal number of the current record in the current file. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed in the last file processed.
FS
- Input field separator regular expression; a <space> by default.
NF
- The number of fields in the current record. Inside a BEGIN action, the use of NF is undefined unless a getline function without a var argument is executed previously. Inside an END action, NF shall retain the value it had for the last record read, unless a subsequent, redirected, getline function without a var argument is performed prior to entering the END action.
NR
- The ordinal number of the current record from the start of input. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed.
OFMT
- The printf format for converting numbers to strings in output statements (see Output Statements); "%.6g" by default. The result of the conversion is unspecified if the value of OFMT is not a floating-point format specification.
OFS
- The print statement output field separation; <space> by default.
ORS
- The print statement output record separator; a <newline> by default.
RLENGTH
- The length of the string matched by the match function.
RS
- The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
RSTART
- The starting position of the string matched by the match function, numbering from 1. This shall always be equivalent to the return value of the match function.
SUBSEP
- The subscript separator string for multi-dimensional arrays; the default value is implementation-defined.
Examples
Complex Parsing of Tokenized List
I had a list of movies, along with dates and times; it looked like this:
Rashomon 08/05/10 00:33:57 The Men Who Stare at Goats 08/04/10 01:28:21 Young Sherlock Holmes 08/03/10 01:20:26 The Girl with the Dragon Tattoo 08/02/10 02:19:57 Paycheck 08/01/10 01:53:20
I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.
In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).
I wrote an awk program to print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", using the following program:
awk '{ time=NF; date=time-1; numwords=date-1;
for(i=1; i<=numwords; i++) printf("%s ",$i);
printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt
Let's walk through this.
time=NF;
date=time-1;
numwords=date-1;
This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.
The date is always the next-to-last field on the line.
And the number of words in the movie title is equal to the number of fields, minus 2.
for(i=1; i<=numwords; i++) printf("%s ",$i);
This loops through the number of words, and prints each word, separated by a space.
printf("\t%s",$date); printf("\t%s",$time); print ""
This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.
Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.
Useful Resources
I highly recommend reading the awk man page, available here:
There is also a page with a very nice collection of awk one-liners available here:
The one-liners can help if you aren't quite sure how to translate the stuff in the awk man page into an awk program. It is also probable that you'll find a program that you can tweak a little bit to do what you want.