From charlesreid1

Awk is a really powerful data-driven programming language.

Overview of Awk

Basics

The basic syntax of an awk program is:

awk 'condition { statement }'


Parsing

You can parse a list using awk by using the -F flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3, etc.

$ echo "token1.token2.token3"
token1.token2.token3

$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1

$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2

This can be done with any character, not just punctuation:

$ echo "token1atoken2atoken3"
token1atoken2atoken3

$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1

$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2

and to parse via spaces:

$ echo "token1 token2 token3" | awk -F" " '{print $1}'
token1

$ echo "token1 token2 token3" | awk -F" " '{print $2}'
token2

NOTE: spaces are the default tokenizer:

$ echo "token1 token2 token3" | awk '{print $1}'
token1

Built-In Variables

From the awk manpage:

ARGC

  • The number of elements in the ARGV array.

ARGV

  • An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1.
  • The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument.

CONVFMT

  • The printf format for converting numbers to strings (except for output statements, where OFMT is used); "%.6g" by default.

ENVIRON

  • An array representing the value of the environment, as described in the exec functions defined in the System Interfaces volume of IEEE Std 1003.1-2001. The indices of the array shall be strings consisting of the names of the environment variables, and the value of each array element shall be a string consisting of the value of that variable. If appropriate, the environment variable shall be considered a numeric string (see Expressions in awk); the array element shall also have its numeric value.
  • In all cases where the behavior of awk is affected by environment variables (including the environment of any commands that awk executes via the system function or via pipeline redirections with the print statement, the printf statement, or the getline function), the environment used shall be the environment at the time awk began executing; it is implementation-defined whether any modification of ENVIRON affects this environment.

FILENAME

  • A pathname of the current input file. Inside a BEGIN action the value is undefined. Inside an END action the value shall be the name of the last input file processed.

FNR

  • The ordinal number of the current record in the current file. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed in the last file processed.

FS

  • Input field separator regular expression; a <space> by default.

NF

  • The number of fields in the current record. Inside a BEGIN action, the use of NF is undefined unless a getline function without a var argument is executed previously. Inside an END action, NF shall retain the value it had for the last record read, unless a subsequent, redirected, getline function without a var argument is performed prior to entering the END action.

NR

  • The ordinal number of the current record from the start of input. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed.

OFMT

  • The printf format for converting numbers to strings in output statements (see Output Statements); "%.6g" by default. The result of the conversion is unspecified if the value of OFMT is not a floating-point format specification.

OFS

  • The print statement output field separation; <space> by default.

ORS

  • The print statement output record separator; a <newline> by default.

RLENGTH

  • The length of the string matched by the match function.

RS

  • The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.

RSTART

  • The starting position of the string matched by the match function, numbering from 1. This shall always be equivalent to the return value of the match function.

SUBSEP

  • The subscript separator string for multi-dimensional arrays; the default value is implementation-defined.

Using Bash Variables from Awk

You can use Bash variables from Awk, but not directly. For example, the following WILL NOT work:

export x="stuff"

awk '{print $x}'

Instead, you have two options: pass the variable to Awk, or print it directly.

To pass the variable to awk, use the -v flag [1]:

root="/webroot"

echo | awk -v r=$root '{ print "shell root value - " r}'

(The echo| is required because some kind of input is required in awk).

Alternatively, print it directly into the awk statement [2]:

var="BASH"

echo "unix scripting" | awk '{gsub(/unix/,"'"${var}"'"); print}'

or, more formally:

var="BASH"

echo "unix scripting" | awk '{gsub(/unix/,"'"$(echo ${var})"'"); print}'


Examples

Complex Parsing of Tokenized List

I had a list of movies, along with dates and times; it looked like this:

Rashomon    08/05/10    00:33:57
The Men Who Stare at Goats 08/04/10    01:28:21
Young Sherlock Holmes   08/03/10    01:20:26
The Girl with the Dragon Tattoo     08/02/10    02:19:57
Paycheck    08/01/10    01:53:20

I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.

This means I needed an awk program that would print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", which I could then import or copy-and-paste into a spreadsheet.

I was able to whip this up in about 5 minutes, with very limited awk knowledge - so if you need to do something complex with awk, all it takes is a few minutes and a few examples to work with (see the Resources section below).

The program is:

awk '{ time=NF; date=time-1; numwords=date-1; 
for(i=1; i<=numwords; i++) printf("%s ",$i); 
printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt

In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).

Let's walk through this.

time=NF;
date=time-1;
numwords=date-1;

This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.

The date is always the next-to-last field on the line.

And the number of words in the movie title is equal to the number of fields, minus 2.

for(i=1; i<=numwords; i++) printf("%s ",$i);

This loops through the number of words, and prints each word, separated by a space.

printf("\t%s",$date); printf("\t%s",$time); print ""

This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.

Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.


Renaming Files, If Names Not Duplicates

I wanted to transform some filenames in a directory using a sed script (see Sed#Renaming_files). However, the sed script had already been run, and some of the files had transformed filenames, while there were also new files that the sed script would need to rename.

Unfortunately, if sed did not change the filename, and it was passed to the mv command, it would result in error messages:

mv: `i059_j072_k072' and `i059_j072_k072' are the same file
mv i042_j072_k072 i042_j072_k072 
mv: `i042_j072_k072' and `i042_j072_k072' are the same file
mv i018_j072_k072 i018_j072_k072 
mv: `i018_j072_k072' and `i018_j072_k072' are the same file
mv i026_j072_k072 i026_j072_k072 
mv: `i026_j072_k072' and `i026_j072_k072' are the same file
mv i016_j072_k072 i016_j072_k072 
mv: `i016_j072_k072' and `i016_j072_k072' are the same file
mv i142_j072_k072 i142_j072_k072 
mv: `i142_j072_k072' and `i142_j072_k072' are the same file
mv i129_j072_k072 i129_j072_k072 
mv: `i129_j072_k072' and `i129_j072_k072' are the same file
mv i135_j072_k072 i135_j072_k072 
mv: `i135_j072_k072' and `i135_j072_k072' are the same file
mv i125_j072_k072 i125_j072_k072 
mv: `i125_j072_k072' and `i125_j072_k072' are the same file
mv i127_j072_k072 i127_j072_k072 
mv: `i127_j072_k072' and `i127_j072_k072' are the same file
mv i119_j072_k072 i119_j072_k072 
mv: `i119_j072_k072' and `i119_j072_k072' are the same file
mv i114_j072_k072 i114_j072_k072 
mv: `i114_j072_k072' and `i114_j072_k072' are the same file
mv i100_j072_k072 i100_j072_k072 
mv: `i100_j072_k072' and `i100_j072_k072' are the same file

I don't want to see this. So, I needed to add a check to make sure the two filenames were not duplicates. I used Awk to check if the filenames were duplicates. The Awk script I used was:

$ awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -n2 -t mv

Here's its use in an actual example:

$ echo "filename1 filename2" | awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -r -n2 -t mv
mv filename1 filename2

$ echo "filename2 filename2" | awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -r -n2 -t mv

(Note that the -r flag for xargs tells it not to run the command if there are no input arguments, so that way it won't run the "mv" command with no arguments).

So the first command, which has two unique arguments, executes the move command. The second command, which has duplicate arguments, does not execute the move command.

In the case that the arguments being fed to awk are coming from sed, they are going to be split up into one line each. For this reason, they should first be piped through xargs -n2, so that output from sed that looks like this:

aaa
bbb
ccc
ddd

can be fed to awk like this:

aaa bbb
ccc ddd

The final script looks like this:

#!/bin/sh

ls -1c i* | /bin/sed \
 -e 'p' \
 -e 's/i\([0-9]\{1\}\)_/i00\1_/' \
 -e 's/i\([0-9]\{2\}\)_/i0\1_/'  \
 -e 's/j\([0-9]\{1\}\)_/j00\1_/' \
 -e 's/j\([0-9]\{2\}\)_/j0\1_/'  \
 -e 's/k\([0-9]\{1\}\)$/k00\1/'  \
 -e 's/k\([0-9]\{2\}\)$/k0\1/'   \
 | xargs -n2 \
 | awk '{ if( $1 == $2 ) print ""; else print $0 }' \
 | xargs -r -n2 -t mv

Getting Last Column (NF and NR)

If you want to get the last column in a list of columns printed out, you can use AWK to do that, you just need a way to access the last item in a list of arguments. Normally you can access the arguments using $0 for all arguments, $1 for the first argument, $2 for the second argument, etc.

To get the last item, you need the number of arguments. This is accessible in awk using the $NF variable.

Here's an example: I am printing a list of docker containers, and want to run the docker rm command on each of the names in the last column:

$ docker ps -a
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                      PORTS               NAMES
c1facefc83d0        busybox                         "touch /icanwrite/..."   9 minutes ago       Exited (0) 9 minutes ago                        eloquent_swartz
f863e355c442        busybox                         "touch /canttouchthis"   10 minutes ago      Exited (1) 10 minutes ago                       clever_kirch
e499e952b0da        busybox                         "touch /icanwrite/..."   10 minutes ago      Exited (0) 10 minutes ago                       friendly_leakey
1ac270b92af7        busybox                         "touch /icanwrite/..."   10 minutes ago      Exited (0) 10 minutes ago                       stupefied_bhabha
692e8d14352f        busybox                         "touch /icanwrite/..."   11 minutes ago      Exited (0) 10 minutes ago                       youthful_beaver
eaa73ee697a2        ubuntu                          "touch /icanwrite/..."   11 minutes ago      Exited (0) 11 minutes ago                       lucid_morse
3ea65b1b3e8d        waleedka/modern-deep-learning   "/bin/bash"              19 hours ago        Exited (130) 19 hours ago                       heuristic_banach
60bdbe518f79        waleedka/modern-deep-learning   "/bin/bash"              19 hours ago        Exited (0) 19 hours ago                         festive_visvesvaraya

This can be passed to awk:

$ docker ps -a | awk '{print $NF}'
NAMES
eloquent_swartz
clever_kirch
friendly_leakey
stupefied_bhabha
youthful_beaver
lucid_morse
heuristic_banach
festive_visvesvaraya

Now we need a way to tell awk to skip the first line. We can access which line we're on using the $NR variable (it starts at 1, not 0), and we can add a pre-condition before our program that the number of rows has to be greater than 1:

$ docker ps -a | awk 'NR>1 {print $NF}'
eloquent_swartz
clever_kirch
friendly_leakey
stupefied_bhabha
youthful_beaver
lucid_morse
heuristic_banach
festive_visvesvaraya

This skips the first row. Now we can pass the output of our awk program to xargs to run a command on each of the input arguments:

$ docker ps -a | awk 'NR>1 {print $NF}' | xargs -t -n1 docker rm
docker rm eloquent_swartz
eloquent_swartz
docker rm clever_kirch
clever_kirch
docker rm friendly_leakey
friendly_leakey
docker rm stupefied_bhabha
stupefied_bhabha
docker rm youthful_beaver
youthful_beaver
docker rm lucid_morse
lucid_morse
docker rm heuristic_banach
heuristic_banach
docker rm festive_visvesvaraya
festive_visvesvaraya

Bingo! There's your awk one-liner.

Useful Resources

I highly recommend reading the awk man page, available here:

There is also a page with a very nice collection of awk one-liners available here:

The one-liners can help if you aren't quite sure how to translate the stuff in the awk man page into an awk program. It is also probable that you'll find a program that you can tweak a little bit to do what you want.