Awk
From charlesreid1
Awk is a really powerful data-driven programming language.
Contents
Overview of Awk
Basics
The basic syntax of an awk program is:
awk 'condition { statement }'
Parsing
You can parse a list using awk by using the -F
flag, followed by the delimiter (by default, it is space). Each token can then be referred to using $1, $2, $3,
etc.
$ echo "token1.token2.token3"
token1.token2.token3
$ echo "token1.token2.token3" | awk -F. '{print $1}'
token1
$ echo "token1.token2.token3" | awk -F. '{print $2}'
token2
This can be done with any character, not just punctuation:
$ echo "token1atoken2atoken3"
token1atoken2atoken3
$ echo "token1atoken2atoken3" | awk -Fa '{print $1}'
token1
$ echo "token1atoken2atoken3" | awk -Fa '{print $2}'
token2
and to parse via spaces:
$ echo "token1 token2 token3" | awk -F" " '{print $1}'
token1
$ echo "token1 token2 token3" | awk -F" " '{print $2}'
token2
NOTE: spaces are the default tokenizer:
$ echo "token1 token2 token3" | awk '{print $1}'
token1
Built-In Variables
From the awk manpage:
ARGC
- The number of elements in the ARGV array.
ARGV
- An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1.
- The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument.
CONVFMT
- The printf format for converting numbers to strings (except for output statements, where OFMT is used); "%.6g" by default.
ENVIRON
- An array representing the value of the environment, as described in the exec functions defined in the System Interfaces volume of IEEE Std 1003.1-2001. The indices of the array shall be strings consisting of the names of the environment variables, and the value of each array element shall be a string consisting of the value of that variable. If appropriate, the environment variable shall be considered a numeric string (see Expressions in awk); the array element shall also have its numeric value.
- In all cases where the behavior of awk is affected by environment variables (including the environment of any commands that awk executes via the system function or via pipeline redirections with the print statement, the printf statement, or the getline function), the environment used shall be the environment at the time awk began executing; it is implementation-defined whether any modification of ENVIRON affects this environment.
FILENAME
- A pathname of the current input file. Inside a BEGIN action the value is undefined. Inside an END action the value shall be the name of the last input file processed.
FNR
- The ordinal number of the current record in the current file. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed in the last file processed.
FS
- Input field separator regular expression; a <space> by default.
NF
- The number of fields in the current record. Inside a BEGIN action, the use of NF is undefined unless a getline function without a var argument is executed previously. Inside an END action, NF shall retain the value it had for the last record read, unless a subsequent, redirected, getline function without a var argument is performed prior to entering the END action.
NR
- The ordinal number of the current record from the start of input. Inside a BEGIN action the value shall be zero. Inside an END action the value shall be the number of the last record processed.
OFMT
- The printf format for converting numbers to strings in output statements (see Output Statements); "%.6g" by default. The result of the conversion is unspecified if the value of OFMT is not a floating-point format specification.
OFS
- The print statement output field separation; <space> by default.
ORS
- The print statement output record separator; a <newline> by default.
RLENGTH
- The length of the string matched by the match function.
RS
- The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
RSTART
- The starting position of the string matched by the match function, numbering from 1. This shall always be equivalent to the return value of the match function.
SUBSEP
- The subscript separator string for multi-dimensional arrays; the default value is implementation-defined.
Using Bash Variables from Awk
You can use Bash variables from Awk, but not directly. For example, the following WILL NOT work:
export x="stuff" awk '{print $x}'
Instead, you have two options: pass the variable to Awk, or print it directly.
To pass the variable to awk, use the -v flag [1]:
root="/webroot"
echo | awk -v r=$root '{ print "shell root value - " r}'
(The echo|
is required because some kind of input is required in awk).
Alternatively, print it directly into the awk statement [2]:
var="BASH"
echo "unix scripting" | awk '{gsub(/unix/,"'"${var}"'"); print}'
or, more formally:
var="BASH"
echo "unix scripting" | awk '{gsub(/unix/,"'"$(echo ${var})"'"); print}'
Examples
Complex Parsing of Tokenized List
I had a list of movies, along with dates and times; it looked like this:
Rashomon 08/05/10 00:33:57 The Men Who Stare at Goats 08/04/10 01:28:21 Young Sherlock Holmes 08/03/10 01:20:26 The Girl with the Dragon Tattoo 08/02/10 02:19:57 Paycheck 08/01/10 01:53:20
I wanted to move the list into a spreadsheet program, so that I could easily deal with the data. So I needed three columns: one for the movie name, one for the date, and one for the time.
This means I needed an awk program that would print out the movie title, then a tab, then the date, then a tab, then the time, and put it all into a file called "NiceList.txt", which I could then import or copy-and-paste into a spreadsheet.
I was able to whip this up in about 5 minutes, with very limited awk knowledge - so if you need to do something complex with awk, all it takes is a few minutes and a few examples to work with (see the Resources section below).
The program is:
awk '{ time=NF; date=time-1; numwords=date-1;
for(i=1; i<=numwords; i++) printf("%s ",$i);
printf("\t%s",$date); printf("\t%s",$time); print "" }' file > NiceList.txt
In awk language, the list had multiple records; each record had multiple fields; and I could not predict the number of records on each line, nor could I predict now many spaces separated the movie name from the time (in some cases it was 5, in other cases it was 1).
Let's walk through this.
time=NF;
date=time-1;
numwords=date-1;
This awk statement creates 3 variables; the first is the "field number" of the time, which is NF - the number of fields in the line. In other words, the time is always the last field on the line.
The date is always the next-to-last field on the line.
And the number of words in the movie title is equal to the number of fields, minus 2.
for(i=1; i<=numwords; i++) printf("%s ",$i);
This loops through the number of words, and prints each word, separated by a space.
printf("\t%s",$date); printf("\t%s",$time); print ""
This portion prints a tab, then the date, then a tab, then the time. It finishes by printing a new line. Awk will then repeat this process for the next record.
Then, I can either import or copy-and-paste "NiceList.txt" into any spreadsheet program or Google Docs, and it will make a 3-column spreadsheet; 1 column with the movie name, 1 column with the date, and 1 column with the time.
Renaming Files, If Names Not Duplicates
I wanted to transform some filenames in a directory using a sed script (see Sed#Renaming_files). However, the sed script had already been run, and some of the files had transformed filenames, while there were also new files that the sed script would need to rename.
Unfortunately, if sed did not change the filename, and it was passed to the mv
command, it would result in error messages:
mv: `i059_j072_k072' and `i059_j072_k072' are the same file mv i042_j072_k072 i042_j072_k072 mv: `i042_j072_k072' and `i042_j072_k072' are the same file mv i018_j072_k072 i018_j072_k072 mv: `i018_j072_k072' and `i018_j072_k072' are the same file mv i026_j072_k072 i026_j072_k072 mv: `i026_j072_k072' and `i026_j072_k072' are the same file mv i016_j072_k072 i016_j072_k072 mv: `i016_j072_k072' and `i016_j072_k072' are the same file mv i142_j072_k072 i142_j072_k072 mv: `i142_j072_k072' and `i142_j072_k072' are the same file mv i129_j072_k072 i129_j072_k072 mv: `i129_j072_k072' and `i129_j072_k072' are the same file mv i135_j072_k072 i135_j072_k072 mv: `i135_j072_k072' and `i135_j072_k072' are the same file mv i125_j072_k072 i125_j072_k072 mv: `i125_j072_k072' and `i125_j072_k072' are the same file mv i127_j072_k072 i127_j072_k072 mv: `i127_j072_k072' and `i127_j072_k072' are the same file mv i119_j072_k072 i119_j072_k072 mv: `i119_j072_k072' and `i119_j072_k072' are the same file mv i114_j072_k072 i114_j072_k072 mv: `i114_j072_k072' and `i114_j072_k072' are the same file mv i100_j072_k072 i100_j072_k072 mv: `i100_j072_k072' and `i100_j072_k072' are the same file
I don't want to see this. So, I needed to add a check to make sure the two filenames were not duplicates. I used Awk to check if the filenames were duplicates. The Awk script I used was:
$ awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -n2 -t mv
Here's its use in an actual example:
$ echo "filename1 filename2" | awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -r -n2 -t mv mv filename1 filename2 $ echo "filename2 filename2" | awk '{ if ( $1 == $2 ) print ""; else print $0; }' | xargs -r -n2 -t mv
(Note that the -r
flag for xargs
tells it not to run the command if there are no input arguments, so that way it won't run the "mv" command with no arguments).
So the first command, which has two unique arguments, executes the move command. The second command, which has duplicate arguments, does not execute the move command.
In the case that the arguments being fed to awk are coming from sed, they are going to be split up into one line each. For this reason, they should first be piped through xargs -n2
, so that output from sed that looks like this:
aaa bbb ccc ddd
can be fed to awk like this:
aaa bbb ccc ddd
The final script looks like this:
#!/bin/sh ls -1c i* | /bin/sed \ -e 'p' \ -e 's/i\([0-9]\{1\}\)_/i00\1_/' \ -e 's/i\([0-9]\{2\}\)_/i0\1_/' \ -e 's/j\([0-9]\{1\}\)_/j00\1_/' \ -e 's/j\([0-9]\{2\}\)_/j0\1_/' \ -e 's/k\([0-9]\{1\}\)$/k00\1/' \ -e 's/k\([0-9]\{2\}\)$/k0\1/' \ | xargs -n2 \ | awk '{ if( $1 == $2 ) print ""; else print $0 }' \ | xargs -r -n2 -t mv
Getting Last Column (NF and NR)
If you want to get the last column in a list of columns printed out, you can use AWK to do that, you just need a way to access the last item in a list of arguments. Normally you can access the arguments using $0
for all arguments, $1
for the first argument, $2
for the second argument, etc.
To get the last item, you need the number of arguments. This is accessible in awk using the $NF
variable.
Here's an example: I am printing a list of docker containers, and want to run the docker rm
command on each of the names in the last column:
$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c1facefc83d0 busybox "touch /icanwrite/..." 9 minutes ago Exited (0) 9 minutes ago eloquent_swartz f863e355c442 busybox "touch /canttouchthis" 10 minutes ago Exited (1) 10 minutes ago clever_kirch e499e952b0da busybox "touch /icanwrite/..." 10 minutes ago Exited (0) 10 minutes ago friendly_leakey 1ac270b92af7 busybox "touch /icanwrite/..." 10 minutes ago Exited (0) 10 minutes ago stupefied_bhabha 692e8d14352f busybox "touch /icanwrite/..." 11 minutes ago Exited (0) 10 minutes ago youthful_beaver eaa73ee697a2 ubuntu "touch /icanwrite/..." 11 minutes ago Exited (0) 11 minutes ago lucid_morse 3ea65b1b3e8d waleedka/modern-deep-learning "/bin/bash" 19 hours ago Exited (130) 19 hours ago heuristic_banach 60bdbe518f79 waleedka/modern-deep-learning "/bin/bash" 19 hours ago Exited (0) 19 hours ago festive_visvesvaraya
This can be passed to awk:
$ docker ps -a | awk '{print $NF}' NAMES eloquent_swartz clever_kirch friendly_leakey stupefied_bhabha youthful_beaver lucid_morse heuristic_banach festive_visvesvaraya
Now we need a way to tell awk to skip the first line. We can access which line we're on using the $NR
variable (it starts at 1, not 0), and we can add a pre-condition before our program that the number of rows has to be greater than 1:
$ docker ps -a | awk 'NR>1 {print $NF}' eloquent_swartz clever_kirch friendly_leakey stupefied_bhabha youthful_beaver lucid_morse heuristic_banach festive_visvesvaraya
This skips the first row. Now we can pass the output of our awk program to xargs to run a command on each of the input arguments:
$ docker ps -a | awk 'NR>1 {print $NF}' | xargs -t -n1 docker rm docker rm eloquent_swartz eloquent_swartz docker rm clever_kirch clever_kirch docker rm friendly_leakey friendly_leakey docker rm stupefied_bhabha stupefied_bhabha docker rm youthful_beaver youthful_beaver docker rm lucid_morse lucid_morse docker rm heuristic_banach heuristic_banach docker rm festive_visvesvaraya festive_visvesvaraya
Bingo! There's your awk one-liner.
Useful Resources
I highly recommend reading the awk man page, available here:
There is also a page with a very nice collection of awk one-liners available here:
The one-liners can help if you aren't quite sure how to translate the stuff in the awk man page into an awk program. It is also probable that you'll find a program that you can tweak a little bit to do what you want.
GNU/Linux/Unix the concrete that makes the foundations of the internet.
Compiling Software · Upgrading Software Category:Build Tools · Make · Cmake · Gdb Bash Bash · Bash/Quick (Quick Reference) · Bash Math Text Editors Text Manipulation Command Line Utilities Aptitude · Diff · Make · Patch · Subversion · Xargs Security SSH (Secure Shell) · Gpg (Gnu Privacy Guard) · Category:Security Networking Linux/SSH · Linux/Networking · Linux/File Server Web Servers
|