Your Browser is not longer supported

Please use Google Chrome, Mozilla Firefox or Microsoft Edge to view the page correctly
Loading...

{{viewport.spaceProperty.prod}}

awk - pattern scanning and processing language

&pagelevel(4)&pagelevel

awk is a programmable text manipulation system.
When you call awk you specify an awk program it is to execute and the files it is to process. The actions defined in the program are then performed on the basis of the specified files. awk does not alter its input files. The results of the actions it performs are by default written on standard output.

awk offers the following advantages over text manipulation programs such as egrep and sed:

  • awk operates on one record at a time. As with egrep and sed, an input record is defined as one line by default; but with awk you can change this setting and define some other unit of text as the record.

  • Each input record is split into fields which can be accessed individually.

  • A pattern (selection criterion) may be a condition defined by the logical combination of extended regular expressions and relational operators.

  • You can program any actions that you require. awk is a high-level C-like programming language.

A detailed description of awk is provided below in the following sections:

Syntax


Format 1: awk[ -F ERE]
   [ -v initialization]... prog[ initialization]...[ file]...
Format 2: awk[ -F ERE]
   [ -f progfile][ -v initialization]...[ initialization]...[ file]

-F ERE

Defines the field separator character for the input record (input field separator).

ERE

Extended regular expression that defines a character to be interpreted as the input field separator. Separators do not form part of the fields.


To be able to use t as the input field separator, you must specify it as follows on the awk command line or in the BEGIN section of the awk program:
awk -F"[t]"... or BEGIN {FS=“t“...}


-F ERE not specified:

Blanks and tabs act as field separators.

-v initialization

Assignments in the form var=value.
The var variable which appears in the program is initialized to value.

var

Name of the variable to be initialized.

value

Initial value to be assigned to var. value can be defined in exactly the same way as an environment variable on shell level.

There is no difference between the assignment of a value with -v initialization and with initialization (see below).

prog

awk program argument.

Possible forms for prog are:

’awk-program’, i.e. an awk program written on the command line, or

-f progfile , i.e. the name of a file containing an awk program.

’awk_program’

An awk program written on the command line.

You should always enclose the awk program in single quotes in order to prevent the shell from interpreting metacharacters. If the program is more than one line long, you must escape the newline character with a backslash.

Example

Output all lines in the input file whose third field consists of the character ’0’:

$ awk '$3 == 0' input

-f progfile

The awk program is located in the file named progfile.

initialization

Assignments in the form: var=value

The var variable (whether it appears in the awk program or not) is initialized to value. initialization and file may be specified in any order. The assignment is made at the time when the named file is opened.
Thus, an assignment before the first file argument will be executed after the BEGIN actions (if any), while an assignment after the last file argument will occur before the END actions (if any).

Exception
The $ variables (see Basic elements) cannot be initialized in this way.

var

Name of the variable to be initialized. The name must not begin with $.

value

Initial value to be assigned to var. value can be defined in exactly the same way as an environment variable on shell level.

file

Name of the text file to be processed. You may list more than one file if you wish. Files are read in the order in which they are listed. If file is a dash (-), awk reads from standard input.

file not specified:
awk reads from standard input. awk reads input one record at a time, processes it, and after each line outputs the result for that record. Hitting CTRL+D or @@d terminates your input.


Typical awk applications

awk is a tool which makes text manipulation tasks easy to accomplish. Typical applications for awk include:

  • selectively extracting data from files

  • checking the contents of files

  • performing calculations on the data in a file

  • changing the format of input data.

Using four simple examples, this section demonstrates how awk can be used.

Example

A file called supplies contains a list of office supplies. It includes the name of each article, along with its quantity and unit price:

Pencil      100     0.60
Table         5   345.00
Lamp         20    79.80
Paper        75     1.00
Diskette   1000     2.40
Envelope   1500     0.20


Example 1

Select all articles with a quantity greater than 100:

$ awk '$2 > 100 {print}' supplies

Diskette   1000     2.40

Envelope   1500     0.20

With $2 you access the second field of a line, which in this case is the quantity of each article. If the quantity is greater than 100, the condition is fulfilled, and the print function is executed. Since no arguments were specified for print, the whole line is output.


Example 2

Calculate the total price for all articles with a quantity greater than 100 and print this total along with the article name:

$ awk '$2 > 100 {print $1 "\t" $2*$3}' supplies

Diskette        2400

Envelope        300

Three arguments are entered for the print function in this example. The following is output:

$1article name (first field)
\ttab character
$2*$3quantity (second field) times unit price (third field)


Example 3

Include a heading in the output:

$ awk 'BEGIN      {print "Article \tTotal"}

>        $2 > 100 {print $1 "\t" $2*$3}' supplies

Article         Total
Diskette        2400

Envelope        300

This example illustrates the use of the BEGIN pattern. awk executes the action after BEGIN only once, i.e. when the program is started. The heading is therefore printed only once at the beginning.


Example 4

Print a grand total of all amounts at the end. For this purpose we use a variable called sum, which is initialized to zero in the BEGIN pattern. The product of column 2 and column 3 is calculated for each line, and all the products are summed up:

$ awk 'BEGIN      {sum=0; print "Article \tTotal"}

>        $2 > 100 {print $1 "\t" $2*$3; sum += $2*$3}

>        END        {print "\nGrand total: " sum} ' supplies

Article         Total

Diskette        2400
Envelope        300

Grand total: 2700

This example demonstrates the use of the END pattern. awk executes the action after END only once, i.e. before termination of the program. The grand total of all subtotals is therefore printed just once at the end.


Structure of an awk program

An awk program can consist of a BEGIN section, a main section, and an END section, structured as shown below:

Syntax
BEGIN section –
[ BEGIN {action} ]
Main section
[[pattern] {action}
| pattern [{action}]
| function_definition
.
.
.
                              ]
END section –
[ END {action} ]

pattern

The pattern indicates which data is to be selected from the input files (see “Patterns”).

action

The action indicates what to do with data that matches the pattern (see “Actions”).

function_definition

A function_definition enables you to define your own functions (see “Functions”).

At least one of the three sections (pattern, action or function-definition) must be present.

In a pattern {action} pair, either the pattern or the action can be omitted. If the action is omitted, each line that matches the pattern is output; omitting the pattern causes the action to be performed on all lines.
The definition of a user-defined function may appear at any position in the main section.
Each of the following ust be located at the start of a line (following any number of blanks or tabs):

  • the BEGIN section

  • the [pattern]{action} and pattern [{action}] pairs

  • the function definitions

  • the END section

Operation of the awk command

awk executes the awk program that is specified by the user, proceeding in the following sequence:

  1. Initial processing
    The first step performed by awk is to initialize any variables that may have been defined. If there is a BEGIN section including an action, awk then executes the action specified there. The action in the BEGIN section is executed just once, before the first line is processed.

  2. File processing
    Next awk processes the specified input files by reading the input records sequentially. For each input record, awk tries to match each pattern in the order that is specified in the awk program. If a pattern is matched, i.e. the selection criterion is fulfilled, the associated action is performed.
    If no pattern is specified for an action, awk performs the action for every record.
    If no action is specified for a pattern, the default action is to output (print) the record.Multiple input files are processed in the specified order.

  3. Final processing
    When all the specified files have been processed, awk performs the action in the END section, if one has been included. awk then exits.


The input file

An input file consists of records that are subdivided into fields.

Records

Records are separated by a record separator. The record separator does not form part of a record. By default, a record is one line, and the record separator is the newline character. However, you do have the option of changing this setup by assigning any single character to the special variable RS (Record Separator). If you specify a string of characters as a value for RS, only the first character will be taken into account. The ordinal number of the current record is available in the variable NR (Number of Record). If there is more than one input file, NR counts from the start of the first file to the end of the last one. The special variable $0 addresses the whole of the current record. Further information on variables is provided in the section “Basic elements of the awk language”.

Fields

Each record is split into fields separated by one or more field separators. The default field separator is white space (any sequence of tabs and blanks), but you do have the option of changing this by assigning any other character to the special variable FS (Field Separator). You can make this assignment either in the awk program or by using option -F on the command line. The value assigned to FS is interpreted as an extended regular expression (see section “Regular POSIX shell expressions”).


Example 1

To define the characters x and y as alternate field separators:


syntax on the awk command line: -F[xy]

syntax in the awk program: FS=[xy]


Example 2

To define the field separator as one or more occurrences of the character x:


syntax on the awk command line: -Fx+
syntax in the awk program: FS=x+


The default setting (any sequence of blanks and tabs) can be expressed by the regular expression [ \t]+, where ' ' stands for a blank, and \t represents a tab.

Note that the newline character is always interpreted as a field separator, regardless of the value assigned to FS!

The number of fields in the current record is stored in the variable NF (Number of Fields). Individual fields of the current record are addressed by the predefined variables $1, $2, to $NF. Further information on variables is provided in the section “Basic elements of the awk language”.

Example

Default setup

Field 1  Field 2   ...         Field 5 ...
This     is        the first   record             <--- Record 1
and      this      is the      second record.     <--- Record 2


Customized setup: RS="%"; FS=":";

Field 1   Field 2       Field 3
%Name  : Address      : Phone number      <--- Record 1
%SNI AG :81730 Munich : 089-636-1         <--- Record 2

Rules for record and field separators

  • Default settings for record separators

    • The default record separator is the newline character.

    • If the null string is assigned to RS (RS=""), the file is treated as a single record. If several files are specified, each file will consist of a single record (which means that the ultimate value of NR will be equal to the number of files).

  • Default settings for field separators

    • If the record separator is newline, the field separator defaults to blanks and tabs.

    • If the record separator is not a newline, the newline character always counts as a field separator, regardless of which character has been explicitly defined as the field separator (see Fields, example 2).

    • If you explicitly assign a blank to FS, either with -F" " on the awk command line or by using the assignment FS=" ", then blanks and tab characters are treated as field separators.

    • On the other hand, if you explicitly assign the tab character to FS (FS="\t"), then only the tab character is treated as the field separator and not the blank.

  • Leading field separators and field separator strings

    • The following applies to blanks, tabs and newlines as field separators:

    • Leading field separators are ignored.

    • Multiple occurrences of a field separator are treated as a single field separator (see example 9).

    • For all other field separators, leading field separators are counted. In multiple occurrences of a field separator, each character is counted separately. Thus two consecutive field separators are deemed to have an empty field between them (see example 10).

  • Changing separators:

    If you need a number of different record separators in one file, you can change RS within the awk program. The new record separator comes into effect as soon as the assignment to RS has been implemented. Similarly, you can change FS within the awk program, should you require a number of different field separators in one file. The new field separator comes into effect as soon as the assignment to FS has been implemented.

Special variables for the input file

The following list shows all special awk variables pertaining to the input file and the corresponding values awk usually assigns to these variables.

FILENAME

Name of the current input file, - for standard input

FS

Input field separator (default: any sequence of blanks and tabs)

NF

Number of fields in the current record

NR

Ordinal number of the current record from start of input

FNR

Ordinal number of the current record in the current file

RS

Input record separator (default: newline)

$0

Current record

$1

First field of the current record

$2

Second field of the current record

...

$NF

Last field of the current record

You can change these variables within an awk program if you wish. This does not alter the input file. Further information on variables is provided in the section “Basic elements of the awk language”.



Basic elements of the awk language

This section gives a syntax of the basic elements of the awk language. You will need these elements in order to define pattern and action pairs.

Comments

You can include comments in an awk program, as in a shell script. A comment begins with the # character and continues till the end of the line.

Constants

There are two types of constant:

number

A number (numeric constant) is a signed or unsigned integer or floating point number. awk does not check its format. If your number contains invalid characters, awk attempts to filter out a valid part and ignores the rest.

integer

An integer is a sequence of digits from 0 to 9.

floating point number

A floating point number consists of a mantissa with or without an exponent.
The mantissa comprises an integer with or without a fractional part.
The fractional part is represented by a radix character and an integer.

string

A string (alphanumeric constant) is a sequence of characters, enclosed in double quotes "...". If the double quotes are omitted, awk will interpret the string as a variable name, a number, or an operator.

character

A single character is also enclosed in double quotes "..." in order to prevent awk interpreting the character as a variable name. A character may be a displayable character from the character set which is currently in use (see section “EDF04 character set”) or one of the following special characters as represented in C:


\"for "
\\for \
\afor bell character
\nfor newline character
\tfor tab character
\vfor vertical tab
\bfor backspace
\rfor carriage return
\ffor page feed


Variables

awk allows you to use simple variables and arrays to store values.
The special variables are predefined; others can be defined by the user.

Name of a variable

The name of a user-defined variable can be any string made up of underscores (_), uppercase and lowercase letters and digits, beginning with a letter or an underscore.

Data type

Variables do not have a data type. You can thus assign either a number or a string to any variable. If the context is clearly numeric, variables are treated as numeric; otherwise, they default to alphanumeric.


Example:


x = "Miller";# Variable x contains the string Miller
x = "3"+4 ;# Variable x has a value of 7


Declaration

awk variables do not need to be explicitly declared. User-defined variables are automatically declared the first time they are used.

Initialization

Special variables are initialized to predefined values by awk. Depending on the context, user-defined variables are initialized by awk to the null string or to 0 by default. If you wish, you can specify other initial values when you call awk.

Exceptions:

When i>NF, $i will not always be the null string.

$ variables cannot be initialized on the command line.

Special variables

awk recognizes the special variables shown in the list below. The values awk usually assigns to these variables are indicated in the list. New values may be assigned to the variables by the user.

ARGC

Number of elements in the array ARGV

ARGV

Array holding the command line arguments (excluding options and the prog argument), numbered from 0 to ARGC-1

ENVIRON

Array holding the values of environment variables, where the indexes are the names of the variables

FILENAME

Name of the current input file, - for standard input

FS

Input field separator (default: any sequence of blanks and tabs)

NF

Number of fields in the current record

NR

Ordinal number of the current record from start of input

FNR

Ordinal number of the current record in the current file

OFS

Output field separator (default: one blank)

ORS

Output record separator (default: newline)

OFMT

Output format for floating point numbers (see  printf - Formatted output )
(default: %.6g, up to 6 places after the decimal point)

RS

Input record separator (default: newline)

RLENGTH

Length of the string matched by the match function

RSTART

Starting position of the string matched by the match function. Numbering begins with 1.This value always corresponds to the value returned by the match function.

SUBSEP

Subscript string separator for multi-dimensional arrays. The default setting is \034.

$0

Current record

$n

Field n of the current record

$NF

Last field of the current record


What is the effect of changing special variables?


Example 1

The assignment

$1 = "new";

assigns the string new to $1; but this does not actually alter the first field of the current input record.

This also applies to the following awk settings relating to the input file:

  1. The current input file does not change when you assign a new name to FILENAME.

  2. When you assign a value to a variable $i where i>NF, NF is assigned the value i.

  3. If you assign a new value to NR, you only alter the number assigned to the current line; you do not move to a different line.


Example 2

The contents of $0 remain the same even if NR is modified:

{print NR, $0; NR=NR+34; print NR, $0}

A typical output would then be:

10 This is the tenth line

44 This is the tenth line


When you assign a new value to a variable, its old value is deleted. Thus, if you change NF, for example, the information on the number of fields in the current record is lost.


Peculiarity of $ variables:
You can specify the number of a $ variable as a constant or as an expression which evaluates to the number.


Example 3

You can use $(NF-1) to access the second-last field.

Array

An array is a set of constants or variables.

An array element is addressed as follows:


array_name[index]


array_name

Name of a variable.

index

A simple variable.
The index may be numeric or alphanumeric. The index you specify can therefore be a number, a string, or an expression that evaluates to an index value.

awk provides two special types of arrays:

  • Dynamic arrays
    Arrays, like simple variables, do not need to be declared. Above all, there is no need to define dimensions. New array elements are created automatically as and when required.

  • Associative arrays
    Individual array elements can be accessed via an alphanumeric index.
    A special control-flow statement is provided in order to process all elements of an associative array:

    for (index in array) statement

    index assumes the index values present to this point in random order, and the specified statement is executed once for each array element (see control-flow statement for).


Example

A file called expenses contains various expenses incurred. For each item of expenditure the file shows the date, month, amount, and a brief description, with a colon to separate them. For example:

01:January:   40.78:Supplies
05:January: 6789.00:Laser printer
23:March:    240.32:Lamps
11:January:  478.00:Chairs
01:February:  45.00:Journals

Using an associative array you can easily calculate total expenditure for each month from the data in this file. The program in the example uses an array called mexpenses and the names of the months as an alphanumeric index. For each line, the expenses in the third field ($3) are summed up to produce total expenditure for each month appearing in the second field ($2).


$ awk 'BEGIN {FS=":"}

>      {mexpenses[$2] += $3;}

>      END {for (i in mexpenses) print "Total spent in",\
>           i, mexpenses[i]    } ' expenses

Total spent in January 7307.78

Total spent in February 45

Total spent in March 240.32


Expressions

An expression can be any of the following:

constant
variable
function_call
un_op expression
expression bin_op expression
(expression)
expression ? expression : expression

constant

Numeric or alphanumeric constant (see “Basic elements of the awk language”).

variable

Variable (see “Basic elements of the awk language”).

function_call

Invocation of a predefined function (see “Functions”).

expression

Expression.

un_op

Unary operator (see “awk operators”).

bin_op

Binary operator (see “awk operators”).

Expressions are evaluated and return a value. They may appear both in patterns and in actions.

awk operators

awk recognizes all C operators plus the operators for pattern matching and string concatenation.
The following list shows all awk operators in ascending order of precedence. Operators in the same line have the same precedence.


= assignment operator
+= -= *= /= %= ^= compound assignment operators as in C
|| logical OR
&& logical AND
!~ pattern matching operators
> >= < <= != == relational operators
operand listconcatenation
+ - plus, minus
* / % multiply, divide, remainder
! logical NOT
^ ** exponent
++ -- increment, decrement
Evaluation of expressions

Since no data type is prescribed for the operands, you can freely mix numeric and alphanumeric constants. awk determines from the context whether a numeric or alphanumeric operation is required.
Please note that, as in C, there are no special truth values. Like C, awk treats a value of 0 as false and a non-zero value as true. This means that any non-zero value as an argument of a logical operation is held to be true. If the result of a logical operation is true, it is represented as 1.


Example:

2&&2)+3=4

Patterns

Patterns (selection criteria) are specified by the user as a means of indicating which data is to be selected from the input files. A pattern can have any of the following forms:

/regexp/
relexp 
matchexp
pattern_range
compound_pattern

/regexp/

Regular expression

awk supports extended regular expressions (see section “Regular POSIX shell expressions”). A regular expression is enclosed in slashes /.../.

Example:

A regular expression matching any number of occurrences of a, b or c:

/[abc]+/

relexp

relexp is an expression (see “Expressions”) featuring relational operators. The operators and their meanings are:


a > ba greater than b?
a >= ba greater than or equal to b?
a < ba less than b?
a <= ba less than or equal to b?
a == ba equal to b?
a != ba not equal to b?


Operands a and b are any expressions. If both operands are numeric, the comparison is numeric; if not, it is alphanumeric.


 matchexp

matchexp is an expression (see “Expressions”) featuring pattern matching operators. It involves the comparison of a regular expression (pattern) with a string. The pattern matching operators and their meanings are:


str ~ pstring str must match pattern p
str !~ pstring str must not match pattern p


Using matchexp as a pattern allows you to select individual fields.


Example:

Select all records with a first field starting with A or a:

$1 ~ /^[Aa]/

The regular expression ^[Aa] represents strings that begin with A or a. The first field of the record ($1) must match (~) the regular expression, i.e. begin with A or a.


pattern_range

A pattern range takes the form:

/ regexp /, / regexp /

Specifying a range causes the associated action to be executed for all records that lie within the range. The limits of the range (start and end) are defined by two regular expressions. The range begins with the first record containing a string that matches the first regular expression and ends with the first record containing a string that matches the second regular expression.


Example:

Select the range from the first line beginning with C to the first line beginning with K and output the first field of every line in the selected range:

/^C/, /^K/ {print $1}


compound_pattern

Logical operators (see Expressions) can be used to negate patterns and to combine several of them to form a single pattern. The logical operators and their meanings are:


!patNegation of pattern pat
pat1 || pat2pat1 or pat2.
The criterion is satisfied if pat1 or pat2 matches.
pat1 && pat2pat1 and pat2.
The criterion is satisfied if both pat1 andpat2 match.
(pat) Parentheses


A compound condition is evaluated from left to right.



Example

Match all records that have an even number of fields and a letter between M (inclusive) and Q (exclusive) in the first field.


NF%2==0 && $1 >= "M" && $1 < "Q"


You can generally combine patterns in several ways in order to make the same selection. Thus, if the currently valid collating sequence defines the range [M-Q] as the uppercase letters M, N, O, P and Q, the above selection could also be made with pattern matching operators:


NF%2==0 && $1 ~ /^[MNOP]/


Since the first awk condition depends on t he collating sequence of the currently valid character set, it may not return the same result in every case. The second awk line, by contrast, will always select only those records in which the first field begins with the letter M, N, O or P.


Actions

Actions indicate what to do when a pattern is matched. An action will typically involve processing one of the selected files. An action has to begin in the same line as the associated pattern. If this is not possible, the newline character must be escaped with a backslash. Blanks and tabs between the action and the pattern are ignored. An action comprises one or more statements and must be enclosed in braces {...} as shown below:


{statement [;statement]...}


A statement can be any of the following:


expression
control_statement


expression

An expression is evaluated but is not put to any further use unless expression is in the form of an assignment, an increment or a decrement (see section “Expressions”).

control_statement

A control_statement allows you to control the flow of an awk program (see section “Control-flow statements”).


A single statement may be spread over several lines, in which case each line except the last must end with a backslash. The backslash escapes (cancels the effect of) the newline character.


Multiple statements

You can group together a number of statements within one pair of braces {}. Statements are delimited by means of:

  • a semicolon ;
  • a right brace }
  • a newline character.

Control-flow statements

Control-flow statements allow you to control the flow of an awk program. awk recognizes the following control-flow statements:


breakterminate a loop
continueskip remainder of loop
exitterminate the awk program
forloop counter and looping an array
ifconditional statement
nextskip to the next input record
whileexecute iteratively
doexecute iteratively
delete array[i]delete element i of the named array
return xreturn from a function with a value
returnreturn from a function without a value


The control-flow statements are described below in alphabetical order.

break - Terminate a loop

break can be used in the body of a for, while, or do loop. break causes an immediate exit from the enclosing loop.


Syntax


break


Example

While records continue to start with a dot, keep reading in the next record. Terminate the loop if the second field of the retrieved record is greater than 1000.

{ while($1 ~ /^\./)
    {
       getline;
       if($2 > 1000) break;
    }
}
continue - Skip remainder of loop

continue can be used in the body of a for, while or do loop. The continue statement causes the current iteration to be terminated and the next one to begin.


Syntax


continue


Example

Print even fields only:

{
   i=1;
   while(i++ <= NF)
      {
        if(i%2) continue;
        else print $i
      }
}
do - Execute iteratively

The statement in a do loop (or a do  while loop) is executed iteratively while a specified condition continues to be satisfied. In contrast to the while loop, the statement in a do loop is always executed at least once.


Syntax


do statement while (expression)


statement

Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces ({ }) and separated by semicolons or linefeed characters.

expression

Expression (see “Expressions”) that specifies the condition.


Example

Print out the individual fields of a record:

{ i=0; do {print $(++i)} while (i != NF) }
exit - Terminate the awk program

exit terminates the awk program.

If an END section is present, awk executes the action specified in it; if not, the program is terminated immediately.


Syntax


exit


Example

If the commercial at symbol @ appears in the input, print the result and terminate processing:

...
/@/ {exit}
...
END {print ergebnis}
for - Loop counter

The statement in a for loop is executed iteratively while a condition continues to be satisfied.


Syntax


for(expr1; expr2; expr3) statement


expr1

Expression (see “Expressions”).
expr1 is evaluated once at the start of the for statement. expr1 is often used to initialize incrementing variables.
Example: i=1

expr2

Expression (see “Expressions”).
expr2 is evaluated before each iteration. The specified statement is executed only if expr2 is non-zero (true); otherwise, the loop is terminated.
Example: i<10

expr3

Expression (see “Expressions”).
expr3 is evaluated after each iteration. When incrementing variables are used, expr3 increments the variable.
Example: i++

statement

Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces {}.


Example

Print out the fields of the current record in reverse order.

{for(i=NF; i>0; i--) print $i}
for - Looping an array

This variant of the for statement is a special awk facility for the handling of arrays.


Syntax


for(index in array) statement


index

Variable (see Basic elements) that assumes all values of the elements of array in random order. The index can be numeric or alphanumeric.

array

Array to be processed.

statement

Statement to be executed for each array element. If several statements are to be executed, they have to be grouped together in braces { }.


Example

The array named month contains the number of days in each month. Each array element is subscripted with the name of the month, e.g.
month["January"]=31.
The following awk program prints the name of each month together with the number of days in it.

$ awk ' BEGIN { month["January"]=31;      \
>               month["February"]=28;     \
>               month["March"]=31;        \ 
>               month["April"]=30;        \
>               month["May"]=31;          \
>               month["June"]=30;         \ 
>               month["July"]=31;         \
>               month["August"]=31  }     \
>       END { for(i in month) print i,"has",month[i],"days" } '                               
May has 31 days
August has 31 days
July has 31 days
April has 30 days
June has 30 days
January has 31 days
March has 31 days
February has 28 days
if - Conditional statement

The statement in an if construct is executed if the specified condition is satisfied.


Syntax


if(expr) statement1 [else statement2]


expr

Expression (see “Expressions”) that defines the condition to be satisfied. If expr is non-zero (true), statement1 is executed.

statement1

Statement to be executed if expr is true. If several statements are to be executed, they have to be grouped together in braces { }.

statement2

Statement to be executed if expr is false. If several statements are to be executed, they have to be grouped together in braces { }.


Example

If field 1 is greater than field 2, fields 2 and 3 are printed; if not, fields 4 and 5 are printed:

{ if($1 > 2) print $2, $3; else print $4, $5 }

next - Skip to the next input record

The next statement causes awk to suspend processing of the current record; statements that follow next are not applied to the current record. awk then reads the next input record. NR, NF, FNR, $0, and $1 to $NF are reset.

Difference between next and the getline function:

getline sets the current record to the next one. Statements that follow getline are executed using the next record’s values for the $ variables and for NR, NF, and FNR.


Syntax


next


Example

Records that begin with a dot are ignored:

{ if ($1 ~/^\./) next }
while - Execute iteratively

The statement in a while loop is executed iteratively while a specified condition continues to be satisfied.


Syntax


while(expr) statement


expr

Expression (see “Expressions”) that specifies the condition.

statement

Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces { }.


Example

Print all input fields, writing each field in a separate output line:

{ i = 1;
  while (i <= NF) {
      print $i
      i++
  }
}

Functions

awk provides a wide range of built-in functions and also offers you the option of defining functions of your own:


Syntax


function name(arg,...) {statements}


The {statements} may be preceded by a newline character. There may also be blank lines within the braces {...}. A function definition has the same precedence as pattern {action} pairs in the main section of an awk program.

Within an action section, function calls can be entered anywhere in an expression, except before the function declaration. There must be no space between the function name and the left parenthesis when a function is called.
Nested and recursive function calls are legal.

Though most functions do not require you to enclose arguments in parentheses, it is a good practice to use them as a means of increasing program transparency. When you pass an array as an argument, a pointer to the array is passed (call by reference), which means that you can change the elements of the array from the function. In the case of scalar variables, the value of the variable is copied and passed (call by value), which means that you cannot change the value of the variable from the function. The scope of function arguments is restricted to the local function, whereas the scope of all other variables is always global. If you need a local variable in a function, define it at the end of the argument list in the function definition. Any variable in the argument list for which no current argument exists is a local variable with a predefined value of 0.

As in C, some functions return a result (e.g. exp), while others are procedural in character (e.g. output functions).

The return statement can be used with or without a return value or may be omitted entirely. In the latter case, the return value would be undefined if it were to be accessed.


Example

In the example below, the function named search looks for the string who in the array allnames and returns the index or -1. The third argument, incr, is used as a local variable.

   ...
{ print $1, search($1, allnames) }
   ...
function search(who, allnames, incr)
{
   for (incr=0; allnames[incr]; incr++)
      if (index(allnames[incr], who) == 1
          && length(allnames[incr]) == length(who))
             return incr
   return -1
}


Built-in functions


Input function

getlineRead input record


Output functions

print([arg,...])Standard output function
printf(format [arg,...])Formatted output


Arithmetic functions

atan2(y,x)Arc tangent of y/x
cos(x)Cosine
exp(x)Exponential function
int(x)Truncate to integer
log(x)Natural logarithm
rand()Return a random number
sin(x)Sine
sqrt(x)Square root
srand([x])Set the seed (initial value) for rand()


String functions

gsub(re,repl[,in])Global substitution function
index(str1,str2)Return first occurrence of substring
length([str])Return length of string
match(str,re)Check whether string str matches regular expression
split(str,array,[sep])Subdivide string
sprintf(format,e1,e2,...)Return formatted output as string
sub(re, repl[,in])Substitution function
substr(str,m,[n])Define substring
tolower(s)Convert to lowercase
toupper(s)Convert to uppercase


General functions

close(expr)Close file or pipe
system(expr)Call shell command


The following section describes each of these functions in alphabetical order together with the associated arguments. The argument you specify can either be a constant or an expression (see “Expressions”). awk first evaluates the expression arguments and then applies the function to the computed results.

atan2 - Arc tangent

atan2 calculates the arc tangent of the quotient of two numbers. atan2(y,x) returns the arc tangent of y/x.


Syntax


atan2(y,x)


y,x

Numbers that produce the quotient for which the arc tangent is to be calculated.

close - Close file or pipe

close closes the specified file or pipe.


Syntax


close(expr)


expr

Name of the file or pipe to be closed, see redirection under “printf - Formatted output”.

cos - Cosine

cos calculates the cosine of a number.


Syntax


cos(x)


x

Number for which the cosine is to be calculated.

exp - Exponential function

exp calculates e to the power of x.


Syntax


exp( x )


x

Number for which ex is to be computed.

getline - Read a record

awk retrieves a record as directed (see also the control-flow statement  next ).

getline has several different formats, with the following return values:


1successful execution
0end-of-file
-1error


Syntax


getline


awk reads the next input record from the input file into $0. NR, NF, FNR, $0, and $1 to $NF are reset.


Example

If a record contains %%%, the next record is read. In other words, input records containing %%% are ignored.

/%%%/ {getline}


Syntax


getline < file


awk reads a record from the named file into $0. NF, $0, and $1 to $NF are reset.

file

Name of the file from which a record is to be read.


Syntax


getline var


awk fetches the next input record from the input file and puts it into the variable var. NR and FNR are reset.

var

Variable into which the next record is to be read.


Syntax


getline var < file


awk fetches a record from the named file and puts it into the variable var. NR, NF, FNR, $0, and $1 to $NF remain unchanged.

var

Variable into which the record is to be read.

file

Name of the file from which the record is to be read.


Syntax


command | getline [var]


The output of the named command is redirected to getline. Each getline call in this format causes awk to read the next line from the output of command and write it into $0 or the variable var.

If var is specified, NR, NF, FNR, $0, and $1 to $NF remain unchanged; if not, NF, $0, and $1 to $NF are reset.

This construct is equivalent to calling the C function popen() with mode r.

var

Variable into which the record is to be written.

var not specified: The record is written into $0.

command

Name of the command whose output is to be read.


gsub - Global substitution function

gsub globally substitutes the string repl for all strings in $0 or instr that match the extended regular expression RE.
gsub returns the number of substitutions.


Syntax


gsub( re,repl[,in] )


re

Extended regular expression that specifies the pattern to be matched.

repl

String to be substituted for the strings that match re.

instr

String in which the substitution is to be made.

instr not specified: Substitution is done in $0.


index - Search for substrings
index searches for a substring within a string. If the substring is present, index returns the starting character position (numbered from 1 onward) of its first occurrence in the string; if not, it returns a value of 0.


Syntax


index( str1,str1 )


str1

String in which index looks for the substring.

str2

Substring that index looks for.


Example

Comparing the string "ToTo-LoTo" with "To"

index("ToTo-LoTo","To") returns 1.


int - Truncate to integer

int returns the largest integer equal to or smaller than the argument.


Syntax


int( x )


x

Number that is to be truncated to its integer part.


length - Return length

length returns the length of a string.


Syntax


lenght[( str )]


str

length returns the length of string str.

str not specified:
length returns the length of the current input record $0.


log - Logarithm

log calculates the natural base e logarithm.


Syntax


log( x )


x

Number whose natural log is to be computed.


match - Match regular expressions

match checks whether a string in str matches the extended regular expression in re. If a matching string is found, match returns the character position in str (numbered from 1 onward) at which the string begins; if not, it returns 0.

The variable RSTART is set to the return value of match; RLENGTH is set to the length of the matching string (or -1 if no matching string is found).


Syntax


match( str,re )


str

String in which the pattern is to be matched.

re

Extended regular expression.


print - Standard output function

print is the standard output function. print outputs either the current record or the specified arguments and terminates its output with the output record separator ORS. For further details refer to "Output format".


Syntax


print( arg1[[,]arg2 ...] )[redirection]


No argument specified:

print writes the current input record on standard output.

arg1arg2

Arguments that are to be printed. print evaluates the expression arguments and concatenates the results in the order in which the arguments are specified.

arg1,arg2

Arguments that are to be printed. print outputs the evaluated expression arguments in the specified order, separated by the output field separator OFS if they are separated by commas in the print statement.

redirection

Output can be redirected to a file or piped to a program. You can use up to 10 output files.

redirection can be in the form of:

>file

The output is written to the named file. The former contents of file are deleted the first time print is called. All subsequent print or printf outputs to file in the same awk program are appended to the end of file. Unless explicitly closed, file remains open until the end of the awk program.

>>file

The output is appended to the previous contents of file. Unless explicitly closed, file remains open until the end of the awk program.

|prog

The output is piped to the program named prog.

You are only permitted to open one pipe to prog within an awk program, but you can pipe any number of print or printf outputs to it.

This construct is equivalent to calling the C function popen() [4] with mode w.

Unless explicitly closed, the pipe remains open until the end of the awk program.

The file or program name can specified directly (enclosed in "...") or via a variable that evaluates to the file name.



Caution!
If you redirect output to the input file, the input file will be destroyed without any warning!


Output format

print outputs integers in decimal and prints strings at full length. Apart from that, the output format is contingent on the following predefined variables:

OFS - output field separator

OFS is one space by default. If you wish, you can assign any one character to OFS to change the output field separator.

ORS - output record separator

ORS is the newline character by default. If you wish, you can assign any one character to ORS to change the output record separator.

OFMT - floating point output format

OFMT defines the output format for floating point values and is set to "%.6g" by default. This means that the fractional part of a floating point number is printed with a maximum of 6 places. If you wish, you can assign a different printf format for floating point numbers to OFMT (see „printf - Formatted output“ below).


Example

Print the first and second fields, separated by a blank:

{print $1,$2}


Concatenate the first and second fields without an output field separator:

{print $1$2}

or

{print $1 $2}


printf - Formatted output

printf is the output function for formatted output. The output format can be specified as in the standard printf() function in C.


Syntax


printf(format, arg,...] )[redirection]


format

String defining the output format. The output format comprises plain characters and format elements (conversion specifications). Printable characters are output unaltered. The special characters listed in the "Basic elements" section are converted immediately. For example, \n sets the position to the start of the next line.

All format elements begins with the percent sign. The most common format elements are presented in the following list:


%csingle character
%ddecimal integer
%efloating point number in exponential notation, e.g. 5.234e+2
%ffloating point number, e.g. 52.34
%g%e or %f, whichever is shorter
%ooctal integer (base 8)
%scharacter string
%uunsigned decimal integer
%xhexadecimal integer (base 16)


arg

Arguments that are to be printed.
printf evaluates the expression arguments, allocates them in the given order to the specifications in format, and outputs them in the appropriate format.

If the format element is incompatible with the argument, e.g. a numeric format specification for an alphanumeric argument, a 0 is printed.

If there are more arguments than format elements, the excess arguments are ignored, i.e. not printed.

If there are more format elements than arguments, an error message is issued.

redirection

Redirection is as for print.

redirection not specified:
printf prints on standard output.


Example

Field 1 is printed as a decimal number with at least 2 positions, followed by ** as a separator, followed by field 2 as a string of at least 5 characters, followed by newline:

{ printf("%2d**%5s\n", $1,$2) }


rand - Return a random number

rand returns a random number r, where 0 <= r < 1.


Syntax


rand


Also refer to srand.


sin - Sine

sin returns the sine of a number.


Syntax


sin(x)


x

Number whose sine is to be computed.


split - Subdivide strings

split divides a string into substrings and stores each substring as an element in an array. The elements are subscripted in ascending order, starting with 1.
split returns the number of array elements.


Syntax


split(str,array[,sep])


str

String that is to be split.

array

Name of the resulting array.

sep

Extended regular expression specifying the characters that act as a separator between
the substrings in str.

sep not specified:
FS is used as the separator.


Example

The input

{
    s=split("january:february:march", months, ":");
    for(i=1; i<s; i++) print months[i];
}

produces the output

january
february
march


sprintf - Return formatted output as a string

sprintf formats in exactly the same way as printf, but there is no direct output. sprintf instead returns the formatted output as a string, which could then be assigned to a variable or used for a similar purpose.


Syntax


sprintf(format,arg,...)


format

String defining the output format (see “printf - Formatted output”).

arg

Arguments that are to be output (see “printf - Formatted output”).


Example

The following awk program fragment produces the same output as the example given under printf.

{ x = sprintf("%2d**%5s\n", $1,$2); print x }


sqrt - Calculate the square root

sqrt calculates the square root of a number.


Syntax


sqrt(x)


x

Number whose square root is to be computed.


srand - Set the seed for the rand function

srand sets the seed (starting point) for the rand function to the number x, or to the current time if no argument is specified.


Syntax


srand([x])


x

Number that is to serve as the seed for rand.


sub - Substitution function
sub substitutes the string repl for the first instance of a string in $0 or instr that matches the extended regular expression RE.
sub returns the number of substitutions.


Syntax


sub(re,repl[,instr])


re

Extended regular expression that specifies the pattern to be matched.

repl

String to be substituted for the strings that match re.

instr

String in which the substitution is to be made.

instr not specified:
The substitution is done in $0.


substr - Define a substring

substr extracts a substring from a string.


Syntax


substr(str,m[,n])


str

String from which the substring is to be extracted.

m

Position in str at which the substring begins. Character positions are numbered consecutively from left to right, starting with one.

n

Maximum length of the substring.

n not specified:
The substring extends to the end of str.


The input

{
x = substr("060789",3,2); print "Month = "x
}

produces the output:

Month = 07


system - Call shell command

system executes the specified shell command and returns its exit status.


Syntax


system(command)


command

Name of the shell command to be executed.


Error

If an awk program contains errors, awk issues corresponding error messages and exits immediately. The error messages indicate the cause of the error, if detectable by awk, and the awk program line in which awk thinks the error is to be found.

Typical error messages are:

awk: syntax error at source line xxx
Line xxx of the awk program contains a syntax error.


awk: illegal statement source line number xxx
Line xxx of the awk program contains an illegal statement.


Locale

The following environment variables affect the execution of awk:

LANG

Provide a default value for the internationalization variables that are unset or null.
If LANG is unset of null, the corresponding value from the implementation-specific default locale will be used. If any of the internationalization variables contains an invalid setting, the utility will behave as if none of the variables had been defined.

LC_ALL

If set to a non-empty string value, override the values of all the other internationalization variables.

LC_COLLATE

Determine the locale for the behavior of ranges, equivalence classes and multicharacter collating elements within regular expressions and in comparisons of string values.

LC_CTYPE

Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single- as opposed to multi-byte characters in arguments) and input files, the behavior of character classes within regular expressions, the identification of characters as letters, and the mapping of upper- and lower-case characters for the toupper and tolower functions.

LC_MESSAGES

Determine the locale that should be used to affect the format and contents of diagnostic messages written to standard error.

LC_NUMERIC

Determine the representation of the radix character, the exponentiation symbol and the digit grouping character.

NLSPATH

Determine the location of message catalogs for the processing of LC_MESSAGES.


awk Examples

Example 1

Output all input lines in which field 3 is greater than field 5:

$ awk '$3 > $5' file

Since no action has been specified, awk prints the selected lines by default.


Example 2

Print every 10th line of a file:

$ awk '(NR % 10) == 0' file


Example 3

Print the second to last and the last field in each line, separated by a colon:

$ awk 'BEGIN {OFS=":"} \
> {print $(NF-1), $NF}' file

If a line consists of a single field, the entire line is output twice, separated by a colon (first $0, then $1).


Example 4

Add up the values of the first field of every line and print the total and average at the end:

$ awk '{s += $1} \
>      END {print "Total: ", s, "Average: ", s/NR}'\
>      file


Example 5

Find a preprocessor if directive, i.e. a range of lines in which the first line begins with #if and the last line with #endif:

$ awk '/^#if/, /^#endif/' file


Example 6

Print all lines in which the first field differs from that of the previous line:

$ awk '$1 != prev { print; prev = $1 }' file


Example 7

file contains a list of data about young people, with the second field containing one of the entries school, university, apprenticeship or elsewhere. For statistical purposes, you want to count how many are at school and university:

$ awk '$2 ~ /school/ {incr["school"]++}
>     $2 ~ /university/ {incr["university"]++}
>     END {print "school:" incr["school"]; \
>          print "university:" incr["university"]} ' file


Example 8

The file contents lists the table of contents of a text. The table of contents is organized in

decimal classification and has the format:

1. Foreword
2. Introduction
3. The Game of Chess
3.1. History
3.2. Rules
3.2.1 Setting Up the Figures
.
.
.
4. The Game of Checkers/Draughts
4.1. History
.
.
.
8. Index


The following awk program can be used to give the list a more orderly format:

$ awk '{$1=$1"        ";     \
>      $1=substr($1,1,6);    \
>      print $0} ' contents >> con.form


The output lines are prepared in the following stages: First, six blanks are added to the end of the first field ($1=$1"      "). Then the first field is truncated to six characters. Thus the first field of each line is 6 characters long, and field 2 always starts at column 7. The output in the file con.form will be as follows:

1.    Foreword
2.    Introduction
3.    The Game of Chess
3.1.  History
3.2.  Rules
3.2.1 Setting Up the Figures
.
.
.
4.    The Game of Checkers/Draughts
4.1.  History
.
.
.
8.    Index


Example 9

The following awk program in the file prog prints the number of fields and the actual fields of each record. The record separator has been redefined as the dollar sign. The field separators are thus blanks, tabs, and the newline character:

BEGIN { RS="$"; printf "Record\tNum" }
{ 
  printf ("\n%4d\t%3d\t", NR, NF);
  for(i=1;i<=NF; i++) printf "%s:", $i 
}
END {print"\n"}


The file text contains the following text:

first record$  second   record     $
$
fourth     and  last
record$


The call:

$ awk -f prog text

returns:

Record  Num
     1    2      first:record:
     2    2      second:record:
     3    0
     4    4      fourth:and:last:record:
     5    0


Example 10

You now change the file text to:

&&
first&&record$second record$$fourth
and&
last
record&

and call awk again, this time using the -F option to change the field separator to &.

$ awk -F"&" -f prog text


The output returned is:

Record  Num
     1    6    :::first::record:
     2    1    second record:
     3    0
     4    8    fourth:and::last::record:::


This example illustrates how fields are separated when a non-standard separator is used. The first line (&&) of the text file is a part of the first record and now yields 3 fields, for example, because each individual separator in a string of separators (&&) is counted, and the newline implicitly acts as a separator as well (2 & + 1 newline = 3).


See also

egrep, fgrep, grep, lex, sed