awk is a programmable text manipulation system.
When you call awk you specify an awk program it is to execute and the files it is to process. The actions defined in the program are then performed on the basis of the specified files. awk does not alter its input files. The results of the actions it performs are by default written on standard output.
awk offers the following advantages over text manipulation programs such as egrep and sed:
awk operates on one record at a time. As with egrep and sed, an input record is defined as one line by default; but with awk you can change this setting and define some other unit of text as the record.
Each input record is split into fields which can be accessed individually.
A pattern (selection criterion) may be a condition defined by the logical combination of extended regular expressions and relational operators.
You can program any actions that you require. awk is a high-level C-like programming language.
A detailed description of awk is provided below in the following sections:
Syntax
Format 1: |
awk[ -F ERE] |
Format 2: |
awk[ -F ERE] |
Defines the field separator character for the input record (input field separator). ERE Extended regular expression that defines a character to be interpreted as the input field separator. Separators do not form part of the fields. To be able to use t as the input field separator, you must specify it as follows on the awk command line or in the BEGIN section of the awk program: -F ERE not specified: Blanks and tabs act as field separators.
Assignments in the form var=value. var Name of the variable to be initialized. value Initial value to be assigned to var. value can be defined in exactly the same way as an environment variable on shell level. There is no difference between the assignment of a value with -v initialization and with initialization (see below).
awk program argument. Possible forms for prog are: ’awk-program’, i.e. an awk program written on the command line, or -f progfile , i.e. the name of a file containing an awk program. ’awk_program’ An awk program written on the command line. You should always enclose the awk program in single quotes in order to prevent the shell from interpreting metacharacters. If the program is more than one line long, you must escape the newline character with a backslash. Example Output all lines in the input file whose third field consists of the character ’0’:
The awk program is located in the file named progfile.
Assignments in the form: var=value The var variable (whether it appears in the awk program or not) is initialized to value. initialization and file may be specified in any order. The assignment is made at the time when the named file is opened. Exception var Name of the variable to be initialized. The name must not begin with $. value Initial value to be assigned to var. value can be defined in exactly the same way as an environment variable on shell level.
Name of the text file to be processed. You may list more than one file if you wish. Files are read in the order in which they are listed. If file is a dash (-), awk reads from standard input. file not specified: |
Typical awk applications
awk is a tool which makes text manipulation tasks easy to accomplish. Typical applications for awk include:
selectively extracting data from files
checking the contents of files
performing calculations on the data in a file
changing the format of input data.
Using four simple examples, this section demonstrates how awk can be used.
Example
A file called supplies contains a list of office supplies. It includes the name of each article, along with its quantity and unit price:
Pencil 100 0.60 Table 5 345.00 Lamp 20 79.80 Paper 75 1.00 Diskette 1000 2.40 Envelope 1500 0.20
Example 1
Select all articles with a quantity greater than 100:
|
With $2 you access the second field of a line, which in this case is the quantity of each article. If the quantity is greater than 100, the condition is fulfilled, and the print function is executed. Since no arguments were specified for print, the whole line is output.
Example 2
Calculate the total price for all articles with a quantity greater than 100 and print this total along with the article name:
|
Three arguments are entered for the print function in this example. The following is output:
$1 | article name (first field) |
\t | tab character |
$2*$3 | quantity (second field) times unit price (third field) |
Example 3
Include a heading in the output:
|
This example illustrates the use of the BEGIN pattern. awk executes the action after BEGIN only once, i.e. when the program is started. The heading is therefore printed only once at the beginning.
Example 4
Print a grand total of all amounts at the end. For this purpose we use a variable called sum, which is initialized to zero in the BEGIN pattern. The product of column 2 and column 3 is calculated for each line, and all the products are summed up:
|
This example demonstrates the use of the END pattern. awk executes the action after END only once, i.e. before termination of the program. The grand total of all subtotals is therefore printed just once at the end.
Structure of an awk program
An awk program can consist of a BEGIN section, a main section, and an END section, structured as shown below:
Syntax
- BEGIN section –
[ BEGIN {action} ]
– Main section –
[[pattern] {action}
| pattern [{action}]
| function_definition . . . ]
– END section –
[ END {action} ]
|
pattern The pattern indicates which data is to be selected from the input files (see “Patterns”). action The action indicates what to do with data that matches the pattern (see “Actions”). function_definition A function_definition enables you to define your own functions (see “Functions”). At least one of the three sections (pattern, action or function-definition) must be present. In a pattern {action} pair, either the pattern or the action can be omitted. If the action is omitted, each line that matches the pattern is output; omitting the pattern causes the action to be performed on all lines.
|
Operation of the awk command
awk executes the awk program that is specified by the user, proceeding in the following sequence:
Initial processing
The first step performed by awk is to initialize any variables that may have been defined. If there is a BEGIN section including an action, awk then executes the action specified there. The action in the BEGIN section is executed just once, before the first line is processed.File processing
Next awk processes the specified input files by reading the input records sequentially. For each input record, awk tries to match each pattern in the order that is specified in the awk program. If a pattern is matched, i.e. the selection criterion is fulfilled, the associated action is performed.
If no pattern is specified for an action, awk performs the action for every record.
If no action is specified for a pattern, the default action is to output (print) the record.Multiple input files are processed in the specified order.Final processing
When all the specified files have been processed, awk performs the action in the END section, if one has been included. awk then exits.
The input file
An input file consists of records that are subdivided into fields.
Records
Records are separated by a record separator. The record separator does not form part of a record. By default, a record is one line, and the record separator is the newline character. However, you do have the option of changing this setup by assigning any single character to the special variable RS (Record Separator). If you specify a string of characters as a value for RS, only the first character will be taken into account. The ordinal number of the current record is available in the variable NR (Number of Record). If there is more than one input file, NR counts from the start of the first file to the end of the last one. The special variable $0 addresses the whole of the current record. Further information on variables is provided in the section “Basic elements of the awk language”.
Fields
Each record is split into fields separated by one or more field separators. The default field separator is white space (any sequence of tabs and blanks), but you do have the option of changing this by assigning any other character to the special variable FS (Field Separator). You can make this assignment either in the awk program or by using option -F on the command line. The value assigned to FS is interpreted as an extended regular expression (see section “Regular POSIX shell expressions”).
Example 1
To define the characters x and y as alternate field separators:
syntax on the awk command line: -F[xy]
syntax in the awk program: FS=[xy]
Example 2
To define the field separator as one or more occurrences of the character x:
syntax on the awk command line: -Fx+
syntax in the awk program: FS=x+
The default setting (any sequence of blanks and tabs) can be expressed by the regular expression [ \t]+
, where ' '
stands for a blank, and \t represents a tab.
Note that the newline character is always interpreted as a field separator, regardless of the value assigned to FS!
The number of fields in the current record is stored in the variable NF (Number of Fields). Individual fields of the current record are addressed by the predefined variables $1, $2, to $NF. Further information on variables is provided in the section “Basic elements of the awk language”.
Example
Default setup
Field 1 Field 2 ... Field 5 ... This is the first record <--- Record 1 and this is the second record. <--- Record 2
Customized setup: RS="%"; FS=":";
Field 1 Field 2 Field 3 %Name : Address : Phone number <--- Record 1 %SNI AG :81730 Munich : 089-636-1 <--- Record 2
Rules for record and field separators
Default settings for record separators
The default record separator is the newline character.
If the null string is assigned to RS (RS=""), the file is treated as a single record. If several files are specified, each file will consist of a single record (which means that the ultimate value of NR will be equal to the number of files).
Default settings for field separators
If the record separator is newline, the field separator defaults to blanks and tabs.
If the record separator is not a newline, the newline character always counts as a field separator, regardless of which character has been explicitly defined as the field separator (see Fields, example 2).
If you explicitly assign a blank to FS, either with -F" " on the awk command line or by using the assignment FS=" ", then blanks and tab characters are treated as field separators.
On the other hand, if you explicitly assign the tab character to FS (FS="\t"), then only the tab character is treated as the field separator and not the blank.
Leading field separators and field separator strings
The following applies to blanks, tabs and newlines as field separators:
Leading field separators are ignored.
Multiple occurrences of a field separator are treated as a single field separator (see example 9).
For all other field separators, leading field separators are counted. In multiple occurrences of a field separator, each character is counted separately. Thus two consecutive field separators are deemed to have an empty field between them (see example 10).
Changing separators:
If you need a number of different record separators in one file, you can change RS within the awk program. The new record separator comes into effect as soon as the assignment to RS has been implemented. Similarly, you can change FS within the awk program, should you require a number of different field separators in one file. The new field separator comes into effect as soon as the assignment to FS has been implemented.
Special variables for the input file
The following list shows all special awk variables pertaining to the input file and the corresponding values awk usually assigns to these variables.
FILENAME
Name of the current input file, - for standard input
FS
Input field separator (default: any sequence of blanks and tabs)
NF
Number of fields in the current record
NR
Ordinal number of the current record from start of input
FNR
Ordinal number of the current record in the current file
RS
Input record separator (default: newline)
$0
Current record
$1
First field of the current record
$2
Second field of the current record
...
$NF
Last field of the current record
You can change these variables within an awk program if you wish. This does not alter the input file. Further information on variables is provided in the section “Basic elements of the awk language”.
Basic elements of the awk language
This section gives a syntax of the basic elements of the awk language. You will need these elements in order to define pattern and action pairs.
Comments
You can include comments in an awk program, as in a shell script. A comment begins with the # character and continues till the end of the line.
Constants
There are two types of constant:
number
A number (numeric constant) is a signed or unsigned integer or floating point number. awk does not check its format. If your number contains invalid characters, awk attempts to filter out a valid part and ignores the rest.
integer
An integer is a sequence of digits from 0 to 9.
floating point number
A floating point number consists of a mantissa with or without an exponent.
The mantissa comprises an integer with or without a fractional part.
The fractional part is represented by a radix character and an integer.
string
A string (alphanumeric constant) is a sequence of characters, enclosed in double quotes "...". If the double quotes are omitted, awk will interpret the string as a variable name, a number, or an operator.
character
A single character is also enclosed in double quotes "..." in order to prevent awk interpreting the character as a variable name. A character may be a displayable character from the character set which is currently in use (see section “EDF04 character set”) or one of the following special characters as represented in C:
\" | for " |
\\ | for \ |
\a | for bell character |
\n | for newline character |
\t | for tab character |
\v | for vertical tab |
\b | for backspace |
\r | for carriage return |
\f | for page feed |
Variables
awk allows you to use simple variables and arrays to store values.
The special variables are predefined; others can be defined by the user.
Name of a variable
The name of a user-defined variable can be any string made up of underscores (_), uppercase and lowercase letters and digits, beginning with a letter or an underscore.
Data type
Variables do not have a data type. You can thus assign either a number or a string to any variable. If the context is clearly numeric, variables are treated as numeric; otherwise, they default to alphanumeric.
Example:
x = "Miller"; | # Variable x contains the string Miller |
x = "3"+4 ; | # Variable x has a value of 7 |
Declaration
awk variables do not need to be explicitly declared. User-defined variables are automatically declared the first time they are used.
Initialization
Special variables are initialized to predefined values by awk. Depending on the context, user-defined variables are initialized by awk to the null string or to 0 by default. If you wish, you can specify other initial values when you call awk.
Exceptions:
When i>NF, $i will not always be the null string.
$ variables cannot be initialized on the command line.
Special variables
awk recognizes the special variables shown in the list below. The values awk usually assigns to these variables are indicated in the list. New values may be assigned to the variables by the user.
ARGC
Number of elements in the array ARGV
ARGV
Array holding the command line arguments (excluding options and the prog argument), numbered from 0 to ARGC-1
ENVIRON
Array holding the values of environment variables, where the indexes are the names of the variables
FILENAME
Name of the current input file, - for standard input
FS
Input field separator (default: any sequence of blanks and tabs)
NF
Number of fields in the current record
NR
Ordinal number of the current record from start of input
FNR
Ordinal number of the current record in the current file
OFS
Output field separator (default: one blank)
ORS
Output record separator (default: newline)
OFMT
Output format for floating point numbers (see printf - Formatted output )
(default: %.6g, up to 6 places after the decimal point)
RS
Input record separator (default: newline)
RLENGTH
Length of the string matched by the match function
RSTART
Starting position of the string matched by the match function. Numbering begins with 1.This value always corresponds to the value returned by the match function.
SUBSEP
Subscript string separator for multi-dimensional arrays. The default setting is \034.
$0
Current record
$n
Field n of the current record
$NF
Last field of the current record
What is the effect of changing special variables?
Example 1
The assignment
$1 = "new";
assigns the string new to $1; but this does not actually alter the first field of the current input record.
This also applies to the following awk settings relating to the input file:
The current input file does not change when you assign a new name to FILENAME.
When you assign a value to a variable $i where i>NF, NF is assigned the value i.
If you assign a new value to NR, you only alter the number assigned to the current line; you do not move to a different line.
Example 2
The contents of $0 remain the same even if NR is modified:
{print NR, $0; NR=NR+34; print NR, $0}
A typical output would then be:
10 This is the tenth line
44 This is the tenth line
When you assign a new value to a variable, its old value is deleted. Thus, if you change NF, for example, the information on the number of fields in the current record is lost.
Peculiarity of $ variables:
You can specify the number of a $ variable as a constant or as an expression which evaluates to the number.
Example 3
You can use $(NF-1)
to access the second-last field.
Array
An array is a set of constants or variables.
An array element is addressed as follows:
array_name[index]
|
array_name
Name of a variable.
index
A simple variable.
The index may be numeric or alphanumeric. The index you specify can therefore be a number, a string, or an expression that evaluates to an index value.
awk provides two special types of arrays:
Dynamic arrays
Arrays, like simple variables, do not need to be declared. Above all, there is no need to define dimensions. New array elements are created automatically as and when required.Associative arrays
Individual array elements can be accessed via an alphanumeric index.
A special control-flow statement is provided in order to process all elements of an associative array:for (index in array) statement
index assumes the index values present to this point in random order, and the specified statement is executed once for each array element (see control-flow statement for).
Example
A file called expenses contains various expenses incurred. For each item of expenditure the file shows the date, month, amount, and a brief description, with a colon to separate them. For example:
01:January: 40.78:Supplies 05:January: 6789.00:Laser printer 23:March: 240.32:Lamps 11:January: 478.00:Chairs 01:February: 45.00:Journals
Using an associative array you can easily calculate total expenditure for each month from the data in this file. The program in the example uses an array called mexpenses and the names of the months as an alphanumeric index. For each line, the expenses in the third field ($3) are summed up to produce total expenditure for each month appearing in the second field ($2).
|
Expressions
An expression can be any of the following:
|
constant
Numeric or alphanumeric constant (see “Basic elements of the awk language”).
variable
Variable (see “Basic elements of the awk language”).
function_call
Invocation of a predefined function (see “Functions”).
expression
Expression.
un_op
Unary operator (see “awk operators”).
bin_op
Binary operator (see “awk operators”).
Expressions are evaluated and return a value. They may appear both in patterns and in actions.
awk operators
awk recognizes all C operators plus the operators for pattern matching and string concatenation.
The following list shows all awk operators in ascending order of precedence. Operators in the same line have the same precedence.
= | assignment operator |
+= -= *= /= %= ^= | compound assignment operators as in C |
|| | logical OR |
&& | logical AND |
~ !~
| pattern matching operators |
> >= < <= != == | relational operators |
operand list | concatenation |
+ -
| plus, minus |
* / %
| multiply, divide, remainder |
!
| logical NOT |
^ **
| exponent |
++ --
| increment, decrement |
Evaluation of expressions
Since no data type is prescribed for the operands, you can freely mix numeric and alphanumeric constants. awk determines from the context whether a numeric or alphanumeric operation is required.
Please note that, as in C, there are no special truth values. Like C, awk treats a value of 0 as false and a non-zero value as true. This means that any non-zero value as an argument of a logical operation is held to be true. If the result of a logical operation is true, it is represented as 1.
Example:
2&&2)+3=4
Patterns
Patterns (selection criteria) are specified by the user as a means of indicating which data is to be selected from the input files. A pattern can have any of the following forms:
/regexp/
relexp matchexp pattern_range compound_pattern |
/regexp/
Regular expression
awk supports extended regular expressions (see section “Regular POSIX shell expressions”). A regular expression is enclosed in slashes /.../.
Example:
A regular expression matching any number of occurrences of a, b or c:
/[abc]+/
relexp
relexp is an expression (see “Expressions”) featuring relational operators. The operators and their meanings are:
a > b | a greater than b? |
a >= b | a greater than or equal to b? |
a < b | a less than b? |
a <= b | a less than or equal to b? |
a == b | a equal to b? |
a != b | a not equal to b? |
Operands a and b are any expressions. If both operands are numeric, the comparison is numeric; if not, it is alphanumeric.
matchexp
matchexp is an expression (see “Expressions”) featuring pattern matching operators. It involves the comparison of a regular expression (pattern) with a string. The pattern matching operators and their meanings are:
str ~ p | string str must match pattern p |
str !~ p | string str must not match pattern p |
Using matchexp as a pattern allows you to select individual fields.
Example:
Select all records with a first field starting with A or a:
$1 ~ /^[Aa]/
The regular expression ^[Aa] represents strings that begin with A or a. The first field of the record ($1) must match (~) the regular expression, i.e. begin with A or a.
pattern_range
A pattern range takes the form:
/ regexp
/, / regexp
/
Specifying a range causes the associated action to be executed for all records that lie within the range. The limits of the range (start and end) are defined by two regular expressions. The range begins with the first record containing a string that matches the first regular expression and ends with the first record containing a string that matches the second regular expression.
Example:
Select the range from the first line beginning with C to the first line beginning with K and output the first field of every line in the selected range:
/^C/, /^K/ {print $1}
compound_pattern
Logical operators (see Expressions) can be used to negate patterns and to combine several of them to form a single pattern. The logical operators and their meanings are:
!pat | Negation of pattern pat | |
pat1 || pat2 | pat1 or pat2. The criterion is satisfied if pat1 or pat2 matches. | |
pat1 && pat2 | pat1 and pat2. The criterion is satisfied if both pat1 andpat2 match. | |
(pat)
| Parentheses |
A compound condition is evaluated from left to right.
Example
Match all records that have an even number of fields and a letter between M (inclusive) and Q (exclusive) in the first field.
NF%2==0 && $1 >= "M" && $1 < "Q"
You can generally combine patterns in several ways in order to make the same selection. Thus, if the currently valid collating sequence defines the range [M-Q] as the uppercase letters M, N, O, P and Q, the above selection could also be made with pattern matching operators:
NF%2==0 && $1 ~ /^[MNOP]/
Since the first awk condition depends on t he collating sequence of the currently valid character set, it may not return the same result in every case. The second awk line, by contrast, will always select only those records in which the first field begins with the letter M, N, O or P.
Actions
Actions indicate what to do when a pattern is matched. An action will typically involve processing one of the selected files. An action has to begin in the same line as the associated pattern. If this is not possible, the newline character must be escaped with a backslash. Blanks and tabs between the action and the pattern are ignored. An action comprises one or more statements and must be enclosed in braces {...} as shown below:
{statement [;statement]...}
|
A statement can be any of the following:
expression |
expression
An expression is evaluated but is not put to any further use unless expression is in the form of an assignment, an increment or a decrement (see section “Expressions”).
control_statement
A control_statement allows you to control the flow of an awk program (see section “Control-flow statements”).
A single statement may be spread over several lines, in which case each line except the last must end with a backslash. The backslash escapes (cancels the effect of) the newline character.
Multiple statements
You can group together a number of statements within one pair of braces {}. Statements are delimited by means of:
- a semicolon ;
- a right brace }
- a newline character.
Control-flow statements
Control-flow statements allow you to control the flow of an awk program. awk recognizes the following control-flow statements:
break | terminate a loop |
continue | skip remainder of loop |
exit | terminate the awk program |
for | loop counter and looping an array |
if | conditional statement |
next | skip to the next input record |
while | execute iteratively |
do | execute iteratively |
delete array[i] | delete element i of the named array |
return x | return from a function with a value |
return | return from a function without a value |
The control-flow statements are described below in alphabetical order.
break - Terminate a loop
break can be used in the body of a for, while, or do loop. break causes an immediate exit from the enclosing loop.
Syntax
break
|
Example
While records continue to start with a dot, keep reading in the next record. Terminate the loop if the second field of the retrieved record is greater than 1000.
{ while($1 ~ /^\./) { getline; if($2 > 1000) break; } }
continue - Skip remainder of loop
continue can be used in the body of a for, while or do loop. The continue statement causes the current iteration to be terminated and the next one to begin.
Syntax
continue
|
Example
Print even fields only:
{ i=1; while(i++ <= NF) { if(i%2) continue; else print $i } }
do - Execute iteratively
The statement in a do loop (or a do while loop) is executed iteratively while a specified condition continues to be satisfied. In contrast to the while loop, the statement in a do loop is always executed at least once.
Syntax
do statement while (expression)
|
statement
Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces ({ }) and separated by semicolons or linefeed characters.
expression
Expression (see “Expressions”) that specifies the condition.
Example
Print out the individual fields of a record:
{ i=0; do {print $(++i)} while (i != NF) }
exit - Terminate the awk program
exit terminates the awk program.
If an END section is present, awk executes the action specified in it; if not, the program is terminated immediately.
Syntax
exit
|
Example
If the commercial at symbol @ appears in the input, print the result and terminate processing:
... /@/ {exit} ... END {print ergebnis}
for - Loop counter
The statement in a for loop is executed iteratively while a condition continues to be satisfied.
Syntax
for(expr1; expr2; expr3) statement |
expr1
Expression (see “Expressions”).
expr1 is evaluated once at the start of the for statement. expr1 is often used to initialize incrementing variables.
Example: i=1
expr2
Expression (see “Expressions”).
expr2 is evaluated before each iteration. The specified statement is executed only if expr2 is non-zero (true); otherwise, the loop is terminated.
Example: i<10
expr3
Expression (see “Expressions”).
expr3 is evaluated after each iteration. When incrementing variables are used, expr3 increments the variable.
Example: i++
statement
Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces {}.
Example
Print out the fields of the current record in reverse order.
{for(i=NF; i>0; i--) print $i}
for - Looping an array
This variant of the for statement is a special awk facility for the handling of arrays.
Syntax
for(index in array) statement |
index
Variable (see Basic elements) that assumes all values of the elements of array in random order. The index can be numeric or alphanumeric.
array
Array to be processed.
statement
Statement to be executed for each array element. If several statements are to be executed, they have to be grouped together in braces { }.
Example
The array named month contains the number of days in each month. Each array element is subscripted with the name of the month, e.g.month["January"]=31
.
The following awk program prints the name of each month together with the number of days in it.
$ awk ' BEGIN { month["January"]=31; \ > month["February"]=28; \ > month["March"]=31; \ > month["April"]=30; \ > month["May"]=31; \ > month["June"]=30; \ > month["July"]=31; \ > month["August"]=31 } \ > END { for(i in month) print i,"has",month[i],"days" } ' May has 31 days August has 31 days July has 31 days April has 30 days June has 30 days January has 31 days March has 31 days February has 28 days
if - Conditional statement
The statement in an if construct is executed if the specified condition is satisfied.
Syntax
if(expr) statement1 [else statement2] |
expr
Expression (see “Expressions”) that defines the condition to be satisfied. If expr is non-zero (true), statement1 is executed.
statement1
Statement to be executed if expr is true. If several statements are to be executed, they have to be grouped together in braces { }.
statement2
Statement to be executed if expr is false. If several statements are to be executed, they have to be grouped together in braces { }.
Example
If field 1 is greater than field 2, fields 2 and 3 are printed; if not, fields 4 and 5 are printed:
{ if($1 > 2) print $2, $3; else print $4, $5 }
next - Skip to the next input record
The next statement causes awk to suspend processing of the current record; statements that follow next are not applied to the current record. awk then reads the next input record. NR, NF, FNR, $0, and $1 to $NF are reset.
Difference between next and the getline function:
getline sets the current record to the next one. Statements that follow getline are executed using the next record’s values for the $ variables and for NR, NF, and FNR.
Syntax
next
|
Example
Records that begin with a dot are ignored:
{ if ($1 ~/^\./) next }
while - Execute iteratively
The statement in a while loop is executed iteratively while a specified condition continues to be satisfied.
Syntax
while(expr) statement |
expr
Expression (see “Expressions”) that specifies the condition.
statement
Statement that is executed in each iteration of the loop. If several statements are to be executed, they have to be grouped together in braces { }.
Example
Print all input fields, writing each field in a separate output line:
{ i = 1; while (i <= NF) { print $i i++ } }
Functions
awk provides a wide range of built-in functions and also offers you the option of defining functions of your own:
Syntax
function name(arg,...) {statements}
|
The {statements} may be preceded by a newline character. There may also be blank lines within the braces {...}. A function definition has the same precedence as pattern {action} pairs in the main section of an awk program.
Within an action section, function calls can be entered anywhere in an expression, except before the function declaration. There must be no space between the function name and the left parenthesis when a function is called.
Nested and recursive function calls are legal.
Though most functions do not require you to enclose arguments in parentheses, it is a good practice to use them as a means of increasing program transparency. When you pass an array as an argument, a pointer to the array is passed (call by reference), which means that you can change the elements of the array from the function. In the case of scalar variables, the value of the variable is copied and passed (call by value), which means that you cannot change the value of the variable from the function. The scope of function arguments is restricted to the local function, whereas the scope of all other variables is always global. If you need a local variable in a function, define it at the end of the argument list in the function definition. Any variable in the argument list for which no current argument exists is a local variable with a predefined value of 0.
As in C, some functions return a result (e.g. exp), while others are procedural in character (e.g. output functions).
The return statement can be used with or without a return value or may be omitted entirely. In the latter case, the return value would be undefined if it were to be accessed.
Example
In the example below, the function named search looks for the string who in the array allnames and returns the index or -1. The third argument, incr, is used as a local variable.
... { print $1, search($1, allnames) } ... function search(who, allnames, incr) { for (incr=0; allnames[incr]; incr++) if (index(allnames[incr], who) == 1 && length(allnames[incr]) == length(who)) return incr return -1 }
Built-in functions
Input function | |
getline | Read input record |
Output functions | |
print([arg,...]) | Standard output function |
printf(format [arg,...]) | Formatted output |
Arithmetic functions | |
atan2(y,x) | Arc tangent of y/x |
cos(x) | Cosine |
exp(x) | Exponential function |
int(x) | Truncate to integer |
log(x) | Natural logarithm |
rand() | Return a random number |
sin(x) | Sine |
sqrt(x) | Square root |
srand([x]) | Set the seed (initial value) for rand() |
String functions | |
gsub(re,repl[,in]) | Global substitution function |
index(str1,str2) | Return first occurrence of substring |
length([str]) | Return length of string |
match(str,re) | Check whether string str matches regular expression |
split(str,array,[sep]) | Subdivide string |
sprintf(format,e1,e2,...) | Return formatted output as string |
sub(re, repl[,in]) | Substitution function |
substr(str,m,[n]) | Define substring |
tolower(s) | Convert to lowercase |
toupper(s) | Convert to uppercase |
General functions | |
close(expr) | Close file or pipe |
system(expr) | Call shell command |
The following section describes each of these functions in alphabetical order together with the associated arguments. The argument you specify can either be a constant or an expression (see “Expressions”). awk first evaluates the expression arguments and then applies the function to the computed results.
atan2 - Arc tangent
atan2 calculates the arc tangent of the quotient of two numbers. atan2(y,x) returns the arc tangent of y/x.
Syntax
atan2(y,x)
|
y,x
Numbers that produce the quotient for which the arc tangent is to be calculated.
close - Close file or pipe
close closes the specified file or pipe.
Syntax
close(expr)
|
expr
Name of the file or pipe to be closed, see redirection under “printf - Formatted output”.
cos - Cosine
cos calculates the cosine of a number.
Syntax
cos(x)
|
x
Number for which the cosine is to be calculated.
exp - Exponential function
exp calculates e to the power of x.
Syntax
exp(
x
)
|
x
Number for which ex is to be computed.
getline - Read a record
awk retrieves a record as directed (see also the control-flow statement next ).
getline has several different formats, with the following return values:
1 | successful execution |
0 | end-of-file |
-1 | error |
Syntax
getline
|
awk reads the next input record from the input file into $0. NR, NF, FNR, $0, and $1 to $NF are reset.
Example
If a record contains %%%, the next record is read. In other words, input records containing %%% are ignored.
/%%%/ {getline}
Syntax
getline < file |
awk reads a record from the named file into $0. NF, $0, and $1 to $NF are reset.
file
Name of the file from which a record is to be read.
Syntax
getline var |
awk fetches the next input record from the input file and puts it into the variable var. NR and FNR are reset.
var
Variable into which the next record is to be read.
Syntax
getline var < file |
awk fetches a record from the named file and puts it into the variable var. NR, NF, FNR, $0, and $1 to $NF remain unchanged.
var
Variable into which the record is to be read.
file
Name of the file from which the record is to be read.
Syntax
command | getline [var] |
The output of the named command is redirected to getline. Each getline call in this format causes awk to read the next line from the output of command and write it into $0 or the variable var.
If var is specified, NR, NF, FNR, $0, and $1 to $NF remain unchanged; if not, NF, $0, and $1 to $NF are reset.
This construct is equivalent to calling the C function popen() with mode r.
var
Variable into which the record is to be written.
var not specified: The record is written into $0.
command
Name of the command whose output is to be read.
gsub - Global substitution function
gsub globally substitutes the string repl for all strings in $0 or instr that match the extended regular expression RE.
gsub returns the number of substitutions.
Syntax
gsub(
re,repl[,in]
)
|
re
Extended regular expression that specifies the pattern to be matched.
repl
String to be substituted for the strings that match re.
instr
String in which the substitution is to be made.
instr not specified: Substitution is done in $0.
index - Search for substrings
Syntax
index(
str1,str1
)
|
str1
String in which index looks for the substring.
str2
Substring that index looks for.
Example
Comparing the string "ToTo-LoTo" with "To"
index("ToTo-LoTo","To") returns 1.
int - Truncate to integer
int returns the largest integer equal to or smaller than the argument.
Syntax
int(
x
)
|
x
Number that is to be truncated to its integer part.
length - Return length
length returns the length of a string.
Syntax
lenght[( str
)] |
str
length returns the length of string str.
str not specified:
length returns the length of the current input record $0.
log - Logarithm
log calculates the natural base e logarithm.
Syntax
log(
x
)
|
x
Number whose natural log is to be computed.
match - Match regular expressions
match checks whether a string in str matches the extended regular expression in re. If a matching string is found, match returns the character position in str (numbered from 1 onward) at which the string begins; if not, it returns 0.
The variable RSTART is set to the return value of match; RLENGTH is set to the length of the matching string (or -1 if no matching string is found).
Syntax
match(
str,re
)
|
str
String in which the pattern is to be matched.
re
Extended regular expression.
print - Standard output function
print is the standard output function. print outputs either the current record or the specified arguments and terminates its output with the output record separator ORS. For further details refer to "Output format".
Syntax
print(
arg1[[,]arg2 ...]
)[redirection] |
No argument specified:
print writes the current input record on standard output.
arg1arg2
Arguments that are to be printed. print evaluates the expression arguments and concatenates the results in the order in which the arguments are specified.
arg1,arg2
Arguments that are to be printed. print outputs the evaluated expression arguments in the specified order, separated by the output field separator OFS if they are separated by commas in the print statement.
redirection
Output can be redirected to a file or piped to a program. You can use up to 10 output files.
redirection can be in the form of:
>file
The output is written to the named file. The former contents of file are deleted the first time print is called. All subsequent print or printf outputs to file in the same awk program are appended to the end of file. Unless explicitly closed, file remains open until the end of the awk program.
>>file
The output is appended to the previous contents of file. Unless explicitly closed, file remains open until the end of the awk program.
|prog
The output is piped to the program named prog.
You are only permitted to open one pipe to prog within an awk program, but you can pipe any number of print or printf outputs to it.
This construct is equivalent to calling the C function popen() [4] with mode w.
Unless explicitly closed, the pipe remains open until the end of the awk program.
The file or program name can specified directly (enclosed in "...") or via a variable that evaluates to the file name.
Caution!
If you redirect output to the input file, the input file will be destroyed without any warning!
Output format
print outputs integers in decimal and prints strings at full length. Apart from that, the output format is contingent on the following predefined variables:
OFS - output field separator
OFS is one space by default. If you wish, you can assign any one character to OFS to change the output field separator.
ORS - output record separator
ORS is the newline character by default. If you wish, you can assign any one character to ORS to change the output record separator.
OFMT - floating point output format
OFMT defines the output format for floating point values and is set to "%.6g" by default. This means that the fractional part of a floating point number is printed with a maximum of 6 places. If you wish, you can assign a different printf format for floating point numbers to OFMT (see „printf - Formatted output“ below).
Example
Print the first and second fields, separated by a blank:
{print $1,$2}
Concatenate the first and second fields without an output field separator:
{print $1$2}
or
{print $1 $2}
printf - Formatted output
printf is the output function for formatted output. The output format can be specified as in the standard printf() function in C.
Syntax
printf(format, arg,...]
)[redirection] |
format
String defining the output format. The output format comprises plain characters and format elements (conversion specifications). Printable characters are output unaltered. The special characters listed in the "Basic elements" section are converted immediately. For example, \n sets the position to the start of the next line.
All format elements begins with the percent sign. The most common format elements are presented in the following list:
%c | single character |
%d | decimal integer |
%e | floating point number in exponential notation, e.g. 5.234e+2 |
%f | floating point number, e.g. 52.34 |
%g | %e or %f, whichever is shorter |
%o | octal integer (base 8) |
%s | character string |
%u | unsigned decimal integer |
%x | hexadecimal integer (base 16) |
arg
Arguments that are to be printed.
printf evaluates the expression arguments, allocates them in the given order to the specifications in format, and outputs them in the appropriate format.
If the format element is incompatible with the argument, e.g. a numeric format specification for an alphanumeric argument, a 0 is printed.
If there are more arguments than format elements, the excess arguments are ignored, i.e. not printed.
If there are more format elements than arguments, an error message is issued.
redirection
Redirection is as for print.
redirection not specified:
printf prints on standard output.
Example
Field 1 is printed as a decimal number with at least 2 positions, followed by ** as a separator, followed by field 2 as a string of at least 5 characters, followed by newline:
{ printf("%2d**%5s\n", $1,$2) }
rand - Return a random number
rand returns a random number r, where 0 <= r < 1.
Syntax
rand
|
Also refer to srand.
sin - Sine
sin returns the sine of a number.
Syntax
sin(x)
|
x
Number whose sine is to be computed.
split - Subdivide strings
split divides a string into substrings and stores each substring as an element in an array. The elements are subscripted in ascending order, starting with 1.
split returns the number of array elements.
Syntax
split(str,array[,sep])
|
str
String that is to be split.
array
Name of the resulting array.
sep
Extended regular expression specifying the characters that act as a separator between
the substrings in str.
sep not specified:
FS is used as the separator.
Example
The input
{ s=split("january:february:march", months, ":"); for(i=1; i<s; i++) print months[i]; }
produces the output
january february march
sprintf - Return formatted output as a string
sprintf formats in exactly the same way as printf, but there is no direct output. sprintf instead returns the formatted output as a string, which could then be assigned to a variable or used for a similar purpose.
Syntax
sprintf(format,arg,...)
|
format
String defining the output format (see “printf - Formatted output”).
arg
Arguments that are to be output (see “printf - Formatted output”).
Example
The following awk program fragment produces the same output as the example given under printf.
{ x = sprintf("%2d**%5s\n", $1,$2); print x }
sqrt - Calculate the square root
sqrt calculates the square root of a number.
Syntax
sqrt(x)
|
x
Number whose square root is to be computed.
srand - Set the seed for the rand function
srand sets the seed (starting point) for the rand function to the number x, or to the current time if no argument is specified.
Syntax
srand([x])
|
x
Number that is to serve as the seed for rand.
sub - Substitution function
sub returns the number of substitutions.
Syntax
sub(re,repl[,instr])
|
re
Extended regular expression that specifies the pattern to be matched.
repl
String to be substituted for the strings that match re.
instr
String in which the substitution is to be made.
instr not specified:
The substitution is done in $0.
substr - Define a substring
substr extracts a substring from a string.
Syntax
substr(str,m[,n])
|
str
String from which the substring is to be extracted.
m
Position in str at which the substring begins. Character positions are numbered consecutively from left to right, starting with one.
n
Maximum length of the substring.
n not specified:
The substring extends to the end of str.
The input
{ x = substr("060789",3,2); print "Month = "x }
produces the output:
Month = 07
system - Call shell command
system executes the specified shell command and returns its exit status.
Syntax
system(command)
|
command
Name of the shell command to be executed.
Error
If an awk program contains errors, awk issues corresponding error messages and exits immediately. The error messages indicate the cause of the error, if detectable by awk, and the awk program line in which awk thinks the error is to be found.
Typical error messages are:
awk: syntax error at source line xxx
Line xxx of the awk program contains a syntax error.
awk: illegal statement source line number xxx
Line xxx of the awk program contains an illegal statement.
Locale
The following environment variables affect the execution of awk:
LANG
Provide a default value for the internationalization variables that are unset or null.
If LANG is unset of null, the corresponding value from the implementation-specific default locale will be used. If any of the internationalization variables contains an invalid setting, the utility will behave as if none of the variables had been defined.
LC_ALL
If set to a non-empty string value, override the values of all the other internationalization variables.
LC_COLLATE
Determine the locale for the behavior of ranges, equivalence classes and multicharacter collating elements within regular expressions and in comparisons of string values.
LC_CTYPE
Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single- as opposed to multi-byte characters in arguments) and input files, the behavior of character classes within regular expressions, the identification of characters as letters, and the mapping of upper- and lower-case characters for the toupper and tolower functions.
LC_MESSAGES
Determine the locale that should be used to affect the format and contents of diagnostic messages written to standard error.
LC_NUMERIC
Determine the representation of the radix character, the exponentiation symbol and the digit grouping character.
NLSPATH
Determine the location of message catalogs for the processing of LC_MESSAGES.
awk Examples
Example 1
Output all input lines in which field 3 is greater than field 5:
$ awk '$3 > $5' file
Since no action has been specified, awk prints the selected lines by default.
Example 2
Print every 10th line of a file:
$ awk '(NR % 10) == 0' file
Example 3
Print the second to last and the last field in each line, separated by a colon:
$ awk 'BEGIN {OFS=":"} \
> {print $(NF-1), $NF}' file
If a line consists of a single field, the entire line is output twice, separated by a colon (first $0, then $1).
Example 4
Add up the values of the first field of every line and print the total and average at the end:
$ awk '{s += $1} \
> END {print "Total: ", s, "Average: ", s/NR}'\
> file
Example 5
Find a preprocessor if directive, i.e. a range of lines in which the first line begins with #if and the last line with #endif:
$ awk '/^#if/, /^#endif/' file
Example 6
Print all lines in which the first field differs from that of the previous line:
$ awk '$1 != prev { print; prev = $1 }' file
Example 7
file contains a list of data about young people, with the second field containing one of the entries school, university, apprenticeship or elsewhere. For statistical purposes, you want to count how many are at school and university:
$ awk '$2 ~ /school/ {incr["school"]++}
> $2 ~ /university/ {incr["university"]++}
> END {print "school:" incr["school"]; \
> print "university:" incr["university"]} ' file
Example 8
The file contents lists the table of contents of a text. The table of contents is organized in
decimal classification and has the format:
1. Foreword 2. Introduction 3. The Game of Chess 3.1. History 3.2. Rules 3.2.1 Setting Up the Figures . . . 4. The Game of Checkers/Draughts 4.1. History . . . 8. Index
The following awk program can be used to give the list a more orderly format:
$ awk '{$1=$1"
";
\
> $1=substr($1,1,6);
\
> print $0} ' contents >> con.form
The output lines are prepared in the following stages: First, six blanks are added to the end of the first field ($1=$1"
"). Then the first field is truncated to six characters. Thus the first field of each line is 6 characters long, and field 2 always starts at column 7. The output in the file con.form will be as follows:
1. Foreword 2. Introduction 3. The Game of Chess 3.1. History 3.2. Rules 3.2.1 Setting Up the Figures . . . 4. The Game of Checkers/Draughts 4.1. History . . . 8. Index
Example 9
The following awk program in the file prog prints the number of fields and the actual fields of each record. The record separator has been redefined as the dollar sign. The field separators are thus blanks, tabs, and the newline character:
BEGIN { RS="$"; printf "Record\tNum" } { printf ("\n%4d\t%3d\t", NR, NF); for(i=1;i<=NF; i++) printf "%s:", $i } END {print"\n"}
The file text contains the following text:
first record$ second record $ $ fourth and last record$
The call:
$ awk -f prog text
returns:
Record Num 1 2 first:record: 2 2 second:record: 3 0 4 4 fourth:and:last:record: 5 0
Example 10
You now change the file text to:
&& first&&record$second record$$fourth and& last record&
and call awk again, this time using the -F option to change the field separator to &.
$ awk -F"&" -f prog text
The output returned is:
Record Num 1 6 :::first::record: 2 1 second record: 3 0 4 8 fourth:and::last::record:::
This example illustrates how fields are separated when a non-standard separator is used. The first line (&&) of the text file is a part of the first record and now yields 3 fields, for example, because each individual separator in a string of separators (&&) is counted, and the newline implicitly acts as a separator as well (2 & + 1 newline = 3).
See also
egrep, fgrep, grep, lex, sed