Awk & Regular Expressions CSCI-620 Dr. Bill Mihajlovic awk Text Editor awk, named after its developers Aho, Weinberger, and Kernighan. awk is UNIX utility. The awk command uses awk program to scan text files or standard input to: display specific data, change data format, and add text to existing data. 1
Awk awk is a pattern scanning and processing language. awk searches one or more input files to see if they contain lines that match specified patterns and then perform associated actions, such as writing the line to the standard output or incrementing a counter each time it finds a match. awk is a programming language which permits easy manipulation of structured data and the generation of formatted reports. # awk pattern {action} infile Awk Syntax Awk utility may receive instructions as the command line string of text or as a text read form a awk-file: # awk pattern {action} infile The awk utility performs the action on all lines that the pattern selects The pattern selects lines from the input file. Braces nust enclose the action so that awk can differentiate it from the pattern. 2
Awk If a program line does not contain a pattern, awk selects all lines in the input file. There are two rules which occur if either a pattern or action is ommited: # awk {action} infile # awk pattern { } infile If the program line does not contain an action, awk copies the selected lines to its standard output (this is usually the display, if you haven't redirected the output to another program or to a file). awk as Programming Language The capabilities of awk extend the idea of text editing into computation, making it possible to perform a variety of data processing tasks, including: analysis, extraction, and reporting of data. These are, indeed, the most common uses of awk. 3
Regular Expression RE Searching for: Exactly matching patterns or Closely matching patterns in the text is a common problem. Regular expressions make finding character patterns much easier. Regular expression MetaCharacters allows characters to take on a range of values. Regular Expression RE RE is a character pattern which can match numerous similar strings, because it can contain metacharacters that expand the scope of the search beyond a literal string. Metacharacters are special characters that represent more than their literal meanings. Quoting is the means of turning off the special meaning of metacharacters. 4
Editor REs Editor regular expressions are used in editors or shell commands to find character patterns within files. Typical editors and other programs that use meta characters are: vi, ex, edit, view, ed, red, sed,grep, egrep, and expr. Patterns An awk pattern is used to conditionally pass control to an action. An action only executes if its relevant pattern was matched. You can use a regular expression, enclosed within slashes, as a pattern. The ~ operator tests to see if a field or variable matches a regular expression. The!~ operator tests for no match. You can process arithmetic and character relational expressions with the following relational operators. 5
Patterns & Operators You can process arithmetic and character relational expressions with the following relational operators. Operator Meaning < less than <= less than or equal to == equal to!= not equal to >= greater than or equal to > greater than awk Operators You can combine any of the patterns using the Boolean operators (OR) or && (AND). The comma is the range operator. If you separate two patterns with a comma on a single awk progam line, awk selects a range of lines beginning with the first line that contains the first pattern. The last line awk selects is the next subsequent line that contains the second pattern. After awk finds the second pattern, it starts the process over by looking for the first pattern again. 6
Use of Patern-Matching Metacharacters Matching Filenames within directory files. File Name Generation (FNG) Matching strings within text. Editor regular expression Full regular expressions Awk regular expressions? Metacharacter The? Matches any one character but a dot. A dot must be matched explicitly. # echo? F A f b # 7
* Metacharacter The * matches any number of characters but a dot and whie space characters, (blank, tab, newline) # echo * F a f b F11 alexf a1 # ls -x a?* alexf a1 # Character Class Expressio Character class is a group of characters to be matched # ls -x f[abc,+123] fb f+ f1 f2 # ls -x f[!a-za-z0-9] f+ f- # Ranges can be included in character classes by listing by listing 2 characters to define range bounds separated by a dash. 8
Basic awk Command Format The basic format of this command consists of the awk command, the instructions enclosed in quotes and curly braces, and the name of the input file. If an input file is not specified, then standard input is used, for example, the keyboard. The following is a basic awk command. The output of the ls l command is piped to awk. For each line received by awk, the print action is executed, which prints the output to the screen. $ ls -l awk {print $0} awk Arguments When awk reads in a line it automatically breaks the line into fields. Each field is assigned a variable name. Spaces or tabs are used as the default delimiter between fields. The variable names assigned to fields are a dollar sign ($) followed by the number of the field, counting from left to right. The variable name $1 represents the contents of Field 1. The variable name $2 represents the contents of Field 2, and so on. The entire line is represented by the variable name $0. 9
awk Displays Specific Data To instruct awk to display specific data (for example, the file owner, file size, and file name), the fields variable names are used with the action. # ls -l awk {print $3 $5 $9} user154120dante user1368dante_1 user1176dat user1512dir1 user1512dir2 user1512dir3 user1512dir4 user1235file1 user1105file2 user1218file3 user1137file4 # awk Example Selecting Data fstats file contains the data for players PPG - points per game Consider this example: RPG - rebounds per game APG - assists per game $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk '$2 > 20 { print $1 }' fstats Smith Jones Davis This command says: "Read each line from fstats Johnson and if the second field is more than 20, print the first $ field." This awk command prints the names of the players who have more than 20 PPG. 10
awk Example Selecting Data fstats file contains the data for players PPG - points per game is $2 RPG - rebounds per game is $3 APG - assists per game is $4 $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk '($2 > 20 && $3 < 7) { print $1 }' fstats Smith Jones $ This command says "Read each line from fstats; if the second field is more than 20 and the third field is less than 7, print the first field." awk Example Selecting Data This command prints the names of players that begin with 'J' $ cat fstats This command prints the Smith 26.4 5.5 7.2 names of players that start with Jones 23.7 5.2 6.0 'J' and have more than 5 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 APG. Williams 18.8 6.1 9.9 $ awk '$1 ~ /^J/ { print $1 } fstats Jones $ awk '($1 ~ /^J/ && $4 > 5) { print $1 }' fstats Jones $ 11
awk Operators BEGIN & END Two unique patterns, BEGIN and END, allow you to execute commands before awk starts its processing and after it finishes. The awk utility executes the actions associated with the BEGIN pattern before, and with the END pattern after, it processes all the files for input. awk Example BEGIN This command prints a header and all of the data. You can do things before or after all lines have been read with BEGIN and END! $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk 'BEGIN { print "Name PPG RPG APG" } { print }' fstats Name PPG RPG APG Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ 12
awk Example END This command prints a header and all of the data. You can do things before or after all lines have been read with BEGIN and END! This command says: "Read each line from fstats, print the whole line, and after the last line, print That is all, folks!'." $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk '{ print } END { print "That is all, folks!" }' fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 That is all, folks! $ awk Example Math - awk even does math! This counts the number of players with more than 20 PPG The above command says "Read each line fstats; if the second field is more than maxppg, make maxppg the second field a make player the first field. After all lines been read, print the line 'player had t most PPG, with maxppg'." $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk '$2 > 20 { total = total + 1 } \ END { print total, "players had more than 20 PPG" }' fstats 4 players had more than 20 PPG $ awk '$2 > maxppg { maxppg = $2; player = $1 } END { print player, "had the most PPG, with", maxppg }' fstats Smith had the most PPG, with 26.4 $ 13
awk Example Math Read each line from fstats and add all total scores Running an awk program from a file Finally, you can store awk programs in files, so you do not have to re-enter long awk commands. For instance, if you wanted to run the previous command from a file, you would create a file (let's call it ex1.awk) containing the following: $ cat awk_prog { totrpg = totrpg + $3; count = count + 1 } END { print "Average RPG is", totrpg/count } $ awk -f ex1.awk fstats Average RPG is 7.34 $ This counts of players. This command says "Read each line from player.dat; add the third field to totrpg and add 1 to count. After all lines have been read, print the awk line 'Average Example RPG is totrpg / count'." Math - To calculate the average RPG for all players. awk even does math! $ cat fstats Smith 26.4 5.5 7.2 Jones 23.7 5.2 6.0 Davis 21.8 9.4 3.7 Johnson 20.8 10.5 3.0 Williams 18.8 6.1 9.9 $ awk '$2 > 20 { total = total + 1 } \ END { print total, "players had more than 20 PPG" }' fstats 4 players had more than 20 PPG $ awk '{ totrpg = totrpg + $3; count = count + 1 } END { print "Average RPG is", totrpg/count }' player.dat Average RPG is 7.34 $ 14
Homework Homework Edit and save file fstats. Add two more lines with your name or the name of your fiend s name. Repeat all examples shown in the slide presentation. Enclose your screen shots. 15
Cygwin Shell and Awk Utility You may use Cygwin Linux shell to demonstrate awk examples. Place file fstats in the Cygwin home directory. Cygwin Shell and Awk Utility Check the version of your awk utiltiy. 16
Where is your Root Directory? Be careful with directories. Use your home directory. Place fstats File in your Home Directory 17
Awk in Janoshell Differs a Bit Different versions of awk utilitiy may have different option switch flags and may differ. However, all versions must perform the same regular expressions use. Place Inout File fstats in Administrator Directory 18
Do all Slide s Examples Do all slide presentation awk examples in both shells Cygwin and Janotech shell. Observe all differences. The End ==== 19