Essentials for Scientific Computing: Stream editing with sed and awk

Essentials for Scientific Computing: Stream editing with sed and awk Ershaad Ahamed TUE-CMS, JNCASR May 2012 1 Stream Editing sed and awk are stream processing commands. What this means is that they are programs that can accept input text, transform the text data and write it to the output. Thus, these programs can be part of a shell pipeline much in the same way as uniq, nl and sort, which you have seen earlier, and which also accept input, perform some transformation and write the result to output. The difference lies in the fact that, while commands like uniq and sort perform a predefined transformation of the input, sed and awk are programmable. They have their own languages that can be used to specify rules and transformations that must be performed on the input. This makes them powerful and flexible tools that can perform complex transformations and can be used as part of a shell pipeline. 2 Regexes and Metacharacters As we progress through the sections below, we will be using patterns, where certain characters have special meanings. Although some of the characters might be familiar from our earlier discussion on glob expressions, their meanings are not the same and should not be confused with glob expression syntax. These special characters are referred to as metacharacters, and they are used to build patterns called Regular Expressions or Regex for short. While glob expressions are used to create patterns that match pathnames, regular expressions are much more extensive and can be used to match and manipulate textual data in general. The most commonly used regular expression metacharacters are *,., +, ^, $, and parentheses () among others. You might see that for many of the metacharacters, we precede them by a \, this is called escaping and we do it inform the interpreter that the character should be interpreted as a special symbol and not literally. 1

3 sed 3.1 The s Command One of the most common uses of sed is to replace one string with another. Consider the following text file. Teh war of the worlds, teh day of teh year This is the third line We want to replace all occurrences of the typo teh with the. To do that we use the following sed command. cat text.txt sed -e s/teh/the/ nl In the command line above, cat reads the file text.txt containing our text and writes it to stdout. Since we are using the pipe to connect it to sed, the data written to stdout is redirected to the stdin of the sed command. The -e option to sed tells the sed command that the argument following the -e should be interpreted as sed commands. In this example the sed commands or script is s/teh/the/. Here s is the sed substitution command. The pattern between the first set of /s is replaced with the string between the second set of /s. Here the pattern to replace is the literal string teh. As a convenience, we also pipe the output of sed through nl so that we get line numbers. The sed command operates by reading in each line of the input, applying the commands specified (here, the s command) and then printing out the modified line. This is done for each line of the input, until the input file ends. The output of this example will be. 1 Teh war of the worlds, the day of teh year We have a few observations to make here. 1. The word Teh on the line 1 was not substituted. This is because Teh (with an uppercase T ) will not match the pattern teh that we specified for the s command. 2. Only the first occurrence of teh on line 1 and line 2 was replaced. This is the default behaviour for the s command 3. The teh present in the word Statehouse on line 2 is also substituted with the Let us try to fix the problem in item 1. The s command of sed accepts certain flags after the final /. These flags modify the functioning of the s command. One of these flags is i which makes the pattern matching case insensitive. cat text.txt sed -e s/teh/the/i nl The output is now. 2

1 the war of the worlds, teh day of teh year The Teh has been replaced, but since the replacement string is the (with a lowercase t ) we have an incorrect case for the replacement. There are a few ways in which we can work around this. One way is to capture the match. For instance, in the example above, our sed command can match Teh, teh, TEH or any other combination of upper and lower case since we have specified a case insensitive match. When sed finds a match, we can store the actual string matched since it can be any of the variants above. We do this by enclosing the part of the pattern we are interested in capturing in capturing parentheses $ and $. Our pattern will now look like. $t$eh This means that if the t in our pattern matches a t in the actual input, t is captured. Else, if a T is matched, T is captured. Now what we need to do is to place the captured t or T in our replacement string. We can refer to text that was captured using capturing parentheses inside the replacement string by using \1, \2, etc.,which refers to the first, second, etc. capturing parenthesis. In our example above, \1 will contain either t or T after a match. So our new command will look like. cat text.txt sed -e s/$t$eh/\1he/i nl Output is now. 1 The war of the worlds, teh day of teh year Moving on to observation 2. This default behaviour of the s command can be modified by passing the g flag, which tells sed to replace all occurrences of the match on each line. Making our script. cat text.txt sed -e s/$t$eh/\1he/ig nl Output is. 1 The war of the worlds, the day of the year 2 Statheouse has the in it Moving on to item 3. We need to tell sed that it should not replace teh if it is a substring, that is, it is part of a larger word. We do this by placing the word boundary pattern \b on either side of the word we would like to match (here teh). \b represents a word boundary, that is, a non-word character followed by a word character, or vice-versa. Word characters are alphabets, digits and the underscore character. Now are script is. cat text.txt sed -e s/\b$t$eh\b/\1he/ig nl Output being. 1 The war of the worlds, the day of the year 2 Statehouse has the in it Which looks good. 3

4 Some Examples 4.1 Repeated words Here s an example of a text file having repeated words. The war of the the worlds, the day of the year This this is the third third line Lets start by writing a pattern to match any complete word. You can use a pattern like below. \b\w\+\b Remember that \b is for a word boundary. \w is a pattern that matches any word character (alphabets, digits and underscore). The \+ pattern means to match one or more repetitions of the previous pattern, the previous pattern here being \w. That is followed by a closing \b. The complete expression therefore matches a word. Now we need to build on this pattern so that it can match the same word repeated again (with a space separating them). Remember that when we need to refer to a previous match, we need to first capture it and then we can use backreferences, which are \1, \2, etc. $\b\w\+\b$ \1 Notice the space between the word-match pattern and the backreference. Using the pattern in a sed script, we have. cat text_repeat.txt sed -e s/$\b\w\+\b$ \1/\1/g The pattern matches a repeated word, but the capturing parentheses captures the first of the repeated words. Therefore in the replacement string we use the backreference \1. Output is. The war of the worlds, the day of the year This this is the third line Notice that, in the last line, the repeated word was not matched because of the difference in case. A quick fix for this will be to use the i flag. cat text_repeat.txt sed -e s/$\b\w\+\b$ \1/\1/gi That fixes it. 4.2 Removing Empty Lines Consider a file with the text below. C 3.102166 11.5549 0.0000 C 4.343029 10.8749 0.0000 C 4.343243 9.41218 0.0000 4

C 3.102143 8.71322 0.0000 B 3.100137 7.30638 0.0000 N 4.341568 6.57610 0.0000 B 4.345228 5.13343 0.0000 N 3.103911 4.39795 0.0000 B 3.100340 2.95305 0.0000 N 4.341533 2.21948 0.0000 C 0.620442 8.71323 0.0000 B 0.618437 7.30639 0.0000 N 1.859867 6.57611 0.0000 B 1.863528 5.13344 0.0000 N 0.622211 4.39797 0.0000 B 0.618640 2.95306 0.0000 N 1.859832 2.21949 0.0000 B 1.863132 0.75964 0.0000 N 0.622276 0.00000 0.0000 We need to remove the empty lines from the file. It may seem easy to do quickly in an editor, but what if the file had 25000 lines. You saw the s command for sed in the previous examples. Now, we will use the d command. Before that, a word on addresses in sed. We can precede a sed command with an address. This address can restrict the commands that follow to be executed only for those lines that satisfy that address. The simplest possible address is a line number. Consider this version of our earlier script for fixing the teh typo. cat text.txt sed -e 2s/$t$eh/\1he/ig The only difference being the 2 preceding the s command. This tells sed to execute the s command only for the second line in the input. Thus our output will be. Teh war of the worlds, teh day of teh year Statheouse has the in it This is the third line Suppose, we wanted all lines except the second to be processed. below would do what is expected. The script cat text.txt sed -e 2!s/$t$eh/\1he/ig Addresses can be of the form N,M which means the range from line N to line M, inclusive. 5