Essentials for Scientific Computing: Stream editing with sed and awk

Similar documents
Essentials for Scientific Computing: Bash Shell Scripting Day 3

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Computer Systems and Architecture

Computer Systems and Architecture

ITST Searching, Extracting & Archiving Data

psed [-an] script [file...] psed [-an] [-e script] [-f script-file] [file...]

Regular Expressions. Regular expressions are a powerful search-and-replace technique that is widely used in other environments (such as Unix and Perl)

Lecture 18 Regular Expressions

Basic Linux (Bash) Commands

More Scripting and Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regex, Sed, Awk. Arindam Fadikar. December 12, 2017

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Basics. I think that the later is better.

Wildcards and Regular Expressions

UNIX / LINUX - REGULAR EXPRESSIONS WITH SED

BASH SHELL SCRIPT 1- Introduction to Shell

Topic 4: Grep, Find & Sed

Regular Expressions. Michael Wrzaczek Dept of Biosciences, Plant Biology Viikki Plant Science Centre (ViPS) University of Helsinki, Finland

Getting to grips with Unix and the Linux family

5/8/2012. Exploring Utilities Chapter 5

Pattern Matching. An Introduction to File Globs and Regular Expressions. Adapted from Practical Unix and Programming Hunter College

Pattern Matching. An Introduction to File Globs and Regular Expressions

Regular Expressions Explained

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

STREAM EDITOR - REGULAR EXPRESSIONS

Cisco IOS Shell. Finding Feature Information. Prerequisites for Cisco IOS.sh. Last Updated: December 14, 2012

Bashed One Too Many Times. Features of the Bash Shell St. Louis Unix Users Group Jeff Muse, Jan 14, 2009

Introduction to UNIX Part II

Regular Expressions. Regular Expression Syntax in Python. Achtung!

Lecture 5. Essential skills for bioinformatics: Unix/Linux

Module 8 Pipes, Redirection and REGEX

Motivation (Scenarios) Topic 4: Grep, Find & Sed. Displaying File Names. grep

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

CS Unix Tools. Fall 2010 Lecture 5. Hussam Abu-Libdeh based on slides by David Slater. September 17, 2010

Perl Regular Expressions. Perl Patterns. Character Class Shortcuts. Examples of Perl Patterns

Regular Expressions. Regular expressions match input within a line Regular expressions are very different than shell meta-characters.

Regular Expressions 1

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

Essentials for Scientific Computing: Source Code, Compilation and Libraries Day 8

CSCI 2132 Software Development. Lecture 7: Wildcards and Regular Expressions

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala

Bash Script. CIRC Summer School 2015 Baowei Liu

CSE 390a Lecture 7. Regular expressions, egrep, and sed

Introduction to Perl. c Sanjiv K. Bhatia. Department of Mathematics & Computer Science University of Missouri St. Louis St.

Common File System Commands

Introduction to UNIX. Introduction. Processes. ps command. The File System. Directory Structure. UNIX is an operating system (OS).

Introduction to UNIX. CSE 2031 Fall November 5, 2012

CSCI 2132: Software Development

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Advanced Handle Definition

CS 246 Winter Tutorial 2

Understanding Regular Expressions, Special Characters, and Patterns

22-Sep CSCI 2132 Software Development Lecture 8: Shells, Processes, and Job Control. Faculty of Computer Science, Dalhousie University

C Shell Tutorial. Section 1

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

QUESTION BANK ON UNIX & SHELL PROGRAMMING-502 (CORE PAPER-2)

Review of Fundamentals

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

INTRODUCTION TO SHELL SCRIPTING ITPART 2

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Regular Expressions. with a brief intro to FSM Systems Skills in C and Unix

Introduction to Regular Expressions Version 1.3. Tom Sgouros

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Unix/Linux Primer. Taras V. Pogorelov and Mike Hallock School of Chemical Sciences, University of Illinois

Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk

Fundamentals of Programming Session 4

Systems Programming/ C and UNIX

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

Introduction Variables Helper commands Control Flow Constructs Basic Plumbing. Bash Scripting. Alessandro Barenghi

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

9.2 Linux Essentials Exam Objectives

Language Basics. /* The NUMBER GAME - User tries to guess a number between 1 and 10 */ /* Generate a random number between 1 and 10 */

The e switch allows Perl to execute Perl statements at the command line instead of from a script.

5/20/2007. Touring Essential Programs

Dr. Sarah Abraham University of Texas at Austin Computer Science Department. Regular Expressions. Elements of Graphics CS324e Spring 2017

CSE 303 Lecture 7. Regular expressions, egrep, and sed. read Linux Pocket Guide pp , 73-74, 81

Server-side Web Development (I3302) Semester: 1 Academic Year: 2017/2018 Credits: 4 (50 hours) Dr Antoun Yaacoub

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

AC109/AT109 UNIX & SHELL PROGRAMMING DEC 2014

CSE 374: Programming Concepts and Tools. Eric Mullen Spring 2017 Lecture 4: More Shell Scripts

Introduction To. Barry Grant

CSE 374 Programming Concepts & Tools. Laura Campbell (thanks to Hal Perkins) Winter 2014 Lecture 6 sed, command-line tools wrapup

Expr Language Reference

CST Lab #5. Student Name: Student Number: Lab section:

Section 5.5: Text Menu Input from Character Strings

Unix as a Platform Exercises. Course Code: OS-01-UNXPLAT

User Commands sed ( 1 )

Practical Linux examples: Exercises

1. Introduction. 2. Scalar Data

Practical 02. Bash & shell scripting

Title:[ Variables Comparison Operators If Else Statements ]

COPYRIGHTED MATERIAL. Getting Started with Windows PowerShell. Installing Windows PowerShell

FILTERS USING REGULAR EXPRESSIONS grep and sed

Advanced training. Linux components Command shell. LiLux a.s.b.l.

A shell can be used in one of two ways:

CS 307: UNIX PROGRAMMING ENVIRONMENT FIND COMMAND

A Brief Introduction to the Linux Shell for Data Science

Chapter 2 Working with Data Types and Operators

Shell Programming Systems Skills in C and Unix

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

Transcription:

Essentials for Scientific Computing: Stream editing with sed and awk Ershaad Ahamed TUE-CMS, JNCASR May 2012 1 Stream Editing sed and awk are stream processing commands. What this means is that they are programs that can accept input text, transform the text data and write it to the output. Thus, these programs can be part of a shell pipeline much in the same way as uniq, nl and sort, which you have seen earlier, and which also accept input, perform some transformation and write the result to output. The difference lies in the fact that, while commands like uniq and sort perform a predefined transformation of the input, sed and awk are programmable. They have their own languages that can be used to specify rules and transformations that must be performed on the input. This makes them powerful and flexible tools that can perform complex transformations and can be used as part of a shell pipeline. 2 Regexes and Metacharacters As we progress through the sections below, we will be using patterns, where certain characters have special meanings. Although some of the characters might be familiar from our earlier discussion on glob expressions, their meanings are not the same and should not be confused with glob expression syntax. These special characters are referred to as metacharacters, and they are used to build patterns called Regular Expressions or Regex for short. While glob expressions are used to create patterns that match pathnames, regular expressions are much more extensive and can be used to match and manipulate textual data in general. The most commonly used regular expression metacharacters are *,., +, ^, $, and parentheses () among others. You might see that for many of the metacharacters, we precede them by a \, this is called escaping and we do it inform the interpreter that the character should be interpreted as a special symbol and not literally. 1

3 sed 3.1 The s Command One of the most common uses of sed is to replace one string with another. Consider the following text file. Teh war of the worlds, teh day of teh year This is the third line We want to replace all occurrences of the typo teh with the. To do that we use the following sed command. cat text.txt sed -e s/teh/the/ nl In the command line above, cat reads the file text.txt containing our text and writes it to stdout. Since we are using the pipe to connect it to sed, the data written to stdout is redirected to the stdin of the sed command. The -e option to sed tells the sed command that the argument following the -e should be interpreted as sed commands. In this example the sed commands or script is s/teh/the/. Here s is the sed substitution command. The pattern between the first set of /s is replaced with the string between the second set of /s. Here the pattern to replace is the literal string teh. As a convenience, we also pipe the output of sed through nl so that we get line numbers. The sed command operates by reading in each line of the input, applying the commands specified (here, the s command) and then printing out the modified line. This is done for each line of the input, until the input file ends. The output of this example will be. 1 Teh war of the worlds, the day of teh year We have a few observations to make here. 1. The word Teh on the line 1 was not substituted. This is because Teh (with an uppercase T ) will not match the pattern teh that we specified for the s command. 2. Only the first occurrence of teh on line 1 and line 2 was replaced. This is the default behaviour for the s command 3. The teh present in the word Statehouse on line 2 is also substituted with the Let us try to fix the problem in item 1. The s command of sed accepts certain flags after the final /. These flags modify the functioning of the s command. One of these flags is i which makes the pattern matching case insensitive. cat text.txt sed -e s/teh/the/i nl The output is now. 2

1 the war of the worlds, teh day of teh year The Teh has been replaced, but since the replacement string is the (with a lowercase t ) we have an incorrect case for the replacement. There are a few ways in which we can work around this. One way is to capture the match. For instance, in the example above, our sed command can match Teh, teh, TEH or any other combination of upper and lower case since we have specified a case insensitive match. When sed finds a match, we can store the actual string matched since it can be any of the variants above. We do this by enclosing the part of the pattern we are interested in capturing in capturing parentheses \( and \). Our pattern will now look like. \(t\)eh This means that if the t in our pattern matches a t in the actual input, t is captured. Else, if a T is matched, T is captured. Now what we need to do is to place the captured t or T in our replacement string. We can refer to text that was captured using capturing parentheses inside the replacement string by using \1, \2, etc.,which refers to the first, second, etc. capturing parenthesis. In our example above, \1 will contain either t or T after a match. So our new command will look like. cat text.txt sed -e s/\(t\)eh/\1he/i nl Output is now. 1 The war of the worlds, teh day of teh year Moving on to observation 2. This default behaviour of the s command can be modified by passing the g flag, which tells sed to replace all occurrences of the match on each line. Making our script. cat text.txt sed -e s/\(t\)eh/\1he/ig nl Output is. 1 The war of the worlds, the day of the year 2 Statheouse has the in it Moving on to item 3. We need to tell sed that it should not replace teh if it is a substring, that is, it is part of a larger word. We do this by placing the word boundary pattern \b on either side of the word we would like to match (here teh). \b represents a word boundary, that is, a non-word character followed by a word character, or vice-versa. Word characters are alphabets, digits and the underscore character. Now are script is. cat text.txt sed -e s/\b\(t\)eh\b/\1he/ig nl Output being. 1 The war of the worlds, the day of the year 2 Statehouse has the in it Which looks good. 3

4 Some Examples 4.1 Repeated words Here s an example of a text file having repeated words. The war of the the worlds, the day of the year This this is the third third line Lets start by writing a pattern to match any complete word. You can use a pattern like below. \b\w\+\b Remember that \b is for a word boundary. \w is a pattern that matches any word character (alphabets, digits and underscore). The \+ pattern means to match one or more repetitions of the previous pattern, the previous pattern here being \w. That is followed by a closing \b. The complete expression therefore matches a word. Now we need to build on this pattern so that it can match the same word repeated again (with a space separating them). Remember that when we need to refer to a previous match, we need to first capture it and then we can use backreferences, which are \1, \2, etc. \(\b\w\+\b\) \1 Notice the space between the word-match pattern and the backreference. Using the pattern in a sed script, we have. cat text_repeat.txt sed -e s/\(\b\w\+\b\) \1/\1/g The pattern matches a repeated word, but the capturing parentheses captures the first of the repeated words. Therefore in the replacement string we use the backreference \1. Output is. The war of the worlds, the day of the year This this is the third line Notice that, in the last line, the repeated word was not matched because of the difference in case. A quick fix for this will be to use the i flag. cat text_repeat.txt sed -e s/\(\b\w\+\b\) \1/\1/gi That fixes it. 4.2 Removing Empty Lines Consider a file with the text below. C 3.102166 11.5549 0.0000 C 4.343029 10.8749 0.0000 C 4.343243 9.41218 0.0000 4

C 3.102143 8.71322 0.0000 B 3.100137 7.30638 0.0000 N 4.341568 6.57610 0.0000 B 4.345228 5.13343 0.0000 N 3.103911 4.39795 0.0000 B 3.100340 2.95305 0.0000 N 4.341533 2.21948 0.0000 C 0.620442 8.71323 0.0000 B 0.618437 7.30639 0.0000 N 1.859867 6.57611 0.0000 B 1.863528 5.13344 0.0000 N 0.622211 4.39797 0.0000 B 0.618640 2.95306 0.0000 N 1.859832 2.21949 0.0000 B 1.863132 0.75964 0.0000 N 0.622276 0.00000 0.0000 We need to remove the empty lines from the file. It may seem easy to do quickly in an editor, but what if the file had 25000 lines. You saw the s command for sed in the previous examples. Now, we will use the d command. Before that, a word on addresses in sed. We can precede a sed command with an address. This address can restrict the commands that follow to be executed only for those lines that satisfy that address. The simplest possible address is a line number. Consider this version of our earlier script for fixing the teh typo. cat text.txt sed -e 2s/\(t\)eh/\1he/ig The only difference being the 2 preceding the s command. This tells sed to execute the s command only for the second line in the input. Thus our output will be. Teh war of the worlds, teh day of teh year Statheouse has the in it This is the third line Suppose, we wanted all lines except the second to be processed. below would do what is expected. The script cat text.txt sed -e 2!s/\(t\)eh/\1he/ig Addresses can be of the form N,M which means the range from line N to line M, inclusive. 5