Lecture 5. Essential skills for bioinformatics: Unix/Linux

Similar documents
Essential Skills for Bioinformatics: Unix/Linux

Essentials for Scientific Computing: Bash Shell Scripting Day 3

Shell Programming (bash)

9.2 Linux Essentials Exam Objectives

Scripting. More Shell Scripts. Adapted from Practical Unix and Programming Hunter College

Shells & Shell Programming (Part B)

Shells and Shell Programming

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Shells and Shell Programming

CS Unix Tools & Scripting

COMP 4/6262: Programming UNIX

Bash scripting basics

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala

A Big Step. Shell Scripts, I/O Redirection, Ownership and Permission Concepts, and Binary Numbers

Grep and Shell Programming

Scripting. Shell Scripts, I/O Redirection, Ownership and Permission Concepts, and Binary Numbers

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Linux Shell Scripting. Linux System Administration COMP2018 Summer 2017

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Useful Unix Commands Cheat Sheet

Practical Linux examples: Exercises

Vi & Shell Scripting

5/8/2012. Exploring Utilities Chapter 5

OPERATING SYSTEMS LAB LAB # 6. I/O Redirection and Shell Programming. Shell Programming( I/O Redirection and if-else Statement)

CSCI 211 UNIX Lab. Shell Programming. Dr. Jiang Li. Jiang Li, Ph.D. Department of Computer Science

A shell can be used in one of two ways:

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Shell Start-up and Configuration Files

EECS 470 Lab 5. Linux Shell Scripting. Friday, 1 st February, 2018

Command Interpreters. command-line (e.g. Unix shell) On Unix/Linux, bash has become defacto standard shell.

Bash scripting Tutorial. Hello World Bash Shell Script. Super User Programming & Scripting 22 March 2013

Introduction Variables Helper commands Control Flow Constructs Basic Plumbing. Bash Scripting. Alessandro Barenghi

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

Chapter 4. Unix Tutorial. Unix Shell

LING 408/508: Computational Techniques for Linguists. Lecture 5

Basic Linux (Bash) Commands

Shell programming. Introduction to Operating Systems

Shell Scripting. Todd Kelley CST8207 Todd Kelley 1

Module 8 Pipes, Redirection and REGEX

UNIX II:grep, awk, sed. October 30, 2017


Unix as a Platform Exercises. Course Code: OS-01-UNXPLAT

Bourne Shell Reference

Basics. I think that the later is better.

BASH SHELL SCRIPT 1- Introduction to Shell

Scripting. More Shell Scripts Loops. Adapted from Practical Unix and Programming Hunter College

Shell Programming Overview

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

COMS 6100 Class Notes 3

Introduction to Linux

COMP 2718: Shell Scripts: Part 1. By: Dr. Andrew Vardy

Last Time. on the website

Unix Shell scripting. Dr Alun Moon 7th October Introduction. Notation. Spaces

Introduction: What is Unix?

Shell script. Shell Scripts. A shell script contains a sequence of commands in a text file. Shell is an command language interpreter.

INTRODUCTION TO SHELL SCRIPTING Dr. Jeffrey Frey University of Delaware, IT

Computer Systems and Architecture

Review of Fundamentals

Assignment 3, Due October 4

1. Hello World Bash Shell Script. Last Updated on Wednesday, 13 April :03

Title:[ Variables Comparison Operators If Else Statements ]

CSCI 2132: Software Development. Norbert Zeh. Faculty of Computer Science Dalhousie University. Shell Scripting. Winter 2019

example: name1=jan name2=mike export name1 In this example, name1 is an environmental variable while name2 is a local variable.

Simple Shell Scripting for Scientists

Cisco IOS Shell. Finding Feature Information. Prerequisites for Cisco IOS.sh. Last Updated: December 14, 2012


Introduction to Shell Scripting

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

bash Execution Control COMP2101 Winter 2019

Using bash. Administrative Shell Scripting COMP2101 Fall 2017

Shell. SSE2034: System Software Experiment 3, Fall 2018, Jinkyu Jeong

STP, Unix, SAC tutorial, Ge167 Winter 2014

Computer Systems and Architecture

Bash command shell language interpreter

CSE 374: Programming Concepts and Tools. Eric Mullen Spring 2017 Lecture 4: More Shell Scripts

Scripting Languages Course 1. Diana Trandabăț

Unix/Linux Primer. Taras V. Pogorelov and Mike Hallock School of Chemical Sciences, University of Illinois

CSE 374 Programming Concepts & Tools. Brandon Myers Winter 2015 Lecture 4 Shell Variables, More Shell Scripts (Thanks to Hal Perkins)

CS 25200: Systems Programming. Lecture 10: Shell Scripting in Bash

SHELL SCRIPT BASIC. UNIX Programming 2014 Fall by Euiseong Seo

Introduction to UNIX. Logging in. Basic System Architecture 10/7/10. most systems have graphical login on Linux machines

Essential Skills for Bioinformatics: Unix/Linux

CSC 2500: Unix Lab Fall 2016

Open up a terminal, make sure you are in your home directory, and run the command.

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Introduction to UNIX command-line II

A Brief Introduction to the Linux Shell for Data Science

There are some string operators that can be used in the test statement to perform string comparison.

Assignment clarifications

C Shell Tutorial. Section 1

SHELL SCRIPT BASIC. UNIX Programming 2015 Fall by Euiseong Seo

CSCI 2132: Software Development

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Lab #2 Physics 91SI Spring 2013

Essentials for Scientific Computing: Stream editing with sed and awk

UNIX shell scripting

Topic 4: Grep, Find & Sed

Unleashing the Shell Hands-On UNIX System Administration DeCal Week 6 28 February 2011

Програмиранев UNIX среда

What is Bash Shell Scripting?

Transcription:

Lecture 5 Essential skills for bioinformatics: Unix/Linux

UNIX DATA TOOLS

Text processing with awk We have illustrated two ways awk can come in handy: Filtering data using rules that can combine regular expressions and arithmetic Reformatting the columns of data using arithmetic We will learn more advanced use cases by introducing two special patterns: BEGIN and END. The BEGIN pattern specifies what to do before the first record is read in. It is useful to initialize and set up variables. The END specifies what to do after the last record s processing is complete. It is useful to print data summaries at the end of file processing.

Text processing with awk Suppose we want to calculate the mean feature length in Homo_sapiens.GRCh38.87.bed: NR is the current record number, so on the last record NR is set to the total number of records processed.

Text processing with awk We can use NR to extract ranges of lines:

Text processing with awk awk makes it easy to convert between bioinformatics files like BED and GTF. Our previous solution using grep and cut has an error:

Text processing with awk We can generate a three-column BED file from the GTF file as follows. Note that we subtract 1 from the start position to convert to BED format. This is because BED uses zero-indexing while GTF uses 1-indexing.

Text processing with awk Using sort uniq c, we counted the number of features belonging to a particular gene. awk also have a very useful data structure known as an associative array which behaves like Python s dictionaries or hashes in other languages.

Text processing with awk We can create an associative array by simply assigning a value to a key.

Text processing with awk This example illustrates that awk is a programming language: within our action blocks, we can use standard programming statements like if, for, and while. However, when awk programs become complex or start to span multiple lines, you should switch to Python at that point.

Stream editing with sed We learned how Unix pipes are fast because they operate on streams of data rather than data written to disk. Additionally, pipes don t require that we load an entire file in memory at once. Instead, we can operate one line at a time. Often we need to make trivial edits to a stream, usually to prepare it for the next step in a Unix pipeline. The stream editor, sed, allows us to do exactly that.

Stream editing with sed sed reads data from a file or standard input and can edit a line at a time. Let s look at a very simple example: converting a file containing a single column of chromosomes in the format chrom1 to the format chr1.

Stream editing with sed We can edit it without opening the entire file in memory. Our edited output stream is then easy to redirect to a new file. In the previous example, we uses sed s substitute command, by far the most popular use of sed. sed s substitute takes the first occurrence of the pattern between the first two slashes, and replaces it with the string between the second and third slashes. In other words, the syntax of sed s substitute is s/pattern/replacement.

Stream editing with sed By default, sed only replaces the first occurrence of a match of each line. We can replace all occurrences of strings that match our pattern by setting the global flag g after the last slash: s/pattern/replacement/g If we need matching to be case-insensitive, we can enable this with the flag I By default, sed s substitutions use POSIX BRE. As with grep, we can use the E option to enable POSOX ERE.

Stream editing with sed Most important is the ability to capture chunks of text that match a pattern, and use these chunks in the replacement. Suppose we want to capture the chromosome name, and start and end positions in a string containing a genomic region in the format chr1:28427874-28425431, and output this as three columns.

Stream editing with sed ^(chr[^:]+): This matches the text that begins at the start of the line (^ enforces this), and then captures everything between ( and ). This pattern used for capturing begins with chr and matches one or more characters that are not :, our delimiter. We match until the first : through a character class defined by everything that s not :, [^:]+

Stream editing with sed ([0-9]+): Match and capture more than one number. Finally, our replacement is these three captured groups, interspersed with tabs, \t. Regular expressions are tricky and take time and practice to master.

Stream editing with sed Explicitly capturing each component of our genomic region is one way to tackle this, and nicely demonstrates sed s ability to capture patterns. However, there are numerous ways to use sed or other Unix tools to parse strings like this.

Stream editing with sed sed 's/[:-]/\t/g : We just replace both delimiters with a tab. Note that we ve enabled the global flag, which is necessary for this approach to work. sed 's/:/\t/' sed 's/-/\t/ : For complex substitutions, it can be much easier to use two or more calls to sed rather than trying to do this with one regular expression. tr ':-' '\t : tr translates all occurrences of its first argument to its second.

Stream editing with sed By default, sed prints every lines, making replacements to matching lines. Suppose we want to capture all transcript names from the last column of a GTF file. Some lines of the last column of the GTF file don t contain transcript_id, so sed prints the entire line rather than the captured group.

Stream editing with sed One way to solve this would be to use grep transcript_ id before sed. A cleaner way is to disable sed from outputting all lines with n. Then, by appending p after the last slash sed will print all lines it s made a replacement on.

Stream editing with sed This example uses an important regular expression idiom: capturing text between delimiters. This is a useful pattern. 1. First, match zero or more of any character (.*) before the string transcript_id 2. Then, match and capture one or more characters that are not a quote ([^ ]+). This is an important idiom. The brackets make up a character class. Character classes specify what characters the expression is allowed to match. Here, we use a caret (^) inside the brackets to match anything except what s inside these brackets. The end result is that we match and capture one or more nonquote characters.

Stream editing with sed The following approach ((.*) rather than ([^ ]+)) will not work

Stream editing with sed It is also possible to select and print certain ranges of lines with sed. In this case, we are not doing pattern matching, so we don t need slashes. To print lines 20 through 50 of a file, we use sed has features that allow you to make any type of edit to a stream of test, but for complex stream processing tasks it can be easier to write a Python script than a long and complicated sed command.

SHELL SCRIPTING

Overview Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose language. Bash is explicitly designed to make running and interfacing command-line programs as simple as possible. For these reason, Bash often takes the role as the glue language of bioinformatics, as it s used to glue many commands together into a cohesive workflow.

Overview Note that Python is a more suitable language for commonly reused or advanced pipelines. Python is a more modern, fully featured scripting language than Bash. Compared to Python, Bash lacks several nice features useful for data-processing scripts: better numeric type support, useful data structures, better string processing, refined option parsing, availability of a large number of libraries, and powerful functions that help with structuring your programs. However, there s more overhead when calling command-line programs from a Python script compared to Bash. Bash is often the best and quickest glue solution.

Writing and running bash scripts Most Bash scripts in bioinformatics are simply commands organized into a re-runnable script with some features to check that files exist and ensuring any error causes the script to abort. We will learn the basics of writing and executing Bash scripts, paying particular attention to how create robust Bash scripts.

A robust Bash header By convention, Bash scripts have the extension.sh. You can create them in your favorite text editor (e.g. emacs, nano, or vi). Anytime you write a Bash script, you should use the following Bash script header, which sets some Bash options that lead to more robust scripts. #!/bin/bash set e set u set o pipefail

A robust Bash header #!/bin/bash This is called the shebang, and it indicates the path to the interpreter used to execute this script. set e By default, a shell script containing a command that fails will not cause the entire shell script to exit: the shell script will just continue on to the next line. We always want errors to be loud and noticeable. This option prevents this, by terminating the script if any command exited with a nonzero exit status.

A robust Bash header Note that this option ignores nonzero statuses in if conditionals. Also, it ignores all exit statuses in Unix pipes except the last one. set u This option fixes another default behavior of Bash scripts: any command containing a reference to an unset variable name will still run. It prevents this type of error by aborting the script if a variable s value is unset

A robust Bash header set o pipefail set e will cause a script to abort if a nonzero exit status is encountered, with some exceptions. One such exception is if a program runs in a Unix pipe exited unsuccessfully. Including set o pipefail will prevent this undesirable behavior: any program that returns a nonzero exit status in the pipe will cause the entire pipe to return a nonzero status. With set e enabled, this will lead the script to abort.

Running bash scripts Running Bash scripts can be done one of two ways: 1. bash script.sh 2../script.sh While we can run any script, calling the script as an executable requires that it has executable permissions. We can set these using: chmod u+x script.sh This adds executable permissions for the user who owns the file. Then, the script can be run with./script.sh.

Variables Pipelines have numerous settings that should be stored in variables. Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier. Rather than having to changes numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value. Bash also reads command-line arguments into variables.

Variables Bash s variables don t have data types. It s helpful to think of Bash s variables as strings. We can create a variable and assign it a value with. results_dir= results/ Note that spaces matter when setting Bash variables. Do not use spaces around the equal sign.

Variables To access a variable s value, we use a dollar sign in front of the variable s name. Suppose we want to create a directory for a sample s alignment data, called <sample>_aln/, where <sample> is replaced by the sample s name. sample= CNTRL01A mkdir ${sample}_aln/

Command-line arguments The variable $0 stores the name of the script, and commandline arguments are assigned to the value $1, $2, $3, etc. Bash assigns the number of command-line arguments to $#.

Command-line arguments If you find your script requires numerous or complicated options, it might be easier to use Python instead of Bash. Python s argparse module is much easier to use. Variables created in your Bash script will only be available for the duration of the Bash process running that script.

if statement Bash supports the standard if conditional statement. The basic syntax is: if [commands] then else fi [if-statements] [else-statements]

if statement A command s exit status provides the true and false. Remember that 0 represents true/success and anything else if false/failure. if [commands] [commands] could be any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then. Otherwise execution continues to the block after else.

if statement [if-statements] is a placeholder for all statements executed if [commands] evaluates to true (0). [else-statements] is a placeholder for all statements executed if [commands] evaluates to false. The else block is optional.

if statement Bash is primarily designed to stitch together other commands. This is an advantage Bash has over Python when writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead to call programs. Although it can be unpleasant to write complicated programs in Bash, writing simple programs is exceedingly easy because Unix tools and Bash harmonize well.

if statement Suppose we wanted to run a set of commands only if a file contains a certain string. Because grep returns 0 only if it matches a pattern in a file and 1 otherwise. The redirection is to tidy the output of this script such that grep s output is redirected to /dev/null and not to the script s standard out.

test Like other programs, test exits with either 0 or 1. However test s exit status indicates the return value of the test specified through its arguments, rather than exit success or error. test supports numerous standard comparison operators.

test String/integer Description -z str String str is null str1 = str2 str1 and str2 are identical str1!= str2 str1 and str2 are different int1 eq int2 Integers int1 and int2 are equal int1 ne int2 int1 and int2 are not equal int1 lt int2 int1 is less than int2 int1 gt int2 int1 is greater than int2 int1 le int2 int1 is less than or equal to int2 int1 ge int2 int1 is greater than or equal to int2

test In practice, the most common conditions you ll be checking are to see if files or directories exist and whether you can write to them. test supports numerous file- and directoryrelated test operations.

test File/directory expression Description -d dir dir is a directory -f file file is a file -e file file exists -r file file is readable -w file file is writable -x file file is executable

test Combining test with if statements is simple: if test f some_file.txt then [ ] fi Bash provides a simpler syntactic alternative: if [ f some_file.txt ] then [ ] fi Note the spaces around and within the brackets: these are required.

test When using this syntax, we can chain test expression with a as logical AND, -o as logical OR,! as negation. Our familiar && and operators won t work in test, because these are shell operators. if [ $# ne 1 o! r $1 ] then fi echo usage: script.sh file_in.txt

for loop In bioinformatics, most of our data is split across multiple files. At the heart of any processing pipeline is some way to apply the same workflow to each of these files, taking care to keep track of sample names. Looping over files with Bash s for loop is the simplest way to accomplish this. There are three essential parts to creating a pipeline to process a set of files: 1. Selecting which files to apply the commands to 2. Looping over the data and applying the commands 3. Keeping track of the names of any output files created

for loop Suppose we have a file called samples.txt that tells you basic information about your raw data: sample name, read pair, and where the file is.

for loop Suppose we want to loop over every file, gather quality statistics on each and every file, and save this information to an output file. First, we load our filenames into a Bash array, which we can then loop over. Bash arrays can be created manually using:

for loop But creating Bash arrays by hand is tedious and error prone. The beauty of Bash is that we can use a command substitution to construct Bash arrays. We can strip the path and extension from each filename using basename.

for loop

Learning Unix https://www.codecademy.com/learn/learn-the-command-line http://swcarpentry.github.io/shell-novice/ http://korflab.ucdavis.edu/bootcamp.html http://korflab.ucdavis.edu/unix_and_perl/current.html https://www.learnenough.com/command-line-tutorial http://cli.learncodethehardway.org/book/ https://learnxinyminutes.com/docs/bash/ http://explainshell.com/