Lecture 5 Essential skills for bioinformatics: Unix/Linux
UNIX DATA TOOLS
Text processing with awk We have illustrated two ways awk can come in handy: Filtering data using rules that can combine regular expressions and arithmetic Reformatting the columns of data using arithmetic We will learn more advanced use cases by introducing two special patterns: BEGIN and END. The BEGIN pattern specifies what to do before the first record is read in. It is useful to initialize and set up variables. The END specifies what to do after the last record s processing is complete. It is useful to print data summaries at the end of file processing.
Text processing with awk Suppose we want to calculate the mean feature length in Homo_sapiens.GRCh38.87.bed: NR is the current record number, so on the last record NR is set to the total number of records processed.
Text processing with awk We can use NR to extract ranges of lines:
Text processing with awk awk makes it easy to convert between bioinformatics files like BED and GTF. Our previous solution using grep and cut has an error:
Text processing with awk We can generate a three-column BED file from the GTF file as follows. Note that we subtract 1 from the start position to convert to BED format. This is because BED uses zero-indexing while GTF uses 1-indexing.
Text processing with awk Using sort uniq c, we counted the number of features belonging to a particular gene. awk also have a very useful data structure known as an associative array which behaves like Python s dictionaries or hashes in other languages.
Text processing with awk We can create an associative array by simply assigning a value to a key.
Text processing with awk This example illustrates that awk is a programming language: within our action blocks, we can use standard programming statements like if, for, and while. However, when awk programs become complex or start to span multiple lines, you should switch to Python at that point.
Stream editing with sed We learned how Unix pipes are fast because they operate on streams of data rather than data written to disk. Additionally, pipes don t require that we load an entire file in memory at once. Instead, we can operate one line at a time. Often we need to make trivial edits to a stream, usually to prepare it for the next step in a Unix pipeline. The stream editor, sed, allows us to do exactly that.
Stream editing with sed sed reads data from a file or standard input and can edit a line at a time. Let s look at a very simple example: converting a file containing a single column of chromosomes in the format chrom1 to the format chr1.
Stream editing with sed We can edit it without opening the entire file in memory. Our edited output stream is then easy to redirect to a new file. In the previous example, we uses sed s substitute command, by far the most popular use of sed. sed s substitute takes the first occurrence of the pattern between the first two slashes, and replaces it with the string between the second and third slashes. In other words, the syntax of sed s substitute is s/pattern/replacement.
Stream editing with sed By default, sed only replaces the first occurrence of a match of each line. We can replace all occurrences of strings that match our pattern by setting the global flag g after the last slash: s/pattern/replacement/g If we need matching to be case-insensitive, we can enable this with the flag I By default, sed s substitutions use POSIX BRE. As with grep, we can use the E option to enable POSOX ERE.
Stream editing with sed Most important is the ability to capture chunks of text that match a pattern, and use these chunks in the replacement. Suppose we want to capture the chromosome name, and start and end positions in a string containing a genomic region in the format chr1:28427874-28425431, and output this as three columns.
Stream editing with sed ^(chr[^:]+): This matches the text that begins at the start of the line (^ enforces this), and then captures everything between ( and ). This pattern used for capturing begins with chr and matches one or more characters that are not :, our delimiter. We match until the first : through a character class defined by everything that s not :, [^:]+
Stream editing with sed ([0-9]+): Match and capture more than one number. Finally, our replacement is these three captured groups, interspersed with tabs, \t. Regular expressions are tricky and take time and practice to master.
Stream editing with sed Explicitly capturing each component of our genomic region is one way to tackle this, and nicely demonstrates sed s ability to capture patterns. However, there are numerous ways to use sed or other Unix tools to parse strings like this.
Stream editing with sed sed 's/[:-]/\t/g : We just replace both delimiters with a tab. Note that we ve enabled the global flag, which is necessary for this approach to work. sed 's/:/\t/' sed 's/-/\t/ : For complex substitutions, it can be much easier to use two or more calls to sed rather than trying to do this with one regular expression. tr ':-' '\t : tr translates all occurrences of its first argument to its second.
Stream editing with sed By default, sed prints every lines, making replacements to matching lines. Suppose we want to capture all transcript names from the last column of a GTF file. Some lines of the last column of the GTF file don t contain transcript_id, so sed prints the entire line rather than the captured group.
Stream editing with sed One way to solve this would be to use grep transcript_ id before sed. A cleaner way is to disable sed from outputting all lines with n. Then, by appending p after the last slash sed will print all lines it s made a replacement on.
Stream editing with sed This example uses an important regular expression idiom: capturing text between delimiters. This is a useful pattern. 1. First, match zero or more of any character (.*) before the string transcript_id 2. Then, match and capture one or more characters that are not a quote ([^ ]+). This is an important idiom. The brackets make up a character class. Character classes specify what characters the expression is allowed to match. Here, we use a caret (^) inside the brackets to match anything except what s inside these brackets. The end result is that we match and capture one or more nonquote characters.
Stream editing with sed The following approach ((.*) rather than ([^ ]+)) will not work
Stream editing with sed It is also possible to select and print certain ranges of lines with sed. In this case, we are not doing pattern matching, so we don t need slashes. To print lines 20 through 50 of a file, we use sed has features that allow you to make any type of edit to a stream of test, but for complex stream processing tasks it can be easier to write a Python script than a long and complicated sed command.
SHELL SCRIPTING
Overview Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose language. Bash is explicitly designed to make running and interfacing command-line programs as simple as possible. For these reason, Bash often takes the role as the glue language of bioinformatics, as it s used to glue many commands together into a cohesive workflow.
Overview Note that Python is a more suitable language for commonly reused or advanced pipelines. Python is a more modern, fully featured scripting language than Bash. Compared to Python, Bash lacks several nice features useful for data-processing scripts: better numeric type support, useful data structures, better string processing, refined option parsing, availability of a large number of libraries, and powerful functions that help with structuring your programs. However, there s more overhead when calling command-line programs from a Python script compared to Bash. Bash is often the best and quickest glue solution.
Writing and running bash scripts Most Bash scripts in bioinformatics are simply commands organized into a re-runnable script with some features to check that files exist and ensuring any error causes the script to abort. We will learn the basics of writing and executing Bash scripts, paying particular attention to how create robust Bash scripts.
A robust Bash header By convention, Bash scripts have the extension.sh. You can create them in your favorite text editor (e.g. emacs, nano, or vi). Anytime you write a Bash script, you should use the following Bash script header, which sets some Bash options that lead to more robust scripts. #!/bin/bash set e set u set o pipefail
A robust Bash header #!/bin/bash This is called the shebang, and it indicates the path to the interpreter used to execute this script. set e By default, a shell script containing a command that fails will not cause the entire shell script to exit: the shell script will just continue on to the next line. We always want errors to be loud and noticeable. This option prevents this, by terminating the script if any command exited with a nonzero exit status.
A robust Bash header Note that this option ignores nonzero statuses in if conditionals. Also, it ignores all exit statuses in Unix pipes except the last one. set u This option fixes another default behavior of Bash scripts: any command containing a reference to an unset variable name will still run. It prevents this type of error by aborting the script if a variable s value is unset
A robust Bash header set o pipefail set e will cause a script to abort if a nonzero exit status is encountered, with some exceptions. One such exception is if a program runs in a Unix pipe exited unsuccessfully. Including set o pipefail will prevent this undesirable behavior: any program that returns a nonzero exit status in the pipe will cause the entire pipe to return a nonzero status. With set e enabled, this will lead the script to abort.
Running bash scripts Running Bash scripts can be done one of two ways: 1. bash script.sh 2../script.sh While we can run any script, calling the script as an executable requires that it has executable permissions. We can set these using: chmod u+x script.sh This adds executable permissions for the user who owns the file. Then, the script can be run with./script.sh.
Variables Pipelines have numerous settings that should be stored in variables. Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier. Rather than having to changes numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value. Bash also reads command-line arguments into variables.
Variables Bash s variables don t have data types. It s helpful to think of Bash s variables as strings. We can create a variable and assign it a value with. results_dir= results/ Note that spaces matter when setting Bash variables. Do not use spaces around the equal sign.
Variables To access a variable s value, we use a dollar sign in front of the variable s name. Suppose we want to create a directory for a sample s alignment data, called <sample>_aln/, where <sample> is replaced by the sample s name. sample= CNTRL01A mkdir ${sample}_aln/
Command-line arguments The variable $0 stores the name of the script, and commandline arguments are assigned to the value $1, $2, $3, etc. Bash assigns the number of command-line arguments to $#.
Command-line arguments If you find your script requires numerous or complicated options, it might be easier to use Python instead of Bash. Python s argparse module is much easier to use. Variables created in your Bash script will only be available for the duration of the Bash process running that script.
if statement Bash supports the standard if conditional statement. The basic syntax is: if [commands] then else fi [if-statements] [else-statements]
if statement A command s exit status provides the true and false. Remember that 0 represents true/success and anything else if false/failure. if [commands] [commands] could be any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then. Otherwise execution continues to the block after else.
if statement [if-statements] is a placeholder for all statements executed if [commands] evaluates to true (0). [else-statements] is a placeholder for all statements executed if [commands] evaluates to false. The else block is optional.
if statement Bash is primarily designed to stitch together other commands. This is an advantage Bash has over Python when writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead to call programs. Although it can be unpleasant to write complicated programs in Bash, writing simple programs is exceedingly easy because Unix tools and Bash harmonize well.
if statement Suppose we wanted to run a set of commands only if a file contains a certain string. Because grep returns 0 only if it matches a pattern in a file and 1 otherwise. The redirection is to tidy the output of this script such that grep s output is redirected to /dev/null and not to the script s standard out.
test Like other programs, test exits with either 0 or 1. However test s exit status indicates the return value of the test specified through its arguments, rather than exit success or error. test supports numerous standard comparison operators.
test String/integer Description -z str String str is null str1 = str2 str1 and str2 are identical str1!= str2 str1 and str2 are different int1 eq int2 Integers int1 and int2 are equal int1 ne int2 int1 and int2 are not equal int1 lt int2 int1 is less than int2 int1 gt int2 int1 is greater than int2 int1 le int2 int1 is less than or equal to int2 int1 ge int2 int1 is greater than or equal to int2
test In practice, the most common conditions you ll be checking are to see if files or directories exist and whether you can write to them. test supports numerous file- and directoryrelated test operations.
test File/directory expression Description -d dir dir is a directory -f file file is a file -e file file exists -r file file is readable -w file file is writable -x file file is executable
test Combining test with if statements is simple: if test f some_file.txt then [ ] fi Bash provides a simpler syntactic alternative: if [ f some_file.txt ] then [ ] fi Note the spaces around and within the brackets: these are required.
test When using this syntax, we can chain test expression with a as logical AND, -o as logical OR,! as negation. Our familiar && and operators won t work in test, because these are shell operators. if [ $# ne 1 o! r $1 ] then fi echo usage: script.sh file_in.txt
for loop In bioinformatics, most of our data is split across multiple files. At the heart of any processing pipeline is some way to apply the same workflow to each of these files, taking care to keep track of sample names. Looping over files with Bash s for loop is the simplest way to accomplish this. There are three essential parts to creating a pipeline to process a set of files: 1. Selecting which files to apply the commands to 2. Looping over the data and applying the commands 3. Keeping track of the names of any output files created
for loop Suppose we have a file called samples.txt that tells you basic information about your raw data: sample name, read pair, and where the file is.
for loop Suppose we want to loop over every file, gather quality statistics on each and every file, and save this information to an output file. First, we load our filenames into a Bash array, which we can then loop over. Bash arrays can be created manually using:
for loop But creating Bash arrays by hand is tedious and error prone. The beauty of Bash is that we can use a command substitution to construct Bash arrays. We can strip the path and extension from each filename using basename.
for loop
Learning Unix https://www.codecademy.com/learn/learn-the-command-line http://swcarpentry.github.io/shell-novice/ http://korflab.ucdavis.edu/bootcamp.html http://korflab.ucdavis.edu/unix_and_perl/current.html https://www.learnenough.com/command-line-tutorial http://cli.learncodethehardway.org/book/ https://learnxinyminutes.com/docs/bash/ http://explainshell.com/