ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Size: px

Start display at page:

Download "ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\.""

Donna Simpson
6 years ago
Views:

1 Command line tools - bash, awk and sed We can only explore a small fraction of the capabilities of the bash shell and command-line utilities in Linux during this course. An entire course could be taught about bash alone, but that is not the central topic of interest for this class. There are, however, a few more commands and techniques that are useful enough to merit mention in this session. Previous exercises have used both awk and sed, but this class will devote more attention to general principles of these tools, as well as some additional bash functions. We will also discuss "regular expressions" and file "globs" as part of learning more about command-line tools, because these are generally applicable to many functions. The document FileGlobbing.pdf (linked from the web page) contains the information from the Ubuntu manpage on file globbing. Read this document, then interpret the following globs. What would each of the following patterns match? Test them using the ls command on the /data/atrnaseq/ directory: ls /data/atrnaseq/*.gz ls /data/atrnaseq/[ct][1-3].* ls /data/atrnaseq/[!a-m]*.fastq* ls /data/atrnaseq/[a-m]*.fastq* Regular expressions are similar to file globs in the sense that both are ways of writing general "patterns" that can match many different targets. File globs apply specifically to file and directory names, while regular expressions apply to any text. Unfortunately there are multiple "flavors" of regular expressions: basic, extended, Perl-format, and so forth. Different tools use slightly different syntax and different flavors, so getting familiar enough with regular expressions to make them work for you is not a trivial task. The RegularExpressions.pdf document (linked from the web page) has an overview of regular expressions, although this is by no means a complete explanation of the topic. Read the document, then interpret the following regular expressions, and note how parentheses are used to indicate backreferences that would be available for use in a substitution command. NOTE: grouping using () and alternative search terms using are capabilities of "extended regular expressions" rather than "basic regular expressions", so test these regular expressions using the egrep command on a listing of the files in the /media/lubuntu/data/data directory ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\." Note that grep finds matches to the search pattern regardless of the presence of other characters around the pattern - this is another difference between regular expressions and file globs. A file glob finds only filenames that are completely matched by the pattern, while regular expressions find text that contain the pattern. Another version of the grep program is zgrep, which can search within gzipped files without saving a decompressed version of the file. This is useful if it is necessary to find subsets of data within a large file of sequence data stored in fastq.gz compressed format. One challenge with using zgrep on fastq.gz files is that it cannot readily distinguish between the four types of lines found in FASTQformat files. We will come back to this point later, when we discuss awk in more depth. With a basic understanding of file globs and regular expressions, we can look at how these are useful. We will start with commands for locating things. Linux system architecture is complex

2 enough that it is not always obvious what software is installed on a given system, or if it is not installed, if it is available in pre-packaged form from the distribution repository. The Boost library required for compiling TopHat is an example - how can we find out if our Linux system already has a copy of the Boost library installed? If it does not, how do we know if the library is available in a Ubuntu package from the repository, or if we should download the source code, compile it, and install the library? The software for managing packages in Ubuntu and other variants of the Debian Linux distribibution is called dpkg, short for 'Debian package'. The command to search for packages with names containing the word 'boost' is dpkg-query -l *boost* Try this and see what happens. The information returned provides the complete names of packages that match the search pattern, along with the status of those packages on the local system (in the first three columns of each line). If the line begins with 'ii', the package is installed and functional on the local system; if the line begins with 'un', the package is not installed. Try the search again using different patterns. For example, leave out one or both asterisks: dpkg-query -l boost* The characters 'boost' are surrounded by other characters in the names of the software packages libboost-system and libboost-thread, so both asterisks are required for the search to be successful - this behaves like a file glob, rather than like a regular expression search. The dpkg-query command only searches the names of packages, which may or may not include the pattern of interest. To search for packages that contain the pattern "boost" in either the package name or description, use the command apt-cache search: apt-cache search boost The apt-cache is a list of names and descriptions of all available packages, installed or not, so this search returns a long list of packages that include the pattern 'boost' somewhere in the name or description. Note that leading and trailing asterisks are not needed, unlike dpkg-query - this behaves like a regular expression search rather than a file glob. Another command for finding software on the local system is 'locate' - this uses a local database of installed programs, and so it is wise to make sure the database is up-to-date before doing a search. Execute the following two commands to update the database and then search it: sudo updatedb locate boost Note that the locate command, similar to apt-cache search, does not require leading or trailing asterisk characters to find the pattern 'boost', even when it is surrounded by other characters. This highlights a reality of Linux systems - different parts of the same system don't always follow the same rules, so it is often important to try different versions of a command, or search online for tips using search keywords that include the specific command you are trying to use and the specific distribution you are working with. An even more general tool for searching on the local system is the bash command find, which will return a longer list of files and directories containing the pattern 'boost' than the locate command. Try: sudo find /[!a]* -iname *boost* This should be run with sudo privileges if you are searching (as in this example) from the root directory / rather than in your home directory, because an ordinary user does not have read privileges on many of the directories searched by the find command, so running the command without sudo privileges results in a lot of Permission denied error messages. The [!a]* glob specifies that the /afs directory tree should not be searched, because this will also result in many Permission denied error messages. The -iname option tells find to search for the pattern '*boost*' in filenames, in a case-insensitive manner. Other options allow the find command to search based on file type, file size, creation date, last modification date, or almost any other characteristic. Try the

3 search using boost without the flanking asterisks - is the find function more similar to file globs or to regular expressions? Loops and Conditionals DNA sequence analysis often requires carrying out the same steps of data processing and analysis separately for multiple sequence files. For example, even our small example RNA-seq experiment with only two conditions and three replicates of each condition requires processing six sequence files separately up to the point where the data from the six files are combined into a single object containing read counts for each transcript from all six samples. There are two general strategies for simplifying repetitive sets of tasks that need to be completed on multiple files, and then multiple methods to use either of the two strategies. The two strategies are serial processing and parallel processing - one either carries out the tasks on one file at a time, and repeats the same tasks on one file after another until the whole set is finished, or one carries out the tasks in parallel on all files at the same time. The choice of strategy is based on the amount of system resources required per task, the number of files to be processed, and the total amount of resources available. A workstation with 24 processors and 128 Gb of RAM could carry out parallel processing of six RNA-seq data files from the example dataset at the same time, but the BIT laptops with 4 processors and 8 Gb of RAM probably could not. Command-line options for parallel processing include the commands xargs with tee, or the parallel command; see the manpages for more information on xargs and tee, and the webpage for more information on parallel. Loops One way to carry out serial processing is to create a loop of commands. This is a sequence of commands provided to the interpreter at the same time, that (1) give a list of items to deal, and (2) specify what is to be done with them. The "items to deal with" can be as simple as counting through numbers in a sequence or processing each file in turn from a list of filenames, while the specification of "what is to be done with them" can be as simple or complex as desired. Loops are a general concept that can be applied in many ways using different command interpreters; we will focus on examples of loops in the bash shell and in the awk text processing language. Bash loops can have any of three general forms: for loops, while loops, or until loops (see the Bash Beginners Guide at The for loop takes the form 'for X in Y; do Z; done'. X is a variable, Y is a list of values that will each be assigned to the variable X, and Z is an action or set of actions to be taken. A simple example which simply echos the numbers 1 to 10 to the screen is: for x in {1..10}; do echo $x; done This example demonstrates that numerical sequences can be specified using curly brackets and two dots separating the beginning and ending values. A list of text items, such as file names, can be provided as a space-separated list with no brackets or quotes (provided the file names do not contain spaces): for x in c1 c2 c3 t1 t2 t3; do echo $x; done In order to help the bash shell recognize the boundaries of the string used as the variable name, we can put curly brackets around the string, so that adjacent characters are clearly marked as not part of

4 the name. For example, if we wanted to use the same strategy as before to list the names of the sequence files, but used the variable only for the parts that are different, we could write: for x in c1 c2 c3 t1 t2 t3; do echo ${x}.fq.gz; done Instead of just echoing the filename to the screen, we could just as easily use this loop to carry out serial alignment of all 6 RNA-seq data files to the chromosome 5 reference sequence to produce BAM output files: for x in c1 c2 c3 t1 t2 t3; do bwa mem -t 3 Atchr5 ${x}.fq.gz samtools view -SbuF4 - samtools sort -o ${x} -; done Note that the value of the variable $x is used twice, once to name the input file to be aligned, and a second time to name the BAM file produced as output. This points out the advantages of a filenaming system that keeps all the relevant information together in one part of the filename, so the name can easily be subdivided into different parts, one specific for the sample identity and another specific for the file format or stage of processing. There are also advantages to naming files so that the sample identity segment always consists of the same number of characters. For example, in analysis of genotyping-by-sequencing data using the STACKS pipeline, sequence reads for each sample are expected to be stored in separate files. This can mean dozens or hundreds of different sequence files, so typing out a list of all the filenames as demonstrated above would be very tedious. If the files are named so that the sample identity always occupies the first 6 characters (eg p01a01 to p10h12 for ten 96-well plates of samples, each named for the plate and well position where it is stored), then a loop that processes all 960 files can be written in three lines: for file in *.fastq.gz; do name=${file,0,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz samtools view -SbuF4 - samtools sort -o ${name} -; done It is critical, in using loops of this sort, to make sure that the path information is correct relative to the working directory from which the command is executed. This example would be executed from within a directory containing the sequence files, and also containing a sub-directory called refindex in which the genome reference index called refname is stored. Two more points are worth mentioning here. First, the list of file names that are assigned to the $file variable is produced using a file glob - this example is simple, but it can be as complex as necessary to achieve the desired result. Second, the first 6 characters of each filename are extracted using the syntax ${variable, position, length}, where "position" is 0-indexed. The first character of the filename in the $file variable is position 0, and a sub-string of length 6 beginning at that first character is assigned to the variable $name. We can store the BAM output alignment files in a different directory than the input sequence files, if desired, by providing more path information in addition to filenames. For example, for file in seqdata/*.fastq.gz; do name=${file,8,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz samtools view -SbuF4 - samtools sort - bamfiles/${name}; done

5 This bash loop would be executed from a directory containing three sub-directories, the seqdata/ directory with the fastq.gz sequence files for all samples, the refindex/ directory containing the reference genome index, and the bamfiles/ directory in which the BAM output is to be stored. Note that the substring assigned to the $name variable now starts at position 8 of the filename stored in the $file variable, because the path string "seqdata/" occupies positions 0 to 7. Try the command below to see how files are listed from the /data/atrnaseq directory: for file in /data/atrnaseq/*gz; do echo $file; done Does this command list only the RNAseq data files? If not, how should it be modified so that it lists only the 6 files of RNAseq data? We have discussed the text processing language awk and its variant bioawk in the context of manipulating and summarizing FASTQ sequence and SAM alignment files, but it is a powerful tool for other kinds of text processing as well. For example, consider the problem of summarizing a file containing several FASTA-format sequences, each with a single header line and multiple lines of sequence, like the file test.fa available on the website. If we want to know how long each of the DNA sequences is, we can use an awk loop to condense all the sequence information onto a single line, then determine the length of that line. awk 'BEGIN {RS = ">"; FS = "\n"} {printf "%s\n", ">"$1} {for(i=2;i<=nf;i+=1) printf "%s", $i} }{printf "%s","\n"}' <infile.fa> This complex command begins with awk 'BEGIN {RS = ">"; FS = "\n"}: this tells the shell we are using the awk language, and sets the "record separator" character to ">" and the "field separator" character to newline. Recall that a "record" in awk refers to a set of data relevant to a particular observation, and "fields" are subsets of the data for that observation. By changing the record and field separator values from the defaults (newline and whitespace, respectively) we tell awk that we want it to treat each FASTA-format sequence as a record, and the individual lines as fields within that record. The availability of the bioawk variant eliminates the need for this kind of code for use with FASTA or FASTQ files, but the occasion may arise to use this strategy in other text-processing applications, so it is useful to know that it is possible. The printf command provides complete control over the format of printed output, at the cost of complex syntax. See for more information about printf; a brief summary follows. The "%s" specifies we are printing a 'string', or text; we can specify other types of output with other characters. %d or % i - integer digits %e - number in scientific notation, eg 'printf "%4.3e\n", 1950' prints 1.950e+03 with 3 digits to the right of the decimal. %c - a character specified by the ASCII character code, eg 'printf "%c", 65' produces the character 'A' %f - a decimal number or 'float', with significant digits specified: 'printf "%4.3f", 1950' prints ' ' Note that printf does not add a newline, so multiple printf statements will all be output to the same line unless newline characters '\n' are specified as part of the strings to be printed. The next part of the awk command, {printf "%s\n", ">"$1}, prints the name of the sequence prefixed by a ">" symbol (which was removed because it is specified as the record separator), followed by a newline. The loop {for(i=2;i<=nf;i+=1) printf "%s", $i} prints each line of sequence in turn to output, with no intervening newline characters, for as many lines of sequence as there are in each record. The final command, {printf "%s","\n"}', prints a newline character at the end of the sequence so the stage is set for the beginning of the next header line. The input file can be sent

6 to awk in a pipe, or awk can read directly from a file; default output goes to STDOUT (the screen), but a file destination can be specified in the print command if desired. Another example of formatting sequences: the command awk '{printf "%s%06d\n%s\n", ">",FNR,$0 }' infile.txt > outfile.fa takes as input a text file of short DNA sequences (such as GBS tags) and converts it to FASTA format, using as names the number of each sequence in the list. The 'printf "%s06d\n%s\n" ' part formats the numbers as 6 digits, with leading zeros to keep the sequence names the same length, eg > to > for a set of 100,000 GBS tags. Conditionals The bash shell has multiple types of conditional statements, but those are more complex than we can dig into today. We will focus on conditional statements in awk, because those are relatively simple and very useful for dealing with FASTQ formatted sequence files, where each record occupies four lines in the file. The mathematical operation "modulo" is valuable in this context - the modulo operation returns the remainder left over after dividing one number by another. Line number modulo 4 gives a unique identity to each of the four different types of lines in a FASTQ-format sequence file. In awk terminology, the operation is written NR%4, and the conditional test is specified with two equal signs to distinguish it from an assignment operation which is written with one equal sign. awk 'NR%4==1 {print $0}' prints the first header line with the name of the sequence awk 'NR%4==2 {print $0}' prints the DNA sequence line awk 'NR%4==3 {print $0}' prints the header line for the quality scores (often just '+') awk 'NR%4==0 {print $0}' prints the quality score line The awk utility does not have the capacity to read gzipped files directly, but it can accept input redirected from another shell command. For example, the command zcat /data/atrnaseq/c1.fq.gz awk 'NR%4==1 {print $0}' head prints the first 10 lines of sequence names from the c1.fq.gz file. This can also be written as awk 'NR%4==1 {print $0}' <(zcat /data/atrnaseq/c1.fq.gz ) head This is an example of redirection - the output of the zcat command is redirected from the screen to serve as input to the awk command. The second version allows the <(zcat <filename) to be embedded at any point in a pipe, so it can be used to merge data from different files or different processing streams together if desired, while the first version uses zcat filename to start the pipeline, so it can't be inserted at any other position in the sequence of commands.

The software and data for the RNA-Seq exercise are already available on the USB system

BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software