Essential Skills for Bioinformatics: Unix/Linux

WORKING WITH COMPRESSED DATA

Overview Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across network transfer), is an indispensable technology in modern bioinformatics. For example, sequences from a recent Illumina HiSeq run example.fastq: 63,203,414,514 bytes (59 GB) example.fastq.gz: 21,408,674,240 bytes (20 GB) Compression ratio (uncompressed size/compressed size) of this data is 2.95, which translates to a significant space saving of about 66%.

Overview Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can work natively with compressed data as input, without requiring us to decompress it to disk first. Using pipes and redirection, we can stream compressed data and write compressed files directly to the disk. Common Unix tools like cat, grep all have variants that work with compressed data. While working with large datasets in bioinformatics can be challenging, using the compression tools in Unix and software libraries make our lives much easier.

gzip The two most common compression systems used on Unix are gzip and bzip2. gzip faster than bzip2. bzip2 has a higher compression ratio (the previous fastq file is only about 16 GB when compressed with bzip2) Generally, gzip is used in bioinformatics to compress most sizable files, while bzip2 is more common for long-term data archiving.

gzip It can compress results from standard input. This is useful, as we can compress results directly from another bioinformatics program s standard output.

gzip It also can compress files on disk in place. gzip will compress this file in place, replacing the original uncompressed version with the compressed file (appending the extension.gz to the original filename).

gunzip We can decompress files in place with the command gunzip. Note that this replaces tb1.fasta.gz file with the decompressed version.

gzip -c Both gzip and gunzip can also output their results to standard out. This can be enabled using the c option:

gzip with multiple files

Working with gzipped files The greatest advantage of gzip (and bzip2) is that many Unix and bioinformatics tools can work directly with compressed files. For example, we can search compressed files using grep s analog for gzipped files, zgrep. Likewise, cat has zcat. If programs cannot handle compressed input, you can use zcat and pipe output directly to the standard input of another program.

Working with gzipped files

Creating a tar.gz archive

Extracting a tar.gz file

CASE STUDY: REPRODUCIBLY DOWNLOADING DATA

GRCm38 mouse reference genome We usually download genomic resources like sequence and annotation files from remote servers over the Internet, which may change in the future. Furthermore, new versions of sequence and annotation data may be released, so it is imperative that we document everything about how data was acquired for full reproducibility The human, mouse, zebrafish, and chicken genomes releases are coordinated through the Genome Reference Consortium (https://www.ncbi.nlm.nih.gov/grc).

GRCm38 mouse reference genome The GRC prefix in GRCm38 refers to the Genome Reference Consortium. We can download GRCm38 from Ensembl using wget.

Compare checksum values From ftp://ftp.ensembl.org/pub/release-87/fasta/mus_musculus/dna/checksums

Extract the FASTA headers

Document README Document how and when we downloaded this file in README Copy the SHA-1 checksum values into README

UNIX DATA TOOLS

Overview Understanding how to use Unix data tools in bioinformatics is not only about learning what each tool does, it is about mastering the practice of connecting tools together creating programs from Unix pipelines. By connecting data tools together with pipes, we can construct programs that parse, manipulate, and summarize data. Unix pipelines can be developed in shell scripts or as one-liners (tiny programs built by connecting Unix tools with pipes directly on the shell).

Overview Building more complex programs from small, modular tools capitalizes on the design and philosophy of Unix. The pipeline approach to building programs is a wellestablished tradition in Unix and bioinformatics because it is a fast way to solve problems, incredibly powerful, and adaptable to a variety of problems.

When to use the Unix pipeline approach The Unix one-linear approach is not appropriate for all problems. Many bioinformatics tasks are better accomplished through a custom, well-documented script. Knowing when to use a fast and simple engineering solution like a Unix pipeline and when to resort to writing a welldocumented Python and R script takes experience.

When to use the Unix pipeline approach Unix pipelines: Fast, low-level data manipulation toolkit to explore data, transform data between formats, and inspect data for potential problems. Useful when we want to get a quick answer and keep moving forward with our project. It is essential that everything that produces a result is documented. Storing pipelines in shell scripts is a good approach. Custom scripts using Python or R: Useful for larger, more complex tasks as these allow for the flexibility in checking input data, structuring programs, use of data structures, code documentation.

Inspecting and manipulating text data Many formats in bioinformatics are simple tabular plain-text files delimited by a character. The most common tabular plain-text file format used in bioinformatics is tab-delimited because most Unix tools treat tabs as delimiters by default. Tab-delimited file formats are also simple to parse with scripting language like Python and Perl, and easy to load into R.

Tabular plain-text data formats The basic format: Each row (known as a record) is kept on its own line Each column (known as a field) is separated by some delimiter Three formats: Tab-delimited Comma-separated Variable space-delimited

Tab-delimited The most commonly used in bioinformatics (e.g. BED, GTF/GFF, SA M, VCF). Columns of a tab-delimited file are separated by a single tab char acter (the escape code: \t). A common convention (not a standard) is to include metadata on the first few lines of a tab-delimited files. These metadata lines be gin with #. Tabs in data are not allowed.

Comma-separated values (CSV) CSV is similar to tab-delimited, except the delimiter is a comma character. While not a common occurrence in bioinformatics, it is possible that the data stored in CSV format contain commas. Some variants just do not allow this, while others use quotes around entries that could contain commas.

Variable space-delimited In general, tab-delimited formats and CSV are better choices than variable space-delimited formats because it is quite com mon to encounter data containing spaces.

How lines are separated In Linux and OS X: use a single linefeed character (the escape code: \n) to separate lines. In Windows: use a DOS-style line separator of a carriage return and a linefeed character (\r\n). To convert DOS to Unix text format, use dos2unix. To convert Unix to DOS text format, use unix2dos.

Inspecting data with head and tail Many files in bioinformatics are much too long to inspect with cat. Running cat on a file a million lines long would quickly fill your shell. A better option is to take a look at the top of a file with head.

Inspecting data with head and tail

Inspecting data with head and tail We can control how many lines we see.

Inspecting data with head and tail tail is designed to look at the end of a file. tail works just like head.

Inspecting data with head and tail We can also use tail to remove the header of a file. If n is given a number x preceded with a + sign (e.g. +x), tail will start from the x th line.

Inspecting data with head and tail head is useful for taking a peek at data resulting from a Unix pipeline. We will use grep s results as the standard input for the next program in our pipeline, but first we want to check grep s standard out to see if everything looks correct. When head exits, your shell catches this and stops the entire pipe. When building complex pipelines that process large amounts of data, this is important.

less less is a useful program for a inspecting files and the output of pipes. It is a terminal pager, a program that allows us to view large amounts of text in our terminals at a time. less has more features and is generally preferred than the older terminal pager called more.

less Shortcut Space bar b g G j k /<pattern>?<pattern> Action Next page Previous page First line Last line Down one line at a time Up one line at a time Search down for string <pattern> Search up for string <pattern>

less less is useful in debugging our command-line pipelines. Just pipe the output of the command you want to debug to less. When you run the pipe, less will capture the output of the last command and pause so you can inspect it. less is crucial when iteratively building up a pipeline.

less A useful behavior of pipes is that the execution of a program with output piped to less will be paused when less has a full screen of data. When you pipe a program s output to less and inspect it, less stops reading input from the pipe. The pipe will block and we can spend as much time as needed to inspect the output.