Essential Skills for Bioinformatics: Unix/Linux

Similar documents
Lecture 3. Essential skills for bioinformatics: Unix/Linux

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Introduction to UNIX command-line II

Using Linux as a Virtual Machine

Utilities. September 8, 2015


Handling Ordinary Files

Lecture 5. Essential skills for bioinformatics: Unix/Linux


Introduction to UNIX command-line


Introduction To Linux. Rob Thomas - ACRC

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.


Unzip command in unix

LCE Splunk Client 4.6 User Manual. Last Revised: March 27, 2018

Essential Skills for Bioinformatics: Unix/Linux

7. Archiving and compressing 7.1 Introduction

Practical Linux examples: Exercises

ITST Searching, Extracting & Archiving Data

1 Abstract. 2 Introduction. 3 Requirements

Introduction to Linux

UNIX, GNU/Linux and simple tools for data manipulation

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

Review of Fundamentals

Linux command line basics III: piping commands for text processing. Yanbin Yin Fall 2015

UNIX and Linux Essentials Student Guide

Working With Unix. Scott A. Handley* September 15, *Adapted from UNIX introduction material created by Dr. Julian Catchen

Introduction to Unix: Fundamental Commands

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Practical: Using LAST and MEGAN to get a quick view of a metagenome

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Introduction to Linux. Roman Cheplyaka

Linux II and III. Douglas Scofield. Crea-ng directories and files 18/01/14. Evolu5onary Biology Centre, Uppsala University

Unix unzip zip compress uncompress zip zip zip zip Extracting zip Unzip ZIP Unix Unix zip extracting ZIP zip zip unzip zip unzip zip Unix zipped

Practical Linux Examples

A Brief Introduction to the Linux Shell for Data Science

Chapter-3. Introduction to Unix: Fundamental Commands

MetaStorm: User Manual

Genomic Files. University of Massachusetts Medical School. October, 2014

Computer Systems and Architecture

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Command-Line Data Analysis INX_S17, Day 15,

Fall Lecture 5. Operating Systems: Configuration & Use CIS345. The Linux Utilities. Mostafa Z. Ali.

IB047. Unix Text Tools. Pavel Rychlý Mar 3.

EL2310 Scientific Programming

Lecture 8. Sequence alignments

Practical Unix exercise MBV INFX410

Running Programs in UNIX 1 / 30

Introduction to Unix/Linux INX_S17, Day 8,

Recap From Last Time:

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

BGGN 213 Working with UNIX Barry Grant

Scripting Languages Course 1. Diana Trandabăț

Unix - Basics Course on Unix and Genomic Data Prague, January 2017

File: PLT File Format Libraries

Linux Fundamentals (L-120)

replace my_user_id in the commands with your actual user ID

Shell Programming Overview

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

9/22/2017

File: Racket File Format Libraries

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

CS 460 Linux Tutorial

No Food or Drink in this room. Logon to Windows machine

Genomic Files. University of Massachusetts Medical School. October, 2015

Mineração de Dados Aplicada

An Introduction to Linux and Bowtie

Computer Architecture Lab 1 (Starting with Linux)

Public Repositories Tutorial: Bulk Downloads

commandname flags arguments

Linux Tutorial #7. quota. df (disk free) du (disk usage)

Basics. I think that the later is better.

Unix/Linux Primer. Taras V. Pogorelov and Mike Hallock School of Chemical Sciences, University of Illinois

Pre-Instructions for Proteomics Bioinformatics session Optional things that you can do before the Proteomics session of the Bioinformatics Course:

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

BIOINFORMATICS POST-DIPLOMA PROGRAM SUBJECT OUTLINE Subject Title: OPERATING SYSTEMS AND PROJECT MANAGEMENT Subject Code: BIF713 Subject Description:

Introduction to Unix and Linux. Workshop 1: Directories and Files

Session: Shell Programming Topic: Additional Commands

Why SAS Programmers Should Learn Python Too

Introduction to the shell Part II

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

COMP 4/6262: Programming UNIX

DNA Sequence Reads Compression

ChIP-seq (NGS) Data Formats

Advanced Linux Commands & Shell Scripting

Lecture 5. Additional useful commands. COP 3353 Introduction to UNIX

Archives. Gather and compress Campus-Booster ID : **XXXXX. Copyright SUPINFO. All rights reserved

EL2310 Scientific Programming

FREEENGINEER.ORG. 1 of 6 11/5/15 8:31 PM. Learn UNIX in 10 minutes. Version 1.3. Preface

CS155: Computer Security Spring Project #1

Today. Review. Unix as an OS case study Intro to Shell Scripting. What is an Operating System? What are its goals? How do we evaluate it?

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Cisco IOS Shell. Finding Feature Information. Prerequisites for Cisco IOS.sh. Last Updated: December 14, 2012

Computer Systems and Architecture

Welcome to BCB/EEOB546X! Computational Skills for Biological Data. Instructors: Matt Hufford Tracy Heath Dennis Lavrov

Exploring the Microsoft Access User Interface and Exploring Navicat and Sequel Pro, and refer to chapter 5 of The Data Journalist.

SAS7BDAT Database Binary Format

C++ Programming. Final Project. Implementing the Smith-Waterman Algorithm Software Engineering, EIM-I Philipp Schubert Version 1.1.

Transcription:

Essential Skills for Bioinformatics: Unix/Linux

WORKING WITH COMPRESSED DATA

Overview Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across network transfer), is an indispensable technology in modern bioinformatics. For example, sequences from a recent Illumina HiSeq run example.fastq: 63,203,414,514 bytes (59 GB) example.fastq.gz: 21,408,674,240 bytes (20 GB) Compression ratio (uncompressed size/compressed size) of this data is 2.95, which translates to a significant space saving of about 66%.

Overview Data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can work natively with compressed data as input, without requiring us to decompress it to disk first. Using pipes and redirection, we can stream compressed data and write compressed files directly to the disk. Common Unix tools like cat, grep all have variants that work with compressed data. While working with large datasets in bioinformatics can be challenging, using the compression tools in Unix and software libraries make our lives much easier.

gzip The two most common compression systems used on Unix are gzip and bzip2. gzip faster than bzip2. bzip2 has a higher compression ratio (the previous fastq file is only about 16 GB when compressed with bzip2) Generally, gzip is used in bioinformatics to compress most sizable files, while bzip2 is more common for long-term data archiving.

gzip It can compress results from standard input. This is useful, as we can compress results directly from another bioinformatics program s standard output.

gzip It also can compress files on disk in place. gzip will compress this file in place, replacing the original uncompressed version with the compressed file (appending the extension.gz to the original filename).

gunzip We can decompress files in place with the command gunzip. Note that this replaces tb1.fasta.gz file with the decompressed version.

gzip -c Both gzip and gunzip can also output their results to standard out. This can be enabled using the c option:

gzip with multiple files

gzip with multiple files

Working with gzipped files The greatest advantage of gzip (and bzip2) is that many Unix and bioinformatics tools can work directly with compressed files. For example, we can search compressed files using grep s analog for gzipped files, zgrep. Likewise, cat has zcat. If programs cannot handle compressed input, you can use zcat and pipe output directly to the standard input of another program.

Working with gzipped files

Creating a tar.gz archive

Extracting a tar.gz file

CASE STUDY: REPRODUCIBLY DOWNLOADING DATA

GRCm38 mouse reference genome We usually download genomic resources like sequence and annotation files from remote servers over the Internet, which may change in the future. Furthermore, new versions of sequence and annotation data may be released, so it is imperative that we document everything about how data was acquired for full reproducibility The human, mouse, zebrafish, and chicken genomes releases are coordinated through the Genome Reference Consortium (https://www.ncbi.nlm.nih.gov/grc).

GRCm38 mouse reference genome The GRC prefix in GRCm38 refers to the Genome Reference Consortium. We can download GRCm38 from Ensembl using wget.

Compare checksum values From ftp://ftp.ensembl.org/pub/release-87/fasta/mus_musculus/dna/checksums

Extract the FASTA headers

Document README Document how and when we downloaded this file in README Copy the SHA-1 checksum values into README

UNIX DATA TOOLS

Overview Understanding how to use Unix data tools in bioinformatics is not only about learning what each tool does, it is about mastering the practice of connecting tools together creating programs from Unix pipelines. By connecting data tools together with pipes, we can construct programs that parse, manipulate, and summarize data. Unix pipelines can be developed in shell scripts or as one-liners (tiny programs built by connecting Unix tools with pipes directly on the shell).

Overview Building more complex programs from small, modular tools capitalizes on the design and philosophy of Unix. The pipeline approach to building programs is a wellestablished tradition in Unix and bioinformatics because it is a fast way to solve problems, incredibly powerful, and adaptable to a variety of problems.

When to use the Unix pipeline approach The Unix one-linear approach is not appropriate for all problems. Many bioinformatics tasks are better accomplished through a custom, well-documented script. Knowing when to use a fast and simple engineering solution like a Unix pipeline and when to resort to writing a welldocumented Python and R script takes experience.

When to use the Unix pipeline approach Unix pipelines: Fast, low-level data manipulation toolkit to explore data, transform data between formats, and inspect data for potential problems. Useful when we want to get a quick answer and keep moving forward with our project. It is essential that everything that produces a result is documented. Storing pipelines in shell scripts is a good approach. Custom scripts using Python or R: Useful for larger, more complex tasks as these allow for the flexibility in checking input data, structuring programs, use of data structures, code documentation.

Inspecting and manipulating text data Many formats in bioinformatics are simple tabular plain-text files delimited by a character. The most common tabular plain-text file format used in bioinformatics is tab-delimited because most Unix tools treat tabs as delimiters by default. Tab-delimited file formats are also simple to parse with scripting language like Python and Perl, and easy to load into R.

Tabular plain-text data formats The basic format: Each row (known as a record) is kept on its own line Each column (known as a field) is separated by some delimiter Three formats: Tab-delimited Comma-separated Variable space-delimited

Tab-delimited The most commonly used in bioinformatics (e.g. BED, GTF/GFF, SA M, VCF). Columns of a tab-delimited file are separated by a single tab char acter (the escape code: \t). A common convention (not a standard) is to include metadata on the first few lines of a tab-delimited files. These metadata lines be gin with #. Tabs in data are not allowed.

Comma-separated values (CSV) CSV is similar to tab-delimited, except the delimiter is a comma character. While not a common occurrence in bioinformatics, it is possible that the data stored in CSV format contain commas. Some variants just do not allow this, while others use quotes around entries that could contain commas.

Variable space-delimited In general, tab-delimited formats and CSV are better choices than variable space-delimited formats because it is quite com mon to encounter data containing spaces.

How lines are separated In Linux and OS X: use a single linefeed character (the escape code: \n) to separate lines. In Windows: use a DOS-style line separator of a carriage return and a linefeed character (\r\n). To convert DOS to Unix text format, use dos2unix. To convert Unix to DOS text format, use unix2dos.

Inspecting data with head and tail Many files in bioinformatics are much too long to inspect with cat. Running cat on a file a million lines long would quickly fill your shell. A better option is to take a look at the top of a file with head.

Inspecting data with head and tail

Inspecting data with head and tail We can control how many lines we see.

Inspecting data with head and tail tail is designed to look at the end of a file. tail works just like head.

Inspecting data with head and tail We can also use tail to remove the header of a file. If n is given a number x preceded with a + sign (e.g. +x), tail will start from the x th line.

Inspecting data with head and tail head is useful for taking a peek at data resulting from a Unix pipeline. We will use grep s results as the standard input for the next program in our pipeline, but first we want to check grep s standard out to see if everything looks correct. When head exits, your shell catches this and stops the entire pipe. When building complex pipelines that process large amounts of data, this is important.

less less is a useful program for a inspecting files and the output of pipes. It is a terminal pager, a program that allows us to view large amounts of text in our terminals at a time. less has more features and is generally preferred than the older terminal pager called more.

less Shortcut Space bar b g G j k /<pattern>?<pattern> Action Next page Previous page First line Last line Down one line at a time Up one line at a time Search down for string <pattern> Search up for string <pattern>

less less is useful in debugging our command-line pipelines. Just pipe the output of the command you want to debug to less. When you run the pipe, less will capture the output of the last command and pause so you can inspect it. less is crucial when iteratively building up a pipeline.

less A useful behavior of pipes is that the execution of a program with output piped to less will be paused when less has a full screen of data. When you pipe a program s output to less and inspect it, less stops reading input from the pipe. The pipe will block and we can spend as much time as needed to inspect the output.