Practical Linux examples: Exercises

Similar documents
Practical Linux Examples

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014

Introduction to UNIX command-line II

Lecture 5. Essential skills for bioinformatics: Unix/Linux

replace my_user_id in the commands with your actual user ID

Introduction To. Barry Grant

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

Useful commands in Linux and other tools for quality control. Ignacio Aguilar INIA Uruguay

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

The Unix Shell. Pipes and Filters

Linux command line basics III: piping commands for text processing. Yanbin Yin Fall 2015

Shell Programming Overview

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Introduction to Linux. Roman Cheplyaka

5/8/2012. Exploring Utilities Chapter 5

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Unix as a Platform Exercises. Course Code: OS-01-UNXPLAT

Recap From Last Time:

BGGN 213 Working with UNIX Barry Grant

Introduction to the shell Part II

Genomic Files. University of Massachusetts Medical School. October, 2015

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen


Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

Computer Systems and Architecture

Introduction p. 1 Who Should Read This Book? p. 1 What You Need to Know Before Reading This Book p. 2 How This Book Is Organized p.

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala

Linux shell & shell scripting - II

Practical 02. Bash & shell scripting

Introduction: What is Unix?

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Lab 2: Linux/Unix shell

A Brief Introduction to the Linux Shell for Data Science

Useful Unix Commands Cheat Sheet

Introduction to remote command line Linux. Research Computing Team University of Birmingham

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

Essentials for Scientific Computing: Bash Shell Scripting Day 3

Introduction to Galaxy

Chapter-3. Introduction to Unix: Fundamental Commands

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University

Introduction to Unix: Fundamental Commands

Shell Programming Systems Skills in C and Unix

BASH SHELL SCRIPT 1- Introduction to Shell

Unix basics exercise MBV-INFX410

Introduction to Linux for BlueBEAR. January

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

Process Management! Goals of this Lecture!

Introduction to Linux


Perl and R Scripting for Biologists

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

COMS 6100 Class Notes 3

Genomic Files. University of Massachusetts Medical School. October, 2014

Introduction to UNIX. Logging in. Basic System Architecture 10/7/10. most systems have graphical login on Linux machines

EECS 470 Lab 5. Linux Shell Scripting. Friday, 1 st February, 2018

UNIX II:grep, awk, sed. October 30, 2017

Introduction to Supercomputing

Linux II and III. Douglas Scofield. Crea-ng directories and files 18/01/14. Evolu5onary Biology Centre, Uppsala University


Calling variants in diploid or multiploid genomes

Process Management 1

Computer Systems and Architecture

Advanced training. Linux components Command shell. LiLux a.s.b.l.

Using bash. Administrative Shell Scripting COMP2101 Fall 2017

The Unix Shell & Shell Scripts

15-122: Principles of Imperative Computation

Variant calling using SAMtools

Shells & Shell Programming (Part B)

Introduction to UNIX command-line

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

Exercise 1: Basic Tools

UNIX, GNU/Linux and simple tools for data manipulation

Introduction To Linux. Rob Thomas - ACRC

M2PGER FORTRAN programming. General introduction. Virginie DURAND and Jean VIRIEUX 10/13/2013 M2PGER - ALGORITHME SCIENTIFIQUE

Part III. Shell Config. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

CS 460 Linux Tutorial

Recap From Last Time: Setup Checklist BGGN 213. Todays Menu. Introduction to UNIX.

Essential Skills for Bioinformatics: Unix/Linux

Connect to login8.stampede.tacc.utexas.edu. Sample Datasets

Shell Scripting. With Applications to HPC. Edmund Sumbar Copyright 2007 University of Alberta. All rights reserved

UNIX and Linux Essentials Student Guide

The Shell. EOAS Software Carpentry Workshop. September 20th, 2016

Scripting Languages Course 1. Diana Trandabăț

Laboratory 1 Semester 1 11/12

Utilities. September 8, 2015

Advanced Linux Commands & Shell Scripting

Bash Programming for Geophysicists

The UNIX Shells. Computer Center, CS, NCTU. How shell works. Unix shells. Fetch command Analyze Execute

Quantitative Biology Bootcamp Intro to Unix: Command Line Interface

Maize genome sequence in FASTA format. Gene annotation file in gff format

Lab 4: Bash Scripting

CS 3410 Intro to Unix, shell commands, etc... (slides from Hussam Abu-Libdeh and David Slater)

9.2 Linux Essentials Exam Objectives

A Hands-On Tutorial: RNA Sequencing Using High-Performance Computing

Transcription:

Practical Linux examples: Exercises 1. Login (ssh) to the machine that you are assigned for this workshop (assigned machines: https://cbsu.tc.cornell.edu/ww/machines.aspx?i=87 ). Prepare working directory, and copy data files into the working directory. (Replace XXXXX in the commands with your BioHPC User ID ) mkdir /workdir/xxxxx cd /workdir/xxxxx cp /shared_data/linux_workshop2/*./ 2. You are given a gzipped gff3 file. Inspect the content in the file. Linux functions: gunzip -c, head, tail, cut, less Inspect the first and last 100 lines of the file, using "head" and "tail" functions to retrieve first and last lines of the file; Retrieve line number 1001 to 2000 from the file and write these lines into a new file " mynew.gtf "; Inspect the columns 2,3,4 and 8 of lines 3901 to 4000, using "cut" function to specify the columns. Compare "cut" with "less -S" function. If you use "less" function, remember to exit "less" by pressing "q". gunzip -c human.gff3.gz head -n 100 gunzip -c human.gff3.gz tail -n 100 gunzip -c human.gff3.gz head -n 2000 tail -n 1000 > mynew.gtf gunzip -c human.gff3.gz head -n 4000 tail -n 100 cut -f 2-5,8 gunzip -c human.gff3.gz head -n 4000 tail -n 100 less -S 3. Count the number of genes listed in the file. Linux functions: awk, uniq Count the total number of lines in the file using "wc -l" function; Count the number of genes list in the file. First, you need to use "awk" to retrieve lines with the 3 rd column value equals "gene", then count these lines with "wc -l";

Count the number for each of the feature categories (genes, exons, rrna, mirna, et al.) listed in this GFF3 file. The 3 rd column in GFF3 file defines feature categories. First your "cut" out the 3 rd column of the file, then use "sort uniq -c" to count the occurrences of each of the categories. Note: "uniq" requires sorted input stream. Always run "sort" before "uniq - c", gunzip -c human.gff3.gz wc -l gunzip -c human.gff3.gz awk '{if ($3=="gene") print}' wc -l gunzip -c human.gff3.gz cut -f 3 sort uniq -c 4. Convert the GFF3 file to BED file, using only the lines with the third column equals to "gene". Then add "chr" to the chromosome name, so that chromosome 1 would be "chr1" instead of "1". Linux functions: awk, sed The BED file requires a minimum of 3 columns (chromosome, start position and end position). It is important to note that start and end positions in BED and GFF3 are defined differently. In GFF3 file, the start and end positions are both 1-based (the first nucleotide of a sequence is at position 1). In BED file, the start position is 0-based (the first nucleotide of a sequence is at position 0), and the end position is 1-based. When converting GFF3 to BED, you need to subtract 1 from the start position. In the follow command, the "\" characters are used to split a long command into multiple lines. The expression "BEGIN {OFS = "\t"};'" is to specify that the output stream uses tab as delimiters. The "sed" function is probably the fastest way to modify text in a large file and in a stream. In this command, the output from awk is piped into "sed", and "sed" would add "chr" to the chromosome name. The "^" character in "sed" is to specify making changes at the beginning of each line. #The following three lines are in one single command awk 'BEGIN {OFS = "\t"};{if ($3=="gene") print $1,$4-1,$5}' \ sed "s/^/chr/"> mygenes.bed 5. Get the size distribution of genes. Linux functions: awk, sort, uniq

Calculate the size for each of the genes. The size of genes are calculated by subtracting "gene start position" (column4) from "gene end position" (column 5), and adding 1 because GFF3 file use 1-based coordinate system for both start and end positions. (If a BED file is used to calculate the gene size, you do not need to add 1 because the start position is 0- based in the BED file). To get the size distribution, you need to add two things to the previous command: 1) Use the int(($5-$4+1)/1000) expression to convert the sequence size unit from "base-pair" to "kilo-base-pair" and convert the number into an integer; 2) The " LC_ALL=C sort -n uniq -c" combination is used to get the histogram. Note the two new things are added for the "sort" function: "LC_ALL=c" is to forces "sort" to use the default language "US English" to interpret characters. It is important to add this to make sure that "sort" behaves properly; "-n" is to tell "sort" to do numerical sorting. As the output from this command is very long, you can write the output to a file "gene_dist.txt". gunzip -c human.gff3.gz awk '{if ($3=="gene") print $5-$4+1}' #The following three lines are in one single command awk '{if ($3=="gene") print int(($5-$4+1)/1000)'} \ LC_ALL=C sort -n uniq -c > gene_dist.txt 6. Count the number of genes and pseudogenes in sliding windows across the whole chromosome. Linux functions: paste; BEDTools: makewindows, coverage BED, GFF3/GTF, SAM/BAM and VCF files are all tab delimited text files used for describing features of chromosomal intervals. Software like BEDTools, BEDOPS, VCFTools, SAMtools, BAMtools, et al. are often used in combination with basic Linux functions to process these files. In this exercise, you will use BEDTools, a very efficient tool for analyzing chromosomal interval files, e.g. within each intervals of file A, count the occurrences of features in file B. For this exercise, you will first generate a text file with sliding windows across the chromosome. The input file for the "makewindows" function is a text file with the length of each chromosomes (hg19.txt). The "-w" and "-s" options specify the window and step size for the sliding windows. In this example, the sliding window size is 1 mb. In the next two steps, you will count the number of genes and pseudo-genes in each sliding window. To do this, you can use "awk" to select lines with "gene" or "pseudogene" in column 3, use "bedtools coverage" to count the number of genes in each sliding window. The sorting step after "bedtools coverage" is necessary because bedtools tend to output unsorted results. In this case, you sort the file by two columns: column 1(chromosome name)

and column 2(position). Note that you need to use "version" style sorting for column 1 (- 1,1V) and numerical sorting for column 2 (-k2,2n). As chromosome names are like version number strings, e.g. chr1, chr2,, chr10. With "V" sorting, the "chr1" will be placed before "chr10", with general text sorting, the "chr1" will be placed after "chr10. The two numbers in "-k1,1v" indicate start and end columns for sorting. The "paste" function was used to concatenate the columns, followed by "cut" to output selected columns. Note: we use "paste" here because we know that the two files have same number of corresponding rows. If not sure, you need to use "join" function. bedtools makewindows -g hg19.txt -w 1000000 -s 1000000 > win1mb.bed awk 'BEGIN {OFS = "\t"}; {if ($3=="gene") print $1,$4-1,$5}' \ bedtools coverage -a win1mb.bed -b stdin -counts \ LC_ALL=C sort -k1,1v -k2,2n > gene.cover.bed awk 'BEGIN {OFS = "\t"}; \ {if (($3=="processed_pseudogene") ($3=="pseudogene")) print $1,$4-1,$5}' \ bedtools coverage -a win1mb.bed -b stdin -counts \ LC_ALL=C sort -k1,1v -k2,2n > pseudogene.cover.bed paste gene.cover.bed pseudogene.cover.bed \ cut -f 1,2,3,4,8 > genecounts.txt 7. In this exercise, you will use fastx command FASTQ file to trim sequencing adapters, then get size distribution of the trimmed sequencing reads. Linux functions: grep, wc -l, awk; FASTX:

First you will estimate the percentage of sequencing reads that contain the adapter sequence " AGATCGGAAGAGC". As the file could be very big, you could estimate the percentage based on the first 10,000 sequencing reads. (Note: sometimes the first 10,000 reads could be all low quality reads, then this estimation would not be accurate. You might want to try a few blocks of reads at different places of the file by using "head -n xxxxx tail - n xxxxx".) The "grep" function in this command is used to select lines that contain a specific string, followed by "wc -l" function that count the number of such lines. By doing this, you will find ~48% of the reads in this file contains the adapter sequence. Now, you will remove the adapter sequences. For this, you will use a specialized tool fastx_clipper which allows mismatches, and write the output into a new fastq file "clean.fastq". You can pipe the "gunzip -c" output directory into input stream of fastx_clipper. Now, you will get the read length distribution. In the fastq file, each sequence record has 4 lines and the second line of the record is the actual DNA sequence. The "awk" function has a variable "NR" that records the line number for each row. The expression (NR%4) gives you the remainder of NR divided by 4. The statement "if (NR%4 == 2) print length($0)" means "print the size of the second line in every 4-line sequence record". The output of awk can then be piped into "LC_ALL=C sort -n uniq -c" to get the read size distribution. gunzip -c SRR836349.fastq.gz head -n 40000 grep AGATCGGAAGAGC wc -l gunzip -c SRR836349.fastq.gz fastx_clipper -a AGATCGGAAGAGC -Q33 > clean.fastq awk '{if (NR%4 == 2) print length($0)}' clean.fastq LC_ALL=C sort -n uniq -c 8. Running multiple independent tasks in parallel on a multi-cpu machine As a simple example of multiple independent task problem, we will consider compressing several (here: 5) large files, reads_1.fastq, reads_2.fastq,, reads_5.fastq using gzip compression tool. The compression should be run in parallel using several (e.g., 3) CPU cores. Instructions: Your scratch directory /workdir/xxxxx should already contain the five *.fastq files mentioned above - verify this using the ls command. If the files are not present, copy them as follows (replacing XXXXX with your user ID): cd /workdir/xxxxx

cp /shared_data/linux_workshop2/reads_*.fastq. Create a text file, called my_tasks, containing the list of tasks to be executed, i.e., the five gzip commands, one in each line. The file should look like this: gzip reads_1.fastq gzip reads_2.fastq gzip reads_3.fastq gzip reads_4.fastq gzip reads_5.fastq This file can be created, for example, using a text editor, such as nano or vim. Such hand-editing will work fine for small number of task lines, but not when this number is, say, a 1000 or more. In the latter case, it would be more practical to use the loop capability of the bash shell by typing: for i in {1..5} do echo "gzip reads_${i}.fastq" done > my_tasks The shell commands between do and done (here: just one echo command) will be repeated as specified in the for statement, i.e., 5 times, each time with the variable i incremented by 1, starting with 1 and ending with 5 (we could also write for i in 1 2 3 4 5, which would be equivalent to the present statement, or for i in 1 3 5, which would make i assume only values 1, 3, and 5). The current value of i is accessed inside the loop with using syntax ${i}. The command echo just prints its argument (stuff between "") to the screen, so between do and done 5 lines will be generated. Thanks to the redirection operator >, all these lines will be saved to the file my_tasks instead of being printed to the screen. NOTE that you can always use the for loop to execute multiple similar shell commands, indexed by a variable like i. NOTE also, the index i may be called differently (like x, Rob, or tuna, or anything else), and that the values over which it loops do not have to be integers - strings would be just fine if more convenient for indexing of the commands within the loop. After creating the my_tasks file, especially if you did this using the shell loop described above, verify that it makes sense by looking at a few first lines, e.g., head -2 my_tasks Now that the task file my_tasks is created, it is time to run these tasks in parallel! To try this, run the following command: perl_fork_univ.pl my_tasks 3 >& my_tasks.log & Immediately after submitting the above command (thanks to the & at the end it will run in the background and free the terminal for further action), run ls -alrt a few times and verify that the compressed files reads_1.fastq.gz,..., reads_5.fastq.gz are being created and are growing in size. First, you should see three such files, then the remaining two. While the operation is running, the

original (uncompressed) versions of the files will co-exist with the partial compressed ones. Once the compression of a file is completed, its original version is deleted. Run the command top -u yourid - you should see your three gzip processes on top of the list, consuming sizeable fraction of CPU time (to exit top - press q). So, what is happening here? The script perl_fork_univ.pl (located in/programs/bin/perlscripts) reads the list of tasks my_tasks and launches the first three simultaneously, on separate CPU cores. Once any of these initial three tasks completes freeing up its CPU core, the next task is launched in its place (with two others still running). The process continues until all tasks are finished, while the number of CPU cores being utilized never exceeds 3. This way, the load is balanced between the three CPU cores. A report from the run is saved in the log file my_tasks.log. It contains some useful timing information you may want to analyze. A few remarks: The script perl_fork_univ.pl is a simple way to parallelize any number of independent tasks on a multi-core machine. Just construct the task list file and run it through the script. The tasks do not need to be similar (as it is the case in our simple example), however, they need to be independent from one another. How many CPU cores to use (here: 3)? It depends on many factors, such as: total number of CPU cores on a machine: launching more tasks than available CPUs cores is counter-productive memory taken by each task: combined memory required by all tasks running simultaneously should not exceed about 90% of total memory available on a machine; memory taken by each of your running processes can be monitored with top disk I/O bandwidth: if all simultaneously running tasks read and write a lot to the same disk, they will compete for disk bandwidth, so running too many of them may slow things down instead of speeding them up (this is, actually, the case in our example, since gzip-ing is a diskintensive operation). other jobs running on a machine: if present, they also take CPU cores and memory! Example: during this exercise, there are about 7 other people working on your machine, each trying to run on 3 CPUs! Typically, determining the right number of CPUs to run on requires some experimentation.