ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Size: px
Start display at page:

Download "ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\.""

Transcription

1 Command line tools - bash, awk and sed We can only explore a small fraction of the capabilities of the bash shell and command-line utilities in Linux during this course. An entire course could be taught about bash alone, but that is not the central topic of interest for this class. There are, however, a few more commands and techniques that are useful enough to merit mention in this session. Previous exercises have used both awk and sed, but this class will devote more attention to general principles of these tools, as well as some additional bash functions. We will also discuss "regular expressions" and file "globs" as part of learning more about command-line tools, because these are generally applicable to many functions. The document FileGlobbing.pdf (linked from the web page) contains the information from the Ubuntu manpage on file globbing. Read this document, then interpret the following globs. What would each of the following patterns match? Test them using the ls command on the /data/atrnaseq/ directory: ls /data/atrnaseq/*.gz ls /data/atrnaseq/[ct][1-3].* ls /data/atrnaseq/[!a-m]*.fastq* ls /data/atrnaseq/[a-m]*.fastq* Regular expressions are similar to file globs in the sense that both are ways of writing general "patterns" that can match many different targets. File globs apply specifically to file and directory names, while regular expressions apply to any text. Unfortunately there are multiple "flavors" of regular expressions: basic, extended, Perl-format, and so forth. Different tools use slightly different syntax and different flavors, so getting familiar enough with regular expressions to make them work for you is not a trivial task. The RegularExpressions.pdf document (linked from the web page) has an overview of regular expressions, although this is by no means a complete explanation of the topic. Read the document, then interpret the following regular expressions, and note how parentheses are used to indicate backreferences that would be available for use in a substitution command. NOTE: grouping using () and alternative search terms using are capabilities of "extended regular expressions" rather than "basic regular expressions", so test these regular expressions using the egrep command on a listing of the files in the /media/lubuntu/data/data directory ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\." Note that grep finds matches to the search pattern regardless of the presence of other characters around the pattern - this is another difference between regular expressions and file globs. A file glob finds only filenames that are completely matched by the pattern, while regular expressions find text that contain the pattern. Another version of the grep program is zgrep, which can search within gzipped files without saving a decompressed version of the file. This is useful if it is necessary to find subsets of data within a large file of sequence data stored in fastq.gz compressed format. One challenge with using zgrep on fastq.gz files is that it cannot readily distinguish between the four types of lines found in FASTQformat files. We will come back to this point later, when we discuss awk in more depth. With a basic understanding of file globs and regular expressions, we can look at how these are useful. We will start with commands for locating things. Linux system architecture is complex

2 enough that it is not always obvious what software is installed on a given system, or if it is not installed, if it is available in pre-packaged form from the distribution repository. The Boost library required for compiling TopHat is an example - how can we find out if our Linux system already has a copy of the Boost library installed? If it does not, how do we know if the library is available in a Ubuntu package from the repository, or if we should download the source code, compile it, and install the library? The software for managing packages in Ubuntu and other variants of the Debian Linux distribibution is called dpkg, short for 'Debian package'. The command to search for packages with names containing the word 'boost' is dpkg-query -l *boost* Try this and see what happens. The information returned provides the complete names of packages that match the search pattern, along with the status of those packages on the local system (in the first three columns of each line). If the line begins with 'ii', the package is installed and functional on the local system; if the line begins with 'un', the package is not installed. Try the search again using different patterns. For example, leave out one or both asterisks: dpkg-query -l boost* The characters 'boost' are surrounded by other characters in the names of the software packages libboost-system and libboost-thread, so both asterisks are required for the search to be successful - this behaves like a file glob, rather than like a regular expression search. The dpkg-query command only searches the names of packages, which may or may not include the pattern of interest. To search for packages that contain the pattern "boost" in either the package name or description, use the command apt-cache search: apt-cache search boost The apt-cache is a list of names and descriptions of all available packages, installed or not, so this search returns a long list of packages that include the pattern 'boost' somewhere in the name or description. Note that leading and trailing asterisks are not needed, unlike dpkg-query - this behaves like a regular expression search rather than a file glob. Another command for finding software on the local system is 'locate' - this uses a local database of installed programs, and so it is wise to make sure the database is up-to-date before doing a search. Execute the following two commands to update the database and then search it: sudo updatedb locate boost Note that the locate command, similar to apt-cache search, does not require leading or trailing asterisk characters to find the pattern 'boost', even when it is surrounded by other characters. This highlights a reality of Linux systems - different parts of the same system don't always follow the same rules, so it is often important to try different versions of a command, or search online for tips using search keywords that include the specific command you are trying to use and the specific distribution you are working with. An even more general tool for searching on the local system is the bash command find, which will return a longer list of files and directories containing the pattern 'boost' than the locate command. Try: sudo find /[!a]* -iname *boost* This should be run with sudo privileges if you are searching (as in this example) from the root directory / rather than in your home directory, because an ordinary user does not have read privileges on many of the directories searched by the find command, so running the command without sudo privileges results in a lot of Permission denied error messages. The [!a]* glob specifies that the /afs directory tree should not be searched, because this will also result in many Permission denied error messages. The -iname option tells find to search for the pattern '*boost*' in filenames, in a case-insensitive manner. Other options allow the find command to search based on file type, file size, creation date, last modification date, or almost any other characteristic. Try the

3 search using boost without the flanking asterisks - is the find function more similar to file globs or to regular expressions? Loops and Conditionals DNA sequence analysis often requires carrying out the same steps of data processing and analysis separately for multiple sequence files. For example, even our small example RNA-seq experiment with only two conditions and three replicates of each condition requires processing six sequence files separately up to the point where the data from the six files are combined into a single object containing read counts for each transcript from all six samples. There are two general strategies for simplifying repetitive sets of tasks that need to be completed on multiple files, and then multiple methods to use either of the two strategies. The two strategies are serial processing and parallel processing - one either carries out the tasks on one file at a time, and repeats the same tasks on one file after another until the whole set is finished, or one carries out the tasks in parallel on all files at the same time. The choice of strategy is based on the amount of system resources required per task, the number of files to be processed, and the total amount of resources available. A workstation with 24 processors and 128 Gb of RAM could carry out parallel processing of six RNA-seq data files from the example dataset at the same time, but the BIT laptops with 4 processors and 8 Gb of RAM probably could not. Command-line options for parallel processing include the commands xargs with tee, or the parallel command; see the manpages for more information on xargs and tee, and the webpage for more information on parallel. Loops One way to carry out serial processing is to create a loop of commands. This is a sequence of commands provided to the interpreter at the same time, that (1) give a list of items to deal, and (2) specify what is to be done with them. The "items to deal with" can be as simple as counting through numbers in a sequence or processing each file in turn from a list of filenames, while the specification of "what is to be done with them" can be as simple or complex as desired. Loops are a general concept that can be applied in many ways using different command interpreters; we will focus on examples of loops in the bash shell and in the awk text processing language. Bash loops can have any of three general forms: for loops, while loops, or until loops (see the Bash Beginners Guide at The for loop takes the form 'for X in Y; do Z; done'. X is a variable, Y is a list of values that will each be assigned to the variable X, and Z is an action or set of actions to be taken. A simple example which simply echos the numbers 1 to 10 to the screen is: for x in {1..10}; do echo $x; done This example demonstrates that numerical sequences can be specified using curly brackets and two dots separating the beginning and ending values. A list of text items, such as file names, can be provided as a space-separated list with no brackets or quotes (provided the file names do not contain spaces): for x in c1 c2 c3 t1 t2 t3; do echo $x; done In order to help the bash shell recognize the boundaries of the string used as the variable name, we can put curly brackets around the string, so that adjacent characters are clearly marked as not part of

4 the name. For example, if we wanted to use the same strategy as before to list the names of the sequence files, but used the variable only for the parts that are different, we could write: for x in c1 c2 c3 t1 t2 t3; do echo ${x}.fq.gz; done Instead of just echoing the filename to the screen, we could just as easily use this loop to carry out serial alignment of all 6 RNA-seq data files to the chromosome 5 reference sequence to produce BAM output files: for x in c1 c2 c3 t1 t2 t3; do bwa mem -t 3 Atchr5 ${x}.fq.gz samtools view -SbuF4 - samtools sort -o ${x} -; done Note that the value of the variable $x is used twice, once to name the input file to be aligned, and a second time to name the BAM file produced as output. This points out the advantages of a filenaming system that keeps all the relevant information together in one part of the filename, so the name can easily be subdivided into different parts, one specific for the sample identity and another specific for the file format or stage of processing. There are also advantages to naming files so that the sample identity segment always consists of the same number of characters. For example, in analysis of genotyping-by-sequencing data using the STACKS pipeline, sequence reads for each sample are expected to be stored in separate files. This can mean dozens or hundreds of different sequence files, so typing out a list of all the filenames as demonstrated above would be very tedious. If the files are named so that the sample identity always occupies the first 6 characters (eg p01a01 to p10h12 for ten 96-well plates of samples, each named for the plate and well position where it is stored), then a loop that processes all 960 files can be written in three lines: for file in *.fastq.gz; do name=${file,0,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz samtools view -SbuF4 - samtools sort -o ${name} -; done It is critical, in using loops of this sort, to make sure that the path information is correct relative to the working directory from which the command is executed. This example would be executed from within a directory containing the sequence files, and also containing a sub-directory called refindex in which the genome reference index called refname is stored. Two more points are worth mentioning here. First, the list of file names that are assigned to the $file variable is produced using a file glob - this example is simple, but it can be as complex as necessary to achieve the desired result. Second, the first 6 characters of each filename are extracted using the syntax ${variable, position, length}, where "position" is 0-indexed. The first character of the filename in the $file variable is position 0, and a sub-string of length 6 beginning at that first character is assigned to the variable $name. We can store the BAM output alignment files in a different directory than the input sequence files, if desired, by providing more path information in addition to filenames. For example, for file in seqdata/*.fastq.gz; do name=${file,8,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz samtools view -SbuF4 - samtools sort - bamfiles/${name}; done

5 This bash loop would be executed from a directory containing three sub-directories, the seqdata/ directory with the fastq.gz sequence files for all samples, the refindex/ directory containing the reference genome index, and the bamfiles/ directory in which the BAM output is to be stored. Note that the substring assigned to the $name variable now starts at position 8 of the filename stored in the $file variable, because the path string "seqdata/" occupies positions 0 to 7. Try the command below to see how files are listed from the /data/atrnaseq directory: for file in /data/atrnaseq/*gz; do echo $file; done Does this command list only the RNAseq data files? If not, how should it be modified so that it lists only the 6 files of RNAseq data? We have discussed the text processing language awk and its variant bioawk in the context of manipulating and summarizing FASTQ sequence and SAM alignment files, but it is a powerful tool for other kinds of text processing as well. For example, consider the problem of summarizing a file containing several FASTA-format sequences, each with a single header line and multiple lines of sequence, like the file test.fa available on the website. If we want to know how long each of the DNA sequences is, we can use an awk loop to condense all the sequence information onto a single line, then determine the length of that line. awk 'BEGIN {RS = ">"; FS = "\n"} {printf "%s\n", ">"$1} {for(i=2;i<=nf;i+=1) printf "%s", $i} }{printf "%s","\n"}' <infile.fa> This complex command begins with awk 'BEGIN {RS = ">"; FS = "\n"}: this tells the shell we are using the awk language, and sets the "record separator" character to ">" and the "field separator" character to newline. Recall that a "record" in awk refers to a set of data relevant to a particular observation, and "fields" are subsets of the data for that observation. By changing the record and field separator values from the defaults (newline and whitespace, respectively) we tell awk that we want it to treat each FASTA-format sequence as a record, and the individual lines as fields within that record. The availability of the bioawk variant eliminates the need for this kind of code for use with FASTA or FASTQ files, but the occasion may arise to use this strategy in other text-processing applications, so it is useful to know that it is possible. The printf command provides complete control over the format of printed output, at the cost of complex syntax. See for more information about printf; a brief summary follows. The "%s" specifies we are printing a 'string', or text; we can specify other types of output with other characters. %d or % i - integer digits %e - number in scientific notation, eg 'printf "%4.3e\n", 1950' prints 1.950e+03 with 3 digits to the right of the decimal. %c - a character specified by the ASCII character code, eg 'printf "%c", 65' produces the character 'A' %f - a decimal number or 'float', with significant digits specified: 'printf "%4.3f", 1950' prints ' ' Note that printf does not add a newline, so multiple printf statements will all be output to the same line unless newline characters '\n' are specified as part of the strings to be printed. The next part of the awk command, {printf "%s\n", ">"$1}, prints the name of the sequence prefixed by a ">" symbol (which was removed because it is specified as the record separator), followed by a newline. The loop {for(i=2;i<=nf;i+=1) printf "%s", $i} prints each line of sequence in turn to output, with no intervening newline characters, for as many lines of sequence as there are in each record. The final command, {printf "%s","\n"}', prints a newline character at the end of the sequence so the stage is set for the beginning of the next header line. The input file can be sent

6 to awk in a pipe, or awk can read directly from a file; default output goes to STDOUT (the screen), but a file destination can be specified in the print command if desired. Another example of formatting sequences: the command awk '{printf "%s%06d\n%s\n", ">",FNR,$0 }' infile.txt > outfile.fa takes as input a text file of short DNA sequences (such as GBS tags) and converts it to FASTA format, using as names the number of each sequence in the list. The 'printf "%s06d\n%s\n" ' part formats the numbers as 6 digits, with leading zeros to keep the sequence names the same length, eg > to > for a set of 100,000 GBS tags. Conditionals The bash shell has multiple types of conditional statements, but those are more complex than we can dig into today. We will focus on conditional statements in awk, because those are relatively simple and very useful for dealing with FASTQ formatted sequence files, where each record occupies four lines in the file. The mathematical operation "modulo" is valuable in this context - the modulo operation returns the remainder left over after dividing one number by another. Line number modulo 4 gives a unique identity to each of the four different types of lines in a FASTQ-format sequence file. In awk terminology, the operation is written NR%4, and the conditional test is specified with two equal signs to distinguish it from an assignment operation which is written with one equal sign. awk 'NR%4==1 {print $0}' prints the first header line with the name of the sequence awk 'NR%4==2 {print $0}' prints the DNA sequence line awk 'NR%4==3 {print $0}' prints the header line for the quality scores (often just '+') awk 'NR%4==0 {print $0}' prints the quality score line The awk utility does not have the capacity to read gzipped files directly, but it can accept input redirected from another shell command. For example, the command zcat /data/atrnaseq/c1.fq.gz awk 'NR%4==1 {print $0}' head prints the first 10 lines of sequence names from the c1.fq.gz file. This can also be written as awk 'NR%4==1 {print $0}' <(zcat /data/atrnaseq/c1.fq.gz ) head This is an example of redirection - the output of the zcat command is redirected from the screen to serve as input to the awk command. The second version allows the <(zcat <filename) to be embedded at any point in a pipe, so it can be used to merge data from different files or different processing streams together if desired, while the first version uses zcat filename to start the pipeline, so it can't be inserted at any other position in the sequence of commands.

The software and data for the RNA-Seq exercise are already available on the USB system

The software and data for the RNA-Seq exercise are already available on the USB system BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

ITST Searching, Extracting & Archiving Data

ITST Searching, Extracting & Archiving Data ITST 1136 - Searching, Extracting & Archiving Data Name: Step 1 Sign into a Pi UN = pi PW = raspberry Step 2 - Grep - One of the most useful and versatile commands in a Linux terminal environment is the

More information

5/8/2012. Exploring Utilities Chapter 5

5/8/2012. Exploring Utilities Chapter 5 Exploring Utilities Chapter 5 Examining the contents of files. Working with the cut and paste feature. Formatting output with the column utility. Searching for lines containing a target string with grep.

More information

Introduction to Linux. Roman Cheplyaka

Introduction to Linux. Roman Cheplyaka Introduction to Linux Roman Cheplyaka Generic commands, files, directories What am I running? ngsuser@ubuntu:~$ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu

More information

CMPS 12A Introduction to Programming Lab Assignment 7

CMPS 12A Introduction to Programming Lab Assignment 7 CMPS 12A Introduction to Programming Lab Assignment 7 In this assignment you will write a bash script that interacts with the user and does some simple calculations, emulating the functionality of programming

More information

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version... Contents Note: pay attention to where you are........................................... 1 Note: Plaintext version................................................... 1 Hello World of the Bash shell 2 Accessing

More information

http://xkcd.com/208/ 1. Review of pipes 2. Regular expressions 3. sed 4. awk 5. Editing Files 6. Shell loops 7. Shell scripts cat seqs.fa >0! TGCAGGTATATCTATTAGCAGGTTTAATTTTGCCTGCACTTGGTTGGGTACATTATTTTAAGTGTATTTGACAAG!

More information

Chapter 1 - Introduction. September 8, 2016

Chapter 1 - Introduction. September 8, 2016 Chapter 1 - Introduction September 8, 2016 Introduction Overview of Linux/Unix Shells Commands: built-in, aliases, program invocations, alternation and iteration Finding more information: man, info Help

More information

Essentials for Scientific Computing: Stream editing with sed and awk

Essentials for Scientific Computing: Stream editing with sed and awk Essentials for Scientific Computing: Stream editing with sed and awk Ershaad Ahamed TUE-CMS, JNCASR May 2012 1 Stream Editing sed and awk are stream processing commands. What this means is that they are

More information

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Lecture 3. Essential skills for bioinformatics: Unix/Linux Lecture 3 Essential skills for bioinformatics: Unix/Linux RETRIEVING DATA Overview Whether downloading large sequencing datasets or accessing a web application hundreds of times to download specific files,

More information

Introduction to UNIX command-line II

Introduction to UNIX command-line II Introduction to UNIX command-line II Boyce Thompson Institute 2017 Prashant Hosmani Class Content Terminal file system navigation Wildcards, shortcuts and special characters File permissions Compression

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

Getting to grips with Unix and the Linux family

Getting to grips with Unix and the Linux family Getting to grips with Unix and the Linux family David Chiappini, Giulio Pasqualetti, Tommaso Redaelli Torino, International Conference of Physics Students August 10, 2017 According to the booklet At this

More information

Awk. 1 What is AWK? 2 Why use AWK instead of Perl? 3 Uses Of AWK. 4 Basic Structure Of AWK Programs. 5 Running AWK programs

Awk. 1 What is AWK? 2 Why use AWK instead of Perl? 3 Uses Of AWK. 4 Basic Structure Of AWK Programs. 5 Running AWK programs Awk Author: Reuben Francis Cornel 1 What is AWK? AWK is a programable filter developed with the aim of using it to generate formatted reports from data. Althought is uses addresses like sed to perform

More information

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p. Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics

More information

Linux II and III. Douglas Scofield. Crea-ng directories and files 18/01/14. Evolu5onary Biology Centre, Uppsala University

Linux II and III. Douglas Scofield. Crea-ng directories and files 18/01/14. Evolu5onary Biology Centre, Uppsala University Linux II and III Douglas Scofield Evolu5onary Biology Centre, Uppsala University douglas.scofield@ebc.uu.se slides at Crea-ng directories and files mkdir 1 Crea-ng directories and files touch if file does

More information

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R

Answers to AWK problems. Shell-Programming. Future: Using loops to automate tasks. Download and Install: Python (Windows only.) R Today s Class Answers to AWK problems Shell-Programming Using loops to automate tasks Future: Download and Install: Python (Windows only.) R Awk basics From the command line: $ awk '$1>20' filename Command

More information

bistro Documentation Release dev Philippe Veber

bistro Documentation Release dev Philippe Veber bistro Documentation Release dev Philippe Veber Oct 10, 2018 Contents 1 Getting started 1 1.1 Installation................................................ 1 1.2 A simple example............................................

More information

Module 8 Pipes, Redirection and REGEX

Module 8 Pipes, Redirection and REGEX Module 8 Pipes, Redirection and REGEX Exam Objective 3.2 Searching and Extracting Data from Files Objective Summary Piping and redirection Partial POSIX Command Line and Redirection Command Line Pipes

More information

UMass High Performance Computing Center

UMass High Performance Computing Center UMass High Performance Computing Center University of Massachusetts Medical School February, 2019 Challenges of Genomic Data 2 / 93 It is getting easier and cheaper to produce bigger genomic data every

More information

My Favorite bash Tips and Tricks

My Favorite bash Tips and Tricks 1 of 6 6/18/2006 7:44 PM My Favorite bash Tips and Tricks Prentice Bisbal Abstract Save a lot of typing with these handy bash features you won't find in an old-fashioned UNIX shell. bash, or the Bourne

More information

A Brief Introduction to the Linux Shell for Data Science

A Brief Introduction to the Linux Shell for Data Science A Brief Introduction to the Linux Shell for Data Science Aris Anagnostopoulos 1 Introduction Here we will see a brief introduction of the Linux command line or shell as it is called. Linux is a Unix-like

More information

http://xkcd.com/208/ 1. Review of pipes 2. Regular expressions 3. sed 4. Editing Files 5. Shell loops 6. Shell scripts cat seqs.fa >0! TGCAGGTATATCTATTAGCAGGTTTAATTTTGCCTGCACTTGGTTGGGTACATTATTTTAAGTGTATTTGACAAG!

More information

Practical Linux examples: Exercises

Practical Linux examples: Exercises Practical Linux examples: Exercises 1. Login (ssh) to the machine that you are assigned for this workshop (assigned machines: https://cbsu.tc.cornell.edu/ww/machines.aspx?i=87 ). Prepare working directory,

More information

client X11 Linux workstation

client X11 Linux workstation LPIC1 LPIC Linux: System Administrator LPIC 1 LPI command line LPIC-1 Linux LPIC-1 client X11 Linux workstation Unix GNU Linux Fundamentals Unix and its Design Principles FSF and GNU GPL - General Public

More information

Review of Fundamentals

Review of Fundamentals Review of Fundamentals 1 The shell vi General shell review 2 http://teaching.idallen.com/cst8207/14f/notes/120_shell_basics.html The shell is a program that is executed for us automatically when we log

More information

Basic Linux (Bash) Commands

Basic Linux (Bash) Commands Basic Linux (Bash) Commands Hint: Run commands in the emacs shell (emacs -nw, then M-x shell) instead of the terminal. It eases searching for and revising commands and navigating and copying-and-pasting

More information

Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux Essential Skills for Bioinformatics: Unix/Linux WORKING WITH COMPRESSED DATA Overview Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across

More information

LING 408/508: Computational Techniques for Linguists. Lecture 5

LING 408/508: Computational Techniques for Linguists. Lecture 5 LING 408/508: Computational Techniques for Linguists Lecture 5 Last Time Installing Ubuntu 18.04 LTS on top of VirtualBox Your Homework 2: did everyone succeed? Ubuntu VirtualBox Host OS: MacOS or Windows

More information

Lecture 5. Essential skills for bioinformatics: Unix/Linux

Lecture 5. Essential skills for bioinformatics: Unix/Linux Lecture 5 Essential skills for bioinformatics: Unix/Linux UNIX DATA TOOLS Text processing with awk We have illustrated two ways awk can come in handy: Filtering data using rules that can combine regular

More information

9.2 Linux Essentials Exam Objectives

9.2 Linux Essentials Exam Objectives 9.2 Linux Essentials Exam Objectives This chapter will cover the topics for the following Linux Essentials exam objectives: Topic 3: The Power of the Command Line (weight: 10) 3.3: Turning Commands into

More information

Linux Fundamentals (L-120)

Linux Fundamentals (L-120) Linux Fundamentals (L-120) Modality: Virtual Classroom Duration: 5 Days SUBSCRIPTION: Master, Master Plus About this course: This is a challenging course that focuses on the fundamental tools and concepts

More information

Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux Essential Skills for Bioinformatics: Unix/Linux SHELL SCRIPTING Overview Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose

More information

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Sep. Guide.  Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Sep 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Corp. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

5/20/2007. Touring Essential Programs

5/20/2007. Touring Essential Programs Touring Essential Programs Employing fundamental utilities. Managing input and output. Using special characters in the command-line. Managing user environment. Surveying elements of a functioning system.

More information

UNIX II:grep, awk, sed. October 30, 2017

UNIX II:grep, awk, sed. October 30, 2017 UNIX II:grep, awk, sed October 30, 2017 File searching and manipulation In many cases, you might have a file in which you need to find specific entries (want to find each case of NaN in your datafile for

More information

Introduction to the shell Part II

Introduction to the shell Part II Introduction to the shell Part II Graham Markall http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Civil Engineering Tech Talks 16 th November, 1pm Last week Covered applications and Windows compatibility

More information

Basics. I think that the later is better.

Basics.  I think that the later is better. Basics Before we take up shell scripting, let s review some of the basic features and syntax of the shell, specifically the major shells in the sh lineage. Command Editing If you like vi, put your shell

More information

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal

Lecture 3 Tonight we dine in shell. Hands-On Unix System Administration DeCal Lecture 3 Tonight we dine in shell Hands-On Unix System Administration DeCal 2012-09-17 Review $1, $2,...; $@, $*, $#, $0, $? environment variables env, export $HOME, $PATH $PS1=n\[\e[0;31m\]\u\[\e[m\]@\[\e[1;34m\]\w

More information

Linux Essentials Objectives Topics:

Linux Essentials Objectives Topics: Linux Essentials Linux Essentials is a professional development certificate program that covers basic knowledge for those working and studying Open Source and various distributions of Linux. Exam Objectives

More information

Introduction To Linux. Rob Thomas - ACRC

Introduction To Linux. Rob Thomas - ACRC Introduction To Linux Rob Thomas - ACRC What Is Linux A free Operating System based on UNIX (TM) An operating system originating at Bell Labs. circa 1969 in the USA More of this later... Why Linux? Free

More information

LINUX FUNDAMENTALS. Supported Distributions: Red Hat Enterprise Linux 6 SUSE Linux Enterprise 11 Ubuntu LTS. Recommended Class Length: 5 days

LINUX FUNDAMENTALS. Supported Distributions: Red Hat Enterprise Linux 6 SUSE Linux Enterprise 11 Ubuntu LTS. Recommended Class Length: 5 days LINUX FUNDAMENTALS The course is a challenging course that focuses on the fundamental tools and concepts of Linux and Unix. Students gain proficiency using the command line. Beginners develop a solid foundation

More information

Lecture 18 Regular Expressions

Lecture 18 Regular Expressions Lecture 18 Regular Expressions In this lecture Background Text processing languages Pattern searches with grep Formal Languages and regular expressions Finite State Machines Regular Expression Grammer

More information

BASH SHELL SCRIPT 1- Introduction to Shell

BASH SHELL SCRIPT 1- Introduction to Shell BASH SHELL SCRIPT 1- Introduction to Shell What is shell Installation of shell Shell features Bash Keywords Built-in Commands Linux Commands Specialized Navigation and History Commands Shell Aliases Bash

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

UNIX Shell Programming

UNIX Shell Programming $!... 5:13 $$ and $!... 5:13.profile File... 7:4 /etc/bashrc... 10:13 /etc/profile... 10:12 /etc/profile File... 7:5 ~/.bash_login... 10:15 ~/.bash_logout... 10:18 ~/.bash_profile... 10:14 ~/.bashrc...

More information

Scripting Languages Course 1. Diana Trandabăț

Scripting Languages Course 1. Diana Trandabăț Scripting Languages Course 1 Diana Trandabăț Master in Computational Linguistics - 1 st year 2017-2018 Today s lecture Introduction to scripting languages What is a script? What is a scripting language

More information

Prerequisites: Students should be comfortable with computers. No familiarity with Linux or other Unix operating systems is required.

Prerequisites: Students should be comfortable with computers. No familiarity with Linux or other Unix operating systems is required. GL-120: Linux Fundamentals Course Length: 4 days Course Description: The GL120 is a challenging course that focuses on the fundamental tools and concepts of Linux and Unix. Students gain proficiency using

More information

Handling sam and vcf data, quality control

Handling sam and vcf data, quality control Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz

More information

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala Shell scripting and system variables HORT 59000 Lecture 5 Instructor: Kranthi Varala Text editors Programs built to assist creation and manipulation of text files, typically scripts. nano : easy-to-learn,

More information

Variation among genomes

Variation among genomes Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant

More information

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference

More information

http://xkcd.com/208/ cat seqs.fa >0 TGCAGGTATATCTATTAGCAGGTTTAATTTTGCCTGCACTTGGTTGGGTACATTATTTTAAGTGTATTTGACAAG >1 TGCAGGTTGTTGTTACTCAGGTCCAGTTCTCTGAGACTGGAGGACTGGGAGCTGAGAACTGAGGACAGAGCTTCA >2 TGCAGGGCCGGTCCAAGGCTGCATGAGGCCTGGGGCAGAATCTGACCTAGGGGCCCCTCTTGCTGCTAAAACCAT

More information

UNIX files searching, and other interrogation techniques

UNIX files searching, and other interrogation techniques UNIX files searching, and other interrogation techniques Ways to examine the contents of files. How to find files when you don't know how their exact location. Ways of searching files for text patterns.

More information

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014

Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014 Handling important NGS data formats in UNIX Prac8cal training course NGS Workshop in Nove Hrady 2014 Vaclav Janousek, Libor Morkovsky hjp://ngs- course- nhrady.readthedocs.org (Exercises & Reference Manual)

More information

An Introduction to Linux and Bowtie

An Introduction to Linux and Bowtie An Introduction to Linux and Bowtie Cavan Reilly November 10, 2017 Table of contents Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools Introduction to Linux In order to use

More information

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Mar. Guide.  Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

CSE 303 Lecture 4. users/groups; permissions; intro to shell scripting. read Linux Pocket Guide pp , 25-27, 61-65, , 176

CSE 303 Lecture 4. users/groups; permissions; intro to shell scripting. read Linux Pocket Guide pp , 25-27, 61-65, , 176 CSE 303 Lecture 4 users/groups; permissions; intro to shell scripting read Linux Pocket Guide pp. 19-20, 25-27, 61-65, 118-119, 176 slides created by Marty Stepp http://www.cs.washington.edu/303/ 1 Lecture

More information

Identiyfing splice junctions from RNA-Seq data

Identiyfing splice junctions from RNA-Seq data Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice

More information

The Unix Shell & Shell Scripts

The Unix Shell & Shell Scripts The Unix Shell & Shell Scripts You should do steps 1 to 7 before going to the lab. Use the Linux system you installed in the previous lab. In the lab do step 8, the TA may give you additional exercises

More information

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT

Unix as a Platform Exercises + Solutions. Course Code: OS 01 UNXPLAT Unix as a Platform Exercises + Solutions Course Code: OS 01 UNXPLAT Working with Unix Most if not all of these will require some investigation in the man pages. That's the idea, to get them used to looking

More information

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1 Review of Fundamentals Todd Kelley kelleyt@algonquincollege.com CST8207 Todd Kelley 1 GPL the shell SSH (secure shell) the Course Linux Server RTFM vi general shell review 2 These notes are available on

More information

Sentieon Documentation

Sentieon Documentation Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................

More information

Maize genome sequence in FASTA format. Gene annotation file in gff format

Maize genome sequence in FASTA format. Gene annotation file in gff format Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise

More information

Essentials for Scientific Computing: Bash Shell Scripting Day 3

Essentials for Scientific Computing: Bash Shell Scripting Day 3 Essentials for Scientific Computing: Bash Shell Scripting Day 3 Ershaad Ahamed TUE-CMS, JNCASR May 2012 1 Introduction In the previous sessions, you have been using basic commands in the shell. The bash

More information

Read Naming Format Specification

Read Naming Format Specification Read Naming Format Specification Karel Břinda Valentina Boeva Gregory Kucherov Version 0.1.3 (4 August 2015) Abstract This document provides a standard for naming simulated Next-Generation Sequencing (Ngs)

More information

Unzip command in unix

Unzip command in unix Unzip command in unix Search 24-4-2015 Howto Extract Zip Files in a Linux and. You need to use the unzip command on a Linux or Unix like system. The nixcraft takes a lot of my time and. 16-4-2010 Howto:

More information

Digital Humanities. Tutorial Regular Expressions. March 10, 2014

Digital Humanities. Tutorial Regular Expressions. March 10, 2014 Digital Humanities Tutorial Regular Expressions March 10, 2014 1 Introduction In this tutorial we will look at a powerful technique, called regular expressions, to search for specific patterns in corpora.

More information

Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk

Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk The awk program is a powerful yet simple filter. It processes its input one line at a time, applying user-specified awk pattern

More information

Introduction to Unix: Fundamental Commands

Introduction to Unix: Fundamental Commands Introduction to Unix: Fundamental Commands Ricky Patterson UVA Library Based on slides from Turgut Yilmaz Istanbul Teknik University 1 What We Will Learn The fundamental commands of the Unix operating

More information

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs Summer 2010 Department of Computer Science and Engineering York University Toronto June 29, 2010 1 / 36 Table of contents 1 2 3 4 2 / 36 Our goal Our goal is to see how we can use Unix as a tool for developing

More information

Advanced training. Linux components Command shell. LiLux a.s.b.l.

Advanced training. Linux components Command shell. LiLux a.s.b.l. Advanced training Linux components Command shell LiLux a.s.b.l. alexw@linux.lu Kernel Interface between devices and hardware Monolithic kernel Micro kernel Supports dynamics loading of modules Support

More information

FILTERS USING REGULAR EXPRESSIONS grep and sed

FILTERS USING REGULAR EXPRESSIONS grep and sed FILTERS USING REGULAR EXPRESSIONS grep and sed We often need to search a file for a pattern, either to see the lines containing (or not containing) it or to have it replaced with something else. This chapter

More information

Running SNAP. The SNAP Team October 2012

Running SNAP. The SNAP Team October 2012 Running SNAP The SNAP Team October 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

Perl and R Scripting for Biologists

Perl and R Scripting for Biologists Perl and R Scripting for Biologists Lukas Mueller PLBR 4092 Course overview Linux basics (today) Linux advanced (Aure, next week) Why Linux? Free open source operating system based on UNIX specifications

More information

http://xkcd.com/208/ 1. Computer Hardware 2. Review of pipes 3. Regular expressions 4. sed 5. awk 6. Editing Files 7. Shell loops 8. Shell scripts Hardware http://www.theverge.com/2011/11/23/2582677/thailand-flood-seagate-hard-drive-shortage

More information

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved. C How to Program, 6/e 1992-2010 by Pearson Education, Inc. An important part of the solution to any problem is the presentation of the results. In this chapter, we discuss in depth the formatting features

More information

CS 25200: Systems Programming. Lecture 10: Shell Scripting in Bash

CS 25200: Systems Programming. Lecture 10: Shell Scripting in Bash CS 25200: Systems Programming Lecture 10: Shell Scripting in Bash Dr. Jef Turkstra 2018 Dr. Jeffrey A. Turkstra 1 Lecture 10 Getting started with Bash Data types Reading and writing Control loops Decision

More information

Linux shell & shell scripting - II

Linux shell & shell scripting - II IBS 574 - Computational Biology & Bioinformatics Spring 2018, Tuesday (02/01), 2:00-4:00PM Linux shell & shell scripting - II Ashok R. Dinasarapu Ph.D Scientist, Bioinformatics Dept. of Human Genetics,

More information

Output with printf Input. from a file from a command arguments from the command read

Output with printf Input. from a file from a command arguments from the command read More Scripting 1 Output with printf Input from a file from a command arguments from the command read 2 A script can test whether or not standard input is a terminal [ -t 0 ] What about standard output,

More information

Introduction. Let s start with the first set of slides

Introduction. Let s start with the first set of slides Tux Wars Class - 1 Table of Contents 1) Introduction to Linux and its history 2) Booting process of a linux system 3) Linux Kernel 4) What is a shell 5) Bash Shell 6) Anatomy of command 7) Let s make our

More information

Virtual Machine. Linux flavor : Debian. Everything (except slides) preinstalled for you. https://www.virtualbox.org/

Virtual Machine. Linux flavor : Debian. Everything (except slides) preinstalled for you. https://www.virtualbox.org/ Virtual Machine Anyone have problems installing it? VM: Virtual Box - allows you to run a different operating system within the current operating system of your machine. https://www.virtualbox.org/ Linux

More information

Prerequisites: General computing knowledge and experience. No prior knowledge with Linux is required. Supported Distributions:

Prerequisites: General computing knowledge and experience. No prior knowledge with Linux is required. Supported Distributions: This course prepares students to take the 101 exam of the LPI level 1 certification. The Linux Professional Institute (LPI) is the go to certification body for vendor independent Linux certifications.

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

Ensembl RNASeq Practical. Overview

Ensembl RNASeq Practical. Overview Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted

More information

Unix Tutorial Haverford Astronomy 2014/2015

Unix Tutorial Haverford Astronomy 2014/2015 Unix Tutorial Haverford Astronomy 2014/2015 Overview of Haverford astronomy computing resources This tutorial is intended for use on computers running the Linux operating system, including those in the

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi

sottotitolo A.A. 2016/17 Federico Reghenzani, Alessandro Barenghi Titolo presentazione Piattaforme Software per la Rete sottotitolo BASH Scripting Milano, XX mese 20XX A.A. 2016/17, Alessandro Barenghi Outline 1) Introduction to BASH 2) Helper commands 3) Control Flow

More information

Running SNAP. The SNAP Team February 2012

Running SNAP. The SNAP Team February 2012 Running SNAP The SNAP Team February 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used: Linux Tutorial How to read the examples When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used: $ application file.txt

More information

LINUX FUNDAMENTALS (5 Day)

LINUX FUNDAMENTALS (5 Day) www.peaklearningllc.com LINUX FUNDAMENTALS (5 Day) Designed to provide the essential skills needed to be proficient at the Unix or Linux command line. This challenging course focuses on the fundamental

More information

Introduction to Perl. c Sanjiv K. Bhatia. Department of Mathematics & Computer Science University of Missouri St. Louis St.

Introduction to Perl. c Sanjiv K. Bhatia. Department of Mathematics & Computer Science University of Missouri St. Louis St. Introduction to Perl c Sanjiv K. Bhatia Department of Mathematics & Computer Science University of Missouri St. Louis St. Louis, MO 63121 Contents 1 Introduction 1 2 Getting started 1 3 Writing Perl scripts

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

Tiling Assembly for Annotation-independent Novel Gene Discovery

Tiling Assembly for Annotation-independent Novel Gene Discovery Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the

More information

CSE 490c Autumn 2004 Midterm 1

CSE 490c Autumn 2004 Midterm 1 CSE 490c Autumn 2004 Midterm 1 Please do not read beyond this cover page until told to start. Name: There are 10 questions. Questions are worth varying numbers of points. The number of points roughly reflects

More information

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1 Review of Fundamentals Todd Kelley kelleyt@algonquincollege.com CST8207 Todd Kelley 1 The CST8207 course notes GPL the shell SSH (secure shell) the Course Linux Server RTFM vi general shell review 2 Linux

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1 Regular Expressions Todd Kelley kelleyt@algonquincollege.com CST8207 Todd Kelley 1 POSIX character classes Some Regular Expression gotchas Regular Expression Resources Assignment 3 on Regular Expressions

More information