Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny stefano.gaiarsa@unimi.it

Linux and the command line PART 1 Survival kit for the bash environment

Purpose of the lesson Familiarise with the command line interface (CLI) Why? Most of bioinformatics software is CLI based Lots of bioinformatic data is huge text files Lots of bioinformatic work is repetitive (bio)informatics is all about optimization

first of all: Linux doesn t bite! Take a look around the OS!

What is a command line interface?

Text based interface between user and computer Usually implemented with a shell Shell: computer program that takes commands (text input) and converts them to appropriate operating system functions (other programs) One of the most used shells is the bourne again shell BASH That s the one we will use!

Bash is a container with lots of different commands (tools)

Each command is very well suited for one simple task We can combine commands to do less trivial tasks

What is better? It depends I m trying to perform a simple task once I m not too worried by optimization Then, I do something visual I m processing huge amount of data/files and/ or I m performing a series of different tasks and/ or I need optimization and reproducibility Then, I use CLI

How does it look? Ctrl Alt t Username@Machine:Working_directory$

If bash is a language, statements are our sentences We have verbs (commands) We have objects (inputs) And adjectives, adverbs... (modifiers, variables, etc) We must have an idea of how to compose a statement to perform for our task (experience, google)

Each part of a statement after a command is called argument Command argument1 argument2 argument3

How can a command understand arguments? 1) Position 2) Prefixes Examples: A command y understands the first argument as the input and the second as the output Command input output A command z understands the argument after the prefix -i as input and and the argument after the prefix -o as output Command -i input -o output

How can we know how to use a command? Documentation command -help command -h command --help man command try with: man ls

Our first command is ls and it s used to list the files and folders in the working directory Exercise: list the files ordering them according to the last modification date

Some tips and recommendations 1) Remember, we can t use the mouse to move the cursor 2) Tab autocompletes 3) ctrl - c stops the process running 4) up/down arrow are used to see past statements, you can modify them and execute them again

Some tips and recommendations TAB press TAB once for autocomplete (if there is more than one possible command/file to autocomplete, TAB adds just the letters common to all possibilities) press TAB twice for the list of possible completions

Filesystem Filesystem: how files are organized in our HD In Linux (the OS we are using), it can be seen as a tree or a graph Each file can be seen as a node of the graph It has a parent node and can have one or more children nodes We have a starting file(directory), the root directory

The path Each file (directories are file too!) is defined by its position in the filesystem, called path The path is the address of the file, needed when we want to reach it Bash is not good with addresses, so we must be exact when writing the path of a file

Absolute and relative path Absolute path: complete address of the file, from the start (root directory) to the file itself Relative path: relative address of the file, from where we are to the file Think of phone numbers I want to call someone within the University of Pavia, his intern number is 9898 I m in the University: I dial 9898 - relative path I m outside the University: I dial 0382 98 9898 - relative path I m on Earth(root): I dial +39 0382 98 9898 - absolute path

4 1 2 3 5 6 Paths are directory names divided by forward slashes or back slashes (Windows)

Working directory The working directory can be seen as where we are you can type pwd to know the absolute path of the working dyrectory

Working directory (dual cam) Bash GUI

Change working directory It can be useful to change the working directory: cd change directory By default it sets my home directory as the working directory./ means the current working directory../ means the parent directory of the current working directory ~/ means the home directory

When I use the GUI, I combine the two commands that we know: cd - double click directory ls - view inside directory BUT, if I know the path, I can get to any place in the filesystem with just one line, without clicking at every folder level E.g: cd Desktop/root (two jumps )

Standard (glob) wildcards Can be used with bash commands to work with multiple files? - any single character * - any number of characters (even zero) [1-9] - range {1,2,3} - or [!5] - not

Standard wildcards Can be used with bash commands to work with multiple files Examples: List all files starting with gene contained in the working directory: ls gene* List all files starting with numbers 2 to 5 and ending with.tsv contained in the wd: ls [2-5]*.tsv

Exercises a) Start from root Go to folder 5 List the files contained b) List the files contained in folder 4 without leaving folder 5 c) Go to folder 4 List all files ending with.fasta 4 1 2 3 5 6

Moving, renaming, copying In bash there is only one command for moving and renaming files mv source directory mv source newname Copying is similar cp source directory cp source newname If source is a directory I will want to copy also the files contained in it: cp -r source newname/directory

Deleting files WARNING: when you delete a file from the command line it is deleted, you can t find it in the trash bin remove rm source If source is a directory I need to add -r rm -r source

> Output redirection >> By adding > filename after a command, we redirect its stdout to a new file named filename (if filename already exists, it is overwritten) By adding >> filename after a command, we redirect its stdout, appending it to a file named filename Let s try with ls

Piping commands We can also redirect the output of a command as the input of another command(s) command1 command2 command3 is called the pipe sign By piping we can combine multiple commands and create complex statements

Text file manipulation - visualizing Sometimes I need to explore a file without opening it into a text editor (I don t need to see the whole file, file is too big) Strategies: Reading it one screen at the time less filename Reading first 10 lines head filename Reading last 10 lines tail filename

Exercise: Use a combination of head and tail to print the 27th line of file toy.tsv (in folder 4) Hint: Using the optional argument -n number (e.g. -n 5) head and tail will show n lines instead of 10 Use pipe to combine two commands

File formats, exploiting structure Big files can be intimidating but we can exploit the way they are organized (formatted) to quickly edit them or extract useful information If the information is not organized we are out of luck If we don t know how the format works we need to read its documentation try to look inside file toy.csv in folder 4 and see if you can recognize any pattern

Text file manipulation - select columns cut your turn!! use the man page to understand how it works. Look for: - delimiter - fields (a.k.a. columns) Try to extract columns # 1,3 and 5 from a comma separated values (.csv) file (toy.csv in folder 4) Hint: by default cut uses tabs as delimiters

Text file manipulation - join files(1) cat Catenate Cat can be use to pass a text file to stdout cat filename Its main purpose is to join two files cat file1 file2 What if we want to join n files? Hint: wildcards

Text file manipulation - exercise File is, as always, toy.csv in folder 4 Execute the following operations and write the final result to a new file, you choose the name (no spaces), and move it to folder 2 (or create it directly in folder 2) Get lines 3 and 55 Get columns 1,3, and 5 Hint: create temporary files for the partial results (or not!) and delete them when you have finished (not the file containing the final result!)

Scripts We can write our own programs and scripts Scritps are lists of commands (in a given language) that are read and executed by an interpreter Some examples: python script.py argument2 argument3 RScript script.r argument2 argument3 perl script.pl argument2 argument3 sh script.sh argument2 argument3 in these cases, the command is the name of the interpreter, while the script is the first argument you can write scripts in bash too!

SEE YOU NEXT TIME!