VERY SHORT INTRODUCTION TO UNIX

VERY SHORT INTRODUCTION TO UNIX Tore Samuelsson, Nov 2009. An operating system (OS) is an interface between hardware and user which is responsible for the management and coordination of activities and the sharing of the resources of the computer that acts as a host for computing applications run on the machine. Examples are Linux (UNIX), Windows XP, Mac OSX. Connecting to UNIX computer. Use an ssh client (like putty) to connect to a remote computer with a unix operating system. You may also encounter unix in the context of a local computer. If you have a Mac (OSX) you can open a terminal window to have access to unix commands (Applications- Utilities). If you want to use unix on a Windows computer Cygwin (www.cygwin.com) is recommended. File system Examples of directories / root of file system /bin executable binary files /dev special files used to represent real physical devices /etc commands and files used for system administration

/home contains a home directory for each user of the system /lib librar ies used by various programs and languages /tmp a "scratch" area where any user can store files temporarily /usr system files and directories that you share with other users /home/bioinf1 = home directory of user bioinf1 Moving around in the file system (Commands that you type are shown in the following with light gray background) pwd = present working directory move to a specific directory cd = change directory cd pathname for instance, cd /home/bionf1 cd.. go up one level in directory tree cd (without argument) go to your home directory What files do I have in my directory? ls ls -al (more extensive output) -rw-r--r-- 1 tore users 383756 2007-11-25 16:59 PF02854.txt -rw-r--r-- 1 tore users 383269 2007-11-25 16:54 PF02854_full.txt drwxr-xr-x 9 tore users 656 2003-04-04 20:12 PhyloGrapher/ -rw-r--r-- 1 tore users 4898 2006-09-12 09:12 README.txt -rw-r--r-- 1 tore users 801 2008-01-22 10:29 README_juan.txt -rw-r--r-- 1 tore users 191 2008-01-15 12:05 RESULTS.txt -rwxr-xr-x 1 tores users 120635 2004-08-03 01:47 dnapars* lines starting with 'd' refer to directories file with 'x' is executable Manipulating Files and Directories cp source destination Moving and renaming files mv source destination

Creating and removing directories mkdir dirname rmdir dirname Removing files rm filename Viewing and editing files cat filename (will show the contents of filename on the screen) cat may also be used to merge files cat file1 file2 file3 > newfile (the symbol > means that we are now redirecting the output of the cat program to a file instead of the standard output which is the screen) or to append file(s) to an existing file cat file4 >> newfile Viewing a text file on the screen one page at a time more filename or even better less filename because less also allows you to move backwards in the file. Critical keys for less : space to move down enter to go one line at a time q to quit u to go up (back) Viewing or extracting the first and last lines of a file head filename head -1000 filename (first 1000 lines will be extracted) tail filename

Graphical editors Examples are nedit emacs Extracting file components with cut Let's consider the content of a file dat.txt where the columns are separated with a tab: 1 12 1300 1306 2 11 1500 1458 3 17 1620 1700 We want to extract the columns 1 and 3: cut -f1,3 dat.txt produces: 1 1300 2 1500 3 1620 We may use any separator. Like if dat.txt has A;2;4500 B;5;4505 F;4;4510 cut -f1,2 -d ';' produces A;2 B;5 F;4 Sorting files with sort Consider the file dat2.txt with: A 12 1300 1306 C 11 1500 1458 B 17 1620 1700 sort dat2.txt produces

A 12 1300 1306 B 17 1620 1700 C 11 1500 1458 because file is sorted alphabetically But we may also sort numerically, and we may sort according to a specific column sort -n -k2 dat2.txt produces C 11 1500 1458 A 12 1300 1306 B 17 1620 1700 where sorting was according to column 2 If we want to have ordering in the reverse direction: sort -n -k2 -r dat2.txt Unique lines uniq sortedfile produces the unique lines in a file (for this to work well the lines in the file need to be sorted) uniq -c sortedfile produces the same output but listing the number of times each line occurs. Comparing files with diff and comm diff sortedfile1 sortedfile2 will report differences between the two (sorted) files. comm -12 sortedfile1 sortedfile2 will show lines that are shared between the two files. Counting words with wc wc filename (counts lines, words and number of bytes) wc -l filename (counts lines only)

Redirection and pipes For the > and >> redirection symbols see above under "Viewing and editing files" The output of a file may be directed as input to another with a pipe ' ' symbol, like sort file uniq wc The output of sort will be sent to uniq and the output of uniq will be sent to wc. Result is the number of uniq lines in the file. Finding text strings with grep grep ">" sequence.fasta wc will produce all the lines with '>' in the file sequence.fasta the output will be directed to wc the final output is therefore the number of lines with '>' grep -n -v -i -l "AACGTA seqfile -n report line number -v report lines where AACGTA does not match -i ignore case, i.e for instance "aacgta may also match -l show only the file name, not the matching text Finding files with find To locate files with extension '.fa': find. -name "*.fa The dot right after the find command refers to the current directory, but by default find will also search all subdirectories in that directory. You may also want to locate files with a certain content. Here is what you could do: find. -exec grep HIV { \; This command will show all lines in all files that contain the string "HIV. The -exec parameter means that any program following 'exec' will be executed on the files found by find. You may instead want to list the files that contain the string "HIV": find. -exec grep -l HIV { \;

Useful features of unix shells when typing commands Command line completion with tab key Arrows to recall previous command Ctrl-E move to end of line Ctrl-A move to beginning of line Program run in the background Running a program "in the background" by putting the symbol & after the program command: program & Obtaining data from the net with wget wget is a useful unix utility to retrieve a specific URL. Here is how to retrieve from the NCBI FTP site Genbank records of Bacillus anthracis CDC 684. wget ftp://ftp.ncbi.nih.gov/genomes/bacteria/bacillus_anthracis_cdc_684/nc_0 12581.gbk In some cases data has been compressed and archived and may have the extension "tar.gz". Like this file containing a distribution of the linux blast program. wget ftp://ftp.ncbi.nih.gov/blast/executables/release/2.0.10/blast- 2.0.10-ia32-linux.tar.gz Then you need to uncompress gunzip blast-2.0.10-ia32-linux.tar.gz The resulting file is blast-2.0.10-ia32-linux.tar Then unpack the contents of the tar archive: tar -xvf blast-2.0.10-ia32-linux.tar

A selection of bioinformatics software run in the unix environment 1. sixpack The EMBOSS program sixpack will produce translation products of an input DNA sequence. sixpack seqfile. The output will be two files, with extensions fasta and sixpack, respectively 2. clustalw ClustalW is a program for multiple sequence alignments. The input is typically a collection of sequences in a fasta format. clustalw sequencefile Two different output files are produced. One has the extension 'aln' and is the actual alignment. The other file with extension 'dnd' is a file with information on the tree that was used in the construction of the alignment. The tree may be viewed with programs such as njplot. 3. blastall BLAST is a frequently used program to search databases for sequence similarity to a query sequence. A typical command line for a BLAST search using the NCBI version is: blastall -i input -d database -o output_file -p type_of_blast_search -i input file in FASTA format -d database -o name of output file -p type of blast search, the most common options are: blastn - search a nucleotide database with a nt query blastp - search a protein database with a protein query tblastn - search a nucleotide database with a protein query blastx - search a protein database with a nucleotide query Other useful parameters are: -v Number of hits to show in summary table -b Number of hits to show as alignments -F F Select to have filtering turned off. By default low complexity sequences will be filtered out in the search.

4. blastpgp blastpgp is the name of the executable for psi-blast in the unix environment. blastpgp -i input -d database -j number where j is the parameter specifiying how many rounds (iterations) will be carried out.

VERY SHORT INTRODUCTION TO PERL First example program: #!/usr/bin/perl $seq = 'gcgagggtcacgagcgagtcggtgtcaagt'; $target = 'gtca'; for ($a=0; $a< length($seq); $a++) { $extract = substr($seq,$a,4); if ($extract eq $target) { print "Found match of $target at $a\n ; Executing perl programs perl program.pl or program.pl The second alternative assumes that there is a line : #!/usr/bin/perl at the head of the program file and that the file has been made executable. This is done with: chmod +x program.pl

Variables Scalars $dna = 'gctatatat'; $pi = 3.14; Extracting portion of a string: $part = substr($dna,2,3) # (take 3 characters starting from # position 2) print $part; (will print "tat ) Arrays @dna = (A,G,C,T); @numbers = (3,6,12,13,16); Hashes Elements in arrays are referred to with numbers 0,1,2 etc. print "Third element in array dna is $dna[2]; (will print "C ) %id2description = ( AC988823 => "protein kinase BC887682 => "dehydrogenase NX772123 => "hexokinase ) Values to the left in the above hash are referred to as keys, values to the right values Obtain the value for the key AC988823 print "$id2description{ac988823 "; (protein kinase will be printed) Special variables $_ default variable in perl, a kind of shorthand examples: while (<IN>) {... is the same as while ($_ =<IN>) {... /^Subject:/ is the same as $_ =~ /^Subject:/ tr/a/a/ is the same as $_ =~ tr/a/a/ print is the same as print $_

Conditional statements if/else : if ($a < 10) {print "less than 10"; else {print "10 or larger"; Loops For ($a=0; $a < 4; $a++) { print "$a "; (will print "0 1 2 3 ) @dna = (A,G,C,T); foreach $letter(@dna) { print "$letter "; (will print "A G C T ) Operators + - * / addition, subtraction, multiplication, division && logical operators OR, AND. concatenation of two strings ==!= numeric equality, inequality eq ne string equality, inequality < > Numeric less than, greater than <= >= Numeric less(greater) than or equal to Pattern matching $seq = 'GGACGGACTG'; if ($seq =~ /ACG/) {print "match"; # is there a match of ACG to the string in $str? =~ is the "binding operator / / is the regular expression delimiter Examples of regular expressions:

/A.G/) /^GG/ /G$/ /G{1,2/ /G+/ /G*/ /[^AGCT]/ (. is any character) (matches GG at beginning of string) (matches G at end of string) matches a series of Gs occuring in the range 1-2, i.e 'G' or 'GG' matches one or more of Gs matches zero, one or more of G matches any character which is not A G C or T Capturing matched patterns If you place parantheses around a pattern the matched string will be stored in memory. The contents may be recalled using $1. ($2 for a second set of parantheses etc) $seq = 'GGACGGACTG'; $seq =~ /(G.A)/ ; print "A match was found at $1"; Substitution and transliteration operators $dna = 'GCAATGG'; print "The DNA sequence is $dna\n"; $rna = $dna; $rna =~ s/t/u/g; print "and the RNA sequence is $rna\n"; s is the substitution operator g = global modifier, replace all Ts with Us Produce the reverse complement of a DNA sequence: $dna = 'GCAATGG'; $rev = reverse($dna); $rev =~ tr/atcg/tagc/; print "$rev\n";

Counting characters in a string $dna = 'GCAATGG'; $count_c = 0; $count_g = 0; $count_a = 0; $count_t = 0; @dna = split('',$dna); foreach $base(@dna) { if ($base eq 'a') {$count_a++; if ($base eq 't') {$count_t++; if ($base eq 'c') {$count_c++; if ($base eq 'g') {$count_g++; print "bases a, t, c,g are $count_a $count_t $count_c $count_g\n ; As an alternative to @dna method : for ($a=0; $a<length($dna); $a++) { $base = substr($dna,$a,1); if ($base eq 'a') {$count_a++; if ($base eq 't') {$count_t++; if ($base eq 'c') {$count_c++; if ($base eq 'g') {$count_g++; Executing a program from within a perl script $id = 'brca1_human'; system ("fastacmd -s $id -d /dbs/nr > protein.fa"); The fastacmd command will retrieve a sequence from the protein sequence database nr and the resulting sequence will be stored in the file 'protein.fa'

Reading from a file <FILEHANDLE> in scalar context reads a single line from the file opened by FILEHANDLE : #!/usr/bin/perl $seq = ''; open IN, 'seq.fa' ; while (<IN>) { # we are reading one line at a time unless (/>/) { # we will disregard all lines that contain '>' chomp; # remove the end of line character # add the line to the sequence already stored in $seq $seq = $seq. $_; close IN; In array context, reads the whole file: open IN, 'seq.fa'; @dna = <IN>; close IN; print "@dna"; Writing to a file open OUT, ">outputfile"; print OUT "GGCTACTGAC \n"; close OUT; Or the same thing could be accompished like this: If the script "line.pl" contains this single line: print "GGCTACTGAC"; the following command: perl line.pl > outputfile will now save the text "GGCTACTGAC" in a file called outputfile

Using arguments with a perl script It is often convenient to supply information to a perl script on the command line. Let's say we have a perl script "file.pl with the following content: open IN, $ARGV[0] or die "There is no file with that name\n"; print "This is the content of file $ARGV[0]:\n"; while (<IN>) { print; close IN; We assume here that some file is to read by the perl script, the name of that file is an argument on the command line, like: perl file.pl notes.txt The array @ARGV contains all the words listed on the command line after the name of the perl script. The variable $ARGV[0] has the first element of this array, i.e in this case "notes.txt.

Using perl in bioinformatics: Reformatting files with perl A common task in bioinformatics is to reformat data in a file. Consider for instance this example where the species name of the fasta line in a fasta-formatted Genbank record will end up as the identifier. #!/usr/bin/perl # try it with the input file brca1 $in = $ARGV[0]; $c = 0; open IN, $in; while (<IN>) { if (/>.*\[(.*)\]/) { # text within brackets [] is captured $org = $1; $c++; print ">$c"; print "_"; $org =~ s/ /_/g; print "$org\n"; else {print; close IN; A line >gi 2695691 gb AAC36493.1 BRCA1 [Rattus norvegicus] will change into >1_Rattus_norvegicus A number is added to the species name in order to avoid multiple identical identifiers.

Using perl in bioinformatics: Parsing the output of BLAST with a perl script #!/usr/bin/perl use Bio::SearchIO; $in = new Bio::SearchIO(-format => 'blast', -file => 'human.blastx'); while( $result = $in->next_result ) { ## $result is a Bio::Search::Result::ResultI compliant object while( $hit = $result->next_hit ) { ## $hit is a Bio::Search::Hit::HitI compliant object while( $hsp = $hit->next_hsp ) { ## $hsp is a Bio::Search::HSP::HSPI compliant object if( $hsp->length('total') > 30 ) { if ( $hsp->percent_identity >= 75 ) { print "Query=", $result->query_name, " Hit=", $hit->name, " Length=", $hsp->length('total'), " Percent_id=", $hsp->percent_identity, "\n"; A BLAST report is in the file 'human.blastx'. "Result" is the result of a blast search with a specific query sequence. "Hit" refers to a database sequence. "Hsp" refers as expected to an HSP (high scoring pair).