VERY SHORT INTRODUCTION TO UNIX

Similar documents
Introduction to UNIX command-line II

Using Linux as a Virtual Machine

Introduction to UNIX command-line

Using UNIX. -rwxr--r-- 1 root sys Sep 5 14:15 good_program

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

Bioinformatics. Computational Methods I: Genomic Resources and Unix. George Bell WIBR Biocomputing Group

Where can UNIX be used? Getting to the terminal. Where are you? most important/useful commands & examples. Real Unix computers

Unix L555. Dept. of Linguistics, Indiana University Fall Unix. Unix. Directories. Files. Useful Commands. Permissions. tar.

Introduction to Linux. Roman Cheplyaka

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Files

CSCI 2132 Software Development. Lecture 4: Files and Directories

Introduction to Unix: Fundamental Commands

Install and run external command line softwares. Yanbin Yin

Useful Unix Commands Cheat Sheet

Introduction: What is Unix?

Introduction Into Linux Lecture 1 Johannes Werner WS 2017

Unix, Perl and BioPerl. Why Unix (for me)? Objectives. Why Unix (in general)? Introduction to Unix. for Bioinformatics. Why Unix for Bioinformatics?

Sequence Alignment: BLAST

A Brief Introduction to the Linux Shell for Data Science

Unix, Perl and BioPerl

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Cloud Computing and Unix: An Introduction. Dr. Sophie Shaw University of Aberdeen, UK

Unix Tools / Command Line

Chapter-3. Introduction to Unix: Fundamental Commands

Working With Unix. Scott A. Handley* September 15, *Adapted from UNIX introduction material created by Dr. Julian Catchen

A Brief Introduction to Unix

Introduction. File System. Note. Achtung!

Unix, Perl and BioPerl

Unix/Linux Operating System. Introduction to Computational Statistics STAT 598G, Fall 2011

Read the relevant material in Sobell! If you want to follow along with the examples that follow, and you do, open a Linux terminal.

Unix basics exercise MBV-INFX410

Linux Essentials. Programming and Data Structures Lab M Tech CS First Year, First Semester

genome[phd14]:/home/people/phd14/alignment >

Arkansas High Performance Computing Center at the University of Arkansas

Sequence analysis with Perl Modules and BioPerl. Unix, Perl and BioPerl. Regular expressions. Objectives. Some uses of regular expressions

BLAST. Jon-Michael Deldin. Dept. of Computer Science University of Montana Mon

Basic Linux (Bash) Commands

Unix Basics. Benjamin S. Skrainka University College London. July 17, 2010

Perl and R Scripting for Biologists

Introduction to Unix and Linux. Workshop 1: Directories and Files

Table of contents. Our goal. Notes. Notes. Notes. Summer June 29, Our goal is to see how we can use Unix as a tool for developing programs

5/8/2012. Exploring Utilities Chapter 5

Recap From Last Time:

BGGN 213 Working with UNIX Barry Grant

Computer Systems and Architecture

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

Bioinformatics for Biologists. Ensembl: NCBI. Ensembl ContigView. Mammalian genome databases

CANB7640 Practical Workshop Class 01

Introduction to Linux Workshop 1

commandname flags arguments

Introduction to Linux

Computer Systems and Architecture

Introduction to UNIX. Logging in. Basic System Architecture 10/7/10. most systems have graphical login on Linux machines

CSE 303 Lecture 2. Introduction to bash shell. read Linux Pocket Guide pp , 58-59, 60, 65-70, 71-72, 77-80

CS4350 Unix Programming. Outline

Unix Workshop Aug 2014

The Unix Shell. Pipes and Filters

Week 2 Lecture 3. Unix

acmteam/unix.pdf How to manage your account (user ID, password, shell); How to compile C, C++, and Java programs;

Introduction To Linux. Rob Thomas - ACRC

Mineração de Dados Aplicada

Unix Filesystem. January 26 th, 2004 Class Meeting 2

CS197U: A Hands on Introduction to Unix

Advanced Linux Commands & Shell Scripting

Programming Languages and Uses in Bioinformatics

Unix Guide. Meher Krishna Patel. Created on : Octorber, 2017 Last updated : December, More documents are freely available at PythonDSP

Introduction of Linux

Research. We make it happen. Unix Basics. User Support Group help-line: personal:

Linux Bash Shell Scripting

Reading and manipulating files

LING 408/508: Computational Techniques for Linguists. Lecture 5

Introduction to the Linux Command Line

CENG 334 Computer Networks. Laboratory I Linux Tutorial

INTRODUCTION TO BIOINFORMATICS

Introduction to remote command line Linux. Research Computing Team University of Birmingham

Introduction to Linux Basics

The Linux Command Line & Shell Scripting

Linux command line basics II: downloading data and controlling files. Yanbin Yin

Set 1 MCQ Which command is used to sort the lines of data in a file in reverse order A) sort B) sh C) st D) sort -r

Unix File System. Class Meeting 2. * Notes adapted by Joy Mukherjee from previous work by other members of the CS faculty at Virginia Tech

CS Fundamentals of Programming II Fall Very Basic UNIX

Files and Directories

FREEENGINEER.ORG. 1 of 6 11/5/15 8:31 PM. Learn UNIX in 10 minutes. Version 1.3. Preface

Introduction to Linux

Basic Linux Commands. Srihari Kalgi M.Tech, CSE (KReSIT), IIT Bombay. May 5, 2009

Outline. Structure of a UNIX command

Mills HPC Tutorial Series. Linux Basics I

CS197U: A Hands on Introduction to Unix

Shell Programming Overview

CS CS Tutorial 2 2 Winter 2018

SRM UNIVERSITY DEPARTMENT OF BIOINFORMATICS

CSE 390a Lecture 2. Exploring Shell Commands, Streams, and Redirection

Oxford University Computing Services. Getting Started with Unix

BIOINFORMATICS POST-DIPLOMA PROGRAM SUBJECT OUTLINE Subject Title: OPERATING SYSTEMS AND PROJECT MANAGEMENT Subject Code: BIF713 Subject Description:

Introduction to Linux Organizing Files

Introduction to UNIX Command Line

Introduction to Linux Environment. Yun-Wen Chen

Practical Session 0 Introduction to Linux

Scripting Languages Course 1. Diana Trandabăț

Transcription:

VERY SHORT INTRODUCTION TO UNIX Tore Samuelsson, Nov 2009. An operating system (OS) is an interface between hardware and user which is responsible for the management and coordination of activities and the sharing of the resources of the computer that acts as a host for computing applications run on the machine. Examples are Linux (UNIX), Windows XP, Mac OSX. Connecting to UNIX computer. Use an ssh client (like putty) to connect to a remote computer with a unix operating system. You may also encounter unix in the context of a local computer. If you have a Mac (OSX) you can open a terminal window to have access to unix commands (Applications- Utilities). If you want to use unix on a Windows computer Cygwin (www.cygwin.com) is recommended. File system Examples of directories / root of file system /bin executable binary files /dev special files used to represent real physical devices /etc commands and files used for system administration

/home contains a home directory for each user of the system /lib librar ies used by various programs and languages /tmp a "scratch" area where any user can store files temporarily /usr system files and directories that you share with other users /home/bioinf1 = home directory of user bioinf1 Moving around in the file system (Commands that you type are shown in the following with light gray background) pwd = present working directory move to a specific directory cd = change directory cd pathname for instance, cd /home/bionf1 cd.. go up one level in directory tree cd (without argument) go to your home directory What files do I have in my directory? ls ls -al (more extensive output) -rw-r--r-- 1 tore users 383756 2007-11-25 16:59 PF02854.txt -rw-r--r-- 1 tore users 383269 2007-11-25 16:54 PF02854_full.txt drwxr-xr-x 9 tore users 656 2003-04-04 20:12 PhyloGrapher/ -rw-r--r-- 1 tore users 4898 2006-09-12 09:12 README.txt -rw-r--r-- 1 tore users 801 2008-01-22 10:29 README_juan.txt -rw-r--r-- 1 tore users 191 2008-01-15 12:05 RESULTS.txt -rwxr-xr-x 1 tores users 120635 2004-08-03 01:47 dnapars* lines starting with 'd' refer to directories file with 'x' is executable Manipulating Files and Directories cp source destination Moving and renaming files mv source destination

Creating and removing directories mkdir dirname rmdir dirname Removing files rm filename Viewing and editing files cat filename (will show the contents of filename on the screen) cat may also be used to merge files cat file1 file2 file3 > newfile (the symbol > means that we are now redirecting the output of the cat program to a file instead of the standard output which is the screen) or to append file(s) to an existing file cat file4 >> newfile Viewing a text file on the screen one page at a time more filename or even better less filename because less also allows you to move backwards in the file. Critical keys for less : space to move down enter to go one line at a time q to quit u to go up (back) Viewing or extracting the first and last lines of a file head filename head -1000 filename (first 1000 lines will be extracted) tail filename

Graphical editors Examples are nedit emacs Extracting file components with cut Let's consider the content of a file dat.txt where the columns are separated with a tab: 1 12 1300 1306 2 11 1500 1458 3 17 1620 1700 We want to extract the columns 1 and 3: cut -f1,3 dat.txt produces: 1 1300 2 1500 3 1620 We may use any separator. Like if dat.txt has A;2;4500 B;5;4505 F;4;4510 cut -f1,2 -d ';' produces A;2 B;5 F;4 Sorting files with sort Consider the file dat2.txt with: A 12 1300 1306 C 11 1500 1458 B 17 1620 1700 sort dat2.txt produces

A 12 1300 1306 B 17 1620 1700 C 11 1500 1458 because file is sorted alphabetically But we may also sort numerically, and we may sort according to a specific column sort -n -k2 dat2.txt produces C 11 1500 1458 A 12 1300 1306 B 17 1620 1700 where sorting was according to column 2 If we want to have ordering in the reverse direction: sort -n -k2 -r dat2.txt Unique lines uniq sortedfile produces the unique lines in a file (for this to work well the lines in the file need to be sorted) uniq -c sortedfile produces the same output but listing the number of times each line occurs. Comparing files with diff and comm diff sortedfile1 sortedfile2 will report differences between the two (sorted) files. comm -12 sortedfile1 sortedfile2 will show lines that are shared between the two files. Counting words with wc wc filename (counts lines, words and number of bytes) wc -l filename (counts lines only)

Redirection and pipes For the > and >> redirection symbols see above under "Viewing and editing files" The output of a file may be directed as input to another with a pipe ' ' symbol, like sort file uniq wc The output of sort will be sent to uniq and the output of uniq will be sent to wc. Result is the number of uniq lines in the file. Finding text strings with grep grep ">" sequence.fasta wc will produce all the lines with '>' in the file sequence.fasta the output will be directed to wc the final output is therefore the number of lines with '>' grep -n -v -i -l "AACGTA seqfile -n report line number -v report lines where AACGTA does not match -i ignore case, i.e for instance "aacgta may also match -l show only the file name, not the matching text Finding files with find To locate files with extension '.fa': find. -name "*.fa The dot right after the find command refers to the current directory, but by default find will also search all subdirectories in that directory. You may also want to locate files with a certain content. Here is what you could do: find. -exec grep HIV { \; This command will show all lines in all files that contain the string "HIV. The -exec parameter means that any program following 'exec' will be executed on the files found by find. You may instead want to list the files that contain the string "HIV": find. -exec grep -l HIV { \;

Useful features of unix shells when typing commands Command line completion with tab key Arrows to recall previous command Ctrl-E move to end of line Ctrl-A move to beginning of line Program run in the background Running a program "in the background" by putting the symbol & after the program command: program & Obtaining data from the net with wget wget is a useful unix utility to retrieve a specific URL. Here is how to retrieve from the NCBI FTP site Genbank records of Bacillus anthracis CDC 684. wget ftp://ftp.ncbi.nih.gov/genomes/bacteria/bacillus_anthracis_cdc_684/nc_0 12581.gbk In some cases data has been compressed and archived and may have the extension "tar.gz". Like this file containing a distribution of the linux blast program. wget ftp://ftp.ncbi.nih.gov/blast/executables/release/2.0.10/blast- 2.0.10-ia32-linux.tar.gz Then you need to uncompress gunzip blast-2.0.10-ia32-linux.tar.gz The resulting file is blast-2.0.10-ia32-linux.tar Then unpack the contents of the tar archive: tar -xvf blast-2.0.10-ia32-linux.tar

A selection of bioinformatics software run in the unix environment 1. sixpack The EMBOSS program sixpack will produce translation products of an input DNA sequence. sixpack seqfile. The output will be two files, with extensions fasta and sixpack, respectively 2. clustalw ClustalW is a program for multiple sequence alignments. The input is typically a collection of sequences in a fasta format. clustalw sequencefile Two different output files are produced. One has the extension 'aln' and is the actual alignment. The other file with extension 'dnd' is a file with information on the tree that was used in the construction of the alignment. The tree may be viewed with programs such as njplot. 3. blastall BLAST is a frequently used program to search databases for sequence similarity to a query sequence. A typical command line for a BLAST search using the NCBI version is: blastall -i input -d database -o output_file -p type_of_blast_search -i input file in FASTA format -d database -o name of output file -p type of blast search, the most common options are: blastn - search a nucleotide database with a nt query blastp - search a protein database with a protein query tblastn - search a nucleotide database with a protein query blastx - search a protein database with a nucleotide query Other useful parameters are: -v Number of hits to show in summary table -b Number of hits to show as alignments -F F Select to have filtering turned off. By default low complexity sequences will be filtered out in the search.

4. blastpgp blastpgp is the name of the executable for psi-blast in the unix environment. blastpgp -i input -d database -j number where j is the parameter specifiying how many rounds (iterations) will be carried out.

VERY SHORT INTRODUCTION TO PERL First example program: #!/usr/bin/perl $seq = 'gcgagggtcacgagcgagtcggtgtcaagt'; $target = 'gtca'; for ($a=0; $a< length($seq); $a++) { $extract = substr($seq,$a,4); if ($extract eq $target) { print "Found match of $target at $a\n ; Executing perl programs perl program.pl or program.pl The second alternative assumes that there is a line : #!/usr/bin/perl at the head of the program file and that the file has been made executable. This is done with: chmod +x program.pl

Variables Scalars $dna = 'gctatatat'; $pi = 3.14; Extracting portion of a string: $part = substr($dna,2,3) # (take 3 characters starting from # position 2) print $part; (will print "tat ) Arrays @dna = (A,G,C,T); @numbers = (3,6,12,13,16); Hashes Elements in arrays are referred to with numbers 0,1,2 etc. print "Third element in array dna is $dna[2]; (will print "C ) %id2description = ( AC988823 => "protein kinase BC887682 => "dehydrogenase NX772123 => "hexokinase ) Values to the left in the above hash are referred to as keys, values to the right values Obtain the value for the key AC988823 print "$id2description{ac988823 "; (protein kinase will be printed) Special variables $_ default variable in perl, a kind of shorthand examples: while (<IN>) {... is the same as while ($_ =<IN>) {... /^Subject:/ is the same as $_ =~ /^Subject:/ tr/a/a/ is the same as $_ =~ tr/a/a/ print is the same as print $_

Conditional statements if/else : if ($a < 10) {print "less than 10"; else {print "10 or larger"; Loops For ($a=0; $a < 4; $a++) { print "$a "; (will print "0 1 2 3 ) @dna = (A,G,C,T); foreach $letter(@dna) { print "$letter "; (will print "A G C T ) Operators + - * / addition, subtraction, multiplication, division && logical operators OR, AND. concatenation of two strings ==!= numeric equality, inequality eq ne string equality, inequality < > Numeric less than, greater than <= >= Numeric less(greater) than or equal to Pattern matching $seq = 'GGACGGACTG'; if ($seq =~ /ACG/) {print "match"; # is there a match of ACG to the string in $str? =~ is the "binding operator / / is the regular expression delimiter Examples of regular expressions:

/A.G/) /^GG/ /G$/ /G{1,2/ /G+/ /G*/ /[^AGCT]/ (. is any character) (matches GG at beginning of string) (matches G at end of string) matches a series of Gs occuring in the range 1-2, i.e 'G' or 'GG' matches one or more of Gs matches zero, one or more of G matches any character which is not A G C or T Capturing matched patterns If you place parantheses around a pattern the matched string will be stored in memory. The contents may be recalled using $1. ($2 for a second set of parantheses etc) $seq = 'GGACGGACTG'; $seq =~ /(G.A)/ ; print "A match was found at $1"; Substitution and transliteration operators $dna = 'GCAATGG'; print "The DNA sequence is $dna\n"; $rna = $dna; $rna =~ s/t/u/g; print "and the RNA sequence is $rna\n"; s is the substitution operator g = global modifier, replace all Ts with Us Produce the reverse complement of a DNA sequence: $dna = 'GCAATGG'; $rev = reverse($dna); $rev =~ tr/atcg/tagc/; print "$rev\n";

Counting characters in a string $dna = 'GCAATGG'; $count_c = 0; $count_g = 0; $count_a = 0; $count_t = 0; @dna = split('',$dna); foreach $base(@dna) { if ($base eq 'a') {$count_a++; if ($base eq 't') {$count_t++; if ($base eq 'c') {$count_c++; if ($base eq 'g') {$count_g++; print "bases a, t, c,g are $count_a $count_t $count_c $count_g\n ; As an alternative to @dna method : for ($a=0; $a<length($dna); $a++) { $base = substr($dna,$a,1); if ($base eq 'a') {$count_a++; if ($base eq 't') {$count_t++; if ($base eq 'c') {$count_c++; if ($base eq 'g') {$count_g++; Executing a program from within a perl script $id = 'brca1_human'; system ("fastacmd -s $id -d /dbs/nr > protein.fa"); The fastacmd command will retrieve a sequence from the protein sequence database nr and the resulting sequence will be stored in the file 'protein.fa'

Reading from a file <FILEHANDLE> in scalar context reads a single line from the file opened by FILEHANDLE : #!/usr/bin/perl $seq = ''; open IN, 'seq.fa' ; while (<IN>) { # we are reading one line at a time unless (/>/) { # we will disregard all lines that contain '>' chomp; # remove the end of line character # add the line to the sequence already stored in $seq $seq = $seq. $_; close IN; In array context, reads the whole file: open IN, 'seq.fa'; @dna = <IN>; close IN; print "@dna"; Writing to a file open OUT, ">outputfile"; print OUT "GGCTACTGAC \n"; close OUT; Or the same thing could be accompished like this: If the script "line.pl" contains this single line: print "GGCTACTGAC"; the following command: perl line.pl > outputfile will now save the text "GGCTACTGAC" in a file called outputfile

Using arguments with a perl script It is often convenient to supply information to a perl script on the command line. Let's say we have a perl script "file.pl with the following content: open IN, $ARGV[0] or die "There is no file with that name\n"; print "This is the content of file $ARGV[0]:\n"; while (<IN>) { print; close IN; We assume here that some file is to read by the perl script, the name of that file is an argument on the command line, like: perl file.pl notes.txt The array @ARGV contains all the words listed on the command line after the name of the perl script. The variable $ARGV[0] has the first element of this array, i.e in this case "notes.txt.

Using perl in bioinformatics: Reformatting files with perl A common task in bioinformatics is to reformat data in a file. Consider for instance this example where the species name of the fasta line in a fasta-formatted Genbank record will end up as the identifier. #!/usr/bin/perl # try it with the input file brca1 $in = $ARGV[0]; $c = 0; open IN, $in; while (<IN>) { if (/>.*\[(.*)\]/) { # text within brackets [] is captured $org = $1; $c++; print ">$c"; print "_"; $org =~ s/ /_/g; print "$org\n"; else {print; close IN; A line >gi 2695691 gb AAC36493.1 BRCA1 [Rattus norvegicus] will change into >1_Rattus_norvegicus A number is added to the species name in order to avoid multiple identical identifiers.

Using perl in bioinformatics: Parsing the output of BLAST with a perl script #!/usr/bin/perl use Bio::SearchIO; $in = new Bio::SearchIO(-format => 'blast', -file => 'human.blastx'); while( $result = $in->next_result ) { ## $result is a Bio::Search::Result::ResultI compliant object while( $hit = $result->next_hit ) { ## $hit is a Bio::Search::Hit::HitI compliant object while( $hsp = $hit->next_hsp ) { ## $hsp is a Bio::Search::HSP::HSPI compliant object if( $hsp->length('total') > 30 ) { if ( $hsp->percent_identity >= 75 ) { print "Query=", $result->query_name, " Hit=", $hit->name, " Length=", $hsp->length('total'), " Percent_id=", $hsp->percent_identity, "\n"; A BLAST report is in the file 'human.blastx'. "Result" is the result of a blast search with a specific query sequence. "Hit" refers to a database sequence. "Hsp" refers as expected to an HSP (high scoring pair).