Introducing the LINUX Operating System BecA-ILRI INTRODUCTION TO BIOINFORMATICS Mark Wamalwa BecA- ILRI Hub, Nairobi, Kenya h"p://hub.africabiosciences.org/ h"p://www.ilri.org/ m.wamalwa@cgiar.org 1
What is UNIX? A family of operating systems Multitasking IRIX Multiuser SOLARIS Runs more than one program at the same time. AIX Many different people can use A the busy system system at the can same be running time. LINUX several It is designed hundred to be or linked even to Digital thousands other computers UNIX of programs and to at allow the same people time. to work over a network. HP-UX The network IS the computer.... Networked 2
What is LINUX? Linus Torvalds n A freely available clone of the UNIX operating system for personal computers n Linux and Unix Time Sharing OPS: allow multiple users to use the system simultaneously Unix: developed in 1969 at Bell-Labs Linux is similar to Unix in some aspects 3
What does UNIX do? unix> help users Press ENTER to continue: UNIX Kernel X Xprog The Computer X Console The Pointy, User Controls Disk X Shell Interaction Window clicky storage programs access (or command program. System to the line) hardware. Run Allows Graphical Any Many Memory from number different the the interface user of shell users, to interact (point, can directly click, use typically Use Prevents any one drag, number with accessing programs drop the of etc.) computer actively programs the by at a interfering typing and system Network time methods commands. from enabled with adapter remote to access each other. system machines Provides The Can shell use from many interprets different any easy programs number way these ways for of at Modem programmers and once remote instructs machines the to talk kernel at the to the same electronics. accordingly. time. Is Screen a separate program Controls Very Easier powerful to data use than storage but the can and shell be Keyboard protection. intimidating but less powerful 4
Logging in Log in from anywhere. You must Log Have have in graphical from a username anywhere output (login sent you have anywhere id) to use you a unix/linux permission have system permission Every This user identifies is a member you of to one the system or more so groups it can of users. manage your work properly. This helps the system manage different types of user properly. 5
Logging in Connect to the linux machine using: Putty WinSCP - open source SFTP (SSH File Transfer Protocol) SCP (Secure CoPy) client for Windows using SSH (Secure SHell). Connecting to http://hpc.ilri.cgiar.org Connected. Welcome Xterm to Genotyping by Sequencing (GBS) workshop Login: Telnet Secure Shell username Kermit Other terminal emulators Password: The system will be unavailable unix linux is doesn t case sensitive. show during p/w username on Ramadhan. the screen is not as You have new mail. the you same type your as Username password. or USERNAME username@hpc~> You may get some messages here from the system administrator. 6
Accessing HPC from Windows systems n n Two stage process: Connecting to the system via secure shell (ssh) login Getting a graphical connection that supports X-Windows ssh connection: Need third party software. Local suggestion use putty n Process is slightly more awkward than ideal because local putty is configured for the Sun UNIX environment. n Better download putty.exe from http://www.chiark.greenend.org.uk/~sgtatham/putty/ Just runs from your desktop n Alternative cygwin - a Linux-like environment for Windows www.cygwin.com
Using Local PuTTY - 1 Better choice This is necessary for all PuTTY installs.
Using Local PuTTY - 2 linux
Using PuTTY-3
PuTTY Terminal Screen
The shell or command line Several 1. The Prompt. different shells but they behave more or less the same username@hpc/home~> interactive your username The prompt can be the customised machine your to you look present how location you wish are logged in to 12
The shell or command line 2. Commands username@hpc~> ls -ald ls -ald *.txt *.txt The shell breaks the command up into individual words The first word is a command The subsequent boundary between words form words a list a of space. arguments to For the the command shell to treat a phrase that includes arguments spaces as a beginning single word, with put - are it in options quotes: 'my word' or "my word". * is a special character. It means any group of Options control how the program runs. characters (including none). The shell finds all the '-a -l -d' is equivalent to '-ald' filenames that match anything.txt and adds them to the list of arguments 13
More Special Characters *? " ' Any word single group delineation character. of characters including none. & > < `` $ \ ; Cause Pipe. Redirect the the a process commands to run input. the background Pass output, eg. from the eg. output a file to a instead file of the of the command keyboard. Backticks String Backslash. Semicolon or Dollar (not on the '). left as the input Take Treat Change Seperate to the the output commands next meaning word of the as typed of on a the the in right. command variable next together. character. and as write an argument out its value Some special characters can lose their special meaning if they are inside quotes. 14
Organisation "Everything is a file" An ordinary file contains data. A directory contains other files. A link is a file that is a shortcut to another file. There data are could many be an other image, types a document, of file. a set of This instructions is a folder (a on program) windows. or A any directory fixed information. can contain Files can other have directories more than (sub-directories.) one name, and be in different directories at the same time 15
Organisation of the file system / bin usr home etc The top of the file system is the directory '/', Several commonly subdirectories known as the under root the directory root directory username Any example file in the users file home system can directory be uniquely with identified a subdirectory by and describing several the files path to it from the root directory. Another subdirectory. prot letter project seq4 seq3 seq2 seq1 /home/username/prot 16
Organisation of the file system bin usr home etc / Any process is located somewhere in the filesystem The command 'pwd' will tell you where. username@hpc ~> pwd pwd print /home/username working dir prot username letter seq4 project seq3 seq2 seq1 '~' is a linux shortcut for 'your home directory' 17
Looking at the file system bin usr home etc 'ls' lists the files in a username directory or directories prot letter project Without There are an many argument, options ls to lists ls that all the allow files you that to don't select start and control with. in the the information current directory it presents. seq4 seq3 seq2 seq1 username@hpc~> ~> ls project prot project: letter project seq1 seq2 seq3 seq4 / 18
Moving around the file system bin usr home etc / You can move to a different directory with the command 'cd directory ' prot username letter project 'directory' is the directory seq4 to seq3 which seq2 you seq1 want to move. The name can be written as the username@hpc full path ~/project> ~> cd (from /home/username/project root) cd or.. as the relative path username@hpc (from ~/project> ~> your pwd current directory) pwd '..' means the parent directory. /home/username/project repeat using the relative path '.' means the current directory... 19
Changing the file system bin usr home etc / You can create a new subdirectory in the current directory with the command ' mkdir directory ' username prot letter project model seq4 seq3 seq2 seq1 username@hpc ~> username@hpc ~> mkdir model 20
Changing the file system bin usr home etc You can delete an empty username subdirectory with the command ' rmdir directory' prot letter project model You can delete a file You with can the delete a subdirectory and command ' rm file its contents ' with the command seq4 seq3 seq2 seq1 ' rm -rf directory ' username@hpc ~> rmdir model username@hpc ~> rm prot username@hpc~> rm -rf directory / 21
More about files: filenames Filenames can contain any normal text character including spaces and special characters. Filenames can be almost any length. It is best to stick to a-z, A-Z, If a filename contains _, -, and numbers. It is best a to special keep them character short or a space you may need as it saves to put typing. quotes around the whole path. Special characters in filenames can cause problems with some programs. 22
More about files: reading files You can print the contents of one or more files to the screen with the command: 'cat file1 file2...' You can view the contents of one or more files a cat prints the whole file at once, so a file page at a time on the screen with the command: longer than just a few lines will run off ' more the file1 top of your file2 screen....' You can print the first few lines of a file with the command: more will let you search through a file, go 'head file1 backwards file2 and forwards...' and has many other functions. The last few lines can be viewed with 'tail' 23
More about files: editing files You can change the content of text files and create new files with a text editor. Text editors edit text. They do not try to format the text like word processors. A novice friendly basic text editor used as standard on many systems. Start with the A powerful editing environment which can be command 'pico filename' programmed. It has many modes for auto layout A powerful of program editor which code. Start can be with somewhat the command confusing for 'emacs newcomers. filename' It is designed for rapid editing of text files and programming. Start with the command 'vi filename' PICO EMACS VI Others: kedit,gedit,kwrite etc.. 24
More about files: copying files You can copy a file with the command 'cp oldfilename newfilename' username@hpc ~> ls letter project username@hpc ~> cp letter draft If newfilename is a directory, then the file will be copied to 'newfilename/oldfilename' username@hpc ~> ls draft letter project username@hpc ~> mv oldfilename newfilename Warning: If a file called newfilename already exists The command then 'mv it will oldfilename be overwritten. newfilename' can be used to rename a file 25
More about files: permissions Every file is protected. Permissions determine who can read, write, or execute a given file. Owner Group World The user who owns the file Other users in the same group All the as other users who in the owns the system. file. Files can have read (-r), write (-w) or execute (-x) permission for each of the three types of user. 26
More about files: permissions You can view the permissions for a file by listing it in long format with the command 'ls -l filename' username@hpc ~> ls -l letter -rwxr--r-- 1 username users 6048 Aug 17 16:07 letter The letter l The The date Permissions file The the type: Permissions files was size for The for the last user for the owner modified everyone owners who The owns files group else name group the file - - ordinary file d - directory l - link (shortcut) 27
More about files: permissions You can change the permissions for a file with the command 'chmod change filename' change ls -l letter is the modification you want chmod to o-r make letter to the files permissions ls -l letter username@hpc ~> -rwxr--r-- 1 username users 6048 Aug 17 16:07 letter username@hpc ~> username@hpc ~> -rwxr----- 1 For Permissions How username whom you are you being changing users are changed: changing 6048 permissions: permissions: Aug 17 16:07 letter username@hpc o r - - ~> other read remove permission these permissions g w + - group write add these permissions u x = - user execute set permissions (run) permission to this a - all 28
Introduction to Awk Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data manipulation tasks.
Awk n Works well on record-type data n Reads input file(s) a line at a time n Parses each line into fields n Performs user-defined tests against each line, performs actions on matches
Other Common Uses n Input validation Every record have same # of fields? Do values make sense (negative time, hourly wage > $100, etc.)? n Filtering out certain fields n Searches Who got a zero on lab 3? Who got the highest grade? n Many others (it's late)
Invocation n Can write little one-liners on the command line (very handy): print the 3 rd field of every line: $ awk '{ print $3 }' input.txt n Execute an awk script file: $ awk f script.awk input.txt n Or, use this sha-bang as the first line, and give your script execute permissions: #!/bin/awk -f
Form of an AWK program n AWK programs are entries of the form: pattern { action } pattern some test, looking for a pattern (regular expressions) or C-like conditions n if null, actions are applies to every line action a statement or set of statements n if not provided, the default action is to print the entire line, much like grep
Awk Features n Patterns can be regular expressions or C like conditions. n Each line of the input is matched against the patterns, one after the next. If a match occurs the corresponding action is performed. n Input lines are parsed and split into fields, which are accessed by $1,,$NF, where NF is a variable set to the number of fields. The variable $0 contains the entire line, and by default lines are split by white space (blanks, tabs)
Variables n Not declared, nor typed n No character type Only strings and floats (support for ints) n $n refers to the nth field (where n is some integer value) # prints each field on the line for( i=1; i<=nf; ++i ) print $i
Some Built-in Variables n FS the input field separator n OFS the output field separator n NF # of fields; changes w/each record n NR the # of records read (so far). So, the current record #. n $0 the entire input line
Getting help You can get help on a command by using the command ' man command' If you do not know This what will bring a command up the is manual called, page use the option '-k' and to show get it a list to you of commands screen by screen that may be relevant 'man -k word' Try using the options This will '-h', find '-help', all manual or pages '--help' if you containing can't find word the man in the page. short description of the command. 37
Exercise: Filter SNPS Go to http://hpc.ilri.cgiar.org/beca/gbs/ and run these commands in your home directory a) mkdir snp_data b) cd snp_data c) wget http://hpc.ilri.cgiar.org/beca/gbs/africa55k_10pops.bim d) wget http://hpc.ilri.cgiar.org/beca/gbs/emp.data e) ls -alh f) grep '^23\ ^25\ ^26 Africa55K_10Pops.bim > AfricaAll_Pops_non_autosomal.rsids g) awk '{if ($1 > 22) print $2}' Africa55K_10Pops.bim > Africa55K_10Pops.xchrsnps 38
Example Print those employees who actually worked $ awk '$3>0 {print $1, $2*$3}' emp.data Kathy 40 Mark 100 Mary 121 Susie 76.5 $ cat emp.data Beth 4.00 0 Dan 3.75 0 Kathy 4.00 10 Mark 5.00 20 Mary 5.50 22 Susie 4.25 18
Acknowledgement n SANBI (David Martin) n BSK Adapted from SANBI & Bioinformatics Society of Kenya/BSK 40
Useful literature 'Learning the UNIX operating system', O'Reilly press. Questions? 'UNIX Quickguide hpc 41