Programming introduction part I: Perl, Unix/Linux and using the BlueHive cluster Bio472- Spring 2014 Amanda Larracuente
Text editor Syntax coloring Recognize several languages Line numbers Free! Mac/Windows GNU emacs http://www.gnu.org/software/emacs/ Mac Xcode OR: nano Type text, ctrl+o to save
Log into BlueHive ssh username@bluehive.rochester.edu OR ssh bluehive.crc.rochester.edu l username Go to class directory: /scratch/bio472_2014/ Download problem set #1 from: http://blogs.rochester.edu/selfishdna/ (go to courses)
Basic Unix/Linux commands Reference sheet is under Courses tab at: http://blogs.rochester.edu/selfishdna/ cd dir (change directory to dir) cd.. (go up one directory) ls (list contents of the directory) ls *.txt (list all files ending in.txt) ls s (show file sizes) pwd (show path to current directory) du (show directory space usage) wc l (print the number of lines in a file) cat file.txt (print the contents of file.txt) cat file1.txt file2.txt > file3.txt (concatenate file1 and file2 into file3.txt) grep pattern file (find all instances of pattern in file) grep > test.fa wc l (count # of fasta sequences in test.fa)
Practical Extraction and Report Language (Perl) Free high-level programming language Do you have Perl v5.0 or later on your system? Open terminal and type: perl v
Using BlueHive Go here: https://www.circ.rochester.edu/wiki/index.php/getting_started For graphical applications (e.g. R): Mac: Open Xquartz Application->Terminal Login to BlueHive with Y user@bluehive.crc.rochester.edu Module load R-3.0.2 R Windows: Get Mobaxterm: http://mobaxterm.mobatek.net/ Use ssh to log into BlueHive For text applications: Use ssh to log into BlueHive and submit PBS script with qsub OR work interactively: e.g. qsub -I -q interactive -l nodes=1:ppn=1 -l walltime=1:00:00
Hello World! Open a text editor, type the following lines and save as a file called Hello_world.pl : #!/usr/bin/perl w print "Hello, world!\n"; Run the program in your terminal by typing: perl Hello_world.pl
Scalars Strings of characters: hello Numbers (integers, floating points): 10 or 10.3458 or 10e7 Can be acted on with operators and will return a scalar: + addition (2+3=5) * multiplication (3*12=36) - Subtraction (5.1-2.4=2.7) % modulus (remainder) (10%3=1) / division (14/2=7) ** exponentiation (2**3=8) Store in scalar variable Declare scalar variable with my : my $scalar
Special characters and comparison operators Special characters \n newline \t tab \s space Comparison operators: Comparison Numeric String Equal == eq Not equal!= ne Less than < lt Greater than > gt Less than or equal to <= le Greater than or equal to >= ge
Loops Perl counts from zero! for (my $i=0; $i<10;$i++) { } print $i, \t ; my $i=0; while ($i<10) { print $i, \n ; $i++; } my $i=0; if ($i <= 10 && $i>6) { print High\n ; } elsif ($i<=6 && $i>3) { print Mid\n ; } else { print Low\n ; }
Arrays Variable that contains a list Create an array called people with elements 0-3 and values Fisher, Wright, Haldane, Mayr. my @people; $people[0]= Fisher ; $people[1]= Wright ; $people[2]= Haldane ; $people[3]= Mayr ; #Get the size of the array my $size=$#people+1; print size:,$size, \n ; #Print the names stored in the array for (my $i=0; $i<$size; $i++) { print $people[$i],"\n ; }
Hashes Hold values indexed by strings Look up values with keys (the index) Create a hash called names, with the keys Fisher, Wright, Haldane and Mayr and the values Ronald, Sewall, J.B.S. and Ernst. my %names $names{ Fisher }= Ronald ; $names{ Wright }= Sewall ; $names{ Haldane }= J.B.S. ; $names{ Mayr }= Ernst ; #Print the names stored in the hash for my $key (keys %names) { print $key,",", $names{$key},"\n ; }
Split Create a string called line with the following text: There is grandeur in this view of life Split the line on spaces and store in an array Print the elements of the array my $line="there is grandeur in this view of life..."; my @array=split(/\s/,$line); for (my $j=0; $j<$#array+1;$j++) { print $array[$j],"\n"; }
Substring Grab a subset of characters in a string substr(string,position,length) Example: Extract the word grandeur from the the following string: "There is grandeur in this view of life... ; my $line="there is grandeur in this view of life..."; my $subset=substr($line,9,8); print $subset, \n ;
Regular expressions Substitutions ~s/target/replacement/ Matches ~m/string/ Char Meaning ^ Beginning of string $ End of string. Any character (except newline) * Match 0+ times + Match 1+ times? Match 0 or 1 times, or shortest match alternative \ Special character \w Matches an alphanumeric character \d Matches a digit \s Matches a whitespace
Some examples if ($line=~m/^>/) #if the string starts with > if ($line=~m/[atcg]/) #if the string contains A or T or C or G if($line=~m/\w/) #if the string matches a word if($line=~m/^\w\d+) #if the string starts with a word and one or more digits
Input/Output and Filehandles Filehandle: an I/O connection between you and Perl Special filehandle names: -STDIN -DATA -STDOUT -ARGV -STDERR -ARGVOUT Write a program that 1. Takes the name of a file in on the command line 2. Opens the file and iterates through each line 3. For each line, creates a new string that substitutes most for MOST
Example 1: Input and Output #!/usr/bin/perl -w ############################################################################### # # Amanda Larracuente # Program written to play with I/O and filehandles # # example usage: IO_lesson.pl file.txt file.out # ############################################################################### my $file=$argv[0]; #name of input file my $outfile=$argv[1]; #name of output file #open the input/output files or die and report the error open(file, "$file") die ("Can't open $file!\n"); open(out,">>$outfile") die ("Can't open $outfile!\n"); foreach my $line(<file>) #for each line in the input file { chomp($line); #remove terminal \n $line=~s/most/most/g; #replace most with MOST print $line,"\n"; #print the new line to the screen print OUT $line,"\n"; #write the new line to an output file } #close input/output files close(file); close(out);
Example 2: While #!/usr/bin/perl -w ############################################################################### # # Amanda Larracuente # Program written to play with I/O and filehandles # # example usage: IO_lesson.pl file.txt # ############################################################################### my $file=$argv[0]; #name of input file #open the input file or die and report the error open(file, "$file") die ("Can't open $file!\n"); while (<FILE>) #for each line in the input file { chomp($_); #remove terminal \n my $new=$_; $new=~s/most/most/g; #replace most with MOST globally print $new,"\n";#print the new line to the screen } close(file); #close input file
Example 3: Grab_reads_from sam.pl Reconstruct a fastq file from an alignment file (SAM file) Type more TestGene.sam (we ll learn more about SAM files in the next lecture) This is a tab-delimited file containing alignment information Each line includes the read sequence and quality in the 10 th and 11 th column, respectively Split each line on the \t and store elements in an array Print out the columns that you need
#!/usr/bin/perl use warnings; use strict; ############################################################################### # # Amanda Larracuente 11/21/13 # Program written to recreate fastq from sam file # # example usage: perl Grab_reads_from_sam.pl MappedReads.sam # ############################################################################### my $samfile=$argv[0]; #name of fasta file to fetch from open(file, "$samfile") print ("Can't open $samfile!\n"); print "File ",$samfile," opened...\n"; my $outfile=$samfile; $outfile=~s/.sam/.reads.fq/g; #make output file name by substituting.sam for READS.fq open(out,">>$outfile") die ("Can't open $outfile!\n"); foreach my $line(<file>) #for each line in sam file { chomp($line); #get rid of "\n" at the end of each line if ($line=~m/^@/) {next;} #if the line starts with @, skip because this is a comment in the sam file else #this must be the lines containing alignment information { read name my @linearray=split(/\t/,$line); #split the line on tabs and store in an array my $read_name=$linearray[0]; #so now the first element of the array corresponds to t my $seq=$linearray[9]; #get the read sequence my $qual=$linearray[10]; #get the base qualities #Make a new fastq file containing the reads print OUT "@",$read_name,"\n",$seq,"\n+\n",$qual,"\n"; } } close(out); close(file);
Some useful and quick commands What do these do? Try it with TestGene.READS.fq! cat test.fastq perl -e '$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;} elsif($i==1){print;$i=-3}$i++;}' > test.fasta cat test.fastq perl -e '$i=0;while(<>){if(/^\+/&&$i==2){print;}elsif ($i==3){print;$i=-1}$i++;}' > test.qual
Perl resources O Reilly Perl Books Perl monks website: http://www.perlmonks.org/
Shell scripting Interface between the user and the Linux/Unix system BASH Use PBS scripts to submit computationally intensive jobs to BlueHive
BlueHive: a typical PBS script To run bowtie2: #!/bin/bash #PBS -q standard #PBS -l nodes=1:ppn=4 #PBS -l walltime=4:00:00 #PBS -l pvmem=4000mb #PBS -j oe #PBS -N bowtie2.align #PBS -o bowtie2.align.srr189053.log cd $PBS_O_WORKDIR source /usr/local/modules/init/bash module load bowtie-2.0.6 mkdir bowtie_srr189053 bowtie2 --phred64 --sensitive -p 4 x RanGAP_generegion -q -1 SRR189053_1_val_1.fq -2 SRR189053_2_val_2.fq U SRR189053_unpaired.fq S bowtie_srr189053/srr189053.sam
BlueHive: a typical PBS script To run bowtie2: Use the bash shell #!/bin/bash #PBS -q standard #PBS -l nodes=1:ppn=4 #PBS -l walltime=4:00:00 #PBS -l pvmem=4000mb #PBS -j oe #PBS -N bowtie2.align #PBS -o bowtie2.align.srr189053.log cd $PBS_O_WORKDIR source /usr/local/modules/init/bash module load bowtie-2.0.6 mkdir bowtie_srr189053 Change to pwd Load bowtie2 and dependencies Request 4 processors on a single standard node Request 4GB of RAM and 4 hours of wall time Make a directory to store your output Qstat will show your job as bowtie2.align and create this log file when completed, with run details bowtie2 --phred64 --sensitive -p 4 x RanGAP_generegion -q -1 SRR189053_1_val_1.fq -2 SRR189053_2_val_2.fq U SRR189053_unpaired.fq S bowtie_srr189053/srr189053.sam Your bowtie command
BlueHive: the queue The more resources you request, the longer you will wait in the queue To submit job: Type: qsub jobname.pbs To check on jobs: Type: qstat u user_name To kill a job: Type: qdel job_id
BlueHive dos and don ts DO: Use PBS scripts and qsub to run all jobs Store all of your output in /scratch/username Remember that /scratch is not backed up, so move files that you need DON T: Run a script on the command line (this uses the head node) unless in interactive mode Store intermediate output in ~/username (limited space)
System commands in Perl system( command"); e.g. system( rm file ); system( mv file dir );
File scripting in Perl my $filevar = <<ENDFILE; File contents ENDFILE Example: /scratch/bio472_2014/example_scripts/example_scripter.pl
sed Compact, but powerful! sed 's/string1/string2/g Replace string1 with string2 sed 's/[ \t]*$//' eliminate whitespace at end of line sed -n '10p Print 10 th line
Awk Compact, but powerful! Print every other line in file: awk '!(NR % 2)' testmatrix.txt Print average of 2 nd column: cat testmatrix.txt awk 'BEGIN {max=0} {sum+=$2} END {print "Average qual: "sum/nr}'