Perl for Biologists. Session 13 May 27, Parallelization with Perl. Jaroslaw Pillardy

Similar documents
Perl for Biologists. Practical example. Session 14 June 3, Robert Bukowski. Session 14: Practical example Perl for Biologists 1.

Perl for Biologists. Session 6 April 16, Files, directories and I/O operations. Jaroslaw Pillardy

Perl for Biologists. Object Oriented Programming and BioPERL. Session 10 May 14, Jaroslaw Pillardy

Getting to know you. Anatomy of a Process. Processes. Of Programs and Processes

Perl for Biologists. Session 9 April 29, Subroutines and functions. Jaroslaw Pillardy

Process Management! Goals of this Lecture!

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

518 Lecture Notes Week 3

Process Management 1

UNIT I Linux Utilities

4.8 Summary. Practice Exercises

* What are the different states for a task in an OS?

SE350: Operating Systems

Maize genome sequence in FASTA format. Gene annotation file in gff format

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

Robert Bukowski Jaroslaw Pillardy 6/27/2011

Reading Assignment 4. n Chapter 4 Threads, due 2/7. 1/31/13 CSE325 - Processes 1

The Unix Shell & Shell Scripts

Bioinformatics. Computational Methods II: Sequence Analysis with Perl. George Bell WIBR Biocomputing Group

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 20

Project 2: Shell with History1

Operating System Labs. Yuanbin Wu

Perl for Biologists. Regular Expressions. Session 7. Jon Zhang. April 23, Session 7: Regular Expressions CBSU Perl for Biologists 1.

Killing Zombies, Working, Sleeping, and Spawning Children

Operating Systems. Lecture 05

What is a Process? Processes and Process Management Details for running a program

Introduction to OS Processes in Unix, Linux, and Windows MOS 2.1 Mahmoud El-Gayyar

CSC 539: Operating Systems Structure and Design. Spring 2006

CS 326: Operating Systems. Process Execution. Lecture 5

Processes. Dr. Yingwu Zhu

NAME SYNOPSIS DESCRIPTION. Behavior of other Perl features in forked pseudo-processes. Perl version documentation - perlfork

OS Structure, Processes & Process Management. Don Porter Portions courtesy Emmett Witchel

Interrupts, Fork, I/O Basics

Unix, Perl and BioPerl

Processes. What s s a process? process? A dynamically executing instance of a program. David Morgan

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Operating System Design

Process Creation in UNIX

Creating a Shell or Command Interperter Program CSCI411 Lab

Perl for Biologists. Arrays and lists. Session 4 April 2, Jaroslaw Pillardy. Session 4: Arrays and lists Perl for Biologists 1.

Roadmap. Tevfik Ko!ar. CSC Operating Systems Fall Lecture - III Processes. Louisiana State University. Processes. September 1 st, 2009

CSE 153 Design of Operating Systems Fall 2018

o Reality The CPU switches between each process rapidly (multiprogramming) Only one program is active at a given time

Perl Programming Fundamentals for the Computational Biologist

Processes in linux. What s s a process? process? A dynamically executing instance of a program. David Morgan. David Morgan

CSC374, Spring 2010 Lab Assignment 1: Writing Your Own Unix Shell

Processes. CS3026 Operating Systems Lecture 05

SMD149 - Operating Systems

Process Management! Goals of this Lecture!

Operating Systems, laboratory exercises. List 2.

CSCI Computer Systems Fundamentals Shell Lab: Writing Your Own Unix Shell

Script Programming Systems Skills in C and Unix

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Review of Fundamentals

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

CS 4410, Fall 2017 Project 1: My First Shell Assigned: August 27, 2017 Due: Monday, September 11:59PM

UNIX Processes. by Armin R. Mikler. 1: Introduction

628 Lecture Notes Week 4

! The Process Control Block (PCB) " is included in the context,

Operating Systems Lab

Lecture 4: Memory Management & The Programming Interface

Exercise 9: simple bash script

Windows architecture. user. mode. Env. subsystems. Executive. Device drivers Kernel. kernel. mode HAL. Hardware. Process B. Process C.

Announcements Processes: Part II. Operating Systems. Autumn CS4023

Systems Programming/ C and UNIX

CSE506: Operating Systems CSE 506: Operating Systems

Processes and Threads

CSE 410: Computer Systems Spring Processes. John Zahorjan Allen Center 534

Essentials for Scientific Computing: Bash Shell Scripting Day 3

CS 450 Operating System Week 4 Lecture Notes

Systems Skills in C and Unix

CS61 Scribe Notes Date: Topic: Fork, Advanced Virtual Memory. Scribes: Mitchel Cole Emily Lawton Jefferson Lee Wentao Xu

PROCESS MANAGEMENT. Operating Systems 2015 Spring by Euiseong Seo

Sequence Analysis with Perl. Unix, Perl and BioPerl. Why Perl? Objectives. A first Perl program. Perl Input/Output. II: Sequence Analysis with Perl

Operating systems and concurrency - B03

4. Concurrency via Threads

Multithreaded Programming

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Start of Lecture on January 22, 2014

Cross-platform daemonization tools.

COE518 Lecture Notes Week 2 (Sept. 12, 2011)

NAME SYNOPSIS DESCRIPTION. Behavior of other Perl features in forked pseudo-processes. Perl version documentation - perlfork

Assignment 1. Teaching Assistant: Michalis Pachilakis (

The Shell, System Calls, Processes, and Basic Inter-Process Communication

Process management. What s in a process? What is a process? The OS s process namespace. A process s address space (idealized)

Midterm Exam CPS 210: Operating Systems Spring 2013

CSE 451: Operating Systems Winter Module 4 Processes. Mark Zbikowski Allen Center 476

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Unix Processes. What is a Process?

Preview. The Thread Model Motivation of Threads Benefits of Threads Implementation of Thread

Processes. Johan Montelius KTH

Mills HPC Tutorial Series. Linux Basics II

The Process Abstraction. CMPU 334 Operating Systems Jason Waterman

Chap 4, 5: Process. Dongkun Shin, SKKU

CS 213, Fall 2002 Lab Assignment L5: Writing Your Own Unix Shell Assigned: Oct. 24, Due: Thu., Oct. 31, 11:59PM

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen

A process. the stack

THE PROCESS ABSTRACTION. CS124 Operating Systems Winter , Lecture 7

Transcription:

Perl for Biologists Session 13 May 27, 2015 Parallelization with Perl Jaroslaw Pillardy Perl for Biologists 1.2 1

Session 12 Exercises Review 1. Write a script that checks for new versions of a set of software titles listed in a file. Choose 5 programs from BioHPC Lab software list (http://cbsu.tc.cornell.edu//lab/labsoftware.aspx ) and implement them. Check for new versions on the original software websites, NOT BioHPC website. HINT: You can use script4.pl as the starting point HINT: You will need to put URL and tag delimiters info for each program in the input file. Software chosen: BWA Bioconductor LUCY2 Picard STRUCTURE Perl for Biologists 1.2 2

Session 12 Exercises Review Configuration file (exercise1.config) is tab separated text file and hold necessary information for searching URLs for each program name<tab>url<tab>tag1<tab>offset1<tab>tag2<tab>offset2<tab>last name URL tag1 offset1 tag2 offset2 last the name of the program URL of the page with version info string to find the beginning of version region how many characters to skip from the beginning of tag1 string ending the version region how many characters to skip (left or right) from tag2 do we search for first (last=0) or last (last=1) occurrence Perl for Biologists 1.2 3

#!/usr/local/bin/perl use LWP; exercise1.pl (1) #what is our current version? my %versions; open in, "exercise1.data"; while(my $vtmp = <in>) chomp $vtmp; my ($progname, $pver) = split /\t/, $vtmp; $versions$progname = $pver; close(in); my $message = ""; my $subject = ""; my $ua = LWP::UserAgent->new; $ua->agent("myapp/0.1 "); open in, "exercise1.config"; while(my $entry = <in>) if($entry =~ /^#/)next; print "checking $name\n"; chomp $entry; my ($name, $url, $tag1, $off1, $tag2, $off2, $last) = split /\t/, $entry; my $req = HTTP::Request->new(GET => $url); $req->header(accept => "text/html, */*;q=0.1"); my $res = $ua->request($req); Perl for Biologists 1.2 4

exercise1.pl (2) if ($res->is_success) my $n = index($res->content,$tag1); if($last == 1)$n=rindex($res->content, $tag1); if($n == -1) $message.= "ERROR: Cannot find version string for $name\n"; print "ERROR: Cannot find version string for $name\n"; else my $current = substr($res->content, $n+$off1); $current = substr($current, 0, index($current, $tag2) - $off2); if($current ne $versions$name) $message.= "New $name version found: $current\n"; print "New $name version found: $current\n"; $versions$name = $current; else $message.= "ERROR: Cannot open $name web page $url\n"; print "ERROR: Cannot open $name web page $url\n"; print "DONE\n"; close(in); Perl for Biologists 1.2 5

exercise1.pl (3) if($message ne "") open out, ">exercise1.data"; foreach my $entry (keys %versions) print out "$entry\t$versions$entry\n"; close(out); $subject = "Automated version checker mail"; use Mail::Sendmail; %mail = (To => 'jarekpp@yahoo.com', From => 'jp86@cornell.edu', subject => $subject, Message => $message, smtp=>'appsmtp.mail.cornell.edu'); sendmail(%mail); Perl for Biologists 1.2 6

Running programs: processes Every running program is executed as a process Process is an object of UNIX (Linux) kernel (core of the system) It is identified by process id (PID, an integer) It has an allocated region in memory containing: code (binary instructions) execution queue(s) set of pointers showing which instruction(s) to execute data (variables) copy of environment (variables, arguments etc.) communication handles (STDIN, STDOUT, file handles etc.) Perl for Biologists 1.2 7

Running programs: processes OS kernel assigns processes to available cores dynamically, i.e. one process may share a core with another process by executing instructions in turns. The more cores available the more processes run concurrently. Unused processes can be swept out from memory to disk (swap space), and loaded when needed. One can run many more processes than available cores, but if they all are CPU-intensive loss of performance will occur due to a cost of switching between processes. Perl for Biologists 1.2 8

Running programs: processes Process is always created by system, but the creation may be requested by another process (e.g. system() function). The are three ways for creating processes: Invoking a new program on the same machine e.g. system() Invoking a new program on different (remote) machine e.g. ssh Creating a clone of current process (always on the same machine) fork() function Perl for Biologists 1.2 9

Starting processes with system() system() SYSTEM loads new file and starts process system() will wait for Process B to finish before proceeding Process B Perl for Biologists 1.2 10

Starting processes with ssh SSH connects to machine2 SYSTEM on machine2 loads file and starts Process B SSH will wait for Process B to finish before proceeding Process B machine 2 Perl for Biologists 1.2 11

Starting processes with fork() Process B fork() SYSTEM clones Process A in memory fork() same PID new PID Process B fork() NO waiting fork() Perl for Biologists 1.2 12

Running programs: processes Processes may talk to each other using Pipes output from one may be input to another Files one program may write and other read the same file Signals- special, pre-defined, messages passed from one to another process via OS (SIGINT = CTRL-C, SIGTERM, SIGKILL etc.). Special libraries- there are libraries specializing in interprocess communication (Message Passing Interface, MPI) Perl for Biologists 1.2 13

Threads There is a way to execute more than one set of commands inside the same process by using more than one execution queue. Each separate execution queue is a thread, they all have access to the same memory, environment etc. inside the process. Threads Perl for Biologists 1.2 14

Threads single thread process multithread process Copied from https://computing.llnl.gov/tutorials/pthreads/ Perl for Biologists 1.2 15

Multi-processing (distributed): Parallel execution Multiple separate processes on the same or different machines coordinating their execution via passing data or messages. They cannot read each others memory. start Process 1 Process 2 Execution queue Instruction1 Instruction2 Instruction3 Instruction4 Execution queue Instruction1 Instruction2 Instruction3 Instruction4 Process 3 Execution queue Instruction1 Instruction2 Instruction3 Instruction4 Perl for Biologists 1.2 16

Multi-threading (shared memory): Parallel execution There is only one process with multiple execution queues inside. Threads communicate via messages or memory access each thread has access to the same memory area. start Execution queue 1 Execution queue 2 Execution queue 3 Instruction1 Instruction1 Instruction1 Instruction2 Instruction2 Instruction2 Instruction3 Instruction3 Instruction3 Instruction4 Instruction4 Instruction4 Perl for Biologists 1.2 17

Parallel execution Multi-processing (distributed) with multi-threading : Multiple separate processes on the same or different machines coordinating their execution via passing data or messages, each multithreaded. start Execution queue 1 Instruction1 Instruction2 Instruction3 Instruction4 Process 1 Process 2 Execution queue 2 Instruction1 Instruction2 Instruction3 Instruction4 Execution queue 1 Instruction1 Instruction2 Instruction3 Instruction4 Execution queue 2 Instruction1 Instruction2 Instruction3 Instruction4 Execution queue 1 Instruction1 Instruction2 Instruction3 Instruction4 Process 3 Execution queue 2 Instruction1 Instruction2 Instruction3 Instruction4 Perl for Biologists 1.2 18

Parallel execution Deciding between multithreading and multiprocessing usually comes down to I/O versus CPU requirements. CPU intensive Very good for same machine multiprocessing. I/O intensive Good for same machine multithreading, or multiprocessing on SEPARATE machines (each of them does its own I/O). I/O intensive and CPU intensive Best for multiprocessing on SEPARATE machines, each process multithreaded. Perl for Biologists 1.2 19

fork() This function clones the current process the new process is identical to the previous one, except it has a new PID. On the original process (master) the function returns the new PID of the new process (child). On the new process (child) the function returns 0. On error the function returns -1. Perl for Biologists 1.2 20

fork() my $pid = fork(); if($pid < 0) #error print "ERROR: Cannot fork $!\n"; #further error handling code exit; elsif($pid == 0) #child code child_exec(); exit; else #master code master_exec(); exit; Perl for Biologists 1.2 21

multi-process parallelization example: same machine we have a set of tasks that do not communicate with each other during execution our program should execute them on one machine in parallel, with maximum number of processes limited to a pre-defined number for this example will use a toy child execution: each child should sleep a random number of seconds between 5 and 20. Perl for Biologists 1.2 22

multi-process parallelization example: algorithm Master process will be devoted to process control Master will created child processes, up to maximum allowed limit Child processes will execute the work part Master will monitor child processes, when a child process finishes, master will create another child process if there are unprocessed tasks left Perl for Biologists 1.2 23

waitpid() This function is used to monitor condition of child process my $result = waitpid($pid, WNOHANG); WNOHANG flag tells waitpid to return the status of the child process with given pid$pid WITHOUT waiting for this process to complete. If the return value is less or equal 0 the child process still exists and is running. Otherwise the return value is the completion code of the process same as system(). Perl for Biologists 1.2 24

script1.pl (1) #!/usr/local/bin/perl use POSIX ":sys_wait_h"; need this module for fork() my $maxtask = 8; my $ntasks = 15; #initial fork child processes my @procs; my $task = 0; print "STARTING: maxtask=$maxtask ntasks=$ntasks\n"; for(my $i=0; $i<$maxtask; $i++) $task++; print "starting child $i task $task "; $procs[$i] = start_task($i, $task, @procs); print " pid $procs[$i]\n"; #waiting for child processes to finish and execute remaining tasks in their place Perl for Biologists 1.2 25

script1.pl (2) while(1) sleep(1); #there is no need to check every millisecond - it would use too much CPU my $n=0; for(my $i=0; $i<=$#procs; $i++) if($procs[$i]!= 0) my $kid = waitpid($procs[$i], WNOHANG); if($kid <= 0) $n++; #process exists else print "Child ". ($i+1). " finished (pid=". $procs[$i]. ")\n"; $procs[$i] = 0 ; if($task < $ntasks) $task++; $procs[$i] = start_task($i, $task, @procs); print " child ". ($i+1). " restarted for task $task with pid $procs[$i] \n"; $n++; if($n==0)last; print "ALL DONE\n"; waitpid() checks the status of a child process, WNOHANG flag means no waiting just immediate status return Perl for Biologists 1.2 26

script1.pl (3) sub start_task my ($i, $task, @procs) = @_; my $pid = fork(); if($pid < 0) #error print "\n\nerror: Cannot fork child $i\n"; for(my $j=0; $j<=$#procs; $j++) system("kill -9 ". $procs[$j]); exit; if($pid == 0) #child code child_exec($i+1, $task); exit; #master - continue, $pid contains child pid return $pid; Perl for Biologists 1.2 27

script1.pl (4) sub child_exec my ($num, $tasknum) = @_; my $seed = time * $num * $num; srand($seed); $nsec = int(rand(20)) + 5; #print "CHILD $num: task number $tasknum, wait time $nsec sec\n"; sleep($nsec); Perl for Biologists 1.2 28

multi-process parallelization example: same machine PAML simulation we have a set of genes and we want to run PAML program on each of them our program should execute them on one machine in parallel, with maximum number of processes limited to a pre-defined number The program needs to prepare a directory to run each gene, then execute PAML in parallel Perl for Biologists 1.2 29

multi-process parallelization example: same machine PAML simulation The program should be run from local directory on the workstation, i.e. /workdir Here is the script used to set up the local data #!/bin/bash mkdir /workdir/jarekp cd /workdir/jarekp tar -xzf /home/jarekp/perl_13/paml4.7.tgz cp -R /home/jarekp/perl_13/seqdata. cp /home/jarekp/perl_13/mytree.dnd. cp /home/jarekp/perl_13/template.control. Perl for Biologists 1.2 30

#!/usr/local/bin/perl use POSIX ":sys_wait_h"; script2.pl (1) my $maxtask = 8; my $ntasks = 0; my @tasklist; #PAML parameters my $pamlcmd = "/workdir/jarekp/paml4.7/bin/codeml"; my $template_control_file = "template.control"; my $tree = "mytree.dnd"; my $datadir = "seqdata"; my $outdir = "results"; if(! -e $template_control_file) print "Template control file is NOT available\n"; exit; if(! -e $tree) print "Tree file is NOT available\n"; exit; if(! -d $datadir) print "The data directory is NOT available\n"; exit; Perl for Biologists 1.2 31

if(! -d $outdir) mkdir $outdir; script2.pl (2) print "Preparing PAML and input files\n"; #put template control files in a variable. #a new control file will be created for each gene open (IN, $template_control_file) or die "The template control file $template_control_file cannot be opened"; my $template_control_content = ""; while (<IN>) $template_control_content.= $_; close IN; #get the list of files in seqdata opendir(my $dh, $datadir) die "can't opendir $datadir: $!"; my @alnfiles; foreach my $entry (readdir($dh)) if($entry =~ /\.phy$/) push(@alnfiles, $entry); closedir $dh; Perl for Biologists 1.2 32

#prepare directories - one for each gene #store their names in @tasklist foreach my $alnfile (@alnfiles) my $seqid = $alnfile; $seqid =~ s/\..+//; script2.pl (3) my $outseqdir = $outdir. "/". $seqid; if (! -d $outseqdir) mkdir $outseqdir; system ("cp $tree $outseqdir/" ); system ("cp $datadir/$alnfile $outseqdir/" ); my $mycontrolfile = $template_control_content; $mycontrolfile=~s/xxxxxx/$alnfile/; open CC, ">$outseqdir/my.control"; print CC $mycontrolfile; close CC; push(@tasklist, $outseqdir); my $ntasks = $#tasklist + 1; print "We have $ntasks genes to run\n"; Perl for Biologists 1.2 33

#initial fork child processes script2.pl (4) my @procs; my $task = 0; print "STARTING: maxtask=$maxtask ntasks=$ntasks\n"; for(my $i=0; $i<$maxtask; $i++) $task++; print "starting child ". ($i+1). " task $task\n"; $procs[$i] = start_task($tasklist[$task-1], $pamlcmd, @procs); print " child ". ($i+1). " task $task started with pid $procs[$i]\n"; #waiting for child processes to finish and execute remaining tasks in their place while(1) sleep(1); #there is no need to check every millisecond-it would use too much CPU my $n=0; for(my $i=0; $i<=$#procs; $i++) if($procs[$i]!= 0) my $kid = waitpid($procs[$i], WNOHANG); if($kid <= 0) #process exists $n++; Perl for Biologists 1.2 34

else print "Child ". ($i+1). " finished (pid=". $procs[$i]. ")\n"; $procs[$i] = 0 ; if($task < $ntasks) $task++; $procs[$i] = start_task($tasklist[$task-1], $pamlcmd, @procs); script2.pl (5) print " child ". ($i+1). " restarted for task $task with pid $procs[$i]\n"; $n++; if($n==0)last; print "ALL DONE\n"; sub child_exec my ($outseqdir, $pamlcmd) = @_; chdir($outseqdir); system("$pamlcmd my.control >& log"); Perl for Biologists 1.2 35

sub start_task my ($taskstr, $pamlcmd, @procs) = @_; script2.pl (6) my $pid = fork(); if($pid < 0) #error print "\n\nerror: Cannot fork child $i\n"; for(my $j=0; $j<=$#procs; $j++) system("kill -9 ". $procs[$j]); exit; if($pid == 0) #child code child_exec($taskstr, $pamlcmd); exit; #master - continue, $pid contains child pid return $pid; Perl for Biologists 1.2 36

multi-process parallelization example: different machines BLAST search We have a set of sequences in one fastafile (sequences.fasta) Our program should BLAST them against RefSeqmammalian RNA on a pre-defined set of machines in parallel, each BLAST should use multiple cores on each machine (BLAST supports multithreading) The fastafile should be split into blocks, each containing 5 sequences Each block should be searched as one unit on each machine Perl for Biologists 1.2 37

>ssh-keygen-t rsa ssh without password in BioHPC Lab (between workstations) (use empty password when prompted) >cat.ssh/id_rsa.pub >>.ssh/authorized_keys >chmod 640.ssh/authorized_keys >chmod 700.ssh Before running parallel script make sure you can sshfrom your master machine to all machines listed in the script Perl for Biologists 1.2 38

#!/usr/local/bin/perl use POSIX ":sys_wait_h"; script3.pl (1) my $maxtask = 4; my @machines = qw(localhost cbsum1c2b015 cbsum1c2b014 cbsum1c2b012); my $ntasks = 0; my @tasklist; my $exe = "/home/jarekp/perl_13/script3exe.pl"; my $datadir = "/home/jarekp/perl_13/data3"; my $outdir = "/home/jarekp/perl_13/out3"; my $fastafile = "/home/jarekp/perl_13/sequences.fa"; my $blocksize = 5; Perl for Biologists 1.2 39

print "Splitting large fasta file\n"; script3.pl (2) open (IN, $fastafile) or die "Fasta file $fastafile cannot be opened"; my $scount = 0; my $block = 0; my $line = 0; while (my $txt=<in>) $line++; if($txt =~ /^>/)$scount++; if(($scount-1)%$blocksize == 0 && ($line-1)%2 == 0) if($block>0)close(block); $block++; open BLOCK, ">$datadir/block$block"; push(@tasklist, "block$block"); print BLOCK $txt; close IN; close BLOCK; my $ntasks = $#tasklist + 1; print "We have $ntasks blocks to blast\n"; Perl for Biologists 1.2 40

script3.pl (3) #initial fork child processes my @procs; my $task = 0; print "STARTING: maxtask=$maxtask ntasks=$ntasks\n"; for(my $i=0; $i<$maxtask; $i++) $task++; print "starting child ". ($i+1). " task $task\n"; $procs[$i] = start_task($tasklist[$task-1], $machines[$i], $exe, @procs); print " child ". ($i+1). " task $task started with pid $procs[$i]\n"; #waiting for child processes to finish and execute remaining tasks in their place while(1) sleep(1); #there is no need to check every millisecond-it would use too much CPU my $n=0; for(my $i=0; $i<=$#procs; $i++) if($procs[$i]!= 0) my $kid = waitpid($procs[$i], WNOHANG); if($kid <= 0) #process exists $n++; Perl for Biologists 1.2 41

script3.pl (4) else print "Child ". ($i+1). " finished (pid=". $procs[$i]. ")\n"; $procs[$i] = 0 ; if($task < $ntasks) $task++; $procs[$i] = start_task($tasklist[$task-1], $machines[$i], $exe, @procs); print " child ". ($i+1). " restarted for task $task with pid $procs[$i]\n"; $n++; if($n==0)last; print "ALL DONE\n"; Perl for Biologists 1.2 42

sub start_task my ($datafile, $machine, $cmd, @procs) = @_; script3.pl (5) my $pid = fork(); if($pid < 0) #error print "\n\nerror: Cannot fork child $i\n"; for(my $j=0; $j<=$#procs; $j++) system("kill -9 ". $procs[$j]); exit; if($pid == 0) #child code child_exec($datafile, $machine, $cmd); exit; #master - continue, $pid contains child pid return $pid; Perl for Biologists 1.2 43

script3.pl (6) sub child_exec my ($datafile, $machine, $cmd) = @_; if($machine eq "localhost") system("$cmd $datafile"); else system("ssh $machine $cmd $datafile"); Perl for Biologists 1.2 44

#!/usr/local/bin/perl script3exe.pl my $seqfile = $ARGV[0]; my $tempdir = "/workdir/jarekp"; my $local_dbdir = "/workdir/jarekp/db"; my $dbname = "RefSeq_mammalian.rna"; my $orig_dbdir = "/shared_data/genome_db/blast_ncbi"; my $datadir = "/home/jarekp/perl_13/data3"; my $outdir = "/home/jarekp/perl_13/out3"; if(! -e $tempdir)mkdir $tempdir; if(! -e $local_dbdir)mkdir $local_dbdir; if(! -e "$local_dbdir/$dbname.nal") system("cp -f $orig_dbdir/$dbname.* $local_dbdir"); my $temprundir = $tempdir. "/". $seqfile; mkdir $temprundir; chdir($temprundir); system("cp $datadir/$seqfile."); system("blastall -a 8 -d $local_dbdir/$dbname -i $seqfile -o $seqfile.out -p blastn >\& $seqfile.log"); system("cp $seqfile.out $outdir"); system("cp $seqfile.log $outdir"); chdir(".."); system("rm -rf $seqfile"); Perl for Biologists 1.2 45

Exercises 1. Modify script1.pl so it executes commands (tasks) from a file (one task per line), each in separate directory. Print out total execution time of all tasks and average execution time per task (in seconds). Perl for Biologists 1.2 46