Applications of Grid Computing in Genetics and Proteomics Jorge Andrade 1, Malin Andersen 1,2, Lisa Berglund 1, and Jacob Odeberg 1,2 1 Department of Biotechnology, Royal Institute of Technology (KTH), AlbaNova University Center, SE-106 91 Stockholm, Sweden {jorge, jacob, malina}@biotech.kth.se, lisaber@kth.se http://www.biotech.kth.se 2 Department of Medicine, Atherosclerosis Research Unit, King Gustaf V Research Institute, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden Abstract. The potential for Grid technologies in applied bioinformatics is largely unexplored. We have developed a model for solving computationally demanding bioinformatics tasks in distributed Grid environments, designed to ease the usability for scientists unfamiliar with Grid computing. With a script-based implementation that uses a strategy of temporary installations of databases and existing executables on remote nodes at submission, we propose a generic solution that do not rely on predefined Grid runtime environments and that can easily be adapted to other bioinformatics tasks suitable for parallelization. This implementation has been successfully applied to whole proteome sequence similarity analyses and to genome-wide genotype simulations, where computation time was reduced from years to weeks. We conclude that computational Grid technology is a useful resource for solving high compute tasks in genetics and proteomics using existing algorithms. 1 Introduction Bioinformatics is a relatively new field of biological research involving the integration of computers, software tools, and databases in an effort to address biological questions. Areas include human genome research, simulations of biological and biochemical processes, and proteomics (for example protein folding simulations). With an increasing amount and complexity of data in genomics and genetics generated by today s high-throughput screening technologies and the development of advanced algorithms for mining complex data, computational power now sometimes defines the practical limit. High performance computing or alternative solutions are required to undertake the intensive data processing and analysis. Grid computing [1], offers a model for solving massive computational problems by subdividing the computation in a set of small jobs, executed in parallel on geographically distributed resources. However, the current job management process on Grid environments is relatively complex and non-automated. Biologists who want to take advantage of B. Kågström et al. (Eds.): PARA 2006, LNCS 4699, pp. 791 798, 2007. c Springer-Verlag Berlin Heidelberg 2007
792 J. Andrade et al. Grid resources face a process of having to manually submit their jobs, periodically check the resource broker for the status of the jobs ( Submitted, Ready, Scheduled, Running, or Finished status), and finally get the results with a raw file transfer from the remote storage area or remote worker to the local file system of their user interface. Different solutions for increasing the usability, scalability and stability in computational Grids have recently been proposed [2], [3]. The presented implementation represents a model by which access and utilization of Grid resources is greatly facilitated, allowing biologist and other non- Grid-experts to exploit the Grid power without necessarily having knowledge of Grid related details and procedures. The utility of this implementation is demonstrated by application to two computationally expensive bioinformatics tasks: Whole proteome sequence similarity analysis and genotype simulations for genome wide linkage analysis 2 Methods In order to make the interaction with the complex computational environments on Grids more straightforward to the biologically oriented scientists, the following tasks were automated: Proxy setup handles the user authentication as a member of a Virtual Organization (VO) and grants the user access to the Grid resources. By default, twelve hours is the time for the proxy to be in effect. After the proxy expires, the task of re-creating new proxy is automatically scheduled in the local Grid client. Job submission involves the remote distribution of the split input data files or databases, as well as the executable binary files to the Grid workers. For each Grid job submitted, a Grid job specification is created using the Resource Specification Language (RSL). Processing. After job submission, a local temporary installation of datasets and executables in the allocated remote nodes is performed. After that, parallel execution is started in remote nodes, and a constant monitoring of the current job s status is performed. Job re-submission in case of job failure or excessive delay in Grid queue systems is also handled. Job collection. When specific Grid jobs are finished, partial results are downloaded from the remote Grid workers to the local computer. This module is also able to handle parallel retrieval of several finished jobs. The figure 1 shows a graphical description of the Grid framework configuration used for this implementation. 3 Implementation A Perl script based Grid broker that ensure unique user authentication was implemented, allowing the user to remotely deploy and execute pre-existing algorithms or software across available Grid resources at submission time. The presented solution is adjusted to NorduGrid ARC [4], but can be easily adapted to any Globus based Grid middleware.
Applications of Grid Computing in Genetics and Proteomics 793 Fig. 1. Grid computing Framework for application in Bioinformatics This implementation can be adapted to tasks suitable for parallelization where an existing Linux executable exists. The implementation consists of two Perl scripts: gridjobsetup.pl. Manages two main tasks. Firstly, the big computationally expensive task is partitioned into a user-selected number of smaller equally sized atomistic jobs, each corresponding to a fraction of the total data. Secondly, for each datra fraction, a Grid job specification is created using the resource specification language (RSL). gridbroker.pl. This is the Grid broker. Its function is to manage the submission, monitoring and collection of the Grid jobs. Following node allocation and job submission, gridbroker.pl performs temporary installations of the deployed executable on the Grid nodes/remote workers, and parallel execution of the Grid jobs is started. gridbroker.pl constantly monitors the parallel execution of the distributed tasks, and in the case of job failure or if a job or set of jobs are excessively delayed in the work-queue scheduler, gridbroker.pl manages the resubmission of this job or set of jobs to different available Grid workers. When jobs reach the status of finished, forked download of specific job-results to the user local file system is performed. The partial Grid job results are finally concatenated to generate the output file. A fraction of the Perl implementation of the broker is shown below. The code shows a loop that manages the submission of a user defined number of Grid jobs; a vector of Grid job identifiers is created
794 J. Andrade et al. in memory and in an archive. This vector will then be used to mange the monitoring and downloading of the jobs. A log file that registers submission start and finish times is also created. Fraction of the Algorithm that Manage the Submission of Grid Jobs Input: XRSL-specification(s) of a number of Grid jobs; for each Grid job,a set of specific input parameters. Action: Submit the given number of Grid jobs. Output: Vector of Job s id and file with timings. 1. Process XRSL-specification 2. Create a time-log-file and register the start of submission 3. Create and open a job-id-file 4. For each job (a) Select the cluster(s) to which the job will be Submitted (b)submit the job (c)collect the retrieved job-id (d)push the collected job-id in a vector (e)push the collected job-id in a job-id-file 5. Register in time-log-file the end of submission 6. Close time-log-file 7. Close job-id-file Fraction of Algorithm that Manage the Monitoring and Downloading of Finished Grid Jobs (The following algorithm shows the constantly monitoring of job s status using the previously created vector of jobs identifiers; in case of job failure, re-submission of jobs is performed, jobs that have successfully reached the status of finished are downloaded.) Input: job-id vector and job-id-file. Action: Monitoring and collection of Grid jobs and resubmission if "job-failure". Output: Collection of finished Grid Jobs and time-log-file. 1. While number of downloaded jobs <= number of total Grid jobs submitted 2. For each job: (a)monitoring status of vector job-id[i] (b)if status of job-id[i] is "FAILURE" then: i. Re-submit job- id[i] to available Grid cluster ii. Delete old and push new retrieved job-id iii. Delete old and push new job-id in job-id File
Applications of Grid Computing in Genetics and Proteomics 795 iv. Register re-submission time in the log-file (c)if satus of job-id[i] is "FINISHED" then: i. Collect job-id[i] and register time ii. Push job-id[i] from vector of Job s id iii. Push job-id[i] from file of Job s id iv. Increase the counter of downloaded jobs 3. Register end of job-collection and close log-file 4 Results XWe have aimed to develop a generic Grid implementation for solving bioinformatics tasks suitable for parallelization where neither pre-selection of available Grid nodes nor pre-installation of software or databases will be necessary. Existing Linux-based executables can be used when scaling up tasks prohibitively time-consuming to perform in single work stations, as our solution will not require re-codification or programming modifications. The implementation is also applicable in situations where the source code is not available. To streamline the process we chose the strategy of making temporary installations of the executable and databases locally at each remote node at submission, followed by un-installation after download and collection of the results. By avoiding the need of predefined run-time environments, this implementation limits the interaction with Grid administrators for installation of applications/software and updates, thereby accommodating for dynamic Grid environments in which available nodes change between submissions. This strategy is however not applicable for instance in cases when a database management system (DBMS) is required, typical examples of DBMSs like Oracle, Microsoft SQL Server or MySQL, will necessary need the use of a specific run-time environments. Our implementation was evaluated in two highly computer intensive real applications in proteomics and genetics: The first application deals with whole proteome protein similarity analysis using a sliding window algorithm [5]. In contrast to ordinary blastp queries aligning full length query protein sequences, the sliding window approach results in a significantly higher number of blast searchers. Using a sliding window size of 51 amino acids, the number of blastp searches for a 1000 amino acid protein increase from 1 to 950. For the entire human Ensembl database [http://www.ensembl.org] of close to 34,000 human proteins, this corresponds to about 15,000,000 blastp searches. The time needed to run this number of blastp searches on a single computer was about eight weeks. As the Ensembl database is constantly evolving and being updated, where protein sequences are added, changed or deleted, frequent reprocessing of the database becomes necessary in the HPA program [http://www.proteinatlas.org] in order to work with the most accurate data at any one time. Once a new version of the database is released, the sequence similarity data on which the epitope design is based needs also to be updated. The computational requirements for this task were exceeding in-house resources if the processed results of a database update were to be
796 J. Andrade et al. delivered before it was already obsolete. With a Grid implementation where local installations of both the blastp executable and the entire Ensembl database was performed on each node (a total package of a size of 16 MB)[5], runtime was reduced from about eight weeks on one single up-to-date computer, to less than 24 hours using 300 Grid nodes in Swegrid [http://www.swegrid.se]. The absolute speed-up for this application was calculated as: Sp = T 1 s (1) T p Where T1 s is the sequential run-time, and T p is the execution time in p Grid nodes. Using the complete human Ensembl database as input, speed-up of 56 fold was archived, this was calculated by dividing T1 s = 1344 hours by T p =24 hours (the Grid run-time with same data as input in 300 Grid processors in Swegrid). The expected linear speed-up (300 fold in 300 nodes) was not archived, mainly due to Grid latency. By making a local installation of a database at each submission, the speed of running queries against a local database was obtained together with running against the most recent update. The alternative strategy of storing the database in one single Grid storage resource accessed by all the other nodes, proved to create an I/O overload in the Grid storage server, resulting in a significant increase of the total runtime. The second application was facilitating computer simulations of genotypes using a HMM based software [6], in order to evaluate the significance of genomewide linkage data. This was applied in a study aimed to identify novel genes involved in the pathogenesis of Alzheimers disease (AD) by performing a nonparametric multipoint linkage analysis on AD families from the relatively genetically homogeneous Swedish population. On a genome-wide scale, this task is extremely computationally intensive. In the absence of sufficient computational resources the number of simulations would therefore have to be limited, which could lead to the estimation of insufficient global significance levels and false positive linkage claims. We developed Grid-Allegro [7] which was used in the hypothesis testing to evaluate the statistical significance of the linkage data under the null hypothesis of no linkage using a set of 109 AD families. Serial execution time required to perform the minimum required 22000 genotype simulation analyses was reduced from the projected time, more that 3 years on a single up-to-date CPU, to less than 3 days when distributed computing was performed in 600 Grid workers in Swegrid [7]. 5 Discussion There are several computationally demanding algorithms and tasks in bioinformatics that may cause a computational overload when scaled up. To the researcher without access to expensive resources in-house such as dedicated clusters or computer farms, Grids represents a cost-effective and powerful resource. However, a current obstacle especially to the biologically oriented researcher is managing the middleware that is still raw and hardly accessible. For the
Applications of Grid Computing in Genetics and Proteomics 797 non-computer scientist, more user-friendly alternative solutions are necessary. One alternative is to develop web-based user front-end services of underlying Grid implementations, which are accessed by third party users. This is the most accessible alternative of exploiting Grid resources, as it is associated with minimal complexity where no necessary previous knowledge of distributed computing is required by the user. Grid resource brokers and job submission services based on Grid and Web services have been previously proposed [8]. However, for our specific purposes, we decided to use a generic, script-based strategy for implementing Grid-aware applications of bioinformatics task that are suitable for parallelisation. Our major concerns were related with security, stability and usability. Although Grid security is based in public key infrastructure (PKI) and this architecture offers strong security levels for the Grid end-user, current PKI implementations suffer from serious usability issues, especially when applied to web-based Grid-services. [9] Strong efforts are required in searching for new mechanisms for increasing the usability of Grid security. [10] Web-based implementations also confine the input submission format to those defined or envisioned by the provider/developer, which may reduce the flexibility for the third party user. Furthermore, Web-based Grid implementations may require re-codification of previously existing single CPU-oriented algorithm implementations. The developer assumes the administrator responsibility for maintaining the availability and updating of the resource. When web-based services are developed and provided through large initiatives [11], this indeed represents a transparent and user-friendly solution. However, new applications depend on continued development and implementation by these providers, and are hence not always available to meet the specific needs in individual third party projects. The alternative generic strategy, although requiring basic computer knowledge by the user, greatly increases the flexibility by enabling the implementation to be applied to similar distributable computation-demanding tasks. In conclusion, our implementation facilitates the biologically oriented scientist s remote deployment and execution of pre-existing codifications of bioinformatics algorithms across multiple Grid resources. By applying this implementation in solving two data and CPU intensive tasks, we have demonstrated the potential utility of Grid technology for addressing highly computational demanding bioinformatics task. References 1. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. International Journal of High Performance Computing Applications 15(3), 200 222 (2001) 2. Ellert, M., Konstantinov, B., K onya, J., Lindemann, J., Livenson, I., Nielsen, J., Smirnova, O., Wäanänen, A.: Advanced Resource Connector middleware for lightweight computational Grids. Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods and Applications 23, 219 240 (2007)
798 J. Andrade et al. 3. Elmroth, E., Tordsson, J.: Grid Resource Brokering Algorithms Enabling Advance Reservations and Resource Selection Based on Performance Predictions. Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods and Applications (2007) 4. Ellert, M., et al.: The NorduGrid project: using Globus toolkit for building GRID infrastructure. Nuclear Instruments & Methods in Physics Research Section a- Accelerators Spectrometers Detectors and Associated Equipment 502(2-3), 407 410 (2003) 5. Andrade, J., et al.: Using Grid technology for computationally intensive applied bioinformatics analyses. Silico Biology, 6 (2006) 6. Gudbjartsson, D.F., et al.: Allegro, a new computer program for multipoint linkage analysis. Nat Genet 25(1), 12 13 (2000) 7. Andrade, J., et al.: The use of Grid computing to drive data-intensive genetic research. European Journal of Human Genetics (March 21, 2007) 8. Elmroth, E., Tordsson, J.: An interoperable, standards-based Grid resource broker and job submission service. In: First International Conference on e-science and Grid Computing, IEEE Computer Society Press, Los Alamitos (2005) 9. Gui, X.L., et al.: A grid security infrastructure based on behaviors and trusts. In: Grid and Cooperative Computing Gcc 2004 Workshops, Proceedings, vol. 3252, pp. 482 489 (2004) 10. Beckles, B., Welch, V., Basney, J.: Mechanisms for increasing the usability of grid security. International Journal of Human-Computer Studies 63(1-2), 74 101 (2005) 11. Blanchet, C., et al.: GPS@ Bioinformatics Portal: from Network to EGEE Grd, vol. 2006, pp. 187 193. IOS Press, Amsterdam (2006)