Applications of Grid Computing in Genetics and Proteomics

Similar documents
Data Management for the World s Largest Machine

Usage statistics and usage patterns on the NorduGrid: Analyzing the logging information collected on one of the largest production Grids of the world

ATLAS NorduGrid related activities

Empowering a Flexible Application Portal with a SOA-based Grid Job Management Framework

UNICORE Globus: Interoperability of Grid Infrastructures

ARC-XWCH bridge: Running ARC jobs on the XtremWeb-CH volunteer

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

Monitoring the Usage of the ZEUS Analysis Grid

Delivering Data Management for Engineers on the Grid 1

Architecture Proposal

Development of new security infrastructure design principles for distributed computing systems based on open protocols

Introduction to Grid Infrastructures

The NorduGrid Architecture and Middleware for Scientific Applications

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Introduction to Grid Computing

AGARM: An Adaptive Grid Application and Resource Monitor Framework

Data Mining Technologies for Bioinformatics Sequences

High Performance Computing Course Notes Grid Computing I

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

A Distributed Media Service System Based on Globus Data-Management Technologies1

Interoperable and Transparent Dynamic Deployment of Web Services for Service Oriented Grids

Performance Analysis of Parallelized Bioinformatics Applications

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations

Problems for Resource Brokering in Large and Dynamic Grid Environments

glite Grid Services Overview

Interconnect EGEE and CNGRID e-infrastructures

Grid Resource Brokering Algorithms Enabling Advance Reservations and Resource Selection Based on Performance Predictions

PoS(EGICF12-EMITC2)081

Grid Scheduling Architectures with Globus

Building Data-Intensive Grid Applications with Globus Toolkit An Evaluation Based on Web Crawling

Dynamic Data Grid Replication Strategy Based on Internet Hierarchy

Kenneth A. Hawick P. D. Coddington H. A. James

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

XML in the bipharmaceutical

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

IEPSAS-Kosice: experiences in running LCG site

An I/O device driver for bioinformatics tools: the case for BLAST

The Lattice BOINC Project Public Computing for the Tree of Life

A Finite State Mobile Agent Computation Model

Functional Requirements for Grid Oriented Optical Networks

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

Assignment 5. Georgia Koloniari

Scientific data management

SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance

Oracle Big Data Connectors

A VO-friendly, Community-based Authorization Framework

Textual Description of webbioc

Juliusz Pukacki OGF25 - Grid technologies in e-health Catania, 2-6 March 2009

Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems

A Federated Grid Environment with Replication Services

Evolving SQL Queries for Data Mining

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Initial experiences with GeneRecon on MiG

S i m p l i f y i n g A d m i n i s t r a t i o n a n d M a n a g e m e n t P r o c e s s e s i n t h e P o l i s h N a t i o n a l C l u s t e r

Min Wang. April, 2003

NUSGRID a computational grid at NUS

A 3-tier Grid Architecture and Interactive Applications Framework for Community Grids

Multiple Broker Support by Grid Portals* Extended Abstract

Rapid Deployment of VS Workflows. Meta Scheduling Service

A Fast and High Throughput SQL Query System for Big Data

Grid Architectural Models

Monitoring ARC services with GangliARC

ROCI 2: A Programming Platform for Distributed Robots based on Microsoft s.net Framework

Upgrading Existing Databases Recommendations for Irrigation Districts

Research on the Interoperability Architecture of the Digital Library Grid

Java-Grid Environment for Bioinformatics Applied Tools (JEBAT)

GRID COMPUTING IN MEDICAL APPLICATIONS

Integrating a Common Visualization Service into a Metagrid.

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

NorduGrid Tutorial. Client Installation and Job Examples

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

A Cloud Framework for Big Data Analytics Workflows on Azure

Automatic Job Resubmission in the Nordugrid Middleware

Research and Design Application Platform of Service Grid Based on WSRF

Was ist dran an einer spezialisierten Data Warehousing platform?

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

Real-time grid computing for financial applications

Heterogeneous Grid Computing: Issues and Early Benchmarks

Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

FuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019

Molecular dynamics simulations in the MolDynGrid Virtual Laboratory by means of ARC between Grid and Cloud

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation

EnterpriseLink Benefits

High Throughput WAN Data Transfer with Hadoop-based Storage

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Comprehensive Guide to Evaluating Event Stream Processing Engines

Transcription:

Applications of Grid Computing in Genetics and Proteomics Jorge Andrade 1, Malin Andersen 1,2, Lisa Berglund 1, and Jacob Odeberg 1,2 1 Department of Biotechnology, Royal Institute of Technology (KTH), AlbaNova University Center, SE-106 91 Stockholm, Sweden {jorge, jacob, malina}@biotech.kth.se, lisaber@kth.se http://www.biotech.kth.se 2 Department of Medicine, Atherosclerosis Research Unit, King Gustaf V Research Institute, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden Abstract. The potential for Grid technologies in applied bioinformatics is largely unexplored. We have developed a model for solving computationally demanding bioinformatics tasks in distributed Grid environments, designed to ease the usability for scientists unfamiliar with Grid computing. With a script-based implementation that uses a strategy of temporary installations of databases and existing executables on remote nodes at submission, we propose a generic solution that do not rely on predefined Grid runtime environments and that can easily be adapted to other bioinformatics tasks suitable for parallelization. This implementation has been successfully applied to whole proteome sequence similarity analyses and to genome-wide genotype simulations, where computation time was reduced from years to weeks. We conclude that computational Grid technology is a useful resource for solving high compute tasks in genetics and proteomics using existing algorithms. 1 Introduction Bioinformatics is a relatively new field of biological research involving the integration of computers, software tools, and databases in an effort to address biological questions. Areas include human genome research, simulations of biological and biochemical processes, and proteomics (for example protein folding simulations). With an increasing amount and complexity of data in genomics and genetics generated by today s high-throughput screening technologies and the development of advanced algorithms for mining complex data, computational power now sometimes defines the practical limit. High performance computing or alternative solutions are required to undertake the intensive data processing and analysis. Grid computing [1], offers a model for solving massive computational problems by subdividing the computation in a set of small jobs, executed in parallel on geographically distributed resources. However, the current job management process on Grid environments is relatively complex and non-automated. Biologists who want to take advantage of B. Kågström et al. (Eds.): PARA 2006, LNCS 4699, pp. 791 798, 2007. c Springer-Verlag Berlin Heidelberg 2007

792 J. Andrade et al. Grid resources face a process of having to manually submit their jobs, periodically check the resource broker for the status of the jobs ( Submitted, Ready, Scheduled, Running, or Finished status), and finally get the results with a raw file transfer from the remote storage area or remote worker to the local file system of their user interface. Different solutions for increasing the usability, scalability and stability in computational Grids have recently been proposed [2], [3]. The presented implementation represents a model by which access and utilization of Grid resources is greatly facilitated, allowing biologist and other non- Grid-experts to exploit the Grid power without necessarily having knowledge of Grid related details and procedures. The utility of this implementation is demonstrated by application to two computationally expensive bioinformatics tasks: Whole proteome sequence similarity analysis and genotype simulations for genome wide linkage analysis 2 Methods In order to make the interaction with the complex computational environments on Grids more straightforward to the biologically oriented scientists, the following tasks were automated: Proxy setup handles the user authentication as a member of a Virtual Organization (VO) and grants the user access to the Grid resources. By default, twelve hours is the time for the proxy to be in effect. After the proxy expires, the task of re-creating new proxy is automatically scheduled in the local Grid client. Job submission involves the remote distribution of the split input data files or databases, as well as the executable binary files to the Grid workers. For each Grid job submitted, a Grid job specification is created using the Resource Specification Language (RSL). Processing. After job submission, a local temporary installation of datasets and executables in the allocated remote nodes is performed. After that, parallel execution is started in remote nodes, and a constant monitoring of the current job s status is performed. Job re-submission in case of job failure or excessive delay in Grid queue systems is also handled. Job collection. When specific Grid jobs are finished, partial results are downloaded from the remote Grid workers to the local computer. This module is also able to handle parallel retrieval of several finished jobs. The figure 1 shows a graphical description of the Grid framework configuration used for this implementation. 3 Implementation A Perl script based Grid broker that ensure unique user authentication was implemented, allowing the user to remotely deploy and execute pre-existing algorithms or software across available Grid resources at submission time. The presented solution is adjusted to NorduGrid ARC [4], but can be easily adapted to any Globus based Grid middleware.

Applications of Grid Computing in Genetics and Proteomics 793 Fig. 1. Grid computing Framework for application in Bioinformatics This implementation can be adapted to tasks suitable for parallelization where an existing Linux executable exists. The implementation consists of two Perl scripts: gridjobsetup.pl. Manages two main tasks. Firstly, the big computationally expensive task is partitioned into a user-selected number of smaller equally sized atomistic jobs, each corresponding to a fraction of the total data. Secondly, for each datra fraction, a Grid job specification is created using the resource specification language (RSL). gridbroker.pl. This is the Grid broker. Its function is to manage the submission, monitoring and collection of the Grid jobs. Following node allocation and job submission, gridbroker.pl performs temporary installations of the deployed executable on the Grid nodes/remote workers, and parallel execution of the Grid jobs is started. gridbroker.pl constantly monitors the parallel execution of the distributed tasks, and in the case of job failure or if a job or set of jobs are excessively delayed in the work-queue scheduler, gridbroker.pl manages the resubmission of this job or set of jobs to different available Grid workers. When jobs reach the status of finished, forked download of specific job-results to the user local file system is performed. The partial Grid job results are finally concatenated to generate the output file. A fraction of the Perl implementation of the broker is shown below. The code shows a loop that manages the submission of a user defined number of Grid jobs; a vector of Grid job identifiers is created

794 J. Andrade et al. in memory and in an archive. This vector will then be used to mange the monitoring and downloading of the jobs. A log file that registers submission start and finish times is also created. Fraction of the Algorithm that Manage the Submission of Grid Jobs Input: XRSL-specification(s) of a number of Grid jobs; for each Grid job,a set of specific input parameters. Action: Submit the given number of Grid jobs. Output: Vector of Job s id and file with timings. 1. Process XRSL-specification 2. Create a time-log-file and register the start of submission 3. Create and open a job-id-file 4. For each job (a) Select the cluster(s) to which the job will be Submitted (b)submit the job (c)collect the retrieved job-id (d)push the collected job-id in a vector (e)push the collected job-id in a job-id-file 5. Register in time-log-file the end of submission 6. Close time-log-file 7. Close job-id-file Fraction of Algorithm that Manage the Monitoring and Downloading of Finished Grid Jobs (The following algorithm shows the constantly monitoring of job s status using the previously created vector of jobs identifiers; in case of job failure, re-submission of jobs is performed, jobs that have successfully reached the status of finished are downloaded.) Input: job-id vector and job-id-file. Action: Monitoring and collection of Grid jobs and resubmission if "job-failure". Output: Collection of finished Grid Jobs and time-log-file. 1. While number of downloaded jobs <= number of total Grid jobs submitted 2. For each job: (a)monitoring status of vector job-id[i] (b)if status of job-id[i] is "FAILURE" then: i. Re-submit job- id[i] to available Grid cluster ii. Delete old and push new retrieved job-id iii. Delete old and push new job-id in job-id File

Applications of Grid Computing in Genetics and Proteomics 795 iv. Register re-submission time in the log-file (c)if satus of job-id[i] is "FINISHED" then: i. Collect job-id[i] and register time ii. Push job-id[i] from vector of Job s id iii. Push job-id[i] from file of Job s id iv. Increase the counter of downloaded jobs 3. Register end of job-collection and close log-file 4 Results XWe have aimed to develop a generic Grid implementation for solving bioinformatics tasks suitable for parallelization where neither pre-selection of available Grid nodes nor pre-installation of software or databases will be necessary. Existing Linux-based executables can be used when scaling up tasks prohibitively time-consuming to perform in single work stations, as our solution will not require re-codification or programming modifications. The implementation is also applicable in situations where the source code is not available. To streamline the process we chose the strategy of making temporary installations of the executable and databases locally at each remote node at submission, followed by un-installation after download and collection of the results. By avoiding the need of predefined run-time environments, this implementation limits the interaction with Grid administrators for installation of applications/software and updates, thereby accommodating for dynamic Grid environments in which available nodes change between submissions. This strategy is however not applicable for instance in cases when a database management system (DBMS) is required, typical examples of DBMSs like Oracle, Microsoft SQL Server or MySQL, will necessary need the use of a specific run-time environments. Our implementation was evaluated in two highly computer intensive real applications in proteomics and genetics: The first application deals with whole proteome protein similarity analysis using a sliding window algorithm [5]. In contrast to ordinary blastp queries aligning full length query protein sequences, the sliding window approach results in a significantly higher number of blast searchers. Using a sliding window size of 51 amino acids, the number of blastp searches for a 1000 amino acid protein increase from 1 to 950. For the entire human Ensembl database [http://www.ensembl.org] of close to 34,000 human proteins, this corresponds to about 15,000,000 blastp searches. The time needed to run this number of blastp searches on a single computer was about eight weeks. As the Ensembl database is constantly evolving and being updated, where protein sequences are added, changed or deleted, frequent reprocessing of the database becomes necessary in the HPA program [http://www.proteinatlas.org] in order to work with the most accurate data at any one time. Once a new version of the database is released, the sequence similarity data on which the epitope design is based needs also to be updated. The computational requirements for this task were exceeding in-house resources if the processed results of a database update were to be

796 J. Andrade et al. delivered before it was already obsolete. With a Grid implementation where local installations of both the blastp executable and the entire Ensembl database was performed on each node (a total package of a size of 16 MB)[5], runtime was reduced from about eight weeks on one single up-to-date computer, to less than 24 hours using 300 Grid nodes in Swegrid [http://www.swegrid.se]. The absolute speed-up for this application was calculated as: Sp = T 1 s (1) T p Where T1 s is the sequential run-time, and T p is the execution time in p Grid nodes. Using the complete human Ensembl database as input, speed-up of 56 fold was archived, this was calculated by dividing T1 s = 1344 hours by T p =24 hours (the Grid run-time with same data as input in 300 Grid processors in Swegrid). The expected linear speed-up (300 fold in 300 nodes) was not archived, mainly due to Grid latency. By making a local installation of a database at each submission, the speed of running queries against a local database was obtained together with running against the most recent update. The alternative strategy of storing the database in one single Grid storage resource accessed by all the other nodes, proved to create an I/O overload in the Grid storage server, resulting in a significant increase of the total runtime. The second application was facilitating computer simulations of genotypes using a HMM based software [6], in order to evaluate the significance of genomewide linkage data. This was applied in a study aimed to identify novel genes involved in the pathogenesis of Alzheimers disease (AD) by performing a nonparametric multipoint linkage analysis on AD families from the relatively genetically homogeneous Swedish population. On a genome-wide scale, this task is extremely computationally intensive. In the absence of sufficient computational resources the number of simulations would therefore have to be limited, which could lead to the estimation of insufficient global significance levels and false positive linkage claims. We developed Grid-Allegro [7] which was used in the hypothesis testing to evaluate the statistical significance of the linkage data under the null hypothesis of no linkage using a set of 109 AD families. Serial execution time required to perform the minimum required 22000 genotype simulation analyses was reduced from the projected time, more that 3 years on a single up-to-date CPU, to less than 3 days when distributed computing was performed in 600 Grid workers in Swegrid [7]. 5 Discussion There are several computationally demanding algorithms and tasks in bioinformatics that may cause a computational overload when scaled up. To the researcher without access to expensive resources in-house such as dedicated clusters or computer farms, Grids represents a cost-effective and powerful resource. However, a current obstacle especially to the biologically oriented researcher is managing the middleware that is still raw and hardly accessible. For the

Applications of Grid Computing in Genetics and Proteomics 797 non-computer scientist, more user-friendly alternative solutions are necessary. One alternative is to develop web-based user front-end services of underlying Grid implementations, which are accessed by third party users. This is the most accessible alternative of exploiting Grid resources, as it is associated with minimal complexity where no necessary previous knowledge of distributed computing is required by the user. Grid resource brokers and job submission services based on Grid and Web services have been previously proposed [8]. However, for our specific purposes, we decided to use a generic, script-based strategy for implementing Grid-aware applications of bioinformatics task that are suitable for parallelisation. Our major concerns were related with security, stability and usability. Although Grid security is based in public key infrastructure (PKI) and this architecture offers strong security levels for the Grid end-user, current PKI implementations suffer from serious usability issues, especially when applied to web-based Grid-services. [9] Strong efforts are required in searching for new mechanisms for increasing the usability of Grid security. [10] Web-based implementations also confine the input submission format to those defined or envisioned by the provider/developer, which may reduce the flexibility for the third party user. Furthermore, Web-based Grid implementations may require re-codification of previously existing single CPU-oriented algorithm implementations. The developer assumes the administrator responsibility for maintaining the availability and updating of the resource. When web-based services are developed and provided through large initiatives [11], this indeed represents a transparent and user-friendly solution. However, new applications depend on continued development and implementation by these providers, and are hence not always available to meet the specific needs in individual third party projects. The alternative generic strategy, although requiring basic computer knowledge by the user, greatly increases the flexibility by enabling the implementation to be applied to similar distributable computation-demanding tasks. In conclusion, our implementation facilitates the biologically oriented scientist s remote deployment and execution of pre-existing codifications of bioinformatics algorithms across multiple Grid resources. By applying this implementation in solving two data and CPU intensive tasks, we have demonstrated the potential utility of Grid technology for addressing highly computational demanding bioinformatics task. References 1. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. International Journal of High Performance Computing Applications 15(3), 200 222 (2001) 2. Ellert, M., Konstantinov, B., K onya, J., Lindemann, J., Livenson, I., Nielsen, J., Smirnova, O., Wäanänen, A.: Advanced Resource Connector middleware for lightweight computational Grids. Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods and Applications 23, 219 240 (2007)

798 J. Andrade et al. 3. Elmroth, E., Tordsson, J.: Grid Resource Brokering Algorithms Enabling Advance Reservations and Resource Selection Based on Performance Predictions. Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods and Applications (2007) 4. Ellert, M., et al.: The NorduGrid project: using Globus toolkit for building GRID infrastructure. Nuclear Instruments & Methods in Physics Research Section a- Accelerators Spectrometers Detectors and Associated Equipment 502(2-3), 407 410 (2003) 5. Andrade, J., et al.: Using Grid technology for computationally intensive applied bioinformatics analyses. Silico Biology, 6 (2006) 6. Gudbjartsson, D.F., et al.: Allegro, a new computer program for multipoint linkage analysis. Nat Genet 25(1), 12 13 (2000) 7. Andrade, J., et al.: The use of Grid computing to drive data-intensive genetic research. European Journal of Human Genetics (March 21, 2007) 8. Elmroth, E., Tordsson, J.: An interoperable, standards-based Grid resource broker and job submission service. In: First International Conference on e-science and Grid Computing, IEEE Computer Society Press, Los Alamitos (2005) 9. Gui, X.L., et al.: A grid security infrastructure based on behaviors and trusts. In: Grid and Cooperative Computing Gcc 2004 Workshops, Proceedings, vol. 3252, pp. 482 489 (2004) 10. Beckles, B., Welch, V., Basney, J.: Mechanisms for increasing the usability of grid security. International Journal of Human-Computer Studies 63(1-2), 74 101 (2005) 11. Blanchet, C., et al.: GPS@ Bioinformatics Portal: from Network to EGEE Grd, vol. 2006, pp. 187 193. IOS Press, Amsterdam (2006)