Dedicated to my Family

List of papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. Project - 1: In papers I and II, work on application execution environments is described. In paper I, we present tools for general purpose solutions using portal technology while paper II addresses access of grid resources within an application specific problem solving environment. I II Erik Elmroth, Sverker Holmgren, Jonas Lindemann, Salman Toor, and Per-Olov Östberg. Empowering a Flexible Application Portal with a SOA-based Grid Job Management Framework. In Proc. 9th Workshop on State-of-the-art in Scientific and Parallel Computing (PARA 2008), Springer series Lecture Notes in Computer Science (LNCS), 6126 6127. Mahen Jayawardena, Carl Nettelblad, Salman Toor, Per Olov Östberg, Erik Elmroth, and Sverker Holmgren. A Grid Enabled Problem Solving Environment for QTL Analysis in R. In Proc. 2nd International Conference on Bioinformatics and Computational Biology (BiCoB 2010), 2010. ISBN 978-1-880843-76-5. Contributions: In this project I participated in architecture design, integration component implementation and design of the QTL specific interface in LAP. I have also participated in system deployment, running experiments and in writing the article. Project - 2: Paper III, IV and V describe file-oriented distributed storage solutions. Papers III is focused on the architectural design of the Chelonia system whereas papers IV and V addressed stability, performance and identified issues. III IV Jon Kerr Nilsen, Salman Toor, Zsombor Nagy, and Bjarte Mohn. Chelonia A Self-healing Storage Cloud. M. Bubak, M. Turala, and K. Wiatr, editors, In CGW 09 Proceedings, Krakow, 2 2010. ACC CYFRONET AGH. ISBN 978-83-61433-01-9. Jon Kerr Nilsen, Salman Toor, Zsombor Nagy, and Alex Read. Chelonia: A self-healing, replicated storage system. Published in Journal of Physics: Conference Series, 331(6):062019, 2011.

V Jon Kerr Nilsen, Salman Toor, Zsombor Nagy, Bjarte Mohn, and Alex Read. Performance and Stability of the Chelonia Storage System. Accepted in International Symposium on Grids and Clouds (ISGC) 2012. Contributions: I did part of the system design and implementation. Also I designed, implemented and executed the test scenarios presented in all the articles. I was also heavily involved in technical discussions and papers writing. Project - 3: In papers VI and VII a database driven approach for managing data and the analysis requirements from scientific applications is discussed. Paper VI focuses on the data management whereas paper VII presents a solution for data analysis. VI VII Salman Toor, Manivasakan Sabesan, Sverker Holmgren, and Tore Risch. A Scalable Architecture for e-science Data Management. Published in Proc. 7th IEEE International Conference on e-science, ISBN 978-1-4577-2163-2. Salman Toor, Andrej Andrejev, Andreas Hellander, Sverker Holmgren, and Tore Risch. Scientific Analysis by Queries in Extended SPARQL Over a Distributed e-science Data Store. Submitted in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2012). Contributions: I did the architecture design, interface implementation and static partitioning for complex datatypes in Chelonia. I also participated in designing use-cases to demonstrate the system and in article writing. Project - 4: Paper VIII also addresses a distributed storage solution. In this paper we explore a cloud based storage solution for scientific applications. VIII Salman Toor, Rainer Töebbicke, Maitane Zotes Resines, and Sverker Holmgren. Investigating an Open Source Cloud Infrastructure for CERN-Specific Data Analysis. Accepted in 7th IEEE International Conference on Networking, Architecture, and Storage (NAS 2012). Contributions: I participated in enabling access from the ROOT framework to SWIFT and in prototype system deployment. I worked on design, implementation and execution of the test-cases presented, contributed to the technical discussion, and participated in paper writing. Reproduced with the permission of the publishers, presented here in another format than in the original publication.

Contents Part I: Introduction........................................................................................... 11 1 Introduction................................................................................................ 13 1.1 Overview of Distributed Computing............................................. 15 1.1.1 Communication Protocols............................................... 15 1.1.2 Architectural Designs...................................................... 16 1.1.3 Frameworks for Distributed Computing........................ 17 1.2 Models for Scalable Distributed Computing Infrastructures...... 18 1.2.1 Grid Computing............................................................... 19 1.2.2 Cloud Computing............................................................ 20 1.2.3 Grids vs Clouds............................................................... 21 1.2.4 Other Relevant Models................................................... 21 1.3 Technologies for Large Scale Distributed Computing Infrastructures................................................................................. 21 Part II: Application Execution Environments................................................ 25 2 Application Environments for Grids........................................................ 27 2.1 Grid Portals..................................................................................... 27 2.2 Application Workflows.................................................................. 27 2.3 The Job Management Component................................................ 28 2.4 Thesis Contribution........................................................................ 28 2.4.1 System Architecture........................................................ 29 Part III: Distributed Storage Solution............................................................. 31 3 Distributed Storage Systems..................................................................... 33 3.1 Characteristics of Distributed Storage.......................................... 34 3.2 Challenges of Distributed Storage................................................ 34 3.3 Thesis Contribution........................................................................ 35 3.3.1 Chelonia Storage System................................................ 35 3.3.2 Database Enabled Chelonia............................................ 38 3.3.3 Cloud based Storage Solution......................................... 39 Part IV: Resource Allocation in Distributed Computing Infrastructures..... 41 4 Resource Allocation in Distributed Computing Infrastructures............. 43 4.1 Models for Resource Allocation................................................... 43

4.2 Thesis Contribution........................................................................ 44 Part V: Article Summary................................................................................. 47 5 Summary of Papers in the Thesis............................................................. 49 5.1 Paper-I............................................................................................. 49 5.2 Paper-II........................................................................................... 49 5.3 Paper-III.......................................................................................... 50 5.4 Paper-IV.......................................................................................... 50 5.5 Paper-V........................................................................................... 50 5.6 Paper-VI.......................................................................................... 51 5.7 Paper-VII........................................................................................ 51 5.8 Paper-VIII....................................................................................... 52 6 Svensk sammanfattning............................................................................. 53 7 Acknowledgments................................................................................... 55 References........................................................................................................ 57

List of Other Publications These publications have been written during my PhD studies but are not part of the thesis. However, some of the material in publications I and II below is included in other papers in the thesis. Also, some of the conclusions in publication III are presented in Section 4.2 in the thesis summary. I. Mahen Jayawardena, Salman Toor, and Sverker Holmgren. A grid portal for genetic analysis of complex traits. Proc. 32nd International Convention on Information and Communication Technology, Electronics and Microelectronics : Volume I. - Rijeka, Croatia : MIPRO, 2009. - S. 281-284. II. Mahen Jayawardena, Salman Toor, and Sverker Holmgren. Computational and visualization tools for genetic analysis of complex traits. Technical Report no. 2010-001. Department of Information Technology, Uppsala University. III. Salman Toor, Bjarte Mohn, David Cameron, and Sverker Holmgren. Case-Study for Different Models of Resource Brokering in Grid Systems. Technical Report no. 2010-009. Department of Information Technology, Uppsala University. 9

List of Presentations The material presented in this thesis has been presented at the following conferences/workshops: Usage of LUNARC Portal in Bioinformatics. Presented at The NorduGrid Conference in Copenhagen, 2007. Empowering a Flexible Application Portal with a SOA-based Grid Job Management Framework. Presentation at Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA 2008) in Trondheim, 2008. Two presentations at NorduGrid Workshop in Bern, 2008. Introduction to LUNARC Application Portal and Empowering LUNARC portal using GJMF. Architecture of the Gateway Component of Chelonia Storage. Presented at NorduGrid Conference in Budapest, 2008. Efficient and Reliable Brokering System for ARC middleware. Poster presentation at The International Summer School on Grid Computing in Sophia Antipolis, Nice, 2009. Joint demo presentation together with Jon Kerr Nilsen and Zsombor Nagy on Chelonia Storage System at EGEE User Forum in Barcelona, 2009. Video link: http://www.youtube.com/watch?v=neuwzghhghc Architecture of Chelonia Storage. Presented at Cracow Grid Workshop, 2009. A Grid-Enabled Problem Solving Environment for QTL Analysis in R. Poster presentation at The EGEE User Forum in Uppsala, 2010. ARC User Interfaces using LUNARC Portal. Presentation at NorduGrid Conference, Sundvolden, 2011. Extension of Chelonia Storage System to handle databases. Presentation at The Summer School in International Center for Theoretical Physics (ICTP) in Treiste, 2011. A Scalable Architecture for e-science Data Management. Presentation at 7th IEEE e-science Conference in Stockholm, 2011. Performance and Stability evaluation of Chelonia Storage. Presented at The International Symposium on Grids and Clouds (ISGC), Taipei, March 2012. Status report of Open Source Cloud Storage Infrastructure for CERN- Specific Data Analysis, Presented (via Skype) at the 3rd Workshop of COST Open Network for High-Performance Computing on Complex Environments, Genova Italy, April 2012. Investigating an Open Source Cloud Storage Infrastructure for CERN- Specific Data Analysis, will be presented at 7th IEEE International Conference on Networking, Architecture, and Storage (NAS 2012), Xiamen China, June 2012. 10

Part I: Introduction

1. Introduction Computational science plays a vital role for rapid progress of commercial and scientific environments. Together with tools, methods and techniques used in computational science, the advancements in computational models have the potential of providing fundamental break-through in this progress. Depending on the needs of the applications, different parallel and distributed computing models have been developed over time. To fulfill the ever-growing computational and storage needs of the applications, even more efficient, reliable and secure computing environments will be needed also in the future. Applications from disciples like engineering, astronomy, medicine, and biology require sustainable computational models which can fulfill their requirements for long periods of time. For example, applications using stochastic models [87] require several thousands of independent executions for single experiments, implying that significant computational power is needed. Other examples are given in the field of bioinformatics, where multidimensional optimization problems must be solved to determine e.g. interacting genes [77]. In terms of data intensive applications, LHC experiments [22] running at CERN [12] requires storage solutions managing petabytes of data. Similarly, the storage requirements for genome sequencing [45] is even beyond the petascale. In [74], a number of different data-intensive applications are presented which require unconventional solutions to meet their demands. Distributed Computing Infrastructures (DCI) enables geographically distributed resources under autonomous administrative domains to be seamlessly, securely and reliably utilized by applications from various disciplines. During the last decades, a number of different projects have aimed at designing systems which enable efficient usage of geographically distributed resources to fulfill computational and storage requirements. Several models have been used to describe different distributed computing infrastructures, e.g. Utility Computing, Meta Computing, Scalable Computing, Internet Computing, Peer-to- Peer Computing, and Grid Computing. Service oriented architecture enables Cloud Computing to focus on providing non-trivial quality of services both for computational and storage requirements. In principle, grid computing was the first concept that enabled use of large scale distributed computing infrastructures. The idea of building a computational grid evolved from the concept of electric grids [65]. Under the headline of grid computing, issues of efficient, reliable and seamless access to geographically distributed resources have been extensively studied, and a number of production level grids are today essential tools in different scientific disciplines. 13

After grid computing, computational and storage clouds have emerged to provide alternative options for flexible access to computing infrastructures. Cloud computing can be considered as a successor of grid computing, adding some more advanced concepts essential to address a wider span of user communities. The work presented in this thesis is based on the grid and cloud computing paradigms. Three areas are studies; Application environments, development and evaluation of storage solutions for grids and clouds, and efficient resource allocation in grids. Below, a brief introduction to the challenges studied in each field is given: Application environments: For enabling distributed computing infrastructures, it has been realized that two major issues should be addressed. First, the monolithic design of applications needs to be modified. Second, more user friendly and flexible application environments are required to execute and manage complex applications. A number of solutions have been proposed based on high level client API(s), web application portals and workflow management systems. We have developed a general purpose and an application specific problem-solving environments based on R framework [34], GJMF [58] and LUNARC portal [86] for managing applications in DCI. Storage solutions: The significance of storage systems in distributed computing is indispensable. The task of building a large-scale storage system using geographically distributed storage resources is non-trivial, and to achieve production level quality requires functionality such as security, scalability, a transparent view over the geographically distributed resources, simple/easy data access, and a certain level of self-healing capability where components could join and leave the system without affecting the systems availability. To design solutions that can address these features is still a challenge. We have developed and analyzed the Chelonia storage system [8]. Chelonia provides reliable, secure, efficient and self-healing file storage over geographically distributed storage nodes. Recently we have extended the capabilities of Chelonia by enabling databases at storage nodes. The databases are specialized for scientific applications. By using a generalized database schema, Chelonia can handle simple (integer, real and string) and complex (arrays, matrices and tensors) datatypes using databases. We have also investigated the performance and scalability of an Openstack storage [30] solution for CERN-specific data analysis. Resource allocation: For grid systems, efficient selection of the execution or storage target within the set of available resources is one of the key challenges. The heterogeneous nature of the grid environment makes the task of resource discovery and selection cumbersome. A comprehensive view of available resources require up-to-date information. The task of collecting information is expensive and consumes network bandwidth. We have proposed a strategy of classifying attributes of resources which helps in efficient resource discovery. 14

1.1 Overview of Distributed Computing Network-enabled computational nodes are the fundamental building block for the concept of distributed computing. This concept allows the researchers to use computational power far beyond what is available at a centralized facility. The goal of distributed computing is to build powerful and scalable solutions to enhance the computational and storage capabilities. The two most commonly used network models for enabling distributed computing are the request/response model and the message queue approach. Message queues provide an asynchronous communication mode in which messages can be sent any time whereas a request/response system can be either synchronous or asynchronous. Client/Server and Peer-to-Peer communication are both examples of request/response models. These models themselves introduce native ways of utilizing remote resources. On top of these basic models, other solutions have been developed. 1.1.1 Communication Protocols In beginning, Remote Procedure Calls (RPC) [100] introduced in the software played a vital role in enabling distributed computing. RPC works at the transport and application layer of the Open System Interconnect (OSI) model of network communication. RPC provides interprocess communication where the process can be on the local or on a remote host. The communication using RPC is point to point. It also hides the underlined communication details and provides high level interfaces to access remote resources. RPC allows a normal procedural/function call to be executed in another process on the remote host. It works in the client server mode and requires synchronous communication. Other variants for interprocess communication includes messaging queuing and IBM s Advanced Program-to-Program Communication (APPC). Simple Object Access Protocol (SOAP) [36] is an extensible Markup Language (XML) [14] based protocol for applications to share structured informations on the network. It provides an envelope format for exchanging information using communication protocol like the Hyper Text Transport Protocol (HTTP) or the Simple Mail Transport Protocol (SMTP). Since HTTP is the standard communication protocol for internet, SOAP and HTTP together provides a standardized and much used solution to communicate over wide area networks. One of the gains of using SOAP over HTTP is that its easy to cross the security walls in the network. This is because HTTP is allowed on the network and then communicating over a specific port make it possible to identify the incoming requests. Since SOAP-based communication is XML based and XML is accepted on almost all platforms, this approach allows communication in heterogeneous environments. On the other side, because of that the rich XML format is used, SOAP based communication is slow. 15

REST stands for Representational State Transfer [60]. The REST approach is aimed at avoiding SOAP and RPC and simply rely on HTTP requests. REST-based communication is stateless, i.e. each request is self contained and equipped with all the information required by the server to fulfill the request. It provides simple yet fundamental functionality over HTTP. Using REST one can send request like Create, Update, Read and Delete to the remote application. REST is an alternative solution avoiding the limitations in SOAP and RPC. It is light weight and communicating over standard HTTP, which makes it a convenient and platform independent communication option. 1.1.2 Architectural Designs Based on the communication protocols presented in 1.1.1, two basic architectures for distributed computing can be identified; Component-based architecture and Service-oriented Architectures. Here, a number of different variants and hybrid architectures are also available. Component-based architecture rely on point-to-point communication between the nodes. A component is a software object interacting with another software object located on the remote host. In simplest case, each component exposes certain interfaces which are used to interact and access functionalities provided by the component. Since the communication is point-to-point, often system based on this architecture uses RPCs as communication medium. Then Figure 1.1. Point-to-point communication in Component-based architecture using remote procedural calls (RPC). the solution inherits the advantages and disadvantages of using an interprocess communications method. For example the communication will be fast but it will be difficult to use a heterogeneous network environment. The design of the Service Oriented Architecture (SOA) is natural for loosely coupled distributed applications in a heterogeneous environment. During recent years, SOA has been a common choice for designing solutions for distributed computing. A definition given in [29] describes SOA as a "paradigm for organizing and utilizing distributed capabilities that may be under the control of different ownership domains. In general, SOA allows for having a relationship between the needs and the capabilities. This relation can be one- 16

to-one, where one need can be fulfilled by one capability, or it can be manyto-many. The visibility of the capabilities, offered by entities, is described in the service description which also contains the information necessary for the interaction. The service description also informs about what result will be delivered and under what conditions the service can be invoked. Figure 1.2. Service Oriented Architecture (SOA). Similar to the component based architecture, a SOA solution also inherits the advantages and disadvantages used by the underlying communication mechanism. SOAP over HTTP(S) has become a default choice for solutions based on SOA. This allows requests and responses to be comprehensive because of the extensibility of XML. The communication over HTTP(S) allows the services to be deployed over the internet. This will increase the visibility of the services and also these service can be reused by different applications. On the down-side, due to the extensibility of XML, the message parsing mechanism is time consuming. 1.1.3 Frameworks for Distributed Computing Distributed Computing Environment (DCE) [9] is based on a componentbased architecture. It provides consistent communication across the remote execution environments. The framework is used to build client-server applications and also allows features like DCE-specific remote procedure calls (DCE/RPC), authentication and security, naming service and access to distributed file system. Java Remote Method Invocation (RMI) [56] allows Java objects to be shared between Java Virtual Machines (JVM) running on multiple nodes. Since JVM provides a platform-independent environment for Java applications this allows Java objects to be shared across different platforms. Java RMI framework based on the component-based architecture. Using Java RMI, the communi- 17

cation is restricted to a pure Java environment, thus it does not provide support for cross language interoperability. One of the gains of using Java RMI is its object-oriented approach which facilitates building applications. Microsoft s Distributed Component Object Model (DCOM) [96] is another framework based on the component-based object architecture. DCOM is an extension of the Component Object Model (COM). COM allows to build a client server communication model on same host and DCOM extend it to multiple hosts within a same network. DCOM uses standard RPCs developed for the distributed computing environment. DCOM provides security features and it also introduces platform independence for the DCOM award applications. The Common Object Request Broker Architecture (CORBA) [105] provides a framework for platform-independent distribution of objects on a heterogeneous network. This is another effort based on a component-based object sharing architecture. CORBA is one of the most successful frameworks for building distributed solutions. The Object Request Broker (ORB) is the core of CORBA. This component allows the connected nodes to initiate a request without knowing the actual location and the interface to program at the node which can fulfill that request. The framework also provides runtime interface identification and invocation using the Interface Repository (IR) and Dynamic Invocation Interface (DII). Web services [29] form one implementation of a service-oriented architecture. It is the most widely used technology in distributed computing solutions days. A web service framework enables a distributed application that offers functionality by publishing its functions by interfaces and hiding the implementation details. Clients communicate with standard protocols without actually knowing the platform or the implementation details. The success of web service technology is due to the acceptance of standards. Usually the communication process is based on three components: XML for data exchange between client application and service, SOAP and HTTP(s). Also, WSDL (Web Service Description Language) [41], which is an XML based language to describe the attributes, interfaces and other properties of the web-service, is sometimes used. 1.2 Models for Scalable Distributed Computing Infrastructures Distributed Computing Infrastructures can broadly be categorized into either small to medium scale, formed by closely interconnected computing resource in a single organization, or large scale computational environments, based on distributed, shared resources of the organizations. Managing geographically heterogeneous distributed infrastructures requires more advanced solutions to provide reliable systems. Some of the key issues that should be considered are: 18

The proposed model should be scalable and adaptable to the new requirements. Due to the heterogeneous nature of the environment, the availability of the resources is not guaranteed. The solutions should be flexible enough to accommodate changes. The resources are managed by different administrative domains, When designing a federated infrastructure the domain autonomy should be intact. It is important to ensure the correct use of the resources in the system, and mechanism for authorization and authentication is needed. An efficient resource discovery mechanism is required to keep the information updated. The maximum usage of the system requires an efficient and reliable resource allocation mechanism. An abstraction layer should hide the underlying complexity from the users. Grids Computing and Cloud Computing are the two most successful models that implements a distributed computing infrastructure. 1.2.1 Grid Computing Grid Technology provides means to facilitate work in collaborative environments, formed across boundaries of institutions and research organizations. In [62], grid technology is stated to promise to transform the practice of science and engineering, by enabling large-scale resource sharing and coordinated problem solving within farflung communities. Over the last decade, a number of research and development projects have put a lot of effort into making grid technology stable enough to provide a production infrastructure for both computation and data. Grid technology allows different kinds of resources to be seamlessly available over geographical and technological boundaries. The resource can be anything from a single workstation, a rack mounted cluster, a supercomputer, a complex RAID storage, to e.g. a scientific instrument that produces data. These resources are normally independent and managed by different administrative domains. This brings in lots of challenges in how to enable different virtual organizations [64] to access resources in different domains. A basic question is to select which resource to use to run the application or store the data. Since each set of resources are subject to different access policies, how can one enable a standard access mechanism? And how can the environment be made secure enough to maintain the integrity of the system? How can one build a reliable monitoring and accounting system with low overhead? What protocols should be used to communicate with users, between computing resources and between storage centers? Each of these questions emerge as a 19

sub-field in grid computing research in which different research groups come up with various types of solutions. The uptake of grid technology within the scientific community can be measured by the number of middleware initiatives and the number of projects utilizing grid resources using these middlewares. For example by the end of the EGEE project, the glite middleware [10] had more than 260 sites all over the world, in which they had 150,000 processing cores, 28 petabytes of disk space and 41 petabytes of long-term tape storage. More than 15 different scientific domains benefited from this infrastructure. The Advanced Resource Connector (ARC) middleware [57] by NorduGrid [27] have 66 sites in which more than 54,000 CPUs are in use [7]. Many other middlewares, such as Condor- G, Globus [17], Unicore [40] for computing grids and DCache, CASTOR, DPM and SRB for storage grids are also heavily used in different scientific experiments. Apart from these production middlewares for computational and storage grids, there are a number of research projects which have developed different application specific and general purpose environments based on these middleware. 1.2.2 Cloud Computing Clouds address large-scale storage and computing needs by providing a certain level of abstraction. This technology has gained much attention over the last few years and companies like Amazon, Yahoo and Google have presented commercial solutions. There are a number of definitions [39, 103] explaining the concept of a cloud, one example is found in [106] stating that A Computing Cloud is a set of network enabled services, providing scalable, QoS guaranteed, normally personalized, inexpensive computing platform on demand, which could be accessed in a simple and pervasive way. The basic idea of cloud technology is to provide a given level of quality of service while keeping the infrastructural details hidden from the end users. The customer pays and get the services on demand. In [103], the set-up of a cloud service is based on two actors; Service Providers (SPs), which provide a set of different services (e.g. Platform as a Service (PaaS) or Software as a Service (SaaS)) and ensure that the customer access these. Then the Infrastructure Providers (IPs) are responsible for the hardware infrastructure. Actors with specialized roles introduce flexibility in the system, for example one SP can utilize infrastructure of multiple IPs and a single IP can provide infrastructure for a single or multiple SP(s). Having actors responsible for providing services fulfilling a certain Service Level Agreement (SLA) together with an economic model encourage companies to adopt cloud technology and sell computing and storage services like other utilities such as electricity or gas. 20

1.2.3 Grids vs Clouds Currently, a discussion aimed at pinpointing the differences between clouds and grids is ongoing. In [66], a detailed comparison of these technologies is presented, and it is clarified that there are differences in security, computing and programming model. Another key difference is the elasticity provided by cloud solutions. The concept of elasticity allows applications to grow and shrink according to their requirements. This is very important for clouds as the idea of pay-as-you-go can not work if clouds doesn t have provision of dynamic management of resource utilization. Also, the grid concept focuses on loosely coupled federated infrastructures in which there is no guarantee that resources are available all times. In contrast, the current solutions for clouds are based on closely connected dedicated resources where the infrastructure providers guarantee the availability. However, there are also similarities in vision, sometimes in the architecture and also in the tools that are used to build the systems. 1.2.4 Other Relevant Models Apart from Grids and Clouds, there are some more models available for managing distributed infrastructures. Utility Computing is one of them.this concept is a bit similar to the cloud in which an economic model is attached to the computing model and the cost depends on the usage of the resource. Another effort in this direction is the Desktop Grid. The Desktop Grid model inherit the features from Grid computing but focus on low cost, reliable and maintainable solutions. Autonomic Computing focuses on the self-managing process in distributed environments. The idea is to build self sufficient components in the system which can manage themselves under the unpredictable conditions. An other model that has gained significant attention is Pervasive Computing. The model is based on the idea that devices should be completely connected and fully available. 1.3 Technologies for Large Scale Distributed Computing Infrastructures There is a number of reliable solutions available which are based on the concept of grid and cloud computing. In grids, the term grid middleware is used to describe a software stack, designed to enable seamless, reliable, efficient and secure access to the geographically distributed resources whereas in clouds everything is knows as service. A number of different middleware initiatives have been started over the years, and the following description only gives a brief overview of a few production level middlewares for computational and storage grids. 21

Globus Toolkit: Globus is a pioneering project that provides tools to build grid middlewares. The toolkit [63] provided by Globus contains several components which can broadly be categorized into five classes: Execution Management [13], which execute, monitor, and schedule grid jobs; Information Service [23], which discover and monitor resources in the grid; Security [35], which provides a Grid Security Infrastructure (GSI); Data Management [18], which allows for handling of large data sets, and finally Common Runtime, which is a set of tools and libraries used to build the services. Other middleware initiatives provide a more full-blown solution for distributed computational and storage resources and are directly used in different application areas: Advanced Resource Connector (ARC): The Advanced Resource Connector (ARC) Grid middleware is developed by the NorduGrid consortium [26] and the EU KnowARC project [21]. The ARC middleware is SOA-based where services run in a customized service container called the Hosting Environment Daemon (HED) [49]. HED comprises pluggable components which provide different functionalities. For example, Data Management Components are used to transfer data using various protocols, Message Chain Components are responsible for the communication within clients and services, ARC Client Components are plugins used by the clients to connect to different Grid flavors, and Policy Decision Components are responsible for the security model within the system. There are a number of services available for fulfilling fundamental requirements of a grid system. For example, grid job execution and management is handled by the A-REX service [80], policy decisions are taken by the Charon service, the ISIS service [98] is responsible for information indexing, and batch job submission is handled by the Sched service. The work presented in this thesis is based on the ARC middleware. In [28], further details on each of the component and services in ARC are presented. glite: The glite middleware [85] was the interface to the resources in the EGEE [70] infrastructure. Also glite is SOA-based. Two core components of the glite middleware stack are gliteui, a specialized user interface to access available resources, and the Virtual Organization Management Service (VOMS) which manages information and access rights of the users within a VO. Resource level security is managed by the Local Centre Authorization Service (LCAS) and Local Credential Mapping Service (LCMAPS). The Berkeley Database Information Index (BDII) is used for publishing the information. The Workload Management System (WMS) [90] is a key component of the system and distributes and manages user tasks across the available resources. The lcgce and CREAM-CE (Computing Resource Execution And Management Computing Element) are services for providing the computing 22

element, and lcgwn is the service for a worker node. For Data Management [102], the LFC (LCG File Catalog) and the FTS (File Transfer Service) are used. R-GMA [52] and File Transfer Monitor (FTM) [15] are used for monitoring and accounting. UNICORE: UNICORE [99] is a middleware based on a three-layered architecture. Here, the top layer deals with the client tools, the second service layer consist of core middleware services such as authentication, job management and execution. Application workflows are managed by Workflow Engine and Service Orchestrator. The bottom layer is the systems layer, which contains a connection between Unicore and the autonomous resources management system. External Storage is managed by the GridFTP protocol. dcache: dcache [97] is a distributed storage solution which combines geographically distributed storage nodes. It also provides access to tertiary storage systems. The major features of dcache include hot-spot detection, data flow control and the support of different data access protocols. dcache is based on the service-oriented architecture which combines heterogeneous storage elements to collect several hundreds of terabytes in a single namespace. Nordic Data Grid Facility (NDGF) [25] is the largest example of the dcache deployment. There, the core components, such as the metadata catalogue, indexing service and protocol doors are run in a centralized manner, while the storage pools are distributed. OGSA-DAI: The Open Grid Services Architecture Data Access and Integration (OGSA-DAI) [78] is a storage middleware solution that allows uniform access to data resources using a SOA approach. OGSA- DAI consist of three core services, the Data Access and Integration Service Group Registry (DAISGR) allows other services in the system to publish metadata and capabilities, the Grid Data Service Factory (GDSF) has a direct connection to the data resource and contains additional metadata about the resource, and the Grid Data Service (GDS) creates GDS(s) which is used by the clients to access the data. A set of Java-based APIs allows clients to communicate with the system. European Middleware Initiative (EMI): The EMI [11] is a complete software stack based on four major European middlewares ARC, UNI- CORE, glite, and dcache. The aim is to provide a coherent middleware by adhering software standards for interoperability between the core services of partner middlewares. Recently, EMI-1 codename Kebnekaise has been released. It consist of a comprehensive set of tools and services for distributed computing infrastructures which includes EMI-Compute for enabling computational resources, EMI-Data for distributed data management, EMI-Infrastructure provides a set of services require for information and management of DCI and EMI-Security for secure communication. 23

Meta-middlewares: The problem of having to learn and use multiple middlewares has been addressed by adding another layer on top of the existing middlewares. This meta-layer interacts with the underling middlewares and can also add new functionality. The Grid Job Management Framework (GJMF) [58] used in this thesis is an example of a middleware independent resource allocation framework. In contrast to the grid middleware initiatives described above, some well know cloud based solutions are described below. Amazon Cloud Services: Amazon [71, 6] provides commercial solutions for computing and storage capabilities by using Elastic Cloud Computing (EC2) [1] and Simple Storage Solution (S3) [4] web services. The Amazon cloud provides a seamless view to the computing and storage services with a pay-as-you go model. Here, the S3 service is based on the concept of Buckets; a container to store objects which can be configured to be stored in specific region. S3 provides APIs using REST [79] and SOAP for most common operations like Create Bucket, Delete, Write and Read Objects and Listing Keys. EC2 allows access to the computational resources using web service interfaces. Apart from these two service, Amazon also provides SimpleDB [5] for providing core database functions like indexing and querying in the cloud, while RDS [3] addresses the users that need a relational database system and the Elastic ReduceMap [2] services allows users to process massive amount of data. Azure Cloud Services: The Azure is a commercial cloud solution developed by Microsoft. Using Azure Compute, one can build applications using any language, tool or framework. Azure Storage is similar to other storage cloud solutions in the sense that users can create objects, containers, where each container stores items of different types. Azure also provides features like data access using RESTful interfaces, automatic content caching near to the users and a secure access mechanism for data in the cloud [7]. Other services provided by Azure includes SQL Azure for databases. Applications can enable reporting by using Azure Business Analytics and Service Bus allows applications to create reliable messaging mechanism for loosely coupled applications. Openstack Cloud: The Openstack effort is a global, collaborative enterprise for specifying interfaces and building open source components for cloud technology. The effort spans a wide field, covering computing, storage and image services. Openstack Compute is an open source solution based on a large network of virtual machines to provide a scalable computing platform. SWIFT is the Openstack storage solution. It s a BLOB-based solutions for managing petabytes of data. Openstack Image Service provides discovery, registration and delivery services for virtual images. 24

Part II: Application Execution Environments

2. Application Environments for Grids Grid systems provide a means for building large-scale computational and storage environments meeting the growing needs of scientific communities. There are challenges in building and managing efficient and reliable grid software components, but another area that also requires serious attention is how to enable applications to use the grid environment. Often, scientific applications are built using a monolithic approach which makes it difficult to exploit a distributed computing framework. Even for a very simple application, the user needs certain expertise to run the job on a grid system. The client tool has to be installed and configured, a job description file has to be prepared, credentials have to be handled, commands to submit/monitor the job have to be issued, and finally the output files might have to be downloaded. Complex scientific applications use external libraries, input data sets, external storage space and certain toolkits which adds complexity when running the application in a grid environment. Large efforts are needed to handle all these issues, and this greatly affects the overall progress of the real scientific activity. To get maximum benefit of a grid computing infrastructure, there is a need to facilitate the user community with flexible, transparent and user friendly general purpose and application specific environments. Such environments can also e.g. handle several different middlewares in a transparent way. 2.1 Grid Portals Grid application portals represent one way to address the requirements mentioned above. The goal is to access the distributed computational power using a web interface and make application management as simple as utilizing the web for sharing the information. A number of different projects have developed production level application portals. For example; GridSphere [19], LU- NARC portal [86], GENIUS [44] and P-Grid [94] together with GEMLCA [55] provide middleware independent grid portals. 2.2 Application Workflows Scientific applications are often quite complex and a computerized experiment is built up from the execution of multiple dependent or independent components. Single or bulk jobs submission and management systems cannot handle 27

such applications. Enabling complex applications to utilize grid resources require a comprehensive execution model. In a grid environment such models are known as application workflows [108]. In [67] a formal definition of a grid workflow is given as The automation of the processes, which involves the orchestration of a set of grid services, agents and actors that must be combined together to solve a problem or to define a new service. Apart from different independent web-based or desktop applications for handling workflows, different middlewares provide separate components for managing workflows. These components allows for submitting a workflow as one single, complete task. Condor s DAGMan (Directed Acyclic Graph Manager) [53] and Unicore s Workflow engines [46] are examples of such components. Other extensive efforts include Tirana [38], an open source problem solving environment, Pegasus [33], and Taverna [95] for bioinformatics applications. 2.3 The Job Management Component The job management component is an important basic building block of an application environment. The task of this component is to handle job submission, management, resubmission of failed jobs and possibly also migration of jobs from one resource to another. Often the job management component is designed as a set of services having well-defined tasks and the functionality is exposed by client tools or a set of APIs. This component works together with the client-side interface to provide a flexible, robust and reliable management component. This job management component is also responsible for providing seamless access to multiple middlewares. One example is the GEMLCA integration with the P-Grid portal in which the layered architecture of GEMLCA provides a grid-middleware independent way to execute legacy applications. In other examples, the GridWay [75] metascheduler provides reliable and autonomous execution of grid jobs, and GridLab [101] produces a set of application-oriented grid services which are accessed using the Grid Application Toolkit (GAT). Using these tools, application developers can build and run applications on the grid without knowing too much details. 2.4 Thesis Contribution In articles I and II we have developed frameworks for managing applications in grid system. The solutions address both general purpose and application specific problem-solving environments. Our solution is based on the Lunarc Application Portal (LAP) [86], the R framework [34] and the Grid Job Management Framework (GJMF) [58]. LAP provides a user friendly web interface for executing the applications. The default version of LAP rely on the Advanced Resource Connector (ARC) middleware for job management. GJMF 28

was designed to provide a middleware independent job management framework. Using a multi-layered architecture, GJMF allows to subdivide the tasks and provide reliable and fault-tolerant submission and management for grid jobs. The R framework is heavily used by biologists and it provides a wide variety of statistical and graphical techniques, and is highly extensible. 2.4.1 System Architecture The architecture developed in this thesis enables use of distributed computing infrastructures by introducing an abstraction layer to hide the underlying details and provide a simple and easy to use interface. In a distributed environment, computational and storage resources are exposed following various standards. Due to the lack of interopreatability, application users are bound to use a limited set of resources. The proposed solution also addresses this issue by adhering GJMF s transparent access to the resource running under different middlewares. Figure 2.1. System architecture for enabling flexible execution environment. Based on the features and the functionalities provided by the LAP and GJMF, we have developed an architecture which joins the best of these two systems. The architecture is based on three layers and the components in the layers have well-defined tasks. LAP works as the Presentation Layer and provides the application management whereas GJMF works at the Logic Layer and ensures reliable and middleware independent job submission and management functionalities. Figure 2.1 illustrate the flexibility of the architecture. The architecture is highly modular and provides component level fault tolerance, i.e. single or multiple LAP(s) can use single or multiple GJMF deployments. Article I describes the work in detail. Based on these principles, we have enabled the R software framework at the presentation layer. This approach enables the scientists to utilize their local resources for simple tasks expressed in R and submit computationally expensive tasks to grids while working in the familiar environment. Article II presents an application specific problem-solving environment based on R and GJMF. 29

Part III: Distributed Storage Solution

3. Distributed Storage Systems Large-scale storage systems have become an essential computing infrastructure component for both research and commercial environments. Distributed storage systems already hold petabytes of data, and the size is constantly increasing. The challenge of handling huge data volumes include requirements of consistency, reliability, long term archiving and high availability. In distributed collaborative environments, such as particle physics [32], earth sciences [61] and biomedicine [73], the requirement of a distributed storage systems are more pronounced. In order to efficiently utilize the computational power, high availability of the required data is essential. In commercial environments, companies like Amazon, Yahoo and Google are working with solutions to provide "unlimited storage anytime, anywhere". Centralized storage solutions cannot handle the upcoming data challenges in a scalable manner, instead distributed storage systems (DSS), are needed to address these challenges. Network Attached Storage (NAS) and Storage Area Networks (SAN) provide limited solutions, but for large scale storage requirements the concept of geographically distributed resources in Data Grids [51] is a viable solution. The concept of the data grids is to create large, virtual storage pools by connecting a set of geographically distributed storage resources. During the last two decades, the challenge of designing DSS for huge data sets has been addressed in a number of projects. Solutions such as Google BigTable [50], have been developed where a distributed storage system is used for managing petabytes of data over thousands of machines. BigTable is based on the Google file systems [72] and in use with some highly data intensive applications like Google Earth, Google Analytics and Google personal Search Engine. Amazon Dynamo [54] is a storage system used by the world s biggest web-store Amazon.com. Hadoop [20] is another effort aimed at designing a reliable, and scalable, distributed storage system. In the research community there are several projects where different solutions have been developed. For example, CASTOR, DPM [102] from CERN and DCache [69] from FermiLab and DESY laboratory are in use to handle petabytes of data generated from the Large Hardon Collider (LHC) experiments. Here, the data centres are located all over the world and the DSS is used to store the data on geographically distributed storage nodes. DCache is also capable of handling tertiary storage for long term data archiving. Tahoe [37] is an open source filesystem which utilizes several nodes with the design of a resilient architecture. XTreemFS [76] addresses the same problem of distributed storage over the heterogeneous environment using an object-based filesystem. 33

irods [107] presents a layer on top of third party storage solutions and gives high-level seamless access to different storage systems. The projects listed above show the variety of large scale distributed storage systems available for both commercial and research communities. Despite of all these big projects, new efforts are needed to assess limitations in the current DSS. 3.1 Characteristics of Distributed Storage Different studies have been conducted to identify the key features or the characteristics of large-scale storage systems. In [104], a comprehensive summary of the requirements and the key characteristics of such systems: Reliability: The system should be capable to reliably store and share the data generated from various applications. Scalability: The system should have a scalable architecture in which thousands of geographically distributed storage pools can dynamically join and leave the system. Security: The security model is an essential part of the DSS. It is important that users can share the data in an easy-to-use but secure environment. The security is required at different levels in the system, e.g. between different components of the system, while transferring data, accessing meta-data, and to determine ownerships on files and collections. Fault Tolerance: While handling large amounts of data in a geographically distributed environment, it is expected that the system experiences hardware or component failures. The system should have the capability to recover transparently from certain level of fail-overs. High Availability: To run the system in a production environment it is important that the system should be highly available. Accessibility: To make the system practically usable it is very important that the interfaces should be simple enough to hide the overall complexity from the end user. Interoperatability: Due to the diverse emerging requirements from applications to build various scalable solution. It is important to follow standards that allow interoperatability between such solutions. 3.2 Challenges of Distributed Storage Designing large-scale distributed storage systems is a non-trivial task. All the characteristics of DSS listed above have been extensively studied in the past years. In [51] core components have been identified for distributed data management. Several projects have been initiated that helps to increase the overall progress. Below, the most commonly identified technical challenges in build- 34

ing a reliable, efficient, scalable, highly available and self-healing distributed storage system are listed: Data Abstraction or Virtualization: The system should provide a high level abstraction when utilizing the storage resources over independent administrative domains. Data Transfer: Data intensive applications and replication mechanism require protocols for efficient and reliable data transfer. Metadata Management: Decoupling and management of information about the available data in the system is a serious challenge in the design of DSS. For large scale systems, the meta-data store often is the scalability bottleneck and a single point of failure in the system. Authentication and Authorization: Resources running in independent administrative domains must have a security layer which allows single sign-on access to the resources. In grid systems security this is often handled by x509 certificates signed by a certificate authority. Also, the concept of a virtual organization has evolved to make it possible to apply policies or rules by defining a group of individuals or projects in the same field. Replica Management: High availability and reliability of the data is often ensured by creating multiple copies of the data. A number of strategies have been proposed and studied for offering efficient and reliable replica management in the DSS. Resource Discovery and Selection: The heterogeneous nature of most DSS results in a need of a mechanism that gives information about the availability of data and its replicas in the system. The information about the data availability helps to select the source which can efficiently deliver the data to the destination. 3.3 Thesis Contribution In this thesis the development of the Chelonia storage system is presented, and the open source cloud solution Openstack SWIFT for CERN-specific data analysis is also presented. The following sections give an overview of these projects. 3.3.1 Chelonia Storage System The Chelonia storage system was developed with the next generation components of the ARC middleware. Chelonia is a file-oriented distributed storage system based on geographically distributed storage nodes. The system is designed to fulfill requirements ranging from creating a store e.g. for managing holiday pictures to facilitating scientific communities requiring a grid-aware distributed storage system that can be used by the grid jobs. The Chelonia system can address many of the challenges mentioned in section 3.2. Below a 35

brief overview of the system is given whereas paper III, IV, V and [92, 91, 93] provide complete details about Chelonia s architecture, performance and stability evaluation as well as experiences of deploying Chelonia in real environments. Architecture and System Components The architecture of Chelonia follows a service-oriented architecture. It is based on four core services in which each of the services has a well-defined role. Figure 3.1 shows an overview of the Chelonia architecture. The communication in the system is using SOAP over HTTP(S). Figure 3.1. Architecture of Chelonia Storage System Following are the descriptions of the Chelonia services: A-Hash (A-H): A-Hash is a metadata store for consistently storing information in property-value pairs. Chelonia supports two types of AHash, centralized and replicated. Being such a central part of the storage system, the A-Hash needs to be consistent and fault-tolerant. Replicated A-Hash is based on the Oracle Berkeley DB [31] (BDB), an open source database library with a replication API. The replication is based on a single master, multiple clients framework where all clients can read from the database and only the master can write to the database. In the event of a master going offline, the clients sends a request for election, and a new master is elected amongst the clients. Librarian (L): The Librarian works as a metadata catalog while keeping its status as a stateless service in the system. Instead it stores all the persistent information in the A-Hash. This makes it possible to deploy any number of independent Librarian services to provide high-availability and load-balancing. The Librarian only needs to know about one of the A-Hashes at start-up to be able to get the list of all available A-Hashes. During run-time the Librarian holds a local copy of the A-Hash list and refreshes it both regularly and in the case of a failing connection. 36