Grid Technologies for Cancer Research in the ACGT Project Juliusz Pukacki (pukacki@man.poznan.pl) OGF25 - Grid technologies in e-health Catania, 2-6 March 2009
Outline ACGT project ACGT architecture Layers Components Gridge Toolkit (http://www.gridge.org) GRMS DMS GAS Scenarios
ACGT - overview ACGT is an EU co-funded project that develops open-source, semantic and grid-based technologies in support of post genomic clinical trials in cancer research Broad range of activities Data collection clinical trials Data storing and providing Computational jobs oncosimulator application Services invocation workflows Ontologies for cancer research area Strong security polices on data VO management
ACGT Architecture Logical separation of the Grid fabric Grid layers are supposed to provide standard and secure way for accessing hardware resources of the Grid environment Based on that Grid platform it is possible to build many different environments for different fields, not only biomedicine.
ACGT high level Architecture overview User Access and High Level Interoperability Layer (workflow management, dedicated clients) Security ACGT Business Process Services (Ontology Service, Knowledge Discovery, Mediators...) Advanced Grid Middleware (Gridge, + ACGT specific services) Common Grid Infrastructure (Globus Toolkit,...) Hardware Layer (computational resources,network, databases)
Layers description (1) Hardware Layer Basic hardware infrastructure Computational resources Network Databases Common Grid Infrastructure Provides remote access to individual resources from Hardware Layer Globus Toolkit: GRAM, GridFtp, MDS.. Monitoring sensors
Layers description (2) Advanced Grid Middleware Collective services - operates on set of lower level services to provide more advanced functionality Gridge Toolkit: GRMS, GAS, Monitoring Tools, DMS, Mobile support services OGSA-DAI Other services ACGT specific
Layers description (3) ACGT Business Process Services High level services providing interoperability in ACGT environment and integration of different data and resources Ontology Services Knowledge Discovery Services VO Management Service Mediator Services Analytical Services Workflow enactor
Layers description (4) User Access and High Level Interoperability Layer Applications providing access to ACGT Environment for end user (standalone applications or portals) Workflow management applications Applications dedicated for specific ACGT scenarios Visualization Tools
Components Portal Metadata Registration Workflow Editor GridR Session Mediator Service Metadata Repository Workflow Enactor OBTIMA Workflow Editor Globus GRAM DMS GridFTP GAS Authorization Plugins Wrapper Service Databases (DICOM,BASE CTMS) Logging Service MDS GridR Service VO Management DMS Portlet GRMS Ontology Service Oncosimulator Service VO Management Service Analytical Services
Services Classification Position in ACGT architecture Role in ACGT environment Infrastructure Service Analytical Tool Service Owner/maintainer of the service ACGT Service Third Party Service
Data management architecture
Grid Infrastructure Common Grid Infrastructure Globus Toolkit GT4 Advanced Grid Services PSNC Gridge Toolkit (the most important components) Resource Management Authorization and VO management Services Data Management
GRMS - Resource Management System Open source meta-scheduling system Based on dynamic resource selection Deals with dynamic Grid environment Cooperates with other services local resource management services information services data management services security services Implemented as GSI-enabled Web Service (Tomcat/Axis)
GRMS - Architecture
Authorization Service - GAS Decision point for all components of a system Storing security policies for all components The GAS complex data structure can be used to model many abstract and real world objects and security policies for such objects GAS is considered as independent of specific technologies used at lower layers, and it should be fully usable in any environment Implemented as GSI-enabled Web Service
GAS Architecture
Data Management - GDMS Distributed system of services capable of delivering mechanisms for seamless management of large amount of data Virtual file system keeping the data organized in a tree structure metadirectories -for building hierarchy metafiles logical representation of data DMS components: Data Broker access interface Metadata Repository information about data Data Container data storing
GDMS Architecture
Grid Portals Complete solution for grid portal construction Hides complexity of underlaying Grid Capable to handle Grid environment ranging from local computing portals to inter campus wide systems Easy access to user grid job configurations and history Workflow support Core and application grid portlets Open source Gridsphere portal framework
Portals
Grid Monitoring
Privacy challenge for research Research Some sensitive data is required Age, sex, genetic information,... PETs often cannot be used Data from different sources Linking of data required Ease of use? Privacy Personal Sensitive information Confidentiality is key Personal information is prone to abuse e.g. Impact on loan applications, insurance,... Privacy violation is irreversible Possible gain from research VS possible damage to individual
Sensitive data on the grid Some grids know no borders... but personal data does There are legal issues when processing (transporting) personal data cross-border Sensitive data on a grid Remote resources are storing and/or processing your sensitive data Resources should be trusted (trustworthy) How can you know? secure does not imply trustworthy Trustworthiness should be certified (self-certification is not acceptable) Trustworthiness and policies should be assured (enforced) Reliable mechanisms needed
Security in ACGT Technical security measures Central PKI, authorization service (GAS), logging/auditing,... Encryption of communication, GSI... Label of an ACGT enabled service Extra requirements regarding security (eg. Firewall, local configuration,...) Data Protection Framework On top of the general ACGT security infrastructure With the goal to provide a means to all ACGT actors to easily comply with European data protection legislation when working on the ACGT platform
Data Protection Framework
Data Protection Framework Achieves its goals through... De-identification of (personal) data By a Trusted Third Party Safeguarding a context of anonymity : data is anonymous within the context the person is working in ACGT data protection board Through access control on the de-identified content thus falls under the less strict legal requirements for anonymous data Describe the necessary legal agreements Contracts Consent forms (de-identification is a form of processing and thus requires consent) Extensive audit trail of everything that happens with the data
How to use the Grid - scenarios Oncosimulator application scenario Execution of the batch job in the Grid environment GridR environment Integration generic solution with Grid technology transparently for the end user Services invocation workflow scenario Invocation of Grid services as a part of services enactment workflow
Oncosimulator scenario - steps User preparing input data and uploads it to DMS User specifies an oncosimulator application 10 tasks workflow Job submitted to GRMS GRMS prepares an executes the job User can track the job execution -status changes During the runtime data produced are send to specified location and visualized
Oncosimulator scenario - overview Advanced Grid Services User Layer 1. put data to DMS Common Grid Services 4. get data location 3. authorization GridFtp 5. transfer data 6. find available resources 2. submit computation Hardware Layer 7.submit job to resource MDS4 GRAM 8. run job
Oncosimulator - experiment workflow msg RUNNING untar_vis FINISHED visualiz FINISHED FINISHED make_sim untar_sim FINISHED FINISHED simulation RUNNING mkdir FINISHED copier FINISHED FINISHED FINISHED CANCEL FAILED result stopper FINISHED
GridR environment - overview Based on the well known, open source, package R, that provides broad range of the state-of-the-art statistical, graphical techniques and advanced data mining methods R environment used as an interface for the remote grid computation
GridR scenario steps (1) Designing of the algorithm using R environment locally Grid methods initialization Wrapping R code to be executed in the Grid, into single function in the local R environment Calling special function for Grid submission with previously prepared function as an argument
GridR scenario steps (2) Waiting for results locked variable Processing of the result
GridR Service Using R in the Workflow Environmet Remote execution of R code in the Grid. Wrapper to make R look like generic ACGT service Interface to distributed Data Management System and Resource Management System (grid jobs) Interface to Meta Data Repository (pre-registered scripts) Access to the R environment from within grid-based workflows Security: based on credential delegation
ACGT workflow environment Two separate components: Workflow editor Workflow enactor Standard workflow description format BPEL Workflow construction based on metadata describing services Common technology for services implementation - Web Services with GSI
Thank you...