AMGA metadata catalogue system

Similar documents
The AMGA Metadata Service

Comparative evaluation of software tools accessing relational databases from a (real) grid environments

AMGA Metadata catalogue service for managing mass data of high-energy physics

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

AMGA tutorial. Enabling Grids for E-sciencE

glite Grid Services Overview

Ganga The Job Submission Tool. WeiLong Ueng

LCG-2 and glite Architecture and components

Bob Jones. EGEE and glite are registered trademarks. egee EGEE-III INFSO-RI

Understanding StoRM: from introduction to internals

The ARDA project: Grid analysis prototypes of the LHC experiments

Failover procedure for Grid core services

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

ARDA status. Massimo Lamanna / CERN. LHCC referees meeting, 28 June

CouchDB-based system for data management in a Grid environment Implementation and Experience

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

EGEE and Interoperation

Prototype DIRAC portal for EISCAT data Short instruction

Provide Real-Time Data To Financial Applications

Andrea Sciabà CERN, Switzerland

STATUS UPDATE ON THE INTEGRATION OF SEE-GRID INTO G- SDAM AND FURTHER IMPLEMENTATION SPECIFIC TOPICS

R-GMA (Relational Grid Monitoring Architecture) for monitoring applications

DIRAC File Replica and Metadata Catalog

The CORAL Project. Dirk Düllmann for the CORAL team Open Grid Forum, Database Workshop Barcelona, 4 June 2008

Microsoft vision for a new era

Grid and Cloud Activities in KISTI

EUROPEAN MIDDLEWARE INITIATIVE

The Grid: Processing the Data from the World s Largest Scientific Machine

High Availability irods System (HAIRS)

Monitoring tools in EGEE

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

WHEN the Large Hadron Collider (LHC) begins operation

FREE SCIENTIFIC COMPUTING

Enabling Grids for E-sciencE. A centralized administration of the Grid infrastructure using Cfengine. Tomáš Kouba Varna, NEC2009.

A Simplified Access to Grid Resources for Virtual Research Communities

Deploying virtualisation in a production grid

Overview of HEP software & LCG from the openlab perspective

Welcome to the New Era of Cloud Computing

Red Hat AMQ 7.2 Introducing Red Hat AMQ 7

Improving Grid User's Privacy with glite Pseudonymity Service

AMGA metadata catalogue and high level API

SZDG, ecom4com technology, EDGeS-EDGI in large P. Kacsuk MTA SZTAKI

Web Services in Cincom VisualWorks. WHITE PAPER Cincom In-depth Analysis and Review

Towards Transparent Integration of Heterogeneous Cloud Storage Platforms

Distributed Data Management with Storage Resource Broker in the UK

PoS(EGICF12-EMITC2)106

Internet2 Meeting September 2005

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Beob Kyun KIM, Christophe BONNAUD {kyun, NSDC / KISTI

The PanDA System in the ATLAS Experiment

MySQL Cluster Web Scalability, % Availability. Andrew

Dr. Giuliano Taffoni INAF - OATS

COMMUNICATION PROTOCOLS

Administering a SQL Database Infrastructure (M20764)

Outline. ASP 2012 Grid School

The EU DataGrid Testbed

Introduction to Grid Infrastructures

Policy Manager for IBM WebSphere DataPower 7.2: Configuration Guide

HammerCloud: A Stress Testing System for Distributed Analysis

A Simple Mass Storage System for the SRB Data Grid

DIRAC data management: consistency, integrity and coherence of data

Edition 0.1. real scenarios for managing EAP instances. Last Updated:

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

Product Name DCS v MozyPro v2.0 Summary Multi-platform server-client online (Internet / LAN) backup software with web management console

DIRAC pilot framework and the DIRAC Workload Management System

DataSunrise Database Security Suite Release Notes

LHCb Distributed Conditions Database

Operating the Distributed NDGF Tier-1

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Utilizing Databases in Grid Engine 6.0

20533B: Implementing Microsoft Azure Infrastructure Solutions

On the employment of LCG GRID middleware

Technical White Paper

Exam : Implementing Microsoft Azure Infrastructure Solutions

Security in distributed metadata catalogues

Astrophysics and the Grid: Experience with EGEE

Scientific data management

DIGIPASS Authentication for O2 Succendo

NeuroLOG WP1 Sharing Data & Metadata

Java Development and Grid Computing with the Globus Toolkit Version 3

The LHC Computing Grid

Service Availability Monitor tests for ATLAS

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Data Access and Data Management

Sriram Krishnan

Service Virtualization

EGEE (JRA4) Loukik Kudarimoti DANTE. RIPE 51, Amsterdam, October 12 th, 2005 Enabling Grids for E-sciencE.

Shen PingCAP 2017

Grid Middleware and Globus Toolkit Architecture

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8

Developing Microsoft Azure Solutions

Introduction. Distributed Systems IT332

Server Installation Guide

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

ELFms industrialisation plans

StratusLab Cloud Distribution Installation. Charles Loomis (CNRS/LAL) 3 July 2014

Transcription:

AMGA metadata catalogue system Hurng-Chun Lee ACGrid School, Hanoi, Vietnam www.eu-egee.org EGEE and glite are registered trademarks

Outline AMGA overview AMGA Background and Motivation for AMGA Interface, Architecture and Implementation Metadata Replication on AMGA AMGA use cases AMGA hands-on exercise 2

Metadata catalogue logical filename Physical filename Metadata catalogue File catalogue Metadata (data about data) Data content Metadata catalogue can be used as a File catalogue 3

Metadata on the Grid Find the file XXX.YY File Catalogue User Find the data produced by my job x and with energy level < -10 ev SE 1 SE 3 Metadata Catalogue SE 2 SE n The Grid Storage Environment Data are stored as files on many distributed storage elements 4

More than a file catalogue Metadata is data about data On the Grid: information about data content Describe files Locate files based on their contents But also makes DB access a simple task on the Grid Many Grid applications need structured data Many applications require only simple schemas Can be modeled as metadata Main advantage: better integration with the Grid environment Metadata Service is a Grid component Grid security Hide DB heterogeneity 5

ARDA/gLite Metadata Interface Existing Metadata Services in High-Energy Physics AMI (ATLAS), RefDB (CMS), Alien Metadata Catalogue (ALICE) Similar goals, similar concepts Domain specific Technical limitations: large answers, scalability, performance ARDA proposal: AMGA based on the HEP requirements generic for other applications design jointly with the glite/egee team Adopted as the official EGEE Metadata Interface endorsed by PTF (Project Technical Forum of EGEE) 6

The AMGA project ARDA developed a Project Task Force in order to develop: AMGA ARDA Metadata Grid Application Began as prototype to evaluate the Metadata Interface Evaluated by community since the beginning: LHCb and Ganga were early testers (more on this later) Matured quickly thanks to users feedback Now is part of the glite middleware Official Metadata Service for EGEE First release with glite 1.5 Also available as standalone component It is expanding to other user communities: HEP, Biomed, UNOSAT 7

AMGA metadata concept Some Concepts: Metadata - List of attributes associated with entries Attribute key/value pair with type information Type The type (int, float, string, ) Name/Key The name of the attribute Value - Value of an entry's attribute Schema A set of attributes Collection A set of entries associated with a schema Think of schemas as tables, attributes as columns, entries as rows 8

AMGA features Dynamic Schemas Schemas can be modified at runtime by client Create, delete schemas Add, remove attributes Metadata organized as an hierarchy Collections can contain sub-collections Analogy to file system: Collection Directory Entry File Flexible Queries SQL-like query language Joins between schemas QUERY EXAMPLE: selectattr /glibrary:filename \ /glibrary:author \ /glibrary:file=/glaudio:file \ and \ like(/glibrary:filename, %.mp3") 9

AMGA security Unix style permissions ACLs per-collection or per-entry Secure connections SSL Client Authentication based on Username/password General X509 certificates Grid-proxy certificates Access control via a Virtual Organization Management System (VOMS) 10

AMGA implementation C++ multiprocess server Runs on any Linux flavour Backends Oracle, MySQL, PostgreSQL, SQLite Two frontends TCP Streaming SOAP High performance Client API for: C++, Java, Python, Perl, Ruby Interoperability Also implemented as standalone Python library Data stored on filesystem 11

Architecture Designed for scalability Asynchronous operation Reading from DB and sending data to client Response sent to client in chunks No limit on the maximum response size Example: TCP Streaming Text based protocol (like SMTP, POP3, ) Response streamed to client Client: Server: 0 entry value1 value2 <EOT> listattr entry 12

Metadata replication Motivation Scalability Support hundreds/thousands of concurrent users Geographical distribution Hide network latency Reliability No single point of failure DB Independent replication Heterogeneous DB systems Disconnected computing Off-line access (laptops) Architecture Asynchronous replication Master-slave Writes only allowed on the master Replication at the application level Replicate Metadata commands, not SQL DB independence Partial replication supports replication of only sub-trees of the metadata hierarchy 13

Metadata replication (cont.) Full replication Partial replication Federation Proxy 14

Early adoption of AMGA LHCb-bookkeeping (keep additional information from executed jobs) Migrated bookkeeping metadata to ARDA prototype 20M entries, 15 GB Large amount of static metadata Feedback valuable in improving interface and fixing bugs AMGA showing good scalability Ganga Job management system Developed jointly by Atlas and LHCb Uses AMGA for storing information about job status Small amount of highly dynamic metadata 15

Accessing AMGA TCP Streaming Front-end mdcli & mdclient and C++ API (md_cli.h, MD_Client.h) Java Client API and command line mdjavaclient.sh & mdjavacli.sh (also under Windows) Python Client API SOAP Frontend (WSDL) C++ gsoap AXIS (Java) ZSI (Python) 16

Conclusion AMGA Metadata Service of glite Part of glite (but still not certificed in glite 3.0. it will be done with 3.1 release) Useful for simplified DB access Integrated on the Grid environment (Security) Replication/Federation features Tests show good performance/scalability Already deployed by several Grid Applications LHCb, ATLAS, Biomed, AMGA Web Site http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/ 17

Medical data management Scenario: Store and access medical images exploiting metadata on the Grid Requirements: Strong security requirements Patient data is sensitive Data must be encrypted Metadata access must be restricted to authorized users AMGA used as metadata server Demonstrates authentication and encrypted access Used as a simplified DB More details: http://www.i3s.unice.fr/~johan/mdm/mdm-051013.pdf 18

WISDOM drug analysis Scenario: essential docking information (e.g. binding energy, conformation) provider for on-line data analysis on the simulation outputs Requirement: high speed and large-scale I/O supporting large number of entries ( > 10 million database records ) replication and backup AMGA used as a file catalogue as well as part of the post-analysis infrastructure first level data analysis on the metadata grid physical file name is one of the metadata attributes users are pointed to download the full output according to the metadata query 19

Grid-enable drug analysis system Best demo prize of EGEE 07 simulation preparation job submission Grid Computing Units (Worker nodes, agents) online data analysis 6500 Histogram on 295535 screens: DC2_T07IAN 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0-1 6-1 5-1 4-1 3-1 2-1 1-1 0-9 - 8-7 - 6-5 - 4-3 binding energy 20

gmod: grid Movie on Demand gmod provides a Video-On-Demand service User chooses among a list of video and the chosen one is streamed in real time to the video client of the user s workstation For each movie a lot of details (Title, Runtime, Country, Release Date, Genre, Director, Case, Plot Outline) are stored and users can search a particular movie querying on one or more attributes Two kind of users can interact with gmod: TrailersManagers that can administer the db of movies (uploading new ones and attaching metadata to them); GILDA VO users (guest) can browse, search and choose a movie to be streamed. 21

gmod workflow VOMS get Role Genius Portal Metadata Catalogue AMGA Storage Elements LFC Catalogue User Workload Management System W N WN CE WN 22