Comparative evaluation of software tools accessing relational databases from a (real) grid environments

Similar documents
AMGA metadata catalogue system

Dr. Giuliano Taffoni INAF - OATS

The AMGA Metadata Service

glite Grid Services Overview

Astrophysics and the Grid: Experience with EGEE

Understanding StoRM: from introduction to internals

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

EUROPEAN MIDDLEWARE INITIATIVE

The glite middleware. Ariel Garcia KIT

AMGA Metadata catalogue service for managing mass data of high-energy physics

WHEN the Large Hadron Collider (LHC) begins operation

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

LCG-2 and glite Architecture and components

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

The CORAL Project. Dirk Düllmann for the CORAL team Open Grid Forum, Database Workshop Barcelona, 4 June 2008

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

DSpace Fedora. Eprints Greenstone. Handle System

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Edinburgh Research Explorer

Cluster Setup and Distributed File System

R-GMA (Relational Grid Monitoring Architecture) for monitoring applications

Utilizing Databases in Grid Engine 6.0

Provide Real-Time Data To Financial Applications

Assignment 5. Georgia Koloniari

Beob Kyun KIM, Christophe BONNAUD {kyun, NSDC / KISTI

A Simple Mass Storage System for the SRB Data Grid

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Prototypes of a Computational Grid for the Planck Satellite

Recent Evolutions of GridICE: a Monitoring Tool for Grid Systems

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Oracle Database Heterogeneous Connectivity User Guide

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

ELFms industrialisation plans

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Chapter 17 Web Services Additional Topics

COMMUNICATION PROTOCOLS

Federated Authentication with Web Services Clients

Security in distributed metadata catalogues

Data Storage Infrastructure at Facebook

How to build Scientific Gateways with Vine Toolkit and Liferay/GridSphere framework

ForeScout Open Integration Module: Data Exchange Plugin

Accessibility Features in the SAS Intelligence Platform Products

Grid and Cloud Activities in KISTI

LHCb Distributed Conditions Database

Database Assessment for PDMS

SZDG, ecom4com technology, EDGeS-EDGI in large P. Kacsuk MTA SZTAKI

AMGA tutorial. Enabling Grids for E-sciencE

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2018

Advanced Database Systems

Sybase Adaptive Server Enterprise on Linux

Grid Infrastructure For Collaborative High Performance Scientific Computing

Easy Access to Grid Infrastructures

Argus Vulnerability Assessment *1

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

Bob Jones. EGEE and glite are registered trademarks. egee EGEE-III INFSO-RI

Real World Web Scalability. Ask Bjørn Hansen Develooper LLC

ITP 140 Mobile Technologies. Databases Client/Server

Web Services in Cincom VisualWorks. WHITE PAPER Cincom In-depth Analysis and Review

Web Applications. Software Engineering 2017 Alessio Gambi - Saarland University

Database System Concepts and Architecture

Oracle 10g and IPv6 IPv6 Summit 11 December 2003

Monitoring the ALICE Grid with MonALISA

Promoting Open Standards for Digital Repository. case study examples and challenges

Distributed Data Management with Storage Resource Broker in the UK

CIB Session 12th NoSQL Databases Structures

NewSQL Databases. The reference Big Data stack

MySQL Database Administrator Training NIIT, Gurgaon India 31 August-10 September 2015

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database

STATUS UPDATE ON THE INTEGRATION OF SEE-GRID INTO G- SDAM AND FURTHER IMPLEMENTATION SPECIFIC TOPICS

Requirements for data catalogues within facilities

ForeScout CounterACT. Configuration Guide. Version 3.4

EGEE and Interoperation

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

Interconnect EGEE and CNGRID e-infrastructures

Grid monitoring with MonAlisa

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

High Throughput WAN Data Transfer with Hadoop-based Storage

DataSunrise Database Security Suite Release Notes

Overview of WMS/LB API

Internet2 Meeting September 2005

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.1

By Ian Foster. Zhifeng Yun

GROWL Scripts and Web Services

Regular Forum of Lreis. Speechmaker: Gao Ang

THEBES: THE GRID MIDDLEWARE PROJECT Project Overview, Status Report and Roadmap

Globus GTK and Grid Services

GRID Stream Database Managent for Scientific Applications

Distributed Systems. Overview. Distributed Systems September A distributed system is a piece of software that ensures that:

A User-level Secure Grid File System

WP3 Final Activity Report

Point-in-Polygon linking with OGSA-DAI

Transcription:

Comparative evaluation of software tools accessing relational databases from a (real) grid environments Giacinto Donvito, Guido Cuscela, Massimiliano Missiato, Vicenzo Spinoso, Giorgio Maggi INFN-Bari 18

Outline The tools under evaluation The test plan Results of simple query test Grid test: description Grid test: results Grid test: Considerations About standards 19

The tools under evaluation G-DSE (INAF + INFN) Developed by INAF and INFN Edgardo Ambrosi (amborsi@cnaf.infn.it) Giuliano Taffoni (taffoni@oats.inaf.it) Andrea Barisani (lcars@infis.units.it) Project site: http://wwwas.oats.inaf.it/grid/g-dse OGSA-DAI Developed as part of the Open Middleware Infrastructure Institute UK (OMII-UK) project. Project site: http://www.ogsadai.org.uk/index.php AMGA Developed as part of glite by: Birger Koblitz: Initial design, project responsible. Tony Calanducci: RPM building and testing. User support. Salvatore Scifo: Java API maintainer Claudio Cherubino: PHP client Project site: http://amga.web.cern.ch/amga/ 4 2

Enabling Grids for E-sciencE Metadata Catalog of the glite Middleware (EGEE) Main usages: File metadata service, providing means of describing and discovering data files required by users and their jobs; Grid Enabled Database, for applications requiring to structure their data, providing database-like service supporting Grid Security Access existing RDBMS from a grid environment Several groups of users among the EGEE community: High Energy Physics Biomed The AMGA Metadata Catalog Main features: Dynamic schemas Hierarchical organization Security: Authentication: user/pass, X509 Certs and proxy, GSI Authorization: Unix-like, VOMS Role/Groups, ACLs per collections or per entry SQL-like query language: selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album like(/dlaudio:file, %.mp3") Authentication: user/pass, Different join types (inner, outer, left, right) Support for views, indexes, constraints INFSO-RI-508833 4

Enabling Grids for E-sciencE C++ multiprocess server Backends Oracle, MySQL 4/5, PostgreSQL, SQLite Front Ends TCP text streaming High performance Client API for C++, Java, Python, Perl, PHP SOAP (web services) Interoperability Scalability WS-DAIR interface - NEW - http://indico.cern.ch/contributiondisplay.py? contribid=46&sessionid=28&confid=22351 Standalone Python Library implementation Data stored on file system AMGA Implementation AMGA server runs on SLC3/4 (32 or 64bit), Fedora Core, Gentoo, Debian AMGA provides a replication/federation mechanisms Motivation Scalability Support hundreds/thousands of concurrent users Geographical distribution Hide network latency Reliability No single point of failure DB Independent replication Heterogeneous DB systems Disconnected computing Off-line access (laptops) Architecture Asynchronous replication Master-slave writes only allowed on the master Application level replication Replicate Metadata commands Partial replication supports replication of only sub-trees of the metadata hierarchy EGEE-II INFSO-RI-031688

The new G-DSE An updated version of the G-DSE has been issued in December 2007 which is now installed at INAF-Trieste and at INFN-Bari and under evaluation in the framework of the Grid & DB test campaign The new G-DSE includes a number of new features: Fully integrated with LFC (Logical File Catalogue) Support for Encryption

The new G-DSE/Features and improvement Support for GSI and SSL security Client/server structure A new server is forked by the listener when a new client asks for the service The server exits when the client disconnects Interactive It is possible to creates an interactive session that can be used for several queries The client server architecture overcomes the overhead problem Previous version: each job simply submits a single query it is prone to the Grid overhead Current version: the job initiates an interactive session that allows the user to submit multiple queries Backward compatibility: the user can still work in batch mode or single query per job

OGSA-DAI An extensible framework accessed via web services that executes data-centric workflows involving heterogeneous data resources for the purposes of data access, integration, transformation and delivery within a Grid and is intended as a toolkit for building higher-level application-specific data services Grid is about sharing resources OGSA-DAI is about sharing structured data resources 8

OGSA-DAI generic web services Manipulate data using OGSA-DAI s generic web services Clients sees the data in its raw format, e.g. Tables, columns, rows for relational data Collections, elements etc. for XML data Clients can obtain the schema of the data Clients send queries in appropriate query language, e.g. SQL, XPath 9

The test plan Sequential tests extraction of 10, 100, 1000, 10000, 100000 simple tuples extraction of result sets of increasing dimensions: O(kb), O(MB), O(100MB) Submission of complex queries (join, multiple queries, etc) Submission of INSERT, UPDATE, and DELETE queries Performance evaluation for a single action Evaluate the differences between LAN and WAN queries Concurrent tests Whit O(10) concurrent clients extract Zero, 10, 100, 1000 tuples Repeat the extractions with O(100) concurrent clients Use a common working environment for the three tools

Security Tools GSI VOMS Authentication X OGSA-DAI Yes No Yes Transport Layer Security Yes Data Encryption Yes G-DSE Yes Yes Yes No X Yes AMGA Yes Yes Yes Yes

Simple query: test results (2) Number of Tuples OGSA-DAI CSV CLI (s) OGSA-DAI CSV API (s) OGSA-DAI CLI (s) GDSE (s) AMGA (s) 1 5,07 0,23 4,7 0,180 0,024 5 5,13 0,26 4,77 0,198 0,03 10 5,31 0,27 4,65 0,215 0,03 50 5,32 0,304 5,25 0,196 0,044 100 5,43 0,324 6,15 0,198 0,06 500 5,61 0,45 6,37 0,283 0,224 1000 6,63 0,65 7 0,343 0,416 5000 7,74 2,41 14,8 0,853 2,106 10000 9,96 4,61 24,46 1,485 4,208 50000 19,31 15,21 95,63 6,521 21,454 100000 37,42 34,21 188,86 12,639 41,336 The new release of GDSE reduces, not only the overhead for smallest query, but also the time spent retrieving the largest ones. The previous table is available at: http://indico.cern.ch/contributiondisplay.py?contribid=71&sessionid=43&confid=18714

Simple query: test results (2) 2,500 1,938 1,375 0,813 0,250-0,313-0,875-1,438-2,000 1Tuple 5 Tuples 10 Tuples 50 Tuples 100 Tuples 500Tuples 1000 Tuples 5000 Tuples 10000 Tuples 50000 Tuples 100000 Tuples OGSA-DAI CSV CLI (s) OGSA-DAI CSV API (s) OGSA-DAI CLI (s) GDSE (s) AMGA (s)

Concurrent test from Grid: description In order to test the tools in an environment as close as possible to a real case queries were issued from WNs over the grid The number of jobs/clients running at a given time was measured Also all the successful queries executed against each server were accounted. The full EGEE infrastructure was used to execute the test Using always the same proxy certificate for all the jobs. The test looks for the number of served queries, the stability and reliability The test allows also to weight to the ability of each specific tool to overcome the network latency while delivering the output

Concurrent test from Grid: description A slightly modified version of JST (Job Submission Tool: a tool developed inside LIBI project) was used to monitor, in real time, the status of each job over the grid JST, in fact, allows the monitoring of any information coming out from a job (for example: Successful completion of a query) Two different types of query were used: select * from molecule where id<2 to select only a single tuple (< 1Kbyte) select * from molecule where id<1001 to select 1000 tuples (~500Kbyte) The plots present the number of successful queries in a 5 minute interval at server side (averaged over several runs)

Concurrent test from Grid: how we run it Job Submission Monitoring informations Query to DB Access Middleware WN WN WNWNWN WN WNWNWNWN WMS UI Monitoring DB Software under test Back-end DB

Concurrent test from Grid: how it works When the job lends on a WN, it: asks (to the central JST DB) for the tool (AMGA, G-DSE, OGSA-DAI) to test prepares the environment to run the right client makes the query: for each tool, it uses a CLI with one single query checks the output of the query if the query was correctly executed it updates the the central JST DB and... repeats all operations for 5000 more time

Concurrent test from Grid: which client we uses AMGA: a loop of: GDSE:./mdcli "selectattr /molecule:id /molecule:id_name / molecule:id_molecule_type /molecule:id_data_class / molecule:id_sequence_type /molecule:seq /molecule:seq_check / molecule:mw /molecule:length1 /molecule:createddate / molecule:lastupdateddate /molecule:lastannoteddate / molecule:descr /molecule:id_database_division /molecule:id_note / molecule:id_db '/molecule:id<1000 and /molecule:id>0'" > out_test (with a given configuration file) a loop of the new client: OGSA-DAI: gdse-connect -h server.name -p ${PORT} "select * from molecule where id<2;" we developed a ad-hoc client using: SQLQueryCSVAggregate class, with a single query built-in.

Concurrent test from Grid: possible issues Is the monitoring DB fast enough? we tested, successfully, the monitoring DB up to ~550 inserts per second and 200 concurrent clients This is far from the highest limit observed with db access tool under test Are there any network latency (or other kind of bias) on the client side? The jobs are distributed with FazzyRank enabled then several different farms are used at the same time: This should avoid any systematic differences between the tools do to client s network (or others client side problems ) Is the bandwidth on the server a bottleneck? We monitored the server constantly and this never occurred The result set is quite small: up to 500 Kbytes Is the test result depended from the geographical job distribution? We can change in real time the tool under test in order to be sure that no bias was introduced by a particular job distribution

Concurrent test from Grid: possible issues Is the back-end DB a bottleneck? We designed the test in order to do a lot of small select query: This would benefit from large memory buffers on the DB side Anyway this happened only in one case: and we easily spotted-out mostly due to AMGA usage of the DB and we fixed the problem There are some other activity on the server side? We worked in an absolute controlled environment Each server is used only for this test and no other service is running on the same machine We monitored the server using top and GANGLIA for all the time of each test How accurate is the measure of number of the successful queries : We compared the monitoring procedure against AMGA logs: we found -> 0.021% of difference between the number of queries logged into monitoring DB and the one logged into AMGA logs

Concurrent test from Grid: results # of queries every 5 minutes 77787,00 58340,25 38893,50 19446,75 1 Tuple It seems that AMGA can still scale increasing the number of concurrent clients with the given hardware in this scenario Good reliability and stability of the numbers achieved which extends also beyond 150 client shown here 0 0 5 15 30 50 75 100 150 # of client AMGA OGSA-DAI

Concurrent test from Grid: results 7000 1000 Tuple 5250 3500 1750 Good reliability and stability of the numbers achieved which extends also beyond 150 client shown here 0 0 5 15 30 50 100 150 AMGA OGSA-DAI # of clients

Concurrent test from Grid: comments AMGA show a very high number of queries served per time unit: it puts much more load on the back-end DB SQL support is increasing, but it is still not SQL compliant (maybe in the future: AMGA 2.0?) In this test we used: the TCP (not the fully Web Service interface) connection to the server SSL session recycling on the server side Very interesting functionality of replication and federation An AMGA Web Interface is available to browse/edit/create entries, make queries, define schemas, define permissions GDSE was not used in this test: Still in beta-release : the developers need some time to work in order to gain in stability, but: really fast for largest queries good interoperability with glite infrastructure and services Client/server architecture

Concurrent test from Grid: comments OGSA-DAI shows a good number of queries served per time unit: The result was retrieved using CSV format some stability issues where encountered with more than 80 concurrent clients are running against the same server, but... The developers have investigated the problem and give us suggestion on how to tune the software to achieve better performances Now good stability also with high load (proved up to 190 concurrent clients) Full Web-services solution for the query submission Advanced workflow features provided

Toward a standard and common interface? As it is already for flat-file access (many communities are using successfully SRM), there is the need for a standard interface for querying databases. There is already standards agreed within the OGF WS-DAI: a collection of generic data interfaces developed by the Database Access and Integration Services (DAIS) Working Group. WS-DAIR: a specification for a collection of data access interfaces for relational data resources, which extends interfaces defined in the Web Services Data Access and Integration document [WS-DAI]. WS-DAIX: a specification for a collection of data access interfaces for XML data resources, which extends interfaces defined in the Web Services Data Access and Integration document [WS-DAI].

Status of the WS-DAI interfaces We are starting a comparison with the available WS-DAI interfaces What is the status? OGSA-DAI: WS-DAIX already released... we are starting our tests WS-DAIR will be released in 3 month from now AMGA: see in this workshop -> A WS-DAIR interface for the AMGA metadata catalogue We asked to test this interface and we will do the tests as soon as developers give us No news from others tools

Acknowledgments TAFFONI, Giuliano (INAF) VUERLI, Claudio (INAF) BARISANI, andrea (INAF) PASIAN, Fabio (INAF) MANNA, Valeria (INAF) GISEL, Andreas (CNR-ITB) CALANDUCCI, Antonio (INFN) AIFTIMIEI, Cristina (INFN) ATUL, Jain (INFN+Politecnico Bari) PIERRO, Antonio (INFN) Work supported in part by BioinfoGRID and LIBI projects 30 17