Comparative evaluation of software tools accessing relational databases from a (real) grid environments Giacinto Donvito, Guido Cuscela, Massimiliano Missiato, Vicenzo Spinoso, Giorgio Maggi INFN-Bari 18
Outline The tools under evaluation The test plan Results of simple query test Grid test: description Grid test: results Grid test: Considerations About standards 19
The tools under evaluation G-DSE (INAF + INFN) Developed by INAF and INFN Edgardo Ambrosi (amborsi@cnaf.infn.it) Giuliano Taffoni (taffoni@oats.inaf.it) Andrea Barisani (lcars@infis.units.it) Project site: http://wwwas.oats.inaf.it/grid/g-dse OGSA-DAI Developed as part of the Open Middleware Infrastructure Institute UK (OMII-UK) project. Project site: http://www.ogsadai.org.uk/index.php AMGA Developed as part of glite by: Birger Koblitz: Initial design, project responsible. Tony Calanducci: RPM building and testing. User support. Salvatore Scifo: Java API maintainer Claudio Cherubino: PHP client Project site: http://amga.web.cern.ch/amga/ 4 2
Enabling Grids for E-sciencE Metadata Catalog of the glite Middleware (EGEE) Main usages: File metadata service, providing means of describing and discovering data files required by users and their jobs; Grid Enabled Database, for applications requiring to structure their data, providing database-like service supporting Grid Security Access existing RDBMS from a grid environment Several groups of users among the EGEE community: High Energy Physics Biomed The AMGA Metadata Catalog Main features: Dynamic schemas Hierarchical organization Security: Authentication: user/pass, X509 Certs and proxy, GSI Authorization: Unix-like, VOMS Role/Groups, ACLs per collections or per entry SQL-like query language: selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album like(/dlaudio:file, %.mp3") Authentication: user/pass, Different join types (inner, outer, left, right) Support for views, indexes, constraints INFSO-RI-508833 4
Enabling Grids for E-sciencE C++ multiprocess server Backends Oracle, MySQL 4/5, PostgreSQL, SQLite Front Ends TCP text streaming High performance Client API for C++, Java, Python, Perl, PHP SOAP (web services) Interoperability Scalability WS-DAIR interface - NEW - http://indico.cern.ch/contributiondisplay.py? contribid=46&sessionid=28&confid=22351 Standalone Python Library implementation Data stored on file system AMGA Implementation AMGA server runs on SLC3/4 (32 or 64bit), Fedora Core, Gentoo, Debian AMGA provides a replication/federation mechanisms Motivation Scalability Support hundreds/thousands of concurrent users Geographical distribution Hide network latency Reliability No single point of failure DB Independent replication Heterogeneous DB systems Disconnected computing Off-line access (laptops) Architecture Asynchronous replication Master-slave writes only allowed on the master Application level replication Replicate Metadata commands Partial replication supports replication of only sub-trees of the metadata hierarchy EGEE-II INFSO-RI-031688
The new G-DSE An updated version of the G-DSE has been issued in December 2007 which is now installed at INAF-Trieste and at INFN-Bari and under evaluation in the framework of the Grid & DB test campaign The new G-DSE includes a number of new features: Fully integrated with LFC (Logical File Catalogue) Support for Encryption
The new G-DSE/Features and improvement Support for GSI and SSL security Client/server structure A new server is forked by the listener when a new client asks for the service The server exits when the client disconnects Interactive It is possible to creates an interactive session that can be used for several queries The client server architecture overcomes the overhead problem Previous version: each job simply submits a single query it is prone to the Grid overhead Current version: the job initiates an interactive session that allows the user to submit multiple queries Backward compatibility: the user can still work in batch mode or single query per job
OGSA-DAI An extensible framework accessed via web services that executes data-centric workflows involving heterogeneous data resources for the purposes of data access, integration, transformation and delivery within a Grid and is intended as a toolkit for building higher-level application-specific data services Grid is about sharing resources OGSA-DAI is about sharing structured data resources 8
OGSA-DAI generic web services Manipulate data using OGSA-DAI s generic web services Clients sees the data in its raw format, e.g. Tables, columns, rows for relational data Collections, elements etc. for XML data Clients can obtain the schema of the data Clients send queries in appropriate query language, e.g. SQL, XPath 9
The test plan Sequential tests extraction of 10, 100, 1000, 10000, 100000 simple tuples extraction of result sets of increasing dimensions: O(kb), O(MB), O(100MB) Submission of complex queries (join, multiple queries, etc) Submission of INSERT, UPDATE, and DELETE queries Performance evaluation for a single action Evaluate the differences between LAN and WAN queries Concurrent tests Whit O(10) concurrent clients extract Zero, 10, 100, 1000 tuples Repeat the extractions with O(100) concurrent clients Use a common working environment for the three tools
Security Tools GSI VOMS Authentication X OGSA-DAI Yes No Yes Transport Layer Security Yes Data Encryption Yes G-DSE Yes Yes Yes No X Yes AMGA Yes Yes Yes Yes
Simple query: test results (2) Number of Tuples OGSA-DAI CSV CLI (s) OGSA-DAI CSV API (s) OGSA-DAI CLI (s) GDSE (s) AMGA (s) 1 5,07 0,23 4,7 0,180 0,024 5 5,13 0,26 4,77 0,198 0,03 10 5,31 0,27 4,65 0,215 0,03 50 5,32 0,304 5,25 0,196 0,044 100 5,43 0,324 6,15 0,198 0,06 500 5,61 0,45 6,37 0,283 0,224 1000 6,63 0,65 7 0,343 0,416 5000 7,74 2,41 14,8 0,853 2,106 10000 9,96 4,61 24,46 1,485 4,208 50000 19,31 15,21 95,63 6,521 21,454 100000 37,42 34,21 188,86 12,639 41,336 The new release of GDSE reduces, not only the overhead for smallest query, but also the time spent retrieving the largest ones. The previous table is available at: http://indico.cern.ch/contributiondisplay.py?contribid=71&sessionid=43&confid=18714
Simple query: test results (2) 2,500 1,938 1,375 0,813 0,250-0,313-0,875-1,438-2,000 1Tuple 5 Tuples 10 Tuples 50 Tuples 100 Tuples 500Tuples 1000 Tuples 5000 Tuples 10000 Tuples 50000 Tuples 100000 Tuples OGSA-DAI CSV CLI (s) OGSA-DAI CSV API (s) OGSA-DAI CLI (s) GDSE (s) AMGA (s)
Concurrent test from Grid: description In order to test the tools in an environment as close as possible to a real case queries were issued from WNs over the grid The number of jobs/clients running at a given time was measured Also all the successful queries executed against each server were accounted. The full EGEE infrastructure was used to execute the test Using always the same proxy certificate for all the jobs. The test looks for the number of served queries, the stability and reliability The test allows also to weight to the ability of each specific tool to overcome the network latency while delivering the output
Concurrent test from Grid: description A slightly modified version of JST (Job Submission Tool: a tool developed inside LIBI project) was used to monitor, in real time, the status of each job over the grid JST, in fact, allows the monitoring of any information coming out from a job (for example: Successful completion of a query) Two different types of query were used: select * from molecule where id<2 to select only a single tuple (< 1Kbyte) select * from molecule where id<1001 to select 1000 tuples (~500Kbyte) The plots present the number of successful queries in a 5 minute interval at server side (averaged over several runs)
Concurrent test from Grid: how we run it Job Submission Monitoring informations Query to DB Access Middleware WN WN WNWNWN WN WNWNWNWN WMS UI Monitoring DB Software under test Back-end DB
Concurrent test from Grid: how it works When the job lends on a WN, it: asks (to the central JST DB) for the tool (AMGA, G-DSE, OGSA-DAI) to test prepares the environment to run the right client makes the query: for each tool, it uses a CLI with one single query checks the output of the query if the query was correctly executed it updates the the central JST DB and... repeats all operations for 5000 more time
Concurrent test from Grid: which client we uses AMGA: a loop of: GDSE:./mdcli "selectattr /molecule:id /molecule:id_name / molecule:id_molecule_type /molecule:id_data_class / molecule:id_sequence_type /molecule:seq /molecule:seq_check / molecule:mw /molecule:length1 /molecule:createddate / molecule:lastupdateddate /molecule:lastannoteddate / molecule:descr /molecule:id_database_division /molecule:id_note / molecule:id_db '/molecule:id<1000 and /molecule:id>0'" > out_test (with a given configuration file) a loop of the new client: OGSA-DAI: gdse-connect -h server.name -p ${PORT} "select * from molecule where id<2;" we developed a ad-hoc client using: SQLQueryCSVAggregate class, with a single query built-in.
Concurrent test from Grid: possible issues Is the monitoring DB fast enough? we tested, successfully, the monitoring DB up to ~550 inserts per second and 200 concurrent clients This is far from the highest limit observed with db access tool under test Are there any network latency (or other kind of bias) on the client side? The jobs are distributed with FazzyRank enabled then several different farms are used at the same time: This should avoid any systematic differences between the tools do to client s network (or others client side problems ) Is the bandwidth on the server a bottleneck? We monitored the server constantly and this never occurred The result set is quite small: up to 500 Kbytes Is the test result depended from the geographical job distribution? We can change in real time the tool under test in order to be sure that no bias was introduced by a particular job distribution
Concurrent test from Grid: possible issues Is the back-end DB a bottleneck? We designed the test in order to do a lot of small select query: This would benefit from large memory buffers on the DB side Anyway this happened only in one case: and we easily spotted-out mostly due to AMGA usage of the DB and we fixed the problem There are some other activity on the server side? We worked in an absolute controlled environment Each server is used only for this test and no other service is running on the same machine We monitored the server using top and GANGLIA for all the time of each test How accurate is the measure of number of the successful queries : We compared the monitoring procedure against AMGA logs: we found -> 0.021% of difference between the number of queries logged into monitoring DB and the one logged into AMGA logs
Concurrent test from Grid: results # of queries every 5 minutes 77787,00 58340,25 38893,50 19446,75 1 Tuple It seems that AMGA can still scale increasing the number of concurrent clients with the given hardware in this scenario Good reliability and stability of the numbers achieved which extends also beyond 150 client shown here 0 0 5 15 30 50 75 100 150 # of client AMGA OGSA-DAI
Concurrent test from Grid: results 7000 1000 Tuple 5250 3500 1750 Good reliability and stability of the numbers achieved which extends also beyond 150 client shown here 0 0 5 15 30 50 100 150 AMGA OGSA-DAI # of clients
Concurrent test from Grid: comments AMGA show a very high number of queries served per time unit: it puts much more load on the back-end DB SQL support is increasing, but it is still not SQL compliant (maybe in the future: AMGA 2.0?) In this test we used: the TCP (not the fully Web Service interface) connection to the server SSL session recycling on the server side Very interesting functionality of replication and federation An AMGA Web Interface is available to browse/edit/create entries, make queries, define schemas, define permissions GDSE was not used in this test: Still in beta-release : the developers need some time to work in order to gain in stability, but: really fast for largest queries good interoperability with glite infrastructure and services Client/server architecture
Concurrent test from Grid: comments OGSA-DAI shows a good number of queries served per time unit: The result was retrieved using CSV format some stability issues where encountered with more than 80 concurrent clients are running against the same server, but... The developers have investigated the problem and give us suggestion on how to tune the software to achieve better performances Now good stability also with high load (proved up to 190 concurrent clients) Full Web-services solution for the query submission Advanced workflow features provided
Toward a standard and common interface? As it is already for flat-file access (many communities are using successfully SRM), there is the need for a standard interface for querying databases. There is already standards agreed within the OGF WS-DAI: a collection of generic data interfaces developed by the Database Access and Integration Services (DAIS) Working Group. WS-DAIR: a specification for a collection of data access interfaces for relational data resources, which extends interfaces defined in the Web Services Data Access and Integration document [WS-DAI]. WS-DAIX: a specification for a collection of data access interfaces for XML data resources, which extends interfaces defined in the Web Services Data Access and Integration document [WS-DAI].
Status of the WS-DAI interfaces We are starting a comparison with the available WS-DAI interfaces What is the status? OGSA-DAI: WS-DAIX already released... we are starting our tests WS-DAIR will be released in 3 month from now AMGA: see in this workshop -> A WS-DAIR interface for the AMGA metadata catalogue We asked to test this interface and we will do the tests as soon as developers give us No news from others tools
Acknowledgments TAFFONI, Giuliano (INAF) VUERLI, Claudio (INAF) BARISANI, andrea (INAF) PASIAN, Fabio (INAF) MANNA, Valeria (INAF) GISEL, Andreas (CNR-ITB) CALANDUCCI, Antonio (INFN) AIFTIMIEI, Cristina (INFN) ATUL, Jain (INFN+Politecnico Bari) PIERRO, Antonio (INFN) Work supported in part by BioinfoGRID and LIBI projects 30 17