Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database

Similar documents
Frequently Asked Questions

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

Edinburgh Research Explorer

SDSS Dataset and SkyServer Workloads

Research and Design Application Platform of Service Grid Based on WSRF

1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.

Parallel In-situ Data Processing Techniques

Background. $VENDOR wasn t sure either, but they were pretty sure it wasn t their code.

GC-APWG Global Changing Asian-Pacific Wide Grid. Dr. Guoqing Li, CEODE/CAS 25,May,2008 UN GAID E-SDDC, Shanghai

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Point-in-Polygon linking with OGSA-DAI

<Insert Picture Here> DBA s New Best Friend: Advanced SQL Tuning Features of Oracle Database 11g

Database Architectures

VIRTUAL OBSERVATORY TECHNOLOGIES

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Sample Query Generation: Joins and Projection. Jason Ganzhorn

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or

Application Design and Development: October 30

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

UNIT V *********************************************************************************************

Enterprise Reporting -- APEX

WhatsApp Group Data Analysis with R

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection

Scalable Hybrid Search on Distributed Databases

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Oracle - Oracle Database: Program with PL/SQL Ed 2

Enhanced Performance of Database by Automated Self-Tuned Systems

Oracle 1Z Oracle Database 11g Release 2- SQL Tuning. Download Full Version :

CS122 Lecture 3 Winter Term,

Evolution of Database Systems

Evaluation Guide for ASP.NET Web CMS and Experience Platforms

Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding

Indexing Strategies of MapReduce for Information Retrieval in Big Data

NoSQL database and its business applications

Pan-STARRS Introduction to the Published Science Products Subsystem (PSPS)

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

ROCI 2: A Programming Platform for Distributed Robots based on Microsoft s.net Framework

What is Standard APEX? TOOLBOX FLAT DESIGN CARTOON PEOPLE

Call: JSP Spring Hibernate Webservice Course Content:35-40hours Course Outline

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Surfing the SAS cache

Sharing SDSS Data with the World

Virtual Memory. Chapter 8

Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Representing LEAD Experiments in a FEDORA digital repository

Introduction to Grid Computing

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE

An SQL-based approach to physics analysis

Increasing Database Performance through Optimizing Structure Query Language Join Statement

Transaction Analysis using Big-Data Analytics

CC-IN2P3 / NCSA Meeting May 27-28th,2015

Oracle Big Data SQL High Performance Data Virtualization Explained

Vijetha Shivarudraiah Sai Phalgun Tatavarthy. CSc 8711 Georgia State University

Embedded Technosolutions

Oracle Database 12c: Program with PL/SQL Duration: 5 Days Method: Instructor-Led

Designing dashboards for performance. Reference deck

Oracle BI 11g R1: Build Repositories

TAP services integration at IA2 data center

Transformer Looping Functions for Pivoting the data :

Tosska SQL Tuning Expert Pro for Oracle

A Fast and High Throughput SQL Query System for Big Data

Course Description. Audience. Prerequisites. At Course Completion. : Course 40074A : Microsoft SQL Server 2014 for Oracle DBAs

PHP & MySQL For Dummies, 4th Edition PDF

Performance Bottleneck Analysis of Web Applications with eassist

Network Programmability with Cisco Application Centric Infrastructure

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Chapter 3. Database Architecture and the Web

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Vertical Partitioning for Query Processing over Raw Data

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Sql Server 2008 Move Objects To New Schema

Databases in the Cloud

OTN Case Study: Amadeus Using Data Guard for Disaster Recovery & Rolling Upgrades

Online Optimization of VM Deployment in IaaS Cloud

The Virtual Observatory and the IVOA

Learn Well Technocraft

Database Management Systems

- How Base One s Scroll Cache simplifies database browsing & improves performance

Course 40045A: Microsoft SQL Server for Oracle DBAs

Improved Information Retrieval Performance on SQL Database Using Data Adapter

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Review on Managing RDF Graph Using MapReduce

Introduction to Relational Databases

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Using Relational Databases for Digital Research

Learning Objectives : This chapter provides an introduction to performance tuning scenarios and its tools.

Data integration using service composition in data service middleware

Sharepoint 2010 Content Database Has Schema Version Not Supported Farm

About Gluent. we liberate enterprise data. We are long term Oracle Database & Data Warehousing guys long history of performance & scaling

Oracle Compare Two Database Tables Sql Query Join

This module presents the star schema, an alternative to 3NF schemas intended for analytical databases.

Building data infrastructure for geo-computation by spatial information grid

Oracle Database: Program with PL/SQL

Oracle Database: Program with PL/SQL Ed 2

Number Systems MA1S1. Tristan McLoughlin. November 27, 2013

Peer-to-Peer Systems. Chapter General Characteristics

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

AutoPart : automating schema design for large scientific databases using data partitioning

Transcription:

2009 15th International Conference on Parallel and Distributed Systems Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database Helen X Xiang Computer Science, University of Hertfordshire, UK h.xiang@herts.ac.uk Abstract This paper explores the use of emerging Grid technologies for the manipulation of a scientific dataset with complex schema. We focus on the potential of a wellknown Grid middleware OGSA-DQP for querying such datasets. In particular, we investigate running OGSA-DQP queries against SQL Server and Oracle database to access data in a complex schema of a real scientific database the Sloan Digital Sky Survey (SDSS) database. We used two common databases which support complex schemas and very large datasets: SQL Server and Oracle. The paper briefly provides the background information on the SDSS database. It then examines the running of a few OGSA- DQP queries and the query plans. Various join queries run successfully through OGSA-DQP after we optimized the implementation of some of the OGSA-DQP operations. 1. Introduction Started in 2000, the Sloan Digital Sky Survey (SDSS) project has gathered and produced more than 60 terabytes of data (inducing images and catalogues data). It has built a detailed digital map of the visible stars and galaxies of the night sky [7]. The SDSS data produced by is summarised in a multi-terabyte relational database containing photometric objects and spectroscopic information this is available online via the SkyServer (http://skyserver.sdss.org). Our recent papers [2, 3, 4, 5] described how we created an experimental distributed SDSS DR5 database with Grid middleware which was based on Open Grid Services Architecture Distributed Query Processing (OGSA-DQP) [1]. Please refer to [6] for details of the OGSA-DQP operators. This experiment exposes some limitations of the technologies. In particular, this paper will focus on running OGSA-DQP queries against the a heterogeneous distributed SDSS database. We describe the modifications and improvements we made for running the OGSA-DQP queries against a complex scientific database schema. follow-up paper of [6]. 2 A Distributed SDSS Database This is a For the propose of understanding the OGSA-DQP queries in this paper, this section describes how we distributed a reduced SDSS database called MyBestDR5 among three different hosts machines located in different university buildings. Oracle is running on two of the hosts, while Microsoft SQL Server is running on the other. First, the original SDSS MyBestDR5 database s PhotoObjAll schema was split into three parts using objid. We then deployed the migrated SDSS Oracle schema on the two Oracle databases (on hosts 1 and 2) and extracted the relevant data files from the orignal MyBestDR5 database in SQL Server, with objid partitioning conditions. We next injected the appropriate data files to databases on hosts 1 and 2 used the SQL*Loader,. The third partition (on host 3) was formed by a cut down version of MyBestDR5 database after the appropriate records were removed. We now have a heterogeneous distributed SDSS MyBestDR5 database over hosts 1, 2, and 3. We then exposed the distributed SDSS MyBestDR5 database using the OGSA-DAI middleware and configure it with the OGSA-DQP toolkit before running the distributed queries. We installed an OGSA-DAI instance on hosts 1, 2 and 3 and deployed OGSA-DAI data service them with the following full URLs: http://host1:8080/wsrf/services/ ogsadai1/dataservicebuck http://host2:8080/wsrf/services/ ogsadai2/dataserviceicg http://host3:8080/wsrf/services/ogsadai3/ DataServiceAce The SDSS MyBestDR5 database is now distributed among three hosts with heterogeneous DBMSs: two Or- 1521-9097/09 $26.00 2009 IEEE DOI 10.1109/ICPADS.2009.142 706

acle and one SQL Server with the distributed SDSS data resources exposed via the OGSA-DAI middleware. Finally, we installed the OGSA-DQP 3.2 (Tech Preview) toolkit on the distributed sites for querying the distributed SDSS MyBestDR5 database. We also deployed an OGSA- DQP evaluator service On each host. The WSDL of the deployed OGSA-DQP evaluators can be viewed via the following URLs: http://host1:8081/dqp-evaluator/ services/queryevaluationservice http://host2:8081/dqp-evaluator/ services/queryevaluationservice http://host3:9080/dqp-evaluator/services/ QueryEvaluationService An OGSA-DQP coordinator can be installed on a separate host or on one of the three distributed SDSS hosts on the OGSA-DAI Data Services deployed earlier on. The client application is now ready to interact with the OGSA-DQP coordinator service to create an OGSA-DQP coordinator instance for executing queries. The architecture of the distributed SDSS MyBestDR5 database system is illustrated in Figure 1. We are now ready to run some OGSA-DQP queries against this distributed database. 3 Running OGSA-DQP Queries We already examined the OGSA-DQP query plan for a simple local query, and the different SOAP interactions between DQP components in [6]. This section discusses running more complicated queries through OGSA-DQP. We could now run DQP queries against a table on local databases and return one or more rows. In [6] we successfully ran the query: SELECT SKYVERSION FROM PHOTOOBJALL WHERE OBJID=587726014001315907 (1) through OGSA-DQP, against an SDSS Oracle database or SDSS SQL Server database. Next we wanted to try OGSA-DQP queries that can return many rows. We tried to run the following query on the Oracle database on the first host: SELECT SKYVERSION FROM PHOTOOBJALL (2) If successful, this query should return 68, 118 rows containing column value 1. Initially the query failed to complete. On investigation this proved to be caused by a timeout in the OGSA-DAI codes, which was resolved by modifying OGSA-DQP client class. We discovered we had to increase the memory on both DQP coordinator and OGSA-DQP evaluator to 256MB and 750MB respectively for running query 2. But the query still failed with a uk.org.ogsadai.client.toolkit.activity.- delivery.streamdataexception exception. The client returned this message after 11 minutes and 23 seconds. But on the server, the query went on running for 3-4 hours until we killed the Tomcats. We added some debugging lines in the OGSA-DAI source code and reran query 2. We proceeeded to investigate the suspected timeout problem that was preventing larger queries running. After looking into the code for OGSA-DAI and Axis we eventually discovered that this timeout could be disabled by modifying the OGSA-DQP Client class. The solution is to call the method: settimeout(0) on the DataService object. This allowed us for the first time to run queries against the distributed database returning some thousands of rows. The query 2 through OGSA-DQP now worked against the Oracle database on the first host, and returned the correct result 68, 118 rows of 1 s. However, it took a very long time: 452 minutes and 52 seconds in total. We needed to optimise this if we were to run much larger queries against the full SDSS database. Figure 2 shows the query plan for query 2. The query plan is very similar to the one in [6]. The only real difference is in a predicate in the TABLE SCAN node (the WHERE clause). This predicate is not shown in the figure. The SQL blob in Figure 2 is not really a feature of the query plan. It just helps explain how the plan is executed, as we will now discuss. We had already noticed from the SOAP messages captured in [6] that the TABLE SCAN operation was loading all columns of the selected rows of the target table. This might not be a problem if the target tables are small with few columns. But it is a big problem for larger databases with wide tables, and can easily be a performance bottleneck. Most of the tables in SDSS schema are large with many rows and many columns. In particular the table PHOTOOBJALL is a very wide table with 446 columns. The standard OGSA-DQP TABLE SCAN operation loads all of the 446 columns of the PHOTOOBJALL table even if only one is column required. The single column actually required in our query is projected out in the APPLY operation. Therefore the sample query returns 68, 118 rows of 1 (column SKYVERSION). But the standard OGSA- DQP first loads 68, 118 446 = 30, 380, 628 column values. Hence the poor query performance on large queries. To make meaningfully large scan-type queries practical, OGSA-DQP would need to be optimized to avoid read- 707

Host1 Host2 Host3 Figure 1. Distributing the SDSS MyBestDR5 database among three hosts ing all columns of tables. The ideal approach would be to change the query compiler to combine the APPLY and TABLE SCAN into a new kind of node that issued an SQL query to the data resource for just the columns needed. But this would involve changes to several modules of OGSA- DQP. Instead we devised an optimization of project/scantype queries that left the query plan unchanged, but worked by changing the way the evaluator s TableScanOp class and ReduceOp class process the query plan. With the implementation of the above optimization, the query performance is dramatically improved for the simple (medium sized) queries. The sample query 2 took only 2 minutes 37 seconds instead of 452 minutes 52 seconds 173 times faster 1. As explained above, we keep the same query plan, but the evaluator interpets it differently. The optimized query plan in Figure 3 is actually identical to the one in Figure 2, but the changed SQL blob now shows that the query sent to the data resource incorporates the pro- 1 Time for running a query similar to 2 against SDSS MyBestDR5 without using OGSA-DQP was 7 seconds. jection in the APPLY operation as well as the TABLE SCAN operation. We then tried it on another query for larger number of results with more complicated data type: SELECT OBJID FROM NEIGHBORS (3) If everything goes well when this query is run against the first partition of the database, it should return 355, 214 rows of OBJID. Unfortunately, there is a problem with client memory exhaustion when returning large number of rows. After examining the log files, we found the OGSA-DQP client program is buffering all results into one huge string before printing them. This might be OK if the results set are small for example lots of 1 as the previous query returning SKYVERSION column. However, column OBJID has a bigger data type than column SKYVERSION, and trying to buffer 355, 214 of data like 587726014001315907 no doubt leads to the java.lang.outofmemoryerror on the DQP client. 708

Buck Coordinator Buck Evaluator 1. PRINT 2. EXCHANGE 0. EXCHANGE 1. APPLY PROJECT: PHOTOOBJALL.SKYVERSION SQL 0. TABLE_SCAN: PHOTOOBJALL Figure 2. Query plan for query 2 running on the first host Buck Coordinator Buck Evaluator 1. PRINT 2. EXCHANGE SQL 0. EXCHANGE 1. APPLY PROJECT: PHOTOOBJALL.SKYVERSION 0. TABLE_SCAN: PHOTOOBJALL Figure 3. Optimized query plan for query 2 running against the databse on the first host The client memory exhaustion was fixed by changing the DQP Client class to avoid using the method getresults(), so that the DQP client no longer buffers all results before printing. After this modification, we successfully ran a number of OGSA-DQP queries against the Oracle databases, including local joins that involve two or three tables. One example is a sample query that joins two tables and returns 355, 214 rows from the database on the first host (total run time: 21 minutes 46 seconds): SELECT PHOTOOBJALL.OBJID, PHOTOOBJALL.RA, PHOTOOBJALL.DEC, NEIGHBORS.NEIGHBOROBJID, NEIGHBORS.DISTANCE FROM PHOTOOBJALL, NEIGHBORS WHERE PHOTOOBJALL.OBJID = NEIGHBORS.OBJID (4) Another example is a query that joins three tables and returns 327 rows from the database on the first host (total run time: 3 minutes 55 seconds): SELECT PHOTOOBJALL.OBJID FROM FIRST, PHOTOOBJALL, NEIGHBORS WHERE PHOTOOBJALL.OBJID = NEIGHBORS.OBJID AND PHOTOOBJALL.OBJID = FIRST.OBJID (5) Times for running similar queries against SDSS MyBestDR5 without using OGSA-DQP were only 54 seconds and 2 seconds respectively. 709

4. Conclusions This paper focussed on query execution in a wellknown (experimental) Grid-based middleware OGSA- DQP. It examined the OGASA-DQP operator and the running of some queries through OGSA-DQP. We looked at the OGSA-DQP query plan for some queries and made some adjustment when running the quires. This paper discussed some problems running OGSA- DQP queries on an Oracle database, in particular queries returning many rows. The memory of both OGSA-DQP evaluator and coordinator was increased. We modified the OGSA-DQP Client class and disabled a client time out. We also optimized the implementation of the OGSA- DQP TABLE SCAN and APPLY operations to load only required columns into memory, instead of all the columns in the target tables. This optimisation greatly improved the OGSA-DQP queries against large tables. Finally we fixed a OGSA-DQP Client memory exhaustion problem by avoiding buffering all results before printing. After the above improvements, various join queries ran successfully through OGSA-DQP. In future publications, we will describe how we have run more complex queries on a distributed version of full SDSS databases to further investigate the OGSA-DQP distributed query performance. [7] D. G. York et al. The Sloan Digital Sky Survey: Technical summary. Astronomical Journal, 120:1579 1587, 2000. http://www.sdss.org. References [1] The OGSA-DQP project. http://www.ogsadai.org/about/ogsa-dqp. [2] H. Xiang, M. Baker, and R. Nichol. Experiences mirroring and distributing the Sloan Digital Sky Survey. In Fifth International Conference on Grid and Cooperative Computing Workshops (GCC 2006), Changsha, China, pages 518 521. IEEE Computer Society, October 2006. [3] H. X. Xiang. Experiences acquiring and distributing a large scientific database. In Second International Conference on Future Generation Communication and Networking Symposia, volume 2, pages 14 19, Washington, DC, USA, December 2008. IEEE Computer Society. [4] H. X. Xiang. A grid-based distributed database solution for large astronomy datasets. In International Conference on Computer Science and Software Engineering, volume 3, pages 66 69, Washington, DC, USA, December 2008. IEEE Computer Society. [5] H. X. Xiang. Supporting complex scientific database schemas in a grid middleware. In International Conference on Advanced Information Networking and Applications, volume 0, pages 937 944, Bradford, UK, May 2009. IEEE Computer Society. [6] H. X. Xiang. Using grid middleware to query a heterogeneous distributed version of the sdss database. In The Fifteenth International Conference on Parallel and Distributed Systems, Shenzhen, China, December 2009. IEEE Computer Society. 710