SDSS Dataset and SkyServer Workloads

Similar documents
Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

DOT-K: Distributed Online Top-K Elements Algorithm with Extreme Value Statistics

Batch is back: CasJobs, serving multi-tb data on the Web

Sharing SDSS Data with the World

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

GBO Preliminary Research Proposal

Name Date Types of Graphs and Creating Graphs Notes

Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database

A Data Diffusion Approach to Large Scale Scientific Exploration

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

A More Realistic Energy Dissipation Model for Sensor Nodes

Workload-Aware Data Partitioning in CommunityDriven Data Grids

Structured Completion Predictors Applied to Image Segmentation

Performance evaluation and benchmarking of DBMSs. INF5100 Autumn 2009 Jarle Søberg

Qlik Sense Performance Benchmark

Collaborative data-driven science. Collaborative data-driven science. Mike Rippin

Chapter 2 - Graphical Summaries of Data

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

Applied Statistics for the Behavioral Sciences

AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations

Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04

Summary Cache based Co-operative Proxies

Analyzing Imbalance among Homogeneous Index Servers in a Web Search System

Mobile IP Simulation

Collaborative data-driven science. Collaborative data-driven science

The Fusion Distributed File System

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Finding a needle in Haystack: Facebook's photo storage

AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning

Basic Statistical Terms and Definitions

VIRTUAL OBSERVATORY TECHNOLOGIES

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Program Counter Based Pattern Classification in Pattern Based Buffer Caching

I/O Characterization of Commercial Workloads

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

IT 403 Practice Problems (1-2) Answers

PERFORMANCE MEASUREMENTS OF REAL-TIME COMPUTER SYSTEMS

Performance evaluation and. INF5100 Autumn 2007 Jarle Søberg

Locality in Search Engine Queries and Its Implications for Caching

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Value and Relation Display for Interactive Exploration of High Dimensional Datasets

CHAPTER 6. The Normal Probability Distribution

Daniel A. Menascé, Ph. D. Dept. of Computer Science George Mason University

Enemy Territory Traffic Analysis

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Memory. Objectives. Introduction. 6.2 Types of Memory

X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Interactive Responsiveness and Concurrent Workflow

Performance Modeling of Proxy Cache Servers

On the Relationship of Server Disk Workloads and Client File Requests

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Evaluation of Performance of Cooperative Web Caching with Web Polygraph

Review of Energy Consumption in Mobile Networking Technology

Optimizing Testing Performance With Data Validation Option

Averages and Variation

Frequency Distributions

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Descriptive Statistics, Standard Deviation and Standard Error

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Web-Conscious Storage Management for Web Proxies

Bar Graphs and Dot Plots

Adaptive Parallel Compressed Event Matching

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Application agnostic, empirical modelling of end-to-end memory hierarchy

Module 9: Selectivity Estimation

Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

A model for the evaluation of storage hierarchies

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation

A Comparison of File. D. Roselli, J. R. Lorch, T. E. Anderson Proc USENIX Annual Technical Conference

Multimedia Streaming. Mike Zink

Comparing Implementations of Optimal Binary Search Trees

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Efficient Content Verification in Named Data Networking

Introduction to Geospatial Analysis

Web Services for the Virtual Observatory

Organizing and Summarizing Data

Introduction to Minitab 1

Joint Entity Resolution

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

PROTEUS RTI: A FRAMEWORK FOR ON-THE-FLY INTEGRATION OF BIOMEDICAL WEB SERVICES

An Efficient Web Cache Replacement Policy

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CSC443 Winter 2018 Assignment 1. Part I: Disk access characteristics

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Distributed Implementation of BG Benchmark Validation Phase Dimitrios Stripelis, Sachin Raja

LECTURE 2 SPATIAL DATA MODELS

Chapter 2. Frequency distribution. Summarizing and Graphing Data

Normal Data ID1050 Quantitative & Qualitative Reasoning

Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09)

End-to-end UMTS Network Performance Modeling. 1. Introduction

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Transcription:

SDSS Dataset and SkyServer Workloads Overview Understanding the SDSS dataset composition and typical usage patterns is important for identifying strategies to optimize the performance of the AstroPortal and to effectively test the performance of the stacking service. This document is a first attempt at trying to describe both the dataset and observed access patterns. The following sections cover information regarding the SDSS dataset and the workloads we plan to use to test the AstroPortal with. These workloads include random workloads, random workloads with varying data locality, SkyServer workload traces, and realistic random workloads.. Dataset About a year ago, I had extracted some 320 million object coordinates from the SkyServer DR4 database. After performing a lookup for every one of the 320M objects, I got a histogram of each file (of the.3m files) and the number of objects it contains (ordered by the number of objects). The histogram of the object distribution in the SDSS DR4 dataset is shown below in Figure. 2750 2500 2250 Object Count per File 2000 750 500 250 00 750 500 250 0 0 0000 200000 300000 400000 500000 600000 700000 800000 900000 00000 0000 200000 Files Figure : Object distribution in the SDSS DR4 dataset files From what I see in Figure, of the.3 million files in DR4, only 350K some files have any objects in them. What is also interesting is that the objects are not uniformly distributed over the data files, but that there is a normal-like distribution to the number of files with low/medium/high number of objects per file. Page of 9

Figure 2 graphically shows the spatial distribution of the files in the SDSS DR4 dataset. Figure 2: Spatial distribution (according to the RA and DEC coordinates) of the files in the SDSS DR4 dataset; each dot represents a file that could contain a varying number of objects; the more objects there are in a file, the darker and bigger the dot is In essence, I believe we could use the data shown in Figure and Figure 2 to augment and improve the various caching strategies. In principle, any caching strategy could be enhanced by simply factoring in the weight (# of objects) of each entry, and hence ensuring that entries with many objects will survive longer in the cache before being evicted. Since the object distribution does not change over the life of a dataset, the use of this information should be straight forward and easy to implement..2 Workloads Choosing the right workloads to quantify the performance of the proposed systems is crucial to understand the real world benefits when the AstroPortal will be in production mode used by the Astronomy community. We have defined several workloads that should test a wide range of performance metrics, and allow us to identify among the various optimizations which ones will offer the best performance improvements. One important characteristic of the workloads we will define is the data locality. We have 320 million objects that are distributed over.2 million files. When a stacking query is made, we define the data locality metric for this query to be a percentage according to Equation, where numbers close to 0% imply high data locality, and numbers close to 0% imply low data locality: Equation : Data Locality Metric unique _ files _ accessed *0 objects _ accessed Page 2 of 9

The motivation behind this data locality metric definition is that we perform data management on the file level, and the fact that access patterns of various objects that allow the same files to be read should produce better performance. The data locality metric can apply to both ) a single stacking operation of multiple objects, and 2) to multiple stacking operations performed both sequentially and in parallel as the same computational/storage resources could be used to perform the set of stacking operations. The stacking sizes are also very important to investigate since it will directly impact the communication cost as there will be more control and data information that will need to flow from one component to another. We intend to investigate stacking sizes ranging from to 32768 as that will cover typical real world stacking operations..2. Random Workload The random workload represents a random sampling (uniformly distributed) of objects chosen from the collection of 320 million objects from the SDSS DR4 dataset. We will investigate both stackings of the same size, as well as stackings of varying sizes. Once the workload is defined, the same workload will be used for all experiments in order to allow direct comparisons between the performance of the AstroPortal (as well as the related systems) and the various optimizations included. With the large number of objects (as well as the large dataset), a uniform sampling of objects will likely produce workloads with very little (if any) data locality..2.2 Random Workload with Varying Data Locality Building upon the random workload defined above, we can further refine the workload by varying the data locality from 0% to 0%. We will define 5 different workloads with data locality set to 0%, 25%, 50%, 75%, and 0%. At one extreme, we have a sweep like access pattern that has each accessed object in a unique file; this should allow us to see any overhead (and potential slowdown) incurred by the data management system as the data caching will likely get very low hit rates. The other extreme, we have full data locality, in which all objects accessed are contained in the same file, allowing the caching strategies to get maximal cache hit rates. The other data locality investigated (25%, 50%, and 75%) gives us an overview of how well various caching strategies fare to varying data locality in the workloads, and at what point they might become ineffective..2.3 SkyServer Workload Traces The SkyServer [2] is the most widely used system by the Astronomy community for searching through the SDSS catalogs to locate needed information. All the SQL queries against the SkyServer are logged with the following information: date & time of query, client IP address, requestor, server, database name, access, elapsed time to complete query, the number of rows retrieved, the SQL query, and any error message that might have been produced. The logs can be queried at http://skyserver.sdss.org/dr/en/tools/search/sql.asp with a sample query of the form select top 0 * from sdssad2.weblog.dbo.sqllog. Although the queries retrieved are not from a stacking application, all the queries are from the SDSS dataset, and they are likely to be coming from the same set of users that will use the stacking application as well. Until we have the Astronomy community using the AstroPortal so we can gather our own stacking traces, the best we can do is to capture some workload traces of other applications using the same dataset. The trace we plan to use covers the period from October 29 th, 2006 through November 2 st, 2006. Since a stacking application normally needs at least 2 objects to stack, we omitted any queries that returned either 0 or object. We were left with a trace that was produced by 900 distinct users (based on their IP addresses) containing 65K queries returning 90M objects..2.3. Raw Logs Figure 3, Figure 4, and Figure 5 shows a pictorial representation of the raw logs obtained from the SkyServer between the dates of /29/06 and /2/06. In Figure 3 and Figure 4, we have plotted the throughput in both number of queries per minute and the number of rows retrieved per second. Both of these metrics gives us an overview of the kinds of sustained throughput that the AstroPortal will have to maintain to keep up with this particular workload. For example, the AstroPortal would need to be able to do on average about 5 queries per minute, translating to an average of about 285 stackings per second. Page 3 of 9

Throughput (queries/min) 90 80 70 60 50 40 30 20 Throughput 0 queries/min 5.26 queries/min queries/min 8 queries/min.05 queries/min 0 0 5000 000 5000 20000 25000 30000 Time (min) Figure 3: SkyServer logs (/29/06 /2/06) throughput in terms of SQL queries per minute Throughput (rows/sec) 2000 800 600 400 200 00 800 600 400 Throughput 0 rows/sec 285.3 rows/sec 0.267 rows/sec 6745.2 rows/sec 949. rows/sec 200 0 0 250000 500000 750000 00000 250000 500000 750000 Time (sec) Figure 4: SkyServer logs (/29/06 /2/06) throughput in terms of rows returned per second by the performed SQL queries Figure 5 shows the time it took the SkyServer to complete the queries, normalized by the number of rows retrieved. Notice that the graph has the y-axis in logarithmic scale (unlike the previous two graphs). Figure 5 shows quite a wide varying time per retrieved row, which shows the wide variance in performance based on the complexity of the query as well as the load on the SkyServer as multiple users concurrently run queries in parallel. Page 4 of 9

00000 Elapsed Time per Row 0 ms 0000 2274.55 ms 28.35 ms 384383 ms Elapsed Time per Row (ms) 000 00 0 0. 425.08 ms 0.0 0.00 0 5000 000 5000 20000 25000 30000 Time (min) Figure 5: SkyServer logs (/29/06 /2/06) elapsed time per retrieved row in milliseconds.2.3.2 Summary of Logs The following set of figures (Figure 6, Figure 7, Figure 8, and Figure 9) attempt to summarize the raw logs obtained from the SkyServer between the dates of /29/06 and /2/06. We have created histograms of various metrics computed from the raw logs: Inter arrival time of queries Query size Time per retrieved row Time per query There is another set of figures (Figure, Figure, Figure 2) which attempt to summarize the same logs in relation to the clients rather than the queries. Characterizing the clients can allow us to understand the workload distribution among the clients. One thing that we have not investigated yet is the data locality found in the query results, which we plan to do. The data locality information is important as the performance of the various optimizations we have made to the AstroPortal will be heavily influenced based on how low/high the data locality is in typical real world workloads. Notice that all the graphs have both the x-axis and y-axis in logarithmic scale. Some of these graphs have tendencies to show Zipf distributions; Zipf curves follow a straight line when plotted on a double-logarithmic diagram. A simple description of data that follow a Zipf distribution is that they have: a few elements that score very high (the left tail in the diagrams) a medium number of elements with middle-of-the-road scores (the middle part of the diagrams) a huge number of elements that score very low (the right tail in the diagrams) This property is desirable to identify in the captured logs as it will help generate realistic random workloads based on the distribution of the various metrics that makeup a workload: query size, inter arrival time, time per query, time per row, and the distribution of the work among the various clients. Page 5 of 9

000 00 Inter Arrival Time (sec) 0 0. Query Inter Arrival Time 0 seconds.5 seconds 2 seconds 3,903 seconds 48.68 seconds 0.0 0.00 0,000,000 0,000,000,000 Figure 6: SkyServer logs (/29/06 /2/06): inter arrival time of queries histogram in seconds,000,000,000 0,000,000 Query Size (rows),000,000,000,000 0,000,000,000 Query Size 2 rows 7,795 rows 545.5 rows 354,88,96 rows 2,38,927 rows 0 0,000,000 0,000,000,000 Figure 7: SkyServer logs (/29/06 /2/06): average query size histogram in seconds Page 6 of 9

000000 00000 Time per Row (ms) 0000 000 00 0 Time per Row 0 ms 66 ms 0.45 ms,,205 ms,354 ms 0. 0.0 0.00 0,000,000 0,000,000,000 Figure 8: SkyServer logs (/29/06 /2/06): time per retrieved row histogram in milliseconds 0000 000 Time per Query (sec) 00 0 Time per Query 0.0 seconds 22.92 seconds 0.22 seconds 6,904 seconds 234.76 seconds 0. 0.0 0,000,000 0,000,000,000 Figure 9: SkyServer logs (/29/06 /2/06): time per query histogram in seconds Another set of figures (Figure, Figure, Figure 2) which attempt to summarize the same logs in relation to the clients rather than the queries are: Number of queries per client rows retrieved per client query size per client Page 7 of 9

0,000,000,000 0 per Client queries 84 queries 3 queries 84,736 queries 3,028.6 queries 0 00 Clients Figure : SkyServer logs (/29/06 /2/06): number of queries per client histogram distribution,000,000,000 0,000,000,000,000,000,000 Rows per Client 2 rows,322,044 rows 35 rows 355,404,33 rows 4,07,3 rows Rows 0,000,000,000 0 0 00 Clients Figure : SkyServer logs (/29/06 /2/06): number of rows retrieved per client histogram Page 8 of 9

,000,000 Query Size per Client 2 queries,000,000 53,63.2 queries 50 queries 6,705,738 queries Query Size (rows) 0,000,000,000 0 307,535.8 queries 0 00 Clients Figure 2: SkyServer logs (/29/06 /2/06): average query size per client histogram.2.4 Realistic Random Workload We expect that the summary of the SkyServer workload traces to give us enough to be able to build realistic (but randomized) workloads for the AstroPortal using the SDSS dataset. 2 Bibliography [] Tanu Malik, Alex Szalay, Tamas Budavari and Ani Thakar, SkyQuery: A Web Service Approach to Federate Databases, The First Biennial Conference on Innovative Database Systems Research (CIDR), 2003. [2] Alex Szalay, Jim Gray, Ani Thakar, Peter Kuntz, Tanu Malik, Jordan Raddick, Chris Stoughton and Jan Vandenberg, The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data, ACM SIGMOD, 2002. http://www.skyserver.org/default.aspx Page 9 of 9