Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine

Similar documents
Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine

Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets

Cost Models for Query Processing Strategies in the Active Data Repository

Very Large Dataset Access and Manipulation: Active Data Repository (ADR) and DataCutter

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

A Component-based Implementation of Iso-surface Rendering for Visualizing Large Datasets (Extended Abstract)

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

Active Proxy-G: Optimizing the Query Execution Process in the Grid

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Scheduling File Transfers for Data-Intensive Jobs on Heterogeneous Clusters

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 17: Parallel Databases

(b) External fragmentation can happen in a virtual memory paging system.

Impact of High Performance Sockets on Data Intensive Applications

Chapter 8: Virtual Memory. Operating System Concepts

Near Memory Key/Value Lookup Acceleration MemSys 2017

CHAPTER 2: PROCESS MANAGEMENT

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

OPERATING SYSTEM. Chapter 9: Virtual Memory

Properties of Processes

White Paper. Major Performance Tuning Considerations for Weblogic Server

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Operating Systems. Process scheduling. Thomas Ropars.

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions

Indexing Cached Multidimensional Objects in Large Main Memory Systems

CS3733: Operating Systems

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

The Virtual Microscope

Advanced Databases: Parallel Databases A.Poulovassilis

CPU Scheduling: Objectives

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

A Study of the Performance Tradeoffs of a Tape Archive

Course Syllabus. Operating Systems

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CS307: Operating Systems

Cluster-Based Scalable Network Services

Operating System Review Part

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

B. V. Patel Institute of Business Management, Computer &Information Technology, UTU

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

CS370 Operating Systems

CS370 Operating Systems

Process- Concept &Process Scheduling OPERATING SYSTEMS

Column Stores vs. Row Stores How Different Are They Really?

CSE 120 Principles of Operating Systems

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

Review. Preview. Three Level Scheduler. Scheduler. Process behavior. Effective CPU Scheduler is essential. Process Scheduling

Implementing a Statically Adaptive Software RAID System

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Accelerated Library Framework for Hybrid-x86

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Operating Systems (Classroom Practice Booklet Solutions)

Staged Memory Scheduling

Architecture-Conscious Database Systems

Multiprocessor Systems. Chapter 8, 8.1

Software-Controlled Multithreading Using Informing Memory Operations

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

HTRC Data API Performance Study

First-In-First-Out (FIFO) Algorithm

Design and Implementation of A P2P Cooperative Proxy Cache System

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

Introduction to Grid Computing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Hierarchical PLABs, CLABs, TLABs in Hotspot

8th Slide Set Operating Systems

vrealize Operations Manager User Guide 11 OCT 2018 vrealize Operations Manager 7.0

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University


!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Chapter 5: Process Scheduling

Parallelization and Synchronization. CS165 Section 8

Last Class: Processes

Adaptive Parallelism for Web Search

UNIT 1 JAGANNATH UNIVERSITY UNIT 2. Define Operating system and its functions. Explain different types of Operating System

CS370 Operating Systems Midterm Review

OPERATING SYSTEMS ECSE 427/COMP SECTION 01 WEDNESDAY, DECEMBER 7, 2016, 9:00 AM

Lecture 2 Process Management

Performance Optimization for Informatica Data Services ( Hotfix 3)

Optimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs

Table of Contents 1. OPERATING SYSTEM OVERVIEW OPERATING SYSTEM TYPES OPERATING SYSTEM SERVICES Definition...

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

Operating System - Overview

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Transcription:

Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine Henrique Andrade y,tahsinkurc z, Alan Sussman y, Joel Saltz z y Dept. of Computer Science University of Maryland College Park, MD 2742 fhcma,alsg@cs.umd.edu z Dept. of Biomedical Informatics The Ohio State University Columbus, OH, 4321 fkurc-1,saltz-1g@medctr.osu.edu Abstract Query scheduling plays an important role when systems are faced with limited resources and high workloads. It becomes even more relevant for servers applying multiple query optimization techniques to batches of queries, in which portions of datasets as well as intermediate results are maintained in memory to speed up query evaluation. In this work, we present a dynamic query scheduling model based on a priority queue implementation using a directed graph and a strategy for ranking queries. We examine the relative performance of several ranking strategies on a sharedmemory machine using two different versions of an application, called the Virtual Microscope, for browsing digitized microscopy images. investigated in past surveys [4, 5]. Query scheduling techniques have been used to speed up the execution of queries in the face of limited resources [5] and in parallel environments [7]. The work of Mehta et al. [9] is one of the first that has tackled the problem of scheduling queries in a parallel database by considering batches of queries, as opposed to one query at a time. Gupta et al. [6] present an approach that tackles this problem in the context of decision support queries. Sinha and Chase [1] present some heuristics to minimize the flow time of queries and to exploit interquery locality for large distributed systems. Waas and Kersten [11] present scheduling techniques for multiple query workloads generated by web-based requests. Nevertheless, the study of the problem of scheduling multiple query workloads for highly data intensive visualization queries, and more generally for data analysis applications, is lacking. 1 Introduction Multi-query optimization is becoming increasingly important in many application domains, such as relational databases, deductive databases, decision support systems, and data intensive analytical applications (or data analysis applications) [2]. Several optimizations can be applied to speed up query execution when multi-query workloads are presented to the server. Query scheduling is another common technique that can be applied in combination with other optimizations such as use of materialized views (or intermediate results), and data prefetching and caching. The query optimization and scheduling problem has been extensively This research was supported by the National Science Foundation under Grants #ACI-96192 (UC Subcontract #115248) and #ACI-998287, Lawrence Livermore National Laboratory under Grant #B5288 (UC Subcontract #1184497), and the Department of Defense, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F362--C-9 (SAIC Subcontract #4425559). In this paper, we develop a model for scheduling multiple queries in data analysis applications. Queries in this class of applications involve user-defined processing of data and complex, application-specific data structures to hold intermediate results during processing. The multi-query scheduling model described in this paper is based on a priority queue implementation and is dynamic in that we assume queries are submitted to the server continually. The priority queue is implemented using a directed graph to represent the commonalities and dependencies among queries in the system, and a ranking strategy is used to order the queries for execution. We describe several different ranking strategies. We perform an experimental evaluation of the strategies using two versions of a microscopy visualization application that were deployed in a generic, multithreaded, multiple query workload aware runtime system [2]. Each of these versions of the application has different CPU and I/O requirements, creating different application scenarios. 1

Client A Query Query Client Query Result Query Server (Query Planning/Query Execution) Query Result Index Manager Data Store Manager Page Space Manager Query Thread Query Thread Data Source Data Source... Data Source Disk Farm Tape Storage Relational Database Figure 1. The system architecture. The query answering mechanism. Once a new query q j with predicate M j is submitted, the system tries to find a complete or a partial blob that can be used to compute q j. Once it is found (blob I region R i, in our example), a data transformation is applied with the user-defined project method to compute region R j. Sub-queries S j;1, S j;2, S j;3,ands j;4 are generated to complete the query processing and produce the answer blob J. 2 System Architecture We have implemented the Virtual Microscope application [1] using a generic multiple query aware middleware for data analysis applications. This middleware infrastructure, which consists of a multithreaded query server engine, was previously described in [2], but we provide a brief description of some of its components here in order to help with the presentation of the scheduling model. Figure 1 illustrates the architecture of our framework, which consists of several service components, implemented as a C++ class library and a runtime system. The current runtime system supports multithreaded execution on a shared-memory multiprocessor machine. Query Server: The query server is implemented as a fixedsize thread pool (typically the number of threads is the number of processors available in the SMP) that interacts with clients for receiving queries and returning query results. An application developer can implement one or more query objects for application-specific subsetting and processing of datasets. The implementation of a new query object is done through C++ class inheritance and the implementation of virtual methods. When a query is received from a client, the query server instantiates the corresponding query object and spawns a Query Thread (Figure 1) to execute the query. Specifically, when the system receives predicate meta-information M for a query, query processing begins. First, the runtime system searches for cached partial results that can be reused to either completely or partially answer a query. This lookup operation uses a user-defined overlap operator to test for potential matches. Note that the result for one query can also be viewed as an intermediate result for another query. A user-defined project method is then called so that the cached result can be projected, potentially performing a transformation on the cached data, to generate a portion of the output for the current query. Finally, if the current query is only partially answered by cached results, subqueries are created to compute the results for the portions of the query that have not been computed from cached results. The sub-queries are processed just like any other query in the system. The transformation of intermediate data structures and values via the project method is the cornerstone of our system architecture. First, it is responsible for the identification of reuse possibilities when a new query is being executed. Second, information about multiple possible overlaps and multiple scheduling possibilities can be used to process a query batch in a way that better utilizes available system resources. Formally, the following equations describes the data transformation model our system uses to explore common subexpression elimination and partial reuse opportunities (and also defines the functions that must be implemented by the application developer): cmp(m i ;M j )=true or false (1) overlap project (M i ;M j )=k; <= k<=1 (2) project(m i;m j;i) I Mi! J Mj (3) Equation 1 describes the comparison function cmp that returns true or false depending on whether intermediate data result I, described by its predicate M i,isthesameas J described by M j. When cmp returns true, the system 2

has identified a common subexpression elimination opportunity, because query q j canbecompletelyansweredby returning I. In some situations, queries q i and q j,albeit different, have some overlap, which means that cmp is false, but partial reuse is still possible. Equation 2 describes the overlap function that returns a value between and 1 that represents the amount of overlap between intermediate data results I and J. This overlap is computed with respect to a data transformation function project, givenin Equation 3, that takes one intermediate data result I whose predicate is M i and projects it over M j (i.e. performs a data transformation) as seen in Figure 1. One additional function, referred to as qoutsize(m i ), also needs to be implemented. Function qoutsize returns the query output size in terms of the number of bytes needed to store the query results. Currently, this function is only used by the query scheduler. For queries whose predicates only allow the query server to know how much space is going to be needed by actually executing the query, it is not feasible to provide the exact amount of space needed to hold the query results beforehand. In these situations, an estimate of that amount should be returned. Data Store Manager: The data store manager (DS) is responsible for providing dynamic storage space for intermediate data structures generated as partial or final results for a query. The most important feature of the data store is that it records semantic information about intermediate data structures. This allows the use of intermediate results to answer queries later submitted to the system. A query thread interacts with the data store using a DataStore object, which provides functions similar to the C language function malloc. When a query wants to allocate space in the data store for an intermediate data structure, the size (in bytes) of the data structure and the corresponding accumulator meta-data object are passed as parameters to the malloc method of the data store object. The data store manager allocates the buffer space, internally records the pointer to the buffer space and the associated meta-data object, and returns the allocated buffer to the caller. The data store manager also provides a method called lookup. This method is used by the query server to check if a query can be answered entirely or partially using the intermediate results stored in the data store as seen in Figure 1. Page Space Manager: The page space manager (PS) controls the allocation and management of buffer space available for input data in terms of fixed-size pages. All interactions with data sources are done through the page space manager. The pages retrieved from a Data Source are cached in memory. The page space manager also keeps track of I/O requests received from multiple queries so that overlapping I/O requests are reordered and merged, and dupli- high magnification image (i) high magnification image f(pixels) (ii) f(pixels) f(pixels) f(pixels) low magnification image low magnification image Figure 2. Operations to compute images at lower magnification levels from a higher resolution image. (i) The subsampling operation. (ii) The pixel averaging operation, in which the function f(pixels) computes the mean for a given number of pixels from an image at higher resolution. cate requests are eliminated, to minimize I/O overhead. 3 Analysis of Microscopy Data with The Virtual Microscope The Virtual Microscope (VM) application [1] implements a realistic digital emulation of a high power light microscope. One VM example application is in a teaching environment, where an entire class can access and individually manipulate the same slide at the same time, searching for a particular feature in the slide. In that case, the data server may have to process multiple queries simultaneously. The raw input data for VM are 2-dimensional, digitized microscope slides, which are stored on disk at the highest magnification level, and can be up to several gigabytes in size, uncompressed. In order to achieve high I/O bandwidth during data retrieval, each slide is regularly partitioned into data chunks, each of which is a rectangular subregion of the 2D image. During query processing, the chunks that intersect the query region, which is a two-dimensional rectangle within the input image, are retrieved from disk. A retrieved chunk is first clipped to the query window. The clipped chunk is then processed to compute the output image at the desired magnification. We have implemented two functions to process the chunks to produce lower resolution images. The first function employs a simple subsampling operation, and the second one implements an averaging operation over a window. 3

For a magnification level of N giveninaquery,thesubsampling function returns every N th pixel from the region of the input image that intersects the query window, in both dimensions. The averaging function, on the other hand, computes the value of an output pixel by averaging the values of N N pixels in the input image. The averaging function can be viewed as an imaging processing algorithm in the sense that it has to aggregate several input pixels in order to compute an output pixel. Figure 2 illustrates the application of the two functions. The resulting image blocks are directly sent to the client. The client assembles and displays the image blocks to form the query output. The implementation of VM using the runtime system described in this paper is done by sub-classing the base classes provided by the system. We have added two query objects to the runtime system one for each of the processing functions. The output image generated by a query is also available as an intermediate result, and is stored in the data store manager to be used by other queries. The magnification level, the processing function, and the bounding box of the output image in the entire dataset are stored as meta-data. An overlap function was implemented to intersect two regions and return an overlap index, which is computed as overlap index = I A I S (4) O A O S In this equation, I A is the area of intersection between the intermediate result in the data store and the query region, O A is the area of the query region, I S is the zooming factor used for generating the intermediate result, and O S is the zooming factor specified by the current query. O S should be a multiple of I S so that the query can use the intermediate result. Otherwise, the value of the overlap index is. 4 Dynamic Query Scheduling Model The query scheduling model presented in this paper assumes the data server is dynamically receiving queries. This dynamic nature of requests creates a need to evaluate the current state of the server system before a new query is slated for execution. Our approach to the multiple query scheduling problem is based on the use of a priority queue. The priority queue is implemented as a directed graph, G(V; E), wherev denotes the set of vertices and E is the set of edges, along with a set of strategies for ranking queries waiting for execution. The graph, referred to as the query scheduling graph, describes the dependencies and reuse possibilities among queries because of full or partial overlap, as is seen in Figure 3. Each vertex represents a query that is waiting to be computed, is being computed, or was recently computed and its results are cached. A directed edge in the graph connecting q i to q j, e i;j, means that the results of q j can be computed based on the results of q i. In some situations a transformation may exist in only one direction due to the nature of the data transformation, i.e., the data transformation function is not invertible. In that case, the two nodes are connected in only one direction (e.g., e2;4 in Figure 3 ). Associated with each e i;j is a weight w i;j, which gives a quantitative measure of how much of q i canbeusedtoanswerq j through some transformation function 1. The weight is computed as the value of overlap(q i ;q j )qoutsize(q i ), which is a measure of the number of bytes that can be reused. Associated with each node q i 2 V there is a 2-tuple <r i ;s i >,where r i and s i are the current rank and the state of q i, respectively. A query can be in one of four states: WAITING, EECUTING, CACHED, orswapped OUT. The rank is used to decide when a query in the WAITING state is going to be executed, so that when a dequeue operation is invoked on the priority queue the query node with the highest rank is scheduled for execution. This scheduling model is dynamic in the sense that new queries can be inserted as they are received from clients. A new query triggers (1) the instantiation of a new node q j in the query scheduling graph, (2) the addition of edges to all nodes q k that have some overlap with q j, by adding edges e k;j and e j;k (we refer to these nodes as neighbor nodes ), and (3) the computation of r j.also,theq k nodes have their r k updated, because the new query may affect those nodes in terms of reuse opportunities. The state of a new node is set to WAITING. Once a query q i has been scheduled, s i is updated to EECUTING, and once it has finished its state is updated to CACHED, meaning that its results are available for reuse. If the system needs to reclaim memory, the state of the query q i will be set to SWAPPED OUT, meaning that its results are no longer available for reuse. At that point the scheduler removes the node q i and all edges whose source or destination is q i from the query scheduling graph. This morphological transformation triggers a re-computation of the ranks of the nodes which were previously neighbors with q i. Therefore the up-to-date state of the system is reflected to the query server. The ranking of nodes can be viewed as a topological sort operation. Updates to the query scheduling graph and topological sort are done in an incremental fashion to avoid performance degradation because of the cost of running the scheduling algorithm. One objective of our multi-query scheduling model is to minimize the response time of each query submitted to the system. We define the response time ofaqueryasthesum of the waiting time in the queue and the query execution time. Another objective is to decrease the overall execution time for all the queries in a batch (i.e. system throughput). Associated with both goals is the objective of making better use of cached data and data transformations, to prevent 1 Currently, the system uses the weight for a single type of transformation function, namely the best one available. 4

<r 1,s 1 > <r 2,s 2 > w 1,2 q 1 q 2 w 2,1 <r 4,s 4 > w 3,2 w 2,3 q 4 <r 5,s 5 > w 2,4 q 3 <r 3,s 3 > q 5 Figure 3. Sample queries over two parts of a sample dataset with different magnification levels. The correspondent scheduling graph. reissuing I/O operations or recomputing results from input data by effectively ranking and sorting queries for execution. In this paper, we examine several ranking strategies: 1.FirstinFirstout(FIFO).Queries are served in the order they arrive at the system. 2. Most Useful First (MUF). The nodes in the WAITING state in the query scheduling graph are ranked according to a measure of how much other nodes depend on a given node, in terms of overlap qoutsize. The intuition behind this policy is that it quantifies how many queries are going to benefit if we run query q i next. Formally the rank for each node in WAITING state is computed as follows: r i = 8kj9e i;k ^s k=waiting w i;k 3. Farthest First (FF). The nodes are ranked based on a measure of how much a node depends upon another node. When q i depends on q j,andq i gets scheduled for execution while q j is executing, the computation of q i will stall until q j finishes and its results can be used to generate a portion of the result for q i. Although this behavior is correct and efficient in the sense that I/O is not duplicated, it wastes CPU resources because other queries that do not overlap either q i or q j cannot be executed because of the limited number of threads in the query server thread pool. Therefore, farthest first ranking gives an idea of how probable a query is to block because it depends on a result that is either being computed or is still waiting to be computed. The rank is computed as follows: r i =, w k;i 8kj9e k;i^(s k=waiting_s k=executing) The higher the rank for a node is, the greater its chance of being executed next. Therefore, the negative sign in the equation above makes nodes with more dependencies have smaller ranks. 4. Closest First (CF). The nodes are ranked using a measure of how many of the nodes a query depends upon have already executed (and whose results are therefore CACHED) or are currently being executed. The rank r i is computed as: r i = w j;i + w k;i ; 8jj9e j;i^s j=cached) 8kj9e k;i ^s k=executing) is a factor ( < < 1) that can be hand-tuned in a way to give more or less weight to dependencies that rely on intermediate results still being computed. The fact that a query q k is still executing may cause q i to block until q k has finished, in which case q i sits idle wasting CPU resources while it waits for q k to finish. The intuition behind this policy is that scheduling queries that are close has the potential to improve locality, making caching more beneficial. 5. Closest and Non-Blocking First (CNBF). Nodes are ranked by a measure of how many of the nodes that a query depends upon have been or are being executed. We deduct the weight for the nodes being executed, because we would like to decrease the chances of having little overlap due to the deadlock avoidance algorithm implemented in the query server, and minimize waiting time because of the query having to block until the availability of a needed data blob that is still being computed. On the other hand, CNBF also considers nodes that have already been executed and are going to be useful to improve locality. The rank is computed as follows: 5

r i = 8kj9e k;i^s k=cached w k;i, 8jj9e j;i^s j=executing w j;i The intuition is the same as that for CF, but CNBF attempts to prevent interlock situations. 6. Shortest-Job-First (SJF). Queries are ranked by their estimated execution times. The shorter the execution time of a query is, the higher its rank will be. The size in bytes of the input data, qinputsize(m i ), for query q i is used as an estimate of the relative execution time of the query. For the VM application, the value of qinputsize(m i ) is calculated in the index lookup step as the total size of the data chunks that intersect the query window. We should note that in general qinputsize(m i ) may not be as precise as that for VM, because of more complicated index lookup and data processing functions. The ranking strategies are inherently different in that each of them focuses on a different goal. FIFO targets fairness; queries are scheduled in the order they arrive. The goal of MUF is to schedule queries earlier that are going to be the most beneficial for other waiting queries. The objective of FF is to avoid scheduling queries that are mutually dependent and have one of them wait for the other to complete, and hence not productively use system resources. CF and CNBF are similar, but CF aims to achieve higher locality to make better use of cached (or soon to be cached) results, while CNBF tries to improve locality without paying the price of having to wait for dependencies to be satisfied. The goal of SJF is to reduce average waiting time by scheduling queries with shorter execution time first. 5 Experimental Results We present an experimental evaluation of the ranking strategies on a 24-processor SMP, running Solaris 8 2. In the experiments, we turned off the system file cache via the directio function in Solaris so that no I/O amortization happens other than that caused by the Page Space Manager (PS). We employed three datasets, each of which is an image of size 3x3 3-byte pixels, requiring a total of 7.5GB storage space. Each dataset was partitioned into 64KB pages, each representing a square region in the entire image,andstoredonthelocaldiskattachedtothesmpmachine. We present experimental results for the two implementations of VM. The first implementation with the subsampling function is I/O intensive. The CPU time to I/O time ratio is between.4 and.6 (e.g., for each 1 seconds, roughly between 4 and 6 seconds are spent on computation, 2 Experiments on an 8-processor SMP, running Linux, also show similar results. and between 94 and 96 seconds are spent doing I/O). The second implementation with the averaging function is more balanced, with the CPU to I/O time ratio near 1:1. We have emulated 16 concurrent clients. Each client generated a workload of 16 queries, producing 124x124 RGB images (3MB in size) at various magnification levels. Output images were maintained in the Data Store Manager (DS) as intermediate results for possible reuse by new queries. Out of the 16 clients, 8 clients issued queries to the first dataset, 6 clients submitted queries to the second dataset, and 2 clients issued queries to the third dataset. We have used the driver program described in [3] to emulate the behavior of multiple simultaneous clients. We have chosen to use the driver for two reasons. First, extensive real user traces are very difficult to acquire. Second, the emulator allowed us to create different scenarios and vary the workload behavior (both the number of clients and the number of queries) in a controlled way. In all of the experiments, the emulated clients were executed simultaneously on a cluster of PCs connected to the SMP machine via a Fast Ethernet network. In one set of experiments, we evaluated the effect of data caching on the performance of the FIFO and SJF strategies. Presently, either strategy does not take into account the state of the cache for scheduling queries. However, we observed the overall system performance improved by as much as 35% and 7% for FIFO and 4% and 7% for SJF, for subsampling and averaging implementations of VM, respectively. Performance also increased as more memory was alloted for DS. Our results show that even for FIFO and SJF, caching intermediate results can significantly improve performance. Figure 4 shows the 95%-trimmed mean 3 for the query response time achieved by different ranking strategies. The query response time is the sum of waiting time in the queue and the execution time of the query and is shown as Query Wait and Execution time in the figures. In these experiments, each client submitted its queries independently from the other clients, but waited for the completion of a query before submitting another one. For the CF strategy, was fixed at.2. The sizes of memory allocated for PS and DS were 32MB and 64MB, respectively. As is seen from the figure, FIFO performs discernibly worse than the other ranking strategies, as expected. The MUF, FF, CF, and CNBF strategies perform slightly better than SJF in most cases. The SJF strategy aims to minimize average waiting time by executing shorter queries first, but may suffer from less reuse of cached results. The other four strategies try to optimize data reuse, but incur higher waiting time. We also see that as the number of concurrent queries (i.e., threads) increases, performance degrades after an optimal point (4 threads in this 3 A 95%-trimmed mean is computed by discarding the lowest and highest 2.5% of the scores and taking the mean of the remaining scores [8]. 6

Query Wait and Execution Time Query Wait and Execution Time 3 9 25 8 7 2 6 15 5 4 1 3 5 2 1 1 2 4 8 16 # of threads 1 2 4 8 16 # of threads Figure 4. The 95%-trimmed mean for the query response time, as the maximum number of concurrent queries allowed in the system is varied. The subsampling implementation, the pixel averaging implementation. For all these charts 64MB were allocated to DS. Average Overlap Average Overlap.9.9.8.8 Overlap.7 Overlap.7.6.6.5.5 Figure 5. Average overlap increases when more memory is added to DS. Up to 4 queries are allowed to execute concurrently. The subsampling implementation, the pixel averaging implementation. case). For many threads the I/O subsystem cannot keep up with the amount of requests it receives. Thus, although the computation part of the queries scales well, the I/O part does not. The averaging implementation of VM displays better scalability with respect to the number of query threads because it is more balanced in terms of CPU and I/O usage, therefore adding more processing power benefits that implementation more. Figure 5 displays the average overlap achieved as the amount of memory allocated for DS is varied. The amount of overlap increases with the increasing size of DS, as expected. For small cache sizes (e.g., 32MB), higher overlap figures are obtained using CF and CNBF. Both of these strategies try to optimize locality, resulting in better use of cache. However, this may not get translated into lower query response times, as is seen in Figure 6. This is because of the fact that queries may wait longer in the queue for execution. When clients wait for completion of a query before submitting a new one, the primary objective is to minimize the response time observed by the client. However, in some cases, a client may submit a batch of queries to the server. Forexample,ifwewanttocreateamoviefromacasestudy using VM, we may submit a set of queries, each of which corresponds to a visualization of the slide being studied. In that case, it is important to decrease the overall execution time of the batch of queries. Figure 7 shows the overall execution time of 256 queries as a single batch. CF and CNBF result in better performance than do other strategies, especially when resources are scarce (i.e., small DS cache). Our results show that when the objective is to minimize the total execution time of queries, taking into account locality is important for better performance, since the cache is better utilized. 6 Conclusions We have presented a model for scheduling multiple queries in data analysis applications. Our results show that efficient scheduling of multiple queries is important to increase the performance of the server. If minimizing response time is the primary concern, the performance of SJF is comparable to the other strategies that take into account the state of the system. SJF tries to explicitly minimize average waiting time. MUF, CF, and CNBF, however, aim at making better use of cached results, but may result in longer waiting times. For reducing the total execution time of a batch of queries, it becomes more important to exploit reuse opportunities, especially when resources are limited. Our experimental evaluation suggests that a combination of SJF and the other ranking strategies would provide a viable solution. We plan to extend this work on three fronts: (1) the 7

Query Wait and Execution Time Query Wait and Execution Time 25 7 2 6 5 15 1 4 3 2 5 1 Figure 6. The 95%-trimmed mean for the query response time, when more memory is added to DS. Up to 4 queries are allowed to execute concurrently. The subsampling implementation, the pixel averaging implementation. Query Execution Time Query Execution Time 18 16 14 12 1 8 5 45 4 35 3 25 2 6 15 4 1 2 5 Figure 7. The total execution time of a single batch of 256 queries, as the memory allocated to DS is varied. A maximum of up to 4 queries are allowed to be simultaneous evaluated. The subsampling implementation, the pixel averaging implementation. development of a combined strategy and of the capability for self-tuning, (2) additional data analysis applications (e.g., scientific visualization of 3-dimensional datasets), and (3) the incorporation of low level metrics (e.g., processing, I/O, and network bandwidth) into the query scheduling model. This capability can help achieve better speedups when underlying hardware resource availability scales at different rates for different resources. Acknowledgments. We would like to thank the anonymous reviewers of our paper. Their excellent comments helped us significantly in improving the presentation and content of the paper. References [1] A. Afework, M. D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, and H. Tsang. Digital dynamic telepathology - the Virtual Microscope. InAMIA98. American Medical Informatics Association, Nov. [2] H. Andrade, T. Kurc, A. Sussman, and J. Saltz. Efficient execution of multiple workloads in data analysis applications. In Proceedings of the 21 ACM/IEEE SC 1 Conference,Denver, CO, Nov. 21. [3] M.D.Beynon,A.Sussman,andJ.Saltz. Performanceimpact of proxies in data intensive client-server applications. In Proceedings of the 1999 International Conference on Supercomputing. ACM Press, June 1999. [4] S. Chaudhuri. An overview of query optimization in relational systems. In Proceedings of the ACM Symposium on Principles of Database Systems on Principles of Database Systems, pages 34 43, Seattle, WA, 1998. [5] G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), 1993. [6] A. Gupta, S. Sudarshan, and S. Vishwanathan. Query scheduling in multi query optimization. In International Database Engineering and Applications Symposium, IDEAS 1, pages 11 19, Grenoble, France, 21. [7] Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):46 471, 1999. [8] D.M.Lane. HyperStat Online: An Introductory Statistics Book and Online Tutorial for Help in Statistics Courses. 1999. http://davidmlane.com/hyperstat/. [9] M.Mehta,V.Soloviev,andD.J.DeWitt. Batchscheduling in parallel database systems. In Proceedings of the 9th International Conference on Data Engineering, Vienna,Austria, 1993. [1] A. Sinha and C. Chase. Prefetching and caching for query scheduling in a special class of distributed applications. In Proceedings of the 1996 International Conference on Parallel Processing, pages 95 12, Bloomingdale, IL, 1996. [11] F. Waas and M. L. Kersten. Memory aware query scheduling in a database cluster. Technical Report INS-R21, Centrum voor Wiskunde en Informatica, Oct. 2. 8