GBO Preliminary Research Proposal

Size: px
Start display at page:

Download "GBO Preliminary Research Proposal"

Transcription

1 GBO Preliminary Research Proposal Xiaodan Wang 1 1 Johns Hopkins University, USA xwang@cs.jhu.edu I. INTRODUCTION Gray and Szalay [1] documented the data avalanche problem in the sciences in which improvements in physical instruments and better data pipelines lead to an exponential growth in data size. Paralleling this exponential trend is the accumulation of data at multiple, autonomous data sources. Exploring the resulting massive, widely-distributed data is of immense scientific value: as Gray and Szalay observe, the number of scientific discoveries increase polynomially with the number of participating data sources [2]. Federated databases is an attractive solution for the management and sharing of scientific data that are geographically distributed. Increasingly, Science discoveries are made by scanning large portions of the data to find correlations, mine data, extract features, and compute joins across distributed data sources [3]. These needle in a haystack queries are long running and data intensive so that query throughput limits performance. In order to facilitate data exploration, various scientific disciplines have built federations of databases. Federations allow data, which are too large to be widely replicated or stored at a single site, to be managed independently. Examples include SkyQuery [4], Genbank [5], and EcoliHub [6]. To ensure high job throughput and prevent starvation of traditional workloads in a federated environment, we propose new query processing disciplines. Scientific database federations built at a global scale render many goals of distributed query processing obsolete. Specifically, queries are I/O intensive and require non-indexed scans of multi-terabyte tables, which may take several hours to complete [3]. Data size and geography also dictate that transmitting data takes large amounts of time and has a profound impact on query performance. Thus, workloads are not latency sensitive due to the large data sizes (accessing data at each site takes tens of seconds at best). Our goal is maximizing query throughput in the federation through query scheduling techniques that incorporate network structure (exploit high capacity network paths) and account for data access requirements (maximize data sharing among queries). A guiding principle of our work is that rather than choosing the optimal plan for each query, we choose plans that penalize other concurrent queries minimally and improve overall query throughput. For example, algorithms that minimize completion time over-utilize the network by consuming all available resources to achieve a locally optimal plan [7]. Thus, we limit the amount of parallelism in query schedules to avoid multiple data transfers across large geographies. We also delay SELECT... FROM SDSS o, TWOMASS t, USNOB p WHERE XMATCH(o, t, p) < 3.5 and REGION( circle ) Web Mediator Wrapper Wrapper Wrapper USNOB CARD: 800 Result Result Fig. 1. TWOMASS CARD: 100 Probe Query Result SDSS CARD: 30 Spatial join in SkyQuery. completion of individual queries to permit sharing of full table scans across multiple queries. While our approach increases the response time of certain queries, we observe orders of magnitude improvement in system throughput for astronomy workloads. This in turn expands the scale of exploration in Astronomy by supporting queries over a larger federation and spatial region. II. QUERY SCHEDULING IN SKYQUERY Astronomy presents a good case study of data intensive queries against scientific database federation. SkyQuery, a federation of astronomy databases [4] typifies a data intensive database federation with dozens of sites distributed across three Continents. Each member is a multi-terabyte survey cataloging celestial objects from regions in the sky across multiple spectral properties. However, SkyQuery is facing a scalability crisis in terms of the ever expanding data size and number of sites [8]. For example, it is not uncommon for join queries to yield intermediate results that are hundreds of megabytes in size [9]. Also, with roughly thirty sites distributed across three continents, the physical distribution and connectivity of sites in SkyQuery will grow considerably in the near future [8]. We study query performance in SkyQuery to gain sight into issues that are common among scientific database federations. SkyQuery provides federated services to the public through Web-services and Web forms. Users submit queries through a mediator, which communicates with member databases via a shared wrapper interface. The principal query is crossmatch [4], which joins observations of the same astronomical object from several databases by correlating their location in space. Cross-match extends SQL by adding two clauses. The region clause specifies an area for conducting the search.

2 The xmatch clause specifies an unordered list of databases to visit. Figure 1 illustrates a sample execution of a cross-match query involving three sites. The mediator produces a serial schedule (a left-deep join order that visits each site serially) that joins sites by ascending cardinality, which is defined as the number of rows that satisfy the region clause at each site. This approach is currently employed in SkyQuery and is effective at the early elimination of tuples that do not participate in the join at subsequent sites. As a result, it minimizes computation for databases on uniform networks under a couple of assumptions: uniform join probabilities and linear I/O and processing costs in the number of tuples. While the current approach to query scheduling performs reasonably well, improvements can be made to reduce the cost of transmitting join results between sites and the cost of computing spatial joins at each site. A global-scale federation exhibits network heterogeneity (differing capacity on network paths between sites) in which communication overhead dominates up to 90% of query response time in SkyQuery. Thus, schedules should account for network structure rather than joining sites in a strictly ascending cardinality fashion. Another observation is that queries tend to overlap among specific spatial regions of interest. As a result, I/O cost can be shared by reusing results from the same region of interest across multiple queries that exhibit shared data access. Thus, our goal is to improve query performance by minimizing both network and I/O costs for data-intensive workloads in a federated environment. III. CONTRIBUTIONS In this section, we present our contributions from previously published work before mapping out a path for further research. Our first step is to model network costs in query scheduling decisions using a balanced network utilization metric. The metric rewards schedules that utilize paths with excess capacity and produce small intermediate results. We then present algorithms for scheduling inter-site joins, which exploit network locality and avoid narrow, long-haul paths (e.g. transferring data between sites that cross continental boundaries) in the federation by making these paths more costly during optimization. Next, we introduce LifeRaft, a data-driven, batch processing algorithm, designed to improve query throughput at a single site. LifeRaft batches queries with overlapping data requirements at each site and executes them against an ordering of the data that maximizes data sharing among queries. This decreases I/O and increases cache utility. Together, balanced network utilization and LifeRaft form a holistic solution for improving query throughput in a database federation. A. Related Works Kossmann [10] presents a detailed survey of both past and current query processing and optimization techniques that minimize computation and communication costs by exploiting, for instance, intra-query parallelism, caching, and data replication. Most commercial optimizers still rely on System R-style dynamic programming algorithms [11] that exhibit exponential-time complexity [12][13] and are unsuitable for large-scale database federations. Moreover, works that address the communication costs of distributed queries assume network uniformity [14][15]. They minimize the size of intermediate results, which in turn reduces computation and network costs on uniform networks. When compared with previous approaches, we simplify one aspect of query optimization (dealing with join selectivity), which allows us to consider non-uniform and non-metric network costs and balance network utilization over all paths. Our algorithms have low, polynomial-time complexity and, thus, they scale to large federations with hundreds of sites. Evaluating distributed queries using semi-joins can provide substantial network savings [14][16][17]. Semi-joins ship only attributes that are necessary for evaluating a join to another site in order to eliminate tuples that fail to satisfy the join predicate. In many settings, semi-joins are not attractive because computational overhead outweighs network savings on local area networks [18]. For network-bound queries on wide-area networks, semi-joins become more attractive because communication costs dominate performance. Our application of semijoins differs in that we use them to limit attribute aggregation, rather than reducing the cardinality of intermediate results. We also look to data-centric routing in wireless sensor networks for inspiration. Several works perform in-network aggregation of data from multiple sensors to a single base station [19][20][21]. They organize sensors into a spanning tree rooted at the base station and aggregate data along the tree in order to conserve power by minimizing network usage. For example, Meliou et al. [22] studies the NP-hard problem of finding an optimal tour for gathering data from a subset of sensors. They employ a similar metric to capture network utilization, but solve a different problem with different techniques: optimizing power consumption using dynamic programming alone. We explore more complex network structure, variablesize intermediate results, and employ dynamic programming on top of spanning tree solutions. In addition to communication cost, we minimize I/O for scan-intensive queries through batch processing. The query batching paradigm was studied for workloads against large datasets on tertiary storage in order to minimize I/O cost [23 25]. Yu and Dewitt explored this in the Paradise system by reordering queries over data stored on magnetic tape. The reordering achieves sequential I/O by collecting data requirements during a pre-execution phase (without physically performing the I/O), reordering tape requests, and finally executing queries concurrently in one batch. However, queries participating in the join midway must wait until the entire batch finishes. Our approach is not limited to sequential data processing. Sarawagi et. al. [24] provides non-sequential processing by partitioning the data into fragments that are physically contiguous on the tertiary device and schedules concurrent queries on a per fragment basis. While our work leverages some of the ideas described in these works, we also

3 explore additional metrics for high query throughput; namely, the amount of data contention and query starvation. Google s Map-Reduce [26] is an attractive paradigm for parallel computation and is evolving to encompass more dataintensive tasks. Yang et al. [27] extend the Map-Reduce paradigm to more efficiently support relational joins by adding a merge phase that processes heterogeneous datasets simultaneously. More recently, Olston et al. [28] combined the procedural style of Map-Reduce with declarative SQL constructs in a parallel programming paradigm. Agrawal et al. [29] incorporate batch processing for Map-Reduce environments and identify data sharing among map tasks. Jobs that scan the same files are co-scheduled to maximize throughput. While their results are theoretical and does not consider caching, we plan to adapt their solution to query scheduling. We also highlight the current approaches used in SkyQuery to achieve high throughput. The CasJobs [3] system avoids starvation of short queries by data-intensive scan queries through a multi-queue job submission system. The distinction between long and short queries is arbitrarily decided. Extra hardware is used to assign queries from each class to different servers, which is problematic since the longest short queries interfere with the short queue and the shortest long queries are starved. The throughput of long running queries is further improved by partitioning the data and evaluating the queries in parallel across servers. However, achieving a balanced distribution of the workload across multiple servers is difficult because certain regions in the sky are accessed more frequently. Our work does not use ad hoc mechanisms to distinguish long and short running queries. Instead, queries of all sizes are supported in a single system. B. Network-Aware Join Processing As the SkyQuery federation expands geographically, the data and scientific queries become large and naturally distributed, which leads to poor query processing performance. Each site may produce hundreds of megabytes of data to joined at the other sites. Query processing involves sending data from site to site, accumulating results, and eventually delivering query results to the scientist. Given the large data sizes and geographic distance between sites, query processing consumes vast amounts of network resources. Traditional query processing techniques are poorly suited to the scale and heterogeneity of database federations deployed at a global-scale. Previous work focuses on minimizing query completion time. This includes parallel computation at multiple sites and reducing the volume of network traffic. Algorithms that focus on reducing the volume of traffic underutilize the network because paths with excess capacity may be overlooked. Our solution balances the utilization of all network paths, improving performance by an order of magnitude for queries that include ten or more sites [30, 31]. Our algorithms identify network structure, such as the throughput of paths and clusters of sites. We then use this structure to identify excess capacity in the network and schedule joins on those paths. Fig. 2. (a) Previous Scheduler (b) Spanning Tree Approximation Comparison of Join Schedules in SkyQuery. Figure 2 illustrates the benefits of our scheduling techniques for a user query taken from SkyQuery s Web logs. Previous Scheduler denotes SkyQuery s existing algorithm, which minimizes the volume of network traffic but does not account for network distances. The resulting plan crosses the Atlantic several times and suffers from long data transfer times. In contrast, our Spanning Tree Approximation algorithm accounts for network structure in the plan generation and produces a schedule that avoids long-haul paths. For this example, our spanning tree-based approximation algorithm achieves a factor of twelve reduction in network utilization. The previous scheduler, which minimizes the volume of network traffic, does not account for network distances. This results in a plan that crosses the Atlantic several times. In contrast, by extracting network structure and including it in plan generation, we produce schedules that avoid long-haul paths and thereby reducing network costs by twelve-fold. Our scheduling technique uses a balanced network utilization metric to capture network structure. The metric accounts for the capacity of network paths (measured by TCP throughput of bulk transfers between sites) and the size of intermediate join results that are produced by a join schedule. Minimizing this metric for all queries reduces the total time in which network s paths are used to carry data associated with join queries, thereby reducing contention for network resources. However, there is no guarantee that the solution yields a minimum response time schedule for an individual query. Scheduling is performed based on local information and aggregate statistics collected about the system prior to optimization. This allows for decentralized optimization decisions, achieving scale and incurring no communication overhead during optimization. Providing more timely knowledge during optimization about concurrent queries initiated from all sites in the network and the state of every network path would incur significant overhead and is not desirable. A Spanning Tree Approximation algorithm is

4 used to minimize network utilization. The algorithm adapts the two-approximate solution to the traveling salesman problem (TSP) to query scheduling. Namely, join queries initiate at the minimum cardinality site and intermediates results traverse paths along a minimum spanning tree. We also explore parallel execution strategies by observing that geographically co-located sites form highly-connected clusters, which can compute the join in parallel and send the results to sites that are close to the mediator. To achieve a polynomial-time solution, we assume perfect join selectivity: the join of three relations with r m, r n, and r o tuples produces a result with min(r m, r n, r o ) tuples. This assumption holds for SkyQuery. We experimented with exhaustive, dynamic programming solutions that made no assumptions about selectivity at the expense of exponential complexity, but the performance benefits were negligible. C. Data-Driven Batch Scheduling Sites in SkyQuery service millions of queries each month [9] and is an ideal environment for batch processing. Many cross-match queries have long execution times (several hours or an entire day) and are not response time sensitive: they navigate the entire sky and perform full database scans. Thus, evaluating multiple scan-intensive queries concurrently places substantial demands on the disk and limits the scale of exploration. Luckily, select data regions experience frequent reuse (i.e. queries concentrated around star clusters of interest) in which 2% of the data account for more than half of the I/O requests. Queries that overlap in data access also occur close temporally, which benefits caching. Relaxing in-order scheduling in SkyQuery can reduce redundant I/O and achieve large improvements in query throughput. Specifically, rather than execute queries in arrival order, we can interleave I/O requests from multiple queries (possibly increasing the wait time of existing queries) based on the amount of contention between queries for shared data. This is accomplished by first pre-processing each query to identify its data access requirements. (Pre-processing should be inexpensive relative to I/O, which is true for scan-based workloads on data that is spatially or temporally defined). We can then co-schedule queries that access the same data to 1) eliminate redundant accesses to the disk and 2) amortize the cost of data access over multiple queries. However, co-scheduling must also account for wait times of existing queries. Short-lived queries (minutes or seconds) that focus on a small region of the sky and are highly selective also exist in SkyQuery and starving these queries is undesirable. We developed a data-driven, batch processing scheduler, LifeRaft [32], that identifies data sharing among queries and co-schedules queries against data exhibiting the highest degree of sharing. Our approach is data-driven because it focuses on the data requirements of each query, instead of the arrival order, in order to coordinate query processing with access to secondary storage. This is accomplished by partitioning relational data tables into equal-sized (same number of objects) buckets. (We use a space filling curve to order the spatial No Share LifeRaft Fig B 1 B 3 B 4 B 5 3 B 3 4 B 4 4 B B 5 B Q 1 i 3 3 B 3 5 B Qk B 3 1 B 1 Co-scheduling queries to amortize I/O. data while preserving spatial proximity within each bucket). Incoming queries are pre-processed to determine a list of sub-queries which satisfy the following property: each subquery operates on a single bucket and can be processed in any order. The result of the original query is obtained by combining the sub-query results. Finally, buckets are read from disk by scheduler one at a time by decreasing contention (number of pending queries for shared data) so that queries whose workload (list of sub-queries) overlaps the bucket are processed concurrently, incurring no additional I/O. Figure 3 illustrates the benefits of co-scheduling queries. The data table is partitioned into five buckets that are joined with three cross-match queries. We assume a uniform cost of one second of reading each bucket from disk. Also, only a single bucket can be cached at a time. The inter-query arrival time is one second; that is and arrives one and two seconds after respectively. No Share denotes inorder processing by the database, which first processes by reading and joining buckets through B 5 in sequential order. This schedule does not account for data overlap (i.e. both and access bucket ) and results in redundant I/O. The LifeRaft scheduler illustrates an execution order with minimal I/O. Here, we delay the join of against buckets and B 3 until the remaining queries arrive, which allows us to co-schedule queries that share data. However, such reordering leads to the completion of prior to both and. Potentially, this can delay the completion of certain queries indefinitely. IV. FUTURE WORK So far, we have decoupled the discussion of network-aware join processing and data-driven batch processing. LifeRaft schedules scan-based workloads in a single system environment. Looking forward, we want to extend LifeRaft to a federated environment it is not clear how join processing can be coordinated across multiple databases. Batch processing is effective in a single system because the data requirements of each query is provided a priori. This allows the scheduler to anticipate data regions that are accessed in the future and reorder existing queries appropriately. Applied to a federation,

5 A 2 A Fig. 4. B 1 2 B 1 1 B 2 2 B 1 1 B 1 Batch processing across sites. starting to employ federations rather than summary databases and data warehouses [33][34], which exhibit similar challenges; real-time processing over data generated all over the globe. Even without perfect join selectivity, incorporating scheduling optimization that capture network heterogeneity is valuable. We provide solutions for applications that depend on the accuracy of selectivity estimates and can tolerate exponential complexity. We also want to experiment our batch processing algorithm on other temporal-spatial databases in which queries can be sub-divided into data-defined units of work. The Turbulence database [35] is one such example, which is evolving toward larger data sets and workloads. REFERENCES this means that every site is aware of the data accessed by each query before queries visit and join these sites. This allows different sites to coordinate query execution order and maximize batch size over all sites (e.g. amortize I/O cost over more queries). In essence, a distributed version of data-driven, batch processing. We can sub-divide queries into buckets across multiple databases to facilitate inter-database batch processing. Figure 4 illustrates coordinated batch processing across two sites. Given sites A and B, pending sub-queries exist at B that operate on buckets B 1 and while query is pending at A, which also needs to visit and join against B. Note that since joins with bucket, B buffers work against and schedules subqueries against B 1 first. Once B receives the intermediate join results of from A, it then co-schedules previously pending work and incurs no additional I/O for. The drawback is that B may unduly starve requests while waiting for workload from A so that a desired batch size is achieved. Moreover, this approach constrains memory at B because to achieve a large batch size, it needs to buffer more data. Our immediate goal is to generalize and improve batch processing for a single site. Given our initial empirical results [32], we want to provide a theoretical treatment of the issues by adapting a recent work [29], which studies a similar problem in the Map-Reduce framework. Their work does not include caching and we plan to explore the impact of cache eviction policies. We also plan to address workload overflow in which intermediate join results of queries will need to be stored to disk and fetched into memory for processing. This requires the scheduler to migrate matching pairs of join results and buckets into memory for evaluation. We note that in federations, the queries may be quite large, because they include intermediate results from other sites that need to be joined with the current site. Moreover, we want to provide robust quality-of-service guarantees for interactive, short-lived queries in a unified framework. Short queries should finish quickly (regardless of the arrival order) and not risk starvation from prior, long-running queries. Finally, we want to evaluate the applicability of our work for applications outside of scientific databases. With respect to network-aware scheduling, OLAP and DSS applications are [1] J. Gray and A. Szalay, Where the Rubber Meets the Sky: Bridging the Gap Between Databases and Science, IEEE Data Engineering Bulletin, vol. 27, no. 4, pp. 3 11, [2] J. Gray and A. Szalay. (2003) Online Science: The World-Wide Telescope as a Prototype for the New Computational Science. Presentation at the Supercomputing Conference. [Online]. Available: Gray/JimGrayTalks.htm [3] W. O Mullane, N. Li, M. Nieto-Santisteban, A. Szalay, and A. Thakar, Batch is Back: CasJobs, Serving Multi-TB Data on the Web, in ICWS, [4] T. Malik, A. S. Szalay, A. S. Budavri, and A. R. Thakar, SkyQuery: A Web Service Approach to Federate Databases, in CIDR, [5] A. Kementsietsidis, F. Neven, D. V. de Craen, and S. Vansummeren, Scalable MultiQuery Optimization for Exploratory Queries over Federated Scientific Databases, in VLDB, [6] The EcoliHub Web Service. [7] S. Ganguly, W. Hasan, and R. Krishnamurthy, Query Optimization for Parallel Execution, in SIGMOD, [8] A. Szalay, J. Gray, A. Thakar, P. Kuntz, T. Malik, J. Raddick, C. Stoughton, and J. Vandenberg, The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data, in SIGMOD, [9] T. Malik, R. Burns, and A. Chaudhary, Bypass Caching: Making Scientific Databases Good Network Citizens, in ICDE, [10] D. Kossmann, The State of the Art in Distributed Query Processing, ACM Comput. Surv., vol. 32, no. 4, pp , [11] D. Kossmann and K. Stocker, Iterative Dynamic Programming: A New Class of Query Optimization Algorithms, ACM Trans. on Database Systems, vol. 25, no. 1, pp , [12] A. Deshpande and J. Hellerstein, Decoupled Query Optimization for Federated Database Systems, in ICDE, [13] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, Access Path Selection in a Relational Database Management System, in SIGMOD, [14] A. L. P. Chen and V. O. K. Li, Optimizing Star Queries in a Distributed Database System, in VLDB, [15] P. Scheuermann and E. I. Chong, Distributed Join Processing Using Bipartite Graphs, in ICDCS, [16] D.-M. Chiu, P. A. Bernstein, and Y.-C. Ho, Optimizing Chain Queries in a Distributed Database System, SIAM J. Comput., vol. 13, no. 1, pp , [17] Y. Kambayashi, M. Yoshikawa, and S. Yajima, Query Processing for Distributed Databases using Generalized Semi-Joins, in SIGMOD, [18] H. Lu and M. J. Carey, Some Experimental Results on Distributed Join Algorithms in a Local Network, in VLDB, [19] C. Intanagonwiwat, R. Govindan, and D. Estrin, Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks, in MOBICOM, [20] B. Krishnamachari, D. Estrin, and S. B. Wicker, The Impact of Data Aggregation in Wireless Sensor Networks, in ICDCSW, [21] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, Medians and Beyond: New Aggregation Techniques for Sensor Networks, in SenSys, [22] A. Meliou, D. Chu, J. Hellerstein, C. Guestrin, and W. Hong, Data Gathering Tours in Sensor Networks, in IPSN, 2006.

6 [23] J. Myllymaki and M. Livny, Relational Joins for Data on Tertiary Storage, in ICDE, [24] S. Sarawagi, Query Processing in Tertiary Memory Databases, in VLDB, [25] J.-B. Yu and D. J. DeWitt, Query Pre-Execution and Batching in Paradise: A Two-Pronged Approach to the Efficient Processing of Queries on Tape-Resident Raster Images, in SSDBM, [26] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI, [27] H. C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, Map-Reduce- Merge: Simplified Relational Data Processing on Large Clusters, in SIGMOD, [28] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig Latin: A Not-So-Foreign Language for Data Processing, in SIGMOD, [29] P. Agrawal, D. Kifer, and C. Olston, Scheduling Shared Scans of Large Data Files, in VLDB, [30] X. Wang, R. Burns, A. Terzis, and A. Deshpande, International Conference on Data Engineering, in ICDE, [31] X. Wang, R. Burns, and A. Terzis, Throughput-Optimized, Global- Scale Join Processing in Scientific Federations, in NetDB, [32] X. Wang, R. Burns, and T. Malik, LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases, in CIDR, [33] J. M. Hellerstein, M. Stonebraker, and R. Caccia, Independent, Open Enterprise Data Integration, IEEE Data Engineering Bulletin, vol. 22, no. 1, pp , [34] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, Mariposa: A Wide-Area Distributed Database System, VLDB Journal, vol. 5, no. 1, pp , [35] E. Perlman, R. Burns, Y. Li, and C. Meneveau, Data Exploration of Turbulence Simulations Using a Database Cluster, in SC, 2007.

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Xiaodan Wang, Randal Burns, Andreas Terzis Computer Science Department The Johns Hopkins University {xwang, randal, terzis}@cs.jhu.edu

More information

Network-Aware Join Processing in Global-Scale Database Federations

Network-Aware Join Processing in Global-Scale Database Federations Network-Aware Join Processing in Global-Scale Database Federations Xiaodan Wang, Randal Burns, Andreas Terzis, Amol Deshpande Johns Hopkins University, USA {xwang,randal,terzis}@cs.jhu.edu University of

More information

SDSS Dataset and SkyServer Workloads

SDSS Dataset and SkyServer Workloads SDSS Dataset and SkyServer Workloads Overview Understanding the SDSS dataset composition and typical usage patterns is important for identifying strategies to optimize the performance of the AstroPortal

More information

JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations Xiaodan Wang, Eric Perlman, Randal Burns, Tanu Malik, Tamas Budavári, Charles Meneveau and Alexander Szalay Dept. of Computer

More information

Data-driven Query Processing for Immersive Computational Turbulence

Data-driven Query Processing for Immersive Computational Turbulence Data-driven Query Processing for Immersive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins University Baltimore, Maryland 21218 kalin@cs.jhu.edu 1 Introduction Breakthroughs

More information

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang 1, Tanu Malik 1, Randal Burns 1, Stratos Papadomanolakis 2, and Anastassia Ailamaki 2 1 Johns Hopkins University,

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

data parallelism Chris Olston Yahoo! Research

data parallelism Chris Olston Yahoo! Research data parallelism Chris Olston Yahoo! Research set-oriented computation data management operations tend to be set-oriented, e.g.: apply f() to each member of a set compute intersection of two sets easy

More information

Computing Data Cubes Using Massively Parallel Processors

Computing Data Cubes Using Massively Parallel Processors Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Page 1 of 5 1 Year 1 Proposal Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal In order to setup the context for this progress

More information

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Extending the SDSS Batch Query System to the National Virtual Observatory Grid Extending the SDSS Batch Query System to the National Virtual Observatory Grid María A. Nieto-Santisteban, William O'Mullane Nolan Li Tamás Budavári Alexander S. Szalay Aniruddha R. Thakar Johns Hopkins

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Parallel Query Optimisation

Parallel Query Optimisation Parallel Query Optimisation Contents Objectives of parallel query optimisation Parallel query optimisation Two-Phase optimisation One-Phase optimisation Inter-operator parallelism oriented optimisation

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Tradeoffs in Processing Multi-Way Join Queries via Hashing in Multiprocessor Database Machines

Tradeoffs in Processing Multi-Way Join Queries via Hashing in Multiprocessor Database Machines Tradeoffs in Processing Multi-Way Queries via Hashing in Multiprocessor Database Machines Donovan A. Schneider David J. DeWitt Computer Sciences Department University of Wisconsin This research was partially

More information

CSE 544, Winter 2009, Final Examination 11 March 2009

CSE 544, Winter 2009, Final Examination 11 March 2009 CSE 544, Winter 2009, Final Examination 11 March 2009 Rules: Open books and open notes. No laptops or other mobile devices. Calculators allowed. Please write clearly. Relax! You are here to learn. Question

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

PROTEUS RTI: A FRAMEWORK FOR ON-THE-FLY INTEGRATION OF BIOMEDICAL WEB SERVICES

PROTEUS RTI: A FRAMEWORK FOR ON-THE-FLY INTEGRATION OF BIOMEDICAL WEB SERVICES PROTEUS RTI: A FRAMEWORK FOR ON-THE-FLY INTEGRATION OF BIOMEDICAL WEB SERVICES Shahram Ghandeharizadeh, Esam Alwagait and Sriranjan Manjunath Computer Science Department University of Southern California

More information

Integrated Routing and Query Processing in Wireless Sensor Networks

Integrated Routing and Query Processing in Wireless Sensor Networks Integrated Routing and Query Processing in Wireless Sensor Networks T.Krishnakumar Lecturer, Nandha Engineering College, Erode krishnakumarbtech@gmail.com ABSTRACT Wireless Sensor Networks are considered

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

Column-Oriented Database Systems. Liliya Rudko University of Helsinki Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems

More information

Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey

Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey Journal of Computer Science 7 (1): 114-119, 2011 ISSN 1549-3636 2011 Science Publications Mobile Element Scheduling for Efficient Data Collection in Wireless Sensor Networks: A Survey K. Indra Gandhi and

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

An Initial Study of Overheads of Eddies

An Initial Study of Overheads of Eddies An Initial Study of Overheads of Eddies Amol Deshpande University of California Berkeley, CA USA amol@cs.berkeley.edu Abstract An eddy [2] is a highly adaptive query processing operator that continuously

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY , pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper Profile of CopperEye Indexing Technology A CopperEye Technical White Paper September 2004 Introduction CopperEye s has developed a new general-purpose data indexing technology that out-performs conventional

More information

Information Systems (Informationssysteme)

Information Systems (Informationssysteme) Information Systems (Informationssysteme) Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2018 c Jens Teubner Information Systems Summer 2018 1 Part IX B-Trees c Jens Teubner Information

More information

WSN Routing Protocols

WSN Routing Protocols WSN Routing Protocols 1 Routing Challenges and Design Issues in WSNs 2 Overview The design of routing protocols in WSNs is influenced by many challenging factors. These factors must be overcome before

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC. Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC. Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply

More information

The Computation and Data Needs of Canadian Astronomy

The Computation and Data Needs of Canadian Astronomy Summary The Computation and Data Needs of Canadian Astronomy The Computation and Data Committee In this white paper, we review the role of computing in astronomy and astrophysics and present the Computation

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Zonal Rumor Routing for. Wireless Sensor Networks

Zonal Rumor Routing for. Wireless Sensor Networks Tarun Banka Department of Electrical and Computer Engineering tarunb@engr.colostate.edu Zonal Rumor Routing for. Wireless Sensor Networks Gagan Tandon Department of Computer Science gagan@cs.colostate.edu

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Location-aware In-Network Monitoring in Wireless Sensor Networks

Location-aware In-Network Monitoring in Wireless Sensor Networks Location-aware In-Network Monitoring in Wireless Sensor Networks Volker Turau and Christoph Weyer Department of Telematics, Technische Universität Hamburg-Harburg Schwarzenbergstraße 95, 21073 Hamburg,

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

Parallel Databases C H A P T E R18. Practice Exercises

Parallel Databases C H A P T E R18. Practice Exercises C H A P T E R18 Parallel Databases Practice Exercises 181 In a range selection on a range-partitioned attribute, it is possible that only one disk may need to be accessed Describe the benefits and drawbacks

More information

1 (eagle_eye) and Naeem Latif

1 (eagle_eye) and Naeem Latif 1 CS614 today quiz solved by my campus group these are just for idea if any wrong than we don t responsible for it Question # 1 of 10 ( Start time: 07:08:29 PM ) Total Marks: 1 As opposed to the outcome

More information

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm

Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Mobile Cloud Multimedia Services Using Enhance Blind Online Scheduling Algorithm Saiyad Sharik Kaji Prof.M.B.Chandak WCOEM, Nagpur RBCOE. Nagpur Department of Computer Science, Nagpur University, Nagpur-441111

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Full file at

Full file at Chapter 2 Data Warehousing True-False Questions 1. A real-time, enterprise-level data warehouse combined with a strategy for its use in decision support can leverage data to provide massive financial benefits

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS

CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS International Journal of Wireless Communications and Networking 3(1), 2011, pp. 7-13 CACHING IN WIRELESS SENSOR NETWORKS BASED ON GRIDS Sudhanshu Pant 1, Naveen Chauhan 2 and Brij Bihari Dubey 3 Department

More information

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs. Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions

More information

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke Parallel DBMS Prof. Yanlei Diao University of Massachusetts Amherst Slides Courtesy of R. Ramakrishnan and J. Gehrke I. Parallel Databases 101 Rise of parallel databases: late 80 s Architecture: shared-nothing

More information

References. Introduction. Publish/Subscribe paradigm. In a wireless sensor network, a node is often interested in some information, but

References. Introduction. Publish/Subscribe paradigm. In a wireless sensor network, a node is often interested in some information, but References Content-based Networking H. Karl and A. Willing. Protocols and Architectures t for Wireless Sensor Networks. John Wiley & Sons, 2005. (Chapter 12) P. Th. Eugster, P. A. Felber, R. Guerraoui,

More information

Processing Rank-Aware Queries in P2P Systems

Processing Rank-Aware Queries in P2P Systems Processing Rank-Aware Queries in P2P Systems Katja Hose, Marcel Karnstedt, Anke Koch, Kai-Uwe Sattler, and Daniel Zinn Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

Optimization of Queries with User-Defined Predicates

Optimization of Queries with User-Defined Predicates Optimization of Queries with User-Defined Predicates SURAJIT CHAUDHURI Microsoft Research and KYUSEOK SHIM Bell Laboratories Relational databases provide the ability to store user-defined functions and

More information

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University Data Intensive Scalable Computing Thanks to: Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Big Data Sources: Seismic Simulations Wave propagation during an earthquake Large-scale

More information

How to survive the Data Deluge: Petabyte scale Cloud Computing

How to survive the Data Deluge: Petabyte scale Cloud Computing How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 Outline Part 1: Introduction What,

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in

More information

VIRTUAL OBSERVATORY TECHNOLOGIES

VIRTUAL OBSERVATORY TECHNOLOGIES VIRTUAL OBSERVATORY TECHNOLOGIES / The Johns Hopkins University Moore s Law, Big Data! 2 Outline 3 SQL for Big Data Computing where the bytes are Database and GPU integration CUDA from SQL Data intensive

More information

Evaluation of Cartesian-based Routing Metrics for Wireless Sensor Networks

Evaluation of Cartesian-based Routing Metrics for Wireless Sensor Networks Evaluation of Cartesian-based Routing Metrics for Wireless Sensor Networks Ayad Salhieh Department of Electrical and Computer Engineering Wayne State University Detroit, MI 48202 ai4874@wayne.edu Loren

More information

Wide Area Query Systems The Hydra of Databases

Wide Area Query Systems The Hydra of Databases Wide Area Query Systems The Hydra of Databases Stonebraker et al. 96 Gribble et al. 02 Zachary G. Ives University of Pennsylvania January 21, 2003 CIS 650 Data Sharing and the Web The Vision A World Wide

More information

Revealing Applications Access Pattern in Collective I/O for Cache Management

Revealing Applications Access Pattern in Collective I/O for Cache Management Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer

More information

ScaleArc for SQL Server

ScaleArc for SQL Server Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations

More information

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Low Latency Data Grids in Finance

Low Latency Data Grids in Finance Low Latency Data Grids in Finance Jags Ramnarayan Chief Architect GemStone Systems jags.ramnarayan@gemstone.com Copyright 2006, GemStone Systems Inc. All Rights Reserved. Background on GemStone Systems

More information

Intra and Inter Cluster Synchronization Scheme for Cluster Based Sensor Network

Intra and Inter Cluster Synchronization Scheme for Cluster Based Sensor Network Intra and Inter Cluster Synchronization Scheme for Cluster Based Sensor Network V. Shunmuga Sundari 1, N. Mymoon Zuviria 2 1 Student, 2 Asisstant Professor, Computer Science and Engineering, National College

More information

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

Data Intensive Scalable Computing

Data Intensive Scalable Computing Data Intensive Scalable Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them

More information

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Presenter s Name Simon CW See Title & and Division HPC Cloud Computing Sun Microsystems Technology Center Sun Microsystems,

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information