Computing Data Cubes Using Massively Parallel Processors
|
|
- Martha White
- 5 years ago
- Views:
Transcription
1 Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li Department of Information Systems and Computer Science National University of Singapore Abstract To better support decision making, it was proposed to extend SQL to include data cube operations. Computation of data cube requires computing a number of interrelated group-bys, which is rather expensive operation when databases are large. In this paper, we propose to couple a relational database management system with massively parallel processors (MPP) to facilitate on-line analytic processing. Extended SQL queries involving complex data analysis such as data cube computation are decomposed. Data retrieved from the database are pipelined to the MPP machine where data cubes are computed in parallel. The system architecture and issues related to parallel computation of data cubes are described. A brute force parallel data cube processing algorithm was implemented and the results of some preliminary experiments are presented. 1. Introduction Aggregation is widely used in on-line analytical processing. In relational database systems, SQL supports a set of aggregate functions (SUM, COUNT, AVG, MAX, and MIN). Together with another operator, GROUP BY, users can retrieve not only the data physically stored in a database, but also summaries of such data. For example, with a relation SALES (date, product, customer, amount), the total sales for each product can be easily obtained by issuing a single SQL query: SELECT product, SUM (amount) FROM SALES GROUP BY product The semantics of the GROUP BY operator is to partition a relation (or sub-relation) into disjoint sets based on the values of the group-by attributes, the attributes specified in the GROUP BY clause. Aggregate functions are then applied to each of such sets. Although relational database systems and SQL language have been widely used in business applications for the past decade, certain common forms of data analysis, such as histograms, roll-up totals and sub-totals for drilldowns and cross tabulation, are difficult with these SQL aggregation constructs [GCB+97]. A new operator, CUBE BY has been proposed recently to overcome those problems. The CUBE operator is the n-dimensional generalization of the group-by operator. It computes group-bys corresponding to all possible combinations of attributes in the CUBE BY clause. For example, the query SELECT date, product, customer, SUM (amount) FROM SALES CUBE BY date, product, customer will produce the SUM of amount of all tuples in the database and the results of 7 group-bys, i.e., (date, product, customer), (date, product), (date, customer), (product, customer), (date), (product) and (customer). To make these group-buy results union compatible, empty attributes in group-bys are denoted by ALL. The CUBE BY operator can be simply implemented as a union of a series of group-bys. Obviously, it is a quite expensive approach, especially when the number of cube-by attributes and the database are large. In this paper, we report our study on using massively parallel processors to compute data cubes. It is the first phase of a project on parallel on-line analytical processing. It is our belief that, designing and implementing a fully fledged parallel on-line analytical processing system is still not an easy decision to be made for most organizations. However, it is a practical solution to couple a commercial relational DBMS with massively parallel processors (or a cluster of P1-T-1
2 general-purpose processors) to form a system. In such a system, the DBMS provides efficient storage and retrieval of large volume of data and the massively parallel processors provides the computing power required by analytical processing. As the result, queries involving complex data analysis on large volume data can be answered with reasonable response time. The remainder of the paper is organized as follows. Section 2 describes a system architecture that uses massively parallel processors as an OLAP engine of a conventional DBMS. Issues related to parallel implementation of the CUBE BY operator are discussed in Section 3. Section 4 presents some preliminary experimental results. Section 5 concludes the paper. Query Analyst Cost Estimator Query Dispatcher Plan Generator Execution Manager Result Synthesizer DBMS MPP Database Figure 2.1: Coupling MPP with a DBMS to server as an OLAP engine 2. Using MPP as an OLAP engine The reference architecture of a system that couples MPP with a DBMS as an OLAP engine is shown in Figure 1. The shadowed portion is a piece of middle ware to be designed and implemented in our project. A user query is first analyzed by the Query Analyst to see whether the MPP should be invoked. If the MPP should be invoked, the query will be decomposed into sub-queries. The Query Dispatcher will send the sub-queries to DBMS and/or the Plan Generator, which is responsible for generating parallel plan to be executed by the MPP. The DBMS will retrieve the required data from the database and send them to the MPP for further processing. P1-T-2 The execution of the parallel plan is controlled and monitored by the Execution Manager. The Result Synthesizer will assembly the final results of the query and delivery them to the user. To assist the Query Analyst determines whether the MPP s involvement is processing a particular query is beneficial. A module, Cost Estimator, is included in the system that estimates the costs of different query plans. The proposed architecture has a number of features. First, the coupling between the DBMS and the MPP is quite loose. It can be fully controlled by providing options in the software so that the users can cut-off the coupling easily by turning off the option. Without coupling, user queries are directed to the DBMS directly as it is without overhead.
3 Second, both the DBMS and MPP are highly autonomous. No modification is required to the original hardware and software at the both sides. It is obvious that, the Query Analyst and the Parallel Plan Generator are two key components of the system. While the Query Analyst needs to rewrite a user query once it is determined that parallel processing is beneficial, the Plan Generator is responsible for generating query execution plans for parallel execution. A lot of research work has been reported in the field of parallel query processing and optimization [LOT94]. Some of the results can be readily applied. Therefore, we concentrate ourselves on issues which have been less addressed, especially those are particularly important to online analytical processing such as the recently proposed CUBE BY operation. 3. Computing data cubes in parallel There have been a large amount work has been done on parallel query processing in relational database systems [DeGr92, LOT94]. In the recent surge of research interests on OLAP and multidimensional databases [DEBu95], efficient implementation of data cube operation has also attracted researchers attention [AAD+96, DANR96, HaRU96]. However, relatively less work has been devoted to parallel processing of aggregates [ShNa94]. In this section, we discuss some interesting properties and issues related to data cube computation using parallel processors. 3.1 Data cube and cuboids We adopt the notations used in [DANR96]. Let R be a relation with k+1 attributes X = {A 1, A 2,,A k, V}. A cuboid on j attributes S = {A i1, A i2,, A ij} is defined as a group-by on attributes A i1, A i2,, A ij using aggregate function F(.) applied on attribute V. This cuboid can be represented as a k+1 attribute relation by using the special value ALL for the remaining k-j attributes [GBLP96]. The CUBE on attribute set X is the union of cuboids on all subsets of attributes of X. The cuboid on all attributes in X is called the base cuboid. To compute the CUBE, we need to compute all the cuboids that together form the CUBE. Among those cuboids, the base cuboid has to be P1-T-3 computed from the original relation. If the aggregate function is distributive, 1 the other cuboids can be computed from the cuboids. The five aggregate functions supported by SQL are in fact all distributive so that a cuboid on attributes set S i can be computed from any cuboid on attribute sets S j if S i Sj. Figure 3.1 shows a lattice of cuboids for a relation with 4+1 attributes. The nodes in the lattice are cuboids to be computed and the edges indicate that the lower level cuboids can be computed from the upper level ones. The numbers in the parentheses are sample sizes of the cuboids in terms of number of tuples. (A,B) [1,000] (A,B,C) [10,000] (A,C) [500] (A) [50] (A,B,D) [5,000] {A,D) [250] (B) [20] R = 500,000 (A,B,C,D) [49,998] (C) [10] ( ) [1] (A,C,D) [2,500] (B,C) [200] (B,C,D) [1,000] (B,D) [100] (D) [5] Figure 2: Cuboids for a relation with 4+1 attributes. (C,D) [50] There are two basic approaches to compute groupby: sorting and hashing [GRAE93]. Since hashbased approaches are usually suitable for parallel processing, we will only consider the hash-based methods in this study. The basic hash-based approach for computing a cuboid is rather straightforward. A hash table whose entries are distinct values of the group by attributes and the aggregation value is built. For each source tuple, a hash function is applied to the group-by attributes. If the values have not been inserted in the hash table, a new entry is created. The aggregate function is applied and the result is used to update the entry. 1 Aggregate function F () is distributive, if there is a function G () such that F ({X i,j }) = G ({ F ({ X i,j i = 1,, I}) j = 1,, J}).
4 The above description assumes that the hash table resides in memory. If the size of available memory is smaller than required, a cuboid can be computed in a number of iterations. For each iteration, a sub-cuboid whose hash table fits in memory is computed. Computing a cube requires computing a number of interrelated cuboids. Agrawal et. al. summarized the possible optimization techniques that can be used [AAD+96]. They are Smallest-parent: computing a cuboid from the smallest cuboid previously computed cuboid. In Figure 3.1, edges in solid line indicate a cuboid and its smallest parent from which it can be computed. Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os. Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads. Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used. Share-partitions: sharing the partitioning cost cross multiple cuboids when hash-based algorithms are used. A number of algorithms that incorporates some of the above techniques were also proposed and studied. 3.2 Parallel evaluation of data cubes In the MPP environment, it is easy to have a much larger aggregate memory than an uniprocessor machine, which makes it attractive using MPP machines to compute data cubes. However, for most OLAP applications, the data to be aggregated is usually too large even for MPP machines. Especially, those processors are not dedicated database machines, and memory available for database processes will still be limited. Careful memory management for multiple cuboid evaluation is therefore still a key issue in parallel evaluation of data cubes. Furthermore, when more than one cuboid is to be computed using more than one processor, cuboid allocation is another issue. Data partitioning strategy. The standard way to deal with limited memory in hash-based algorithms is to partition the data. When multiple interrelated cuboids are to be evaluated, the data can be partitioned based on either attributes or cuboids. For attribute-based partitioning, when data is partitioned on some attribute, say A, then all cuboids that contain A are partitioned on A and computed at the same time. In other words, a number of partial cuboids are computed concurrently in an iteration. For cuboid-based partitioning, data are partitioned if a cuboid is too large. More than one cuboid is evaluated in the same iteration only when the available memory can accommodate more than one cuboid. Cuboid allocation strategy. When more than one processor is available to evaluate a partition, which contains more than one cuboid or partitions from more than one cuboid, cuboid allocation strategy distributes the computation among the available nodes. There are two possible strategies. Concurrent allocation strategy allows all cuboids or cuboid partitions to be evaluated concurrently. That is, a minimum number of nodes is allocated to evaluate a cuboid (or cuboid partition) based on the size of the cuboid and the size of available memory. Sequential allocation strategy is to evaluate the cuboids sequentially and evenly distribute the computation of each cuboid (cuboid partition) among all available nodes. The advantage of concurrent allocation is that, each node maintains less number of hash tables. As such less number of hash values is computed at each node. For sequential allocation, a node may require evaluate all cuboids for the iteration. More hash tables and more hash operations are required. However, some optimization techniques, such as caching results are easy to be incorporated into the processing algorithm. 3.3 A brute force evaluation algorithm To gain some hands-on experience about parallel evaluation of data cubes. a brute force algorithm was implemented: One node is allocated as the execution coordinator. It reads data, both the original relation and the computed cuboids and broadcast the tuples over the network. P1-T-4
5 A greedy algorithm is used to determine the cuboids or portions of cuboids to be evaluated during each iteration. Cuboidbased partitioning strategy is used. That is, given the aggregate memory of participating processors, the algorithm find the next set cuboids based on estimates of the sizes of the cuboids. All cuboids are sorted on the size of their parents to form a list. Each iteration will choose the next set of cuboids that can fit in memory to compute. Since all cuboids are computed from the same set of source tuples during the iteration, a cuboid may not be computed from its direct parent. The possible increase of CPU cost is estimated to determine whether it is beneficial to include a cuboid into the iteration. During each iteration, the cuboids to be evaluated are allocated to the available processors. Two cuboids allocation strategies were implemented. The minimum node strategy allocates minimum number of nodes to compute one cuboid based on the sizes of cuboids and memory. The evendistribution strategy distributes a cuboid over all available nodes for parallel execution. Execution coordinator broadcasts the tuples over the network. Participating nodes build hash tables for cuboid partitions to be evaluated locally. After receiving the input tuples, it applies hash functions to determine whether the tuple should be evaluated. If so, the aggregate function is applied and the value is used to update the hash table. No data transfer among the participating nodes. When a node completes its computation, a message is sent to the coordinator. The completed cuboid is ready for evaluating other cuboids. We call the algorithm a brute force algorithm, as it is a straightforward implementation for parallel evaluation of data cubes with little optimization incorporated. This because our main objective of this first implementation is not to pursuit the high performance of the algorithm but to get familiar with the facilities provided by AP3000 and the properties of data cube computation. 4. A preliminary performance study The brute force algorithm described in the previous section was implemented in C. In this section, we report results of some initial experiments conducted. The experiments were conducted using Fujitsu AP3000 with 32 nodes. The relation used for computing the data cube contains 500,000 tuples of 5 attributes, A,B,C,D and V, where A,B,C, and D are group-by attributes. Aggregate function SUM is applied to attribute V. The cardinalities of four attributes are 50, 20, 10, and 5 respectively. The attribute values are uniformly distributed within their respective ranges. The sizes of the cuboids are shown in Table 4.1. To simulate memory constraints, the test program only uses a buffer area, which is equivalent to certain number of tuples for the cube computation. For each experiment, the CPU time, message time, i.e., time used to receive broadcast data, and output time, i.e., time for writing result tuples to the disk were recorded. Because the memory allocated is smaller than the size of the total size required, the computation requires a number of iterations. For each iteration, the longest time among all participating nodes is taken as the processing time for the iteration. The total processing time is the sum of the processing time of all iterations. That is, processing time T o (o: {CPU, message, output}) T where k is the number of iterations and max(t oi ) is the maximum of the processing time among all the participating nodes used to complete iteration i. We reported here the results of three sets of experiments that investigated the effects of number of nodes, scheduling strategies, and memory size on the cube processing time Experiment One o k = max( Toi) i= 1 First set of experiments studies the processing time and the number of nodes used for computing the cube. The results are shown in Figure 4.1. P1-T-5
6 processing time (seconds) )LJXUH7LPHIRUFRPSXWLQJWKHVDPSOHFXEH. From Figure 4.1, we can see that the processing time reduces dramatically when we increase the number of participating nodes from 1 to 4. With 5 or more nodes, the speed-up is not that significant. The major reason is that, processing time depends largely on the number of iterations required to compute all the cuboids. Table 4.1 listed the cuboids or their partitions processed when different nodes were used. With 5 or more nodes, the hash table of the largest cuboids, the base cuboid can be held in memory so that the computation can be completed in two iterations. The number of input tuples processed is the same. Only benefit with more nodes in that the portion of cuboids to be computed by each node becomes small. However, this saving is marginal. Table 4.1: Iterations and cuboids computed Nodes Iterations Cuboids computed 1 7 {ABCD 0 } {ABCD 1 } {ABCD 2 } {ABCD 3 }{ABCD 4 }{ABCBCD} {ABD, } 2 4 {ABCD 0 } {ABCD 1 } {ABCD 2 } {ABC, } 3 3 {ABCD 0 } {ABCD 1 } {ABC, } 4 3 {ABCD 0 } {ABCD 1 } {ABC,,} 5 2 (ABCD}{ABC, } 6 2 (ABCD}{ABC, } 7 2 (ABCD}{ABC, } 8 2 (ABCD}{ABC, } 4.2. Experiment Two CPU Message Output number of nodes In the first experiment, a cuboid was evaluated using as many nodes as possible. That is, the even distribution allocation strategy was used. In the second experiment, when more than one cuboid was evaluated in the same iteration, each cuboid is assigned to minimum number of nodes based on the size of the cuboid and the size of the aggregate memory of those nodes. The result is shown in Figure 4.2. processing time (seconds) )LJXUH7LPHIRUFRPSXWLQJWKHVDPSOHFXEH Comparing Figure 4.2 with 4.1, we can see that the CPU time using the second strategy is about 70-75% of the first one when the number of nodes increases to more than four. This can be explained as follows. If a node is responsible for computing n cuboids, n hash operations are required for each of the input tuple. With 5 8 nodes, the second iteration evaluates 15 cuboids and no nodes need to evaluate more than 2 cuboids. With the first strategy, each node may be required to evaluate more than 10 cuboids. 4.3 Memory size and processing time CPU MESSAGE OUTPUT number of nodes Since the size of available memory determines the number of iterations, which has dramatic effects on the total processing time. In the third experiment, we fixed the number of nodes that participate the cube computation to two and varied the available size of memory at each node. The results are shown in Figure 4.3. The curves in Figure 4.3 shows the same trend as Figure 4.1 and 2. However, the speed-up when the memory size increases is not as much as we in the previous cases. For example, the CPU time for 2 nodes with memory of 44,000 tuples each is about P1-T-6
7 processing time (seconds) Figure 4.3: Processing time vs. memory size. as twice as much if 8 nodes with memory of 11,000 tuple each are used. In other words, in addition to the effect of larger aggregate memory size, parallel processing does bring more benefit to cube computation. 4. Discussions aggregate memory (*11,000) CPU Message Output To study the feasibility and benefit of using massively parallel processors to compute data cubes, a brute force algorithm was implemented. A preliminary study indicate that, even without much optimization, massively parallel processors can indeed speed-up the computation of data cubes, partly because of the large aggregate memory. To develop high performance parallel data cube processing algorithms, a number of issues need to be carefully considered. Among them are data partitioning methods and cuboid allocation strategies. To design and implement an algorithm that incorporated various optimization techniques is our immediate task. Conf., Mumbai, India, 1996, [BCL93] K.P. Brown, M.J. Carey, and M. Livny, Managing memory to meet multiclass workload response time goal. In Proc. Of 19th VLDB Conf., Brighton, England, September 1993, [DANR96] P.M. Deshpande, et. al., Computation of multidimensional aggregates. Technical Report-1314, Computer Sciences Department, University of [DeGr92] Wisconsin-Madison, D.J. DeWitt and J. Gray, Parallel Database Systems: The Future of High Performance Database Systems, CACM, June [GCB+97] J.Gray et. al., Data Cube: A relational aggregate operator generalizing groupby, cross-tab, and sub-totals, Data Mining and Knowledge Discovery, Vol 1. No. 1, 1997, [GRAE93] G. Grafe, Query evaluation techniques for large databases. ACM Computing Surveys, Vol. 25, No. 2, 1993, [LOT94] H. Lu, B.C. Ooi, and K-L. Tan, Parallel query processing in relational database systems, IEEE Computer Press, [ShNa94] A. Shatdal and J.F. Naughton, Processing aggregates in parallel database systems, Technical Report- 1233, Computer Sciences Department, University of Wisconsin-Madison, In our experiments, data were transferred using broadcast method. Also, an execution coordinator is responsible for reading and transmitting the data. Comparing the different data transfer methods provided by the system is our another task. References [AAD+96] S. Agrawal, et. al., On the computation of multidimensional aggregates, In Proc. Of 22 nd VLDB P1-T-7
Efficient Computation of Data Cubes. Network Database Lab
Efficient Computation of Data Cubes Network Database Lab Outlines Introduction Some CUBE Algorithms ArrayCube PartitionedCube and MemoryCube Bottom-Up Cube (BUC) Conclusions References Network Database
More informationData Warehousing and Data Mining
Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:
More informationLecture 2 Data Cube Basics
CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1 Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao-
More informationChapter 5, Data Cube Computation
CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full
More information2 CONTENTS
Contents 4 Data Cube Computation and Data Generalization 3 4.1 Efficient Methods for Data Cube Computation............................. 3 4.1.1 A Road Map for Materialization of Different Kinds of Cubes.................
More informationMulti-Cube Computation
Multi-Cube Computation Jeffrey Xu Yu Department of Sys. Eng. and Eng. Management The Chinese University of Hong Kong Hong Kong, China yu@se.cuhk.edu.hk Hongjun Lu Department of Computer Science Hong Kong
More informationSameet Agarwal Rakesh Agrawal Prasad M. Deshpande Ashish Gupta. Jerey F. Naughton Raghu Ramakrishnan Sunita Sarawagi
On the Computation of Multidimensional Aggregates Sameet Agarwal Rakesh Agrawal Prasad M. Deshpande Ashish Gupta Jerey F. Naughton Raghu Ramakrishnan Sunita Sarawagi Abstract At the heart of all OLAP or
More informationData Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems
Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check
More informationPARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining
Journal of Parallel and Distributed Computing 61, 285321 (2001) doi:10.1006jpdc.2000.1691, available online at http:www.idealibrary.com on PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis
More informationUsing Tiling to Scale Parallel Data Cube Construction
Using Tiling to Scale Parallel Data Cube Construction Ruoming in Karthik Vaidyanathan Ge Yang Gagan Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 jinr,vaidyana,yangg,agrawal
More informationData Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems
Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,
More informationAn Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000
An Empirical Comparison of Methods for Iceberg-CUBE Construction Leah Findlater and Howard J. Hamilton Technical Report CS-2-6 August, 2 Copyright 2, Leah Findlater and Howard J. Hamilton Department of
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationChapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives
Chapter 13 Business Intelligence and Data Warehouses Objectives In this chapter, you will learn: How business intelligence is a comprehensive framework to support business decision making How operational
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationChapter 4: Mining Frequent Patterns, Associations and Correlations
Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent
More informationOn-Line Analytical Processing (OLAP) Traditional OLTP
On-Line Analytical Processing (OLAP) CSE 6331 / CSE 6362 Data Mining Fall 1999 Diane J. Cook Traditional OLTP DBMS used for on-line transaction processing (OLTP) order entry: pull up order xx-yy-zz and
More informationB.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,
More informationComputing Complex Iceberg Cubes by Multiway Aggregation and Bounding
Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding LienHua Pauline Chou and Xiuzhen Zhang School of Computer Science and Information Technology RMIT University, Melbourne, VIC., Australia,
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationProject Participants
Annual Report for Period:10/2004-10/2005 Submitted on: 06/21/2005 Principal Investigator: Yang, Li. Award ID: 0414857 Organization: Western Michigan Univ Title: Projection and Interactive Exploration of
More information! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large
Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!
More informationChapter 20: Parallel Databases
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 20: Parallel Databases. Introduction
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the
Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule
More informationChapter 17: Parallel Databases
Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems
More informationSomething to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:
Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base
More informationExternal Sorting Sorting Tables Larger Than Main Memory
External External Tables Larger Than Main Memory B + -trees for 7.1 External Challenges lurking behind a SQL query aggregation SELECT C.CUST_ID, C.NAME, SUM (O.TOTAL) AS REVENUE FROM CUSTOMERS AS C, ORDERS
More informationApriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the
More informationCoarse Grained Parallel On-Line Analytical Processing (OLAP) for Data Mining
Coarse Grained Parallel On-Line Analytical Processing (OLAP) for Data Mining Frank Dehne 1,ToddEavis 2, and Andrew Rau-Chaplin 2 1 Carleton University, Ottawa, Canada, frank@dehne.net, WWW home page: http://www.dehne.net
More informationObject Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ
45 Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ Department of Computer Science The Australian National University Canberra, ACT 2611 Email: fzhen.he, Jeffrey.X.Yu,
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature
More informationTID Hash Joins. Robert Marek University of Kaiserslautern, GERMANY
in: Proc. 3rd Int. Conf. on Information and Knowledge Management (CIKM 94), Gaithersburg, MD, 1994, pp. 42-49. TID Hash Joins Robert Marek University of Kaiserslautern, GERMANY marek@informatik.uni-kl.de
More informationImplementation of Relational Operations
Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationDifferent Cube Computation Approaches: Survey Paper
Different Cube Computation Approaches: Survey Paper Dhanshri S. Lad #, Rasika P. Saste * # M.Tech. Student, * M.Tech. Student Department of CSE, Rajarambapu Institute of Technology, Islampur(Sangli), MS,
More informationEvaluation of relational operations
Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book
More informationEvaluation of Relational Operations. Relational Operations
Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationOptimization of Queries in Distributed Database Management System
Optimization of Queries in Distributed Database Management System Bhagvant Institute of Technology, Muzaffarnagar Abstract The query optimizer is widely considered to be the most important component of
More informationDeccansoft Software Services Microsoft Silver Learning Partner. SSAS Syllabus
Overview: Analysis Services enables you to analyze large quantities of data. With it, you can design, create, and manage multidimensional structures that contain detail and aggregated data from multiple
More informationBuilding Large ROLAP Data Cubes in Parallel
Building Large ROLAP Data Cubes in Parallel Ying Chen Dalhousie University Halifax, Canada ychen@cs.dal.ca Frank Dehne Carleton University Ottawa, Canada www.dehne.net A. Rau-Chaplin Dalhousie University
More informationFrom SQL-query to result Have a look under the hood
From SQL-query to result Have a look under the hood Classical view on RA: sets Theory of relational databases: table is a set Practice (SQL): a relation is a bag of tuples R π B (R) π B (R) A B 1 1 2
More informationImproved Data Partitioning For Building Large ROLAP Data Cubes in Parallel
Improved Data Partitioning For Building Large ROLAP Data Cubes in Parallel Ying Chen Dalhousie University Halifax, Canada ychen@cs.dal.ca Frank Dehne Carleton University Ottawa, Canada www.dehne.net frank@dehne.net
More informationQuery Optimization in Distributed Databases. Dilşat ABDULLAH
Query Optimization in Distributed Databases Dilşat ABDULLAH 1302108 Department of Computer Engineering Middle East Technical University December 2003 ABSTRACT Query optimization refers to the process of
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationTHE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER
THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose
More informationFast Discovery of Sequential Patterns Using Materialized Data Mining Views
Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo
More informationCommunication and Memory Optimal Parallel Data Cube Construction
Communication and Memory Optimal Parallel Data Cube Construction Ruoming Jin Ge Yang Karthik Vaidyanathan Gagan Agrawal Department of Computer and Information Sciences Ohio State University, Columbus OH
More informationRELATIONAL OPERATORS #1
RELATIONAL OPERATORS #1 CS 564- Spring 2018 ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? Algorithms for relational operators: select project 2 ARCHITECTURE OF A DBMS query
More informationData Cube Technology
Data Cube Technology Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl
More informationANU MLSS 2010: Data Mining. Part 2: Association rule mining
ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements
More informationDATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843
DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 WHAT IS A DATA CUBE? The Data Cube or Cube operator produces N-dimensional answers
More informationFrequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar
Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction
More informationInternational Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-1 E-ISSN:
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-6, Issue-1 E-ISSN: 2347-2693 Precomputing Shell Fragments for OLAP using Inverted Index Data Structure D. Datta
More informationMap-Reduce for Cube Computation
299 Map-Reduce for Cube Computation Prof. Pramod Patil 1, Prini Kotian 2, Aishwarya Gaonkar 3, Sachin Wani 4, Pramod Gaikwad 5 Department of Computer Science, Dr.D.Y.Patil Institute of Engineering and
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationA Graph-Based Approach for Mining Closed Large Itemsets
A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and
More informationParser: SQL parse tree
Jinze Liu Parser: SQL parse tree Good old lex & yacc Detect and reject syntax errors Validator: parse tree logical plan Detect and reject semantic errors Nonexistent tables/views/columns? Insufficient
More informationHorizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator
Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator R.Saravanan 1, J.Sivapriya 2, M.Shahidha 3 1 Assisstant Professor, Department of IT,SMVEC, Puducherry, India 2,3 UG student, Department
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationOLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube
OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach
More informationDW Performance Optimization (II)
DW Performance Optimization (II) Overview Data Cube in ROLAP and MOLAP ROLAP Technique(s) Efficient Data Cube Computation MOLAP Technique(s) Prefix Sum Array Multiway Augmented Tree Aalborg University
More informationData Cube Technology. Chapter 5: Data Cube Technology. Data Cube: A Lattice of Cuboids. Data Cube: A Lattice of Cuboids
Chapter 5: Data Cube Technology Data Cube Technology Data Cube Computation: Basic Concepts Data Cube Computation Methods Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/
More informationMaterialized Data Mining Views *
Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationChapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.
Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC. Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationEvaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi
Evaluation of Relational Operations: Other Techniques Chapter 14 Sayyed Nezhadi Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer,
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More informationModel for Load Balancing on Processors in Parallel Mining of Frequent Itemsets
American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationCompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 20 Data Mining and Mining Association Rules Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Reading Material Optional Reading:
More informationItem Set Extraction of Mining Association Rule
Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:
More informationData Mining Part 3. Associations Rules
Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets
More informationACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s
Problem A The Best Team Time Limit: 2s ACM-ICPC 2010 is drawing near and your university want to select three out of N students to form the best team. The university however, has a limited budget, so they
More informationData Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Mining Association Rules Definitions Market Baskets. Consider a set I = {i 1,...,i m }. We call the elements of I, items.
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationParallelizing Frequent Itemset Mining with FP-Trees
Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas
More informationCost Models for Query Processing Strategies in the Active Data Repository
Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationData Communication and Parallel Computing on Twisted Hypercubes
Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures
More informationThe cgmcube project: Optimizing parallel data cube generation for ROLAP
Distrib Parallel Databases (2006) 19: 29 62 DOI 10.1007/s10619-006-6575-6 The cgmcube project: Optimizing parallel data cube generation for ROLAP Frank Dehne Todd Eavis Andrew Rau-Chaplin C Science + Business
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationIncognito: Efficient Full Domain K Anonymity
Incognito: Efficient Full Domain K Anonymity Kristen LeFevre David J. DeWitt Raghu Ramakrishnan University of Wisconsin Madison 1210 West Dayton St. Madison, WI 53706 Talk Prepared By Parul Halwe(05305002)
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More informationPerformance and Scalability: Apriori Implementa6on
Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationDatabase design View Access patterns Need for separate data warehouse:- A multidimensional data model:-
UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to
More informationA Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective
A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationDistributed DBMS. Concepts. Concepts. Distributed DBMS. Concepts. Concepts 9/8/2014
Distributed DBMS Advantages and disadvantages of distributed databases. Functions of DDBMS. Distributed database design. Distributed Database A logically interrelated collection of shared data (and a description
More informationCAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1
CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost
More information