Computing Data Cubes Using Massively Parallel Processors

Size: px

Start display at page:

Download "Computing Data Cubes Using Massively Parallel Processors"

Martha White
5 years ago
Views:

1 Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li Department of Information Systems and Computer Science National University of Singapore Abstract To better support decision making, it was proposed to extend SQL to include data cube operations. Computation of data cube requires computing a number of interrelated group-bys, which is rather expensive operation when databases are large. In this paper, we propose to couple a relational database management system with massively parallel processors (MPP) to facilitate on-line analytic processing. Extended SQL queries involving complex data analysis such as data cube computation are decomposed. Data retrieved from the database are pipelined to the MPP machine where data cubes are computed in parallel. The system architecture and issues related to parallel computation of data cubes are described. A brute force parallel data cube processing algorithm was implemented and the results of some preliminary experiments are presented. 1. Introduction Aggregation is widely used in on-line analytical processing. In relational database systems, SQL supports a set of aggregate functions (SUM, COUNT, AVG, MAX, and MIN). Together with another operator, GROUP BY, users can retrieve not only the data physically stored in a database, but also summaries of such data. For example, with a relation SALES (date, product, customer, amount), the total sales for each product can be easily obtained by issuing a single SQL query: SELECT product, SUM (amount) FROM SALES GROUP BY product The semantics of the GROUP BY operator is to partition a relation (or sub-relation) into disjoint sets based on the values of the group-by attributes, the attributes specified in the GROUP BY clause. Aggregate functions are then applied to each of such sets. Although relational database systems and SQL language have been widely used in business applications for the past decade, certain common forms of data analysis, such as histograms, roll-up totals and sub-totals for drilldowns and cross tabulation, are difficult with these SQL aggregation constructs [GCB+97]. A new operator, CUBE BY has been proposed recently to overcome those problems. The CUBE operator is the n-dimensional generalization of the group-by operator. It computes group-bys corresponding to all possible combinations of attributes in the CUBE BY clause. For example, the query SELECT date, product, customer, SUM (amount) FROM SALES CUBE BY date, product, customer will produce the SUM of amount of all tuples in the database and the results of 7 group-bys, i.e., (date, product, customer), (date, product), (date, customer), (product, customer), (date), (product) and (customer). To make these group-buy results union compatible, empty attributes in group-bys are denoted by ALL. The CUBE BY operator can be simply implemented as a union of a series of group-bys. Obviously, it is a quite expensive approach, especially when the number of cube-by attributes and the database are large. In this paper, we report our study on using massively parallel processors to compute data cubes. It is the first phase of a project on parallel on-line analytical processing. It is our belief that, designing and implementing a fully fledged parallel on-line analytical processing system is still not an easy decision to be made for most organizations. However, it is a practical solution to couple a commercial relational DBMS with massively parallel processors (or a cluster of P1-T-1

2 general-purpose processors) to form a system. In such a system, the DBMS provides efficient storage and retrieval of large volume of data and the massively parallel processors provides the computing power required by analytical processing. As the result, queries involving complex data analysis on large volume data can be answered with reasonable response time. The remainder of the paper is organized as follows. Section 2 describes a system architecture that uses massively parallel processors as an OLAP engine of a conventional DBMS. Issues related to parallel implementation of the CUBE BY operator are discussed in Section 3. Section 4 presents some preliminary experimental results. Section 5 concludes the paper. Query Analyst Cost Estimator Query Dispatcher Plan Generator Execution Manager Result Synthesizer DBMS MPP Database Figure 2.1: Coupling MPP with a DBMS to server as an OLAP engine 2. Using MPP as an OLAP engine The reference architecture of a system that couples MPP with a DBMS as an OLAP engine is shown in Figure 1. The shadowed portion is a piece of middle ware to be designed and implemented in our project. A user query is first analyzed by the Query Analyst to see whether the MPP should be invoked. If the MPP should be invoked, the query will be decomposed into sub-queries. The Query Dispatcher will send the sub-queries to DBMS and/or the Plan Generator, which is responsible for generating parallel plan to be executed by the MPP. The DBMS will retrieve the required data from the database and send them to the MPP for further processing. P1-T-2 The execution of the parallel plan is controlled and monitored by the Execution Manager. The Result Synthesizer will assembly the final results of the query and delivery them to the user. To assist the Query Analyst determines whether the MPP s involvement is processing a particular query is beneficial. A module, Cost Estimator, is included in the system that estimates the costs of different query plans. The proposed architecture has a number of features. First, the coupling between the DBMS and the MPP is quite loose. It can be fully controlled by providing options in the software so that the users can cut-off the coupling easily by turning off the option. Without coupling, user queries are directed to the DBMS directly as it is without overhead.

3 Second, both the DBMS and MPP are highly autonomous. No modification is required to the original hardware and software at the both sides. It is obvious that, the Query Analyst and the Parallel Plan Generator are two key components of the system. While the Query Analyst needs to rewrite a user query once it is determined that parallel processing is beneficial, the Plan Generator is responsible for generating query execution plans for parallel execution. A lot of research work has been reported in the field of parallel query processing and optimization [LOT94]. Some of the results can be readily applied. Therefore, we concentrate ourselves on issues which have been less addressed, especially those are particularly important to online analytical processing such as the recently proposed CUBE BY operation. 3. Computing data cubes in parallel There have been a large amount work has been done on parallel query processing in relational database systems [DeGr92, LOT94]. In the recent surge of research interests on OLAP and multidimensional databases [DEBu95], efficient implementation of data cube operation has also attracted researchers attention [AAD+96, DANR96, HaRU96]. However, relatively less work has been devoted to parallel processing of aggregates [ShNa94]. In this section, we discuss some interesting properties and issues related to data cube computation using parallel processors. 3.1 Data cube and cuboids We adopt the notations used in [DANR96]. Let R be a relation with k+1 attributes X = {A 1, A 2,,A k, V}. A cuboid on j attributes S = {A i1, A i2,, A ij} is defined as a group-by on attributes A i1, A i2,, A ij using aggregate function F(.) applied on attribute V. This cuboid can be represented as a k+1 attribute relation by using the special value ALL for the remaining k-j attributes [GBLP96]. The CUBE on attribute set X is the union of cuboids on all subsets of attributes of X. The cuboid on all attributes in X is called the base cuboid. To compute the CUBE, we need to compute all the cuboids that together form the CUBE. Among those cuboids, the base cuboid has to be P1-T-3 computed from the original relation. If the aggregate function is distributive, 1 the other cuboids can be computed from the cuboids. The five aggregate functions supported by SQL are in fact all distributive so that a cuboid on attributes set S i can be computed from any cuboid on attribute sets S j if S i Sj. Figure 3.1 shows a lattice of cuboids for a relation with 4+1 attributes. The nodes in the lattice are cuboids to be computed and the edges indicate that the lower level cuboids can be computed from the upper level ones. The numbers in the parentheses are sample sizes of the cuboids in terms of number of tuples. (A,B) [1,000] (A,B,C) [10,000] (A,C) [500] (A) [50] (A,B,D) [5,000] {A,D) [250] (B) [20] R = 500,000 (A,B,C,D) [49,998] (C) [10] ( ) [1] (A,C,D) [2,500] (B,C) [200] (B,C,D) [1,000] (B,D) [100] (D) [5] Figure 2: Cuboids for a relation with 4+1 attributes. (C,D) [50] There are two basic approaches to compute groupby: sorting and hashing [GRAE93]. Since hashbased approaches are usually suitable for parallel processing, we will only consider the hash-based methods in this study. The basic hash-based approach for computing a cuboid is rather straightforward. A hash table whose entries are distinct values of the group by attributes and the aggregation value is built. For each source tuple, a hash function is applied to the group-by attributes. If the values have not been inserted in the hash table, a new entry is created. The aggregate function is applied and the result is used to update the entry. 1 Aggregate function F () is distributive, if there is a function G () such that F ({X i,j }) = G ({ F ({ X i,j i = 1,, I}) j = 1,, J}).

4 The above description assumes that the hash table resides in memory. If the size of available memory is smaller than required, a cuboid can be computed in a number of iterations. For each iteration, a sub-cuboid whose hash table fits in memory is computed. Computing a cube requires computing a number of interrelated cuboids. Agrawal et. al. summarized the possible optimization techniques that can be used [AAD+96]. They are Smallest-parent: computing a cuboid from the smallest cuboid previously computed cuboid. In Figure 3.1, edges in solid line indicate a cuboid and its smallest parent from which it can be computed. Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os. Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads. Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used. Share-partitions: sharing the partitioning cost cross multiple cuboids when hash-based algorithms are used. A number of algorithms that incorporates some of the above techniques were also proposed and studied. 3.2 Parallel evaluation of data cubes In the MPP environment, it is easy to have a much larger aggregate memory than an uniprocessor machine, which makes it attractive using MPP machines to compute data cubes. However, for most OLAP applications, the data to be aggregated is usually too large even for MPP machines. Especially, those processors are not dedicated database machines, and memory available for database processes will still be limited. Careful memory management for multiple cuboid evaluation is therefore still a key issue in parallel evaluation of data cubes. Furthermore, when more than one cuboid is to be computed using more than one processor, cuboid allocation is another issue. Data partitioning strategy. The standard way to deal with limited memory in hash-based algorithms is to partition the data. When multiple interrelated cuboids are to be evaluated, the data can be partitioned based on either attributes or cuboids. For attribute-based partitioning, when data is partitioned on some attribute, say A, then all cuboids that contain A are partitioned on A and computed at the same time. In other words, a number of partial cuboids are computed concurrently in an iteration. For cuboid-based partitioning, data are partitioned if a cuboid is too large. More than one cuboid is evaluated in the same iteration only when the available memory can accommodate more than one cuboid. Cuboid allocation strategy. When more than one processor is available to evaluate a partition, which contains more than one cuboid or partitions from more than one cuboid, cuboid allocation strategy distributes the computation among the available nodes. There are two possible strategies. Concurrent allocation strategy allows all cuboids or cuboid partitions to be evaluated concurrently. That is, a minimum number of nodes is allocated to evaluate a cuboid (or cuboid partition) based on the size of the cuboid and the size of available memory. Sequential allocation strategy is to evaluate the cuboids sequentially and evenly distribute the computation of each cuboid (cuboid partition) among all available nodes. The advantage of concurrent allocation is that, each node maintains less number of hash tables. As such less number of hash values is computed at each node. For sequential allocation, a node may require evaluate all cuboids for the iteration. More hash tables and more hash operations are required. However, some optimization techniques, such as caching results are easy to be incorporated into the processing algorithm. 3.3 A brute force evaluation algorithm To gain some hands-on experience about parallel evaluation of data cubes. a brute force algorithm was implemented: One node is allocated as the execution coordinator. It reads data, both the original relation and the computed cuboids and broadcast the tuples over the network. P1-T-4

5 A greedy algorithm is used to determine the cuboids or portions of cuboids to be evaluated during each iteration. Cuboidbased partitioning strategy is used. That is, given the aggregate memory of participating processors, the algorithm find the next set cuboids based on estimates of the sizes of the cuboids. All cuboids are sorted on the size of their parents to form a list. Each iteration will choose the next set of cuboids that can fit in memory to compute. Since all cuboids are computed from the same set of source tuples during the iteration, a cuboid may not be computed from its direct parent. The possible increase of CPU cost is estimated to determine whether it is beneficial to include a cuboid into the iteration. During each iteration, the cuboids to be evaluated are allocated to the available processors. Two cuboids allocation strategies were implemented. The minimum node strategy allocates minimum number of nodes to compute one cuboid based on the sizes of cuboids and memory. The evendistribution strategy distributes a cuboid over all available nodes for parallel execution. Execution coordinator broadcasts the tuples over the network. Participating nodes build hash tables for cuboid partitions to be evaluated locally. After receiving the input tuples, it applies hash functions to determine whether the tuple should be evaluated. If so, the aggregate function is applied and the value is used to update the hash table. No data transfer among the participating nodes. When a node completes its computation, a message is sent to the coordinator. The completed cuboid is ready for evaluating other cuboids. We call the algorithm a brute force algorithm, as it is a straightforward implementation for parallel evaluation of data cubes with little optimization incorporated. This because our main objective of this first implementation is not to pursuit the high performance of the algorithm but to get familiar with the facilities provided by AP3000 and the properties of data cube computation. 4. A preliminary performance study The brute force algorithm described in the previous section was implemented in C. In this section, we report results of some initial experiments conducted. The experiments were conducted using Fujitsu AP3000 with 32 nodes. The relation used for computing the data cube contains 500,000 tuples of 5 attributes, A,B,C,D and V, where A,B,C, and D are group-by attributes. Aggregate function SUM is applied to attribute V. The cardinalities of four attributes are 50, 20, 10, and 5 respectively. The attribute values are uniformly distributed within their respective ranges. The sizes of the cuboids are shown in Table 4.1. To simulate memory constraints, the test program only uses a buffer area, which is equivalent to certain number of tuples for the cube computation. For each experiment, the CPU time, message time, i.e., time used to receive broadcast data, and output time, i.e., time for writing result tuples to the disk were recorded. Because the memory allocated is smaller than the size of the total size required, the computation requires a number of iterations. For each iteration, the longest time among all participating nodes is taken as the processing time for the iteration. The total processing time is the sum of the processing time of all iterations. That is, processing time T o (o: {CPU, message, output}) T where k is the number of iterations and max(t oi ) is the maximum of the processing time among all the participating nodes used to complete iteration i. We reported here the results of three sets of experiments that investigated the effects of number of nodes, scheduling strategies, and memory size on the cube processing time Experiment One o k = max( Toi) i= 1 First set of experiments studies the processing time and the number of nodes used for computing the cube. The results are shown in Figure 4.1. P1-T-5

6 processing time (seconds) )LJXUH7LPHIRUFRPSXWLQJWKHVDPSOHFXEH. From Figure 4.1, we can see that the processing time reduces dramatically when we increase the number of participating nodes from 1 to 4. With 5 or more nodes, the speed-up is not that significant. The major reason is that, processing time depends largely on the number of iterations required to compute all the cuboids. Table 4.1 listed the cuboids or their partitions processed when different nodes were used. With 5 or more nodes, the hash table of the largest cuboids, the base cuboid can be held in memory so that the computation can be completed in two iterations. The number of input tuples processed is the same. Only benefit with more nodes in that the portion of cuboids to be computed by each node becomes small. However, this saving is marginal. Table 4.1: Iterations and cuboids computed Nodes Iterations Cuboids computed 1 7 {ABCD 0 } {ABCD 1 } {ABCD 2 } {ABCD 3 }{ABCD 4 }{ABCBCD} {ABD, } 2 4 {ABCD 0 } {ABCD 1 } {ABCD 2 } {ABC, } 3 3 {ABCD 0 } {ABCD 1 } {ABC, } 4 3 {ABCD 0 } {ABCD 1 } {ABC,,} 5 2 (ABCD}{ABC, } 6 2 (ABCD}{ABC, } 7 2 (ABCD}{ABC, } 8 2 (ABCD}{ABC, } 4.2. Experiment Two CPU Message Output number of nodes In the first experiment, a cuboid was evaluated using as many nodes as possible. That is, the even distribution allocation strategy was used. In the second experiment, when more than one cuboid was evaluated in the same iteration, each cuboid is assigned to minimum number of nodes based on the size of the cuboid and the size of the aggregate memory of those nodes. The result is shown in Figure 4.2. processing time (seconds) )LJXUH7LPHIRUFRPSXWLQJWKHVDPSOHFXEH Comparing Figure 4.2 with 4.1, we can see that the CPU time using the second strategy is about 70-75% of the first one when the number of nodes increases to more than four. This can be explained as follows. If a node is responsible for computing n cuboids, n hash operations are required for each of the input tuple. With 5 8 nodes, the second iteration evaluates 15 cuboids and no nodes need to evaluate more than 2 cuboids. With the first strategy, each node may be required to evaluate more than 10 cuboids. 4.3 Memory size and processing time CPU MESSAGE OUTPUT number of nodes Since the size of available memory determines the number of iterations, which has dramatic effects on the total processing time. In the third experiment, we fixed the number of nodes that participate the cube computation to two and varied the available size of memory at each node. The results are shown in Figure 4.3. The curves in Figure 4.3 shows the same trend as Figure 4.1 and 2. However, the speed-up when the memory size increases is not as much as we in the previous cases. For example, the CPU time for 2 nodes with memory of 44,000 tuples each is about P1-T-6

7 processing time (seconds) Figure 4.3: Processing time vs. memory size. as twice as much if 8 nodes with memory of 11,000 tuple each are used. In other words, in addition to the effect of larger aggregate memory size, parallel processing does bring more benefit to cube computation. 4. Discussions aggregate memory (*11,000) CPU Message Output To study the feasibility and benefit of using massively parallel processors to compute data cubes, a brute force algorithm was implemented. A preliminary study indicate that, even without much optimization, massively parallel processors can indeed speed-up the computation of data cubes, partly because of the large aggregate memory. To develop high performance parallel data cube processing algorithms, a number of issues need to be carefully considered. Among them are data partitioning methods and cuboid allocation strategies. To design and implement an algorithm that incorporated various optimization techniques is our immediate task. Conf., Mumbai, India, 1996, [BCL93] K.P. Brown, M.J. Carey, and M. Livny, Managing memory to meet multiclass workload response time goal. In Proc. Of 19th VLDB Conf., Brighton, England, September 1993, [DANR96] P.M. Deshpande, et. al., Computation of multidimensional aggregates. Technical Report-1314, Computer Sciences Department, University of [DeGr92] Wisconsin-Madison, D.J. DeWitt and J. Gray, Parallel Database Systems: The Future of High Performance Database Systems, CACM, June [GCB+97] J.Gray et. al., Data Cube: A relational aggregate operator generalizing groupby, cross-tab, and sub-totals, Data Mining and Knowledge Discovery, Vol 1. No. 1, 1997, [GRAE93] G. Grafe, Query evaluation techniques for large databases. ACM Computing Surveys, Vol. 25, No. 2, 1993, [LOT94] H. Lu, B.C. Ooi, and K-L. Tan, Parallel query processing in relational database systems, IEEE Computer Press, [ShNa94] A. Shatdal and J.F. Naughton, Processing aggregates in parallel database systems, Technical Report- 1233, Computer Sciences Department, University of Wisconsin-Madison, In our experiments, data were transferred using broadcast method. Also, an execution coordinator is responsible for reading and transmitting the data. Comparing the different data transfer methods provided by the system is our another task. References [AAD+96] S. Agrawal, et. al., On the computation of multidimensional aggregates, In Proc. Of 22 nd VLDB P1-T-7

Efficient Computation of Data Cubes. Network Database Lab

Efficient Computation of Data Cubes Network Database Lab Outlines Introduction Some CUBE Algorithms ArrayCube PartitionedCube and MemoryCube Bottom-Up Cube (BUC) Conclusions References Network Database