Parallel Processing of GroupBy-Before-Join Queries in Cluster Architecture
|
|
- Britton Doyle
- 5 years ago
- Views:
Transcription
1 Parallel Processing of GroupBy-Before-Join Queries in Cluster Architecture David Taniar School of Business Systems Monash University PO Box 63B, Clayton, Vic 3800, Australia info tech. monash.edu. au J. Wenny Rahayu Department of Comp. Sc. and Comp. Eng. La Trobe University Bundoora Vic 3823, Australia Abstract SQL queries in the real world are replete with groupby and join operations. This Qpe of queries is often known as GroupBy-Join queries. In some GroupBy- Join queries, it is desirable to perform group-by before join in order to achieve better performance. This subset of GroupBy-Join queries is called GroupBy-Before-Join queries. In this paper, we present a study on para 1 le1 iza tion of Gro upby - Be fo re - Jo in queries, particularly by exploiting cluster architectures. From our study, we have learned that in parallel quep optimization, processing group-by as early as possible is not always desirable. In many occasions, performing data distribution first before group-by offers performance advantages. In this study, we also describe our clusterbased scheme. 1 Introduction Queries involving aggregates are very common in database processing, especially in On-Line Analytical Processing (OLAP), and Data Warehouse [2,4]. These queries are often used as a tool for strategic decision making. Queries containing aggregate functions summarize a large set of records based on the designated grouping. The input set of records may be derived from multiple tables using a join operation. In this paper, we concentrate on this kind of queries in which the queries contain group-by clause/aggregate functions and join operations. We refer this query as GroupBy-Join query. As the data repository containing data for integrated decision making is growing, aggregate queries are required to be executed efficiently. Large historical tables need to be joined and aggregated each other; consequently, effective optimization of aggregate functions has the potential to result in huge performance gains. In this paper, we would like to focus on the use of parallel processing techniques. The motivation for efficient parallel query processing is not only influenced by the need to performance improvement, but also the fact that parallel architecture is now available in many forms, such as systems consisting of a small number but powerful processors (i.e. SMP machines), clusters of workstations (i.e. loosely coupled shared-nothing architectures), massively parallel processors (i.e. MPP), and clusters of SMP machines (i.e. hybrid architectures) [I]. Parallelism of GroupBy queries involving aggregate functions is not a new subject in the parallel database community. Many researchers have produced several techniques, which are useful for parallelizing such queries. However, most of them focus on a general architecture, mostly shared-nothing architectures. As cluster architectures are now becoming the de facto platform for parallel computing [9,12], there is a need to shift the concentration of this platform. In this paper, we propose parallelization schemes based on a cluster environment. The proposed technique takes into account that processors within a cluster node share main-memory, but communicate through a slower network among other cluster nodes. The work presented in this paper is actually part of a larger project on parallel aggregate query processing. Parallelization of GroupBy-Before-Join queries is the third and the final stage of the project. The first stage of this project dealt with parallelization of GroupBy queries on single tables (i.e. no join operation involved). The results have been reported at PART 2000 conference [IO]. The second stage focused on parallelization of GroupBy- Join queries where the GroupBy attributes are different from the Join attributes with a consequence that the join operation must be carried out first and then the group-by operation. We have presented the outcome of the second stage at HPCAsia 2000 conference [ In the third and final stage, which is the main focus of this paper, concentrates on GroupBy-Join queries (like in stage two), but the join attribute is the same as the group-by attribute resulting that the group-by operation can be performed before the join for optimization purposes (i.e. GroupBy- Before-Join queries). More details on the three group-by queries, our previous work, and the focus of this paper are explained in the next section /01 $ IEEE 178
2 The rest of this paper is organized as follows. Section 2 explains the background of this work. Section 3 describes general parallel algorithms for processing GroupBy-Before-Join queries. Section 4 defines the problem to be solved. Section 5 presents our proposed parallel schemes based on a cluster architecture. Section 6 presents performance evaluation results. Finally, Section 7 gives the conclusions. 2 Background As the background to our work presented in this paper, we need to explain particularly two aspects: one is an overview of GroupBy Queries, and two is our previous work on parallel GroupBy queries and the focus of this paper. 2.1 GroupBy Queries GroupBy queries in SQL can be divided into two broad categories; one is group-by on one table (we call this purely GroupBy queries), and two is a mixture between group-by and join (we then call this GroupBy- Join queries). In either category, aggregate functions are normally involved in the query. To illustrate these two types of GroupBy queries, we use the following tables from a Suppliers-Parts-Projects database: SUPPLIER (E, Sname, Status, City) PARTS (p#, Pname, Weight, Price, City) PROJECT (3, Jname, City, Budget) SHIPMENT (S#, P#, J#, Qty) An example of a GroupBy query on a single table is to "retrieve number of suppliers for each city". The table to be used in this query is table Supplier, and the supplier records are grouped according to its city. For each group, it is then counted the number of records. These numbers will then represent number of suppliers in each city. The SQL of this query are given below. QUERY 1: Select City, COUNT (*) From SUPPLIER Group By City The next category is GroupBy-Join queries. For simplicity of description and without loss of generality, we consider queries that involve only one aggregation function and a single join. The following two queries give an illustration of GroupBy-Join queries. Query 2 is to "group the part shipments by their city locations ". The query written in SQL is as follows. QUERY 2: Select PARTS.City, AVG(Qty) From PARTS, SHIPMENT Where PARTS.P# = SHIPMENT.P# Group By PARTS.City Another example is to "retrieve project numbers, names, and total quantity of shipments for each project". QUERY 3: Select PROJECT.J#, PROJECT.Jname, SUM(Qty) From PROJECT, SHIPMENT Where PROJECT.J# = SHIPMENT.J# Group By PROJECT.J#, PROJECT.Jname The main difference between Query 2 and Query 3 above lies in the join attributes and group-by attributes. In Query 3, the join attribute is also one of the group-by attributes. This is not the case with Query 2, where the join attribute is totally different from the group-by attribute. This difference is especially a critical factor in processing GroupBy-Join queries, as there are decisions to be made in which operation should be performed first: the group-by or the join operation. When the join attribute and the group-by attribute are different as shown in Query 2, there is no choice but to invoke the join operation first, and then the group-by operation. However, when the join attribute and the group-by attribute is the same as shown in Query 3 (e.g. attribute J# of both Project and Shipment tables), it is expected that the group-by operation be carried out first, and then the join operation. Hence, we call the latter query (e.g. Query 3) "GroupBy-Before-Join" query. In Query 3, all Shipment records are grouped based on the J# attribute. After grouping this, the result is joined with table Project. As known widely, join is a more expensive operation than group-by, and it would be beneficial to reduce the join relation sizes by applying the group-by first. Generally, group-by operation should always precede join whenever possible. Early processing of the group-by before join reduces the overall execution time as stated in the general query optimization rule where unary operations are always executed before binary operations if possible. The semantic issues about aggregate functions and join and the conditions under which group-by would be performed before join can be found in literatures [3,5,8,13]. In this paper, we focus on cases where group-by operation is performed before the join operation. Therefore, we will use Query 3 as a running example throughout this paper. 2.2 Our Previous Work and Focus of This Paper Our previous work on parallelization of GroupBy queries mainly focuses on GroupBy queries on a single table (e.g. Query 1) [lo], and GroupBy-Join queries where the join attribute is different from the group by attribute (e.g. Query 2) [ 111. Parallelization of GroupBy join queries on single tables (i.e. Query 1) exists in several forms. In Taniar and Rahayu [ 101, we presented three parallel algorithms. The first two were general algorithms, and the third was a 179
3 specialized algorithm for cluster architectures. The main issue of parallelization single table GroupBy queries was whether to perform distribution after local aggregation (Two Phase method) or to perform distribution without local aggregation (Redistribution method). With the Two Phase method, the communication costs may be reduced by the group-by selectivity factor. However, if the reduction is minimum, local aggregation may not offer much benefit. With cluster architectures, these two issues can be incorporated. That is group-by processing within a cluster node (a cluster node consists of several processors sharing the same main-memory) can be done ala the Redistribution method, and global aggregate processing among cluster nodes through an interconnected network can be done like the Two-Phase method. This method is proven to be efficient in cluster platforms, because the redistribution within each node is done through sharedmemory, whereas communication among nodes like the two-phase method is done after local aggregate filtering is carried out by each cluster node. Parallelization of GroupBy-Join queries where the join attribute is the same as the group-by attribute (i.e. Query 2) also exists in several forms. In Taniar, Jiang, Liu, and Leung [ 1 11, we presented three parallelization techniques. The main issue of parallelization of such a query was that a decision had to be made whether to use the join attribute or to use the group-by attribute as the partitioning attribute for data distribution. If we choose the join attribute as the partitioning attribute (Join Partition method), after data partitioning each processor performs local join and local aggregation. The results from each processor are needed to be redistributed according to the group-by attribute to perform global aggregate to the temporary join result. If we choose the group-by attribute as the partitioning attribute (Aggregate Partition method), only the table associating to the group-by attribute can be partitioned, whereas the other table needs to be replicated. In a cluster architecture, a hybrid approach was adopted. where within each node parallelization is carried out like the Aggregate Partition method (this can be efficient because data replication is done within the shared memory); and among cluster nodes parallelization is performed like the Join Partition method. In this paper, we focus on GroupBy-Before-Join queries (i.e. Query 3). The main differences between this work and our previous work can be outlined as follows. Unlike Query 2, which has two partitioning attributes (e.g. join attribute and group-by attribute), Query 3 has only one partitioning attribute, since the join attribute is the same as the group-by attribute. Therefore, the complexity of this work is not choosing the correct partitioning attribute, but the fact that the group-by clause has to be carried out before the join, and this will affect the parallelization techniques. Unlike Query 1, which does not involve a join, Query 3 involves joining tables. Therefore the complexity of this work is due to the join operation that is involved in the query, and this will affect the decision on when to perform data distribution for calculating the aggregates as joining operation also involves. Because the foundation for processing Query 3 is different from that of Query 1 and Query 2, parallelization of Query 3 needs special attention. 3 General Parallel Algorithms for GroupBy-Before-Join Queries In the following sections, we describe two general parallel algorithms for GroupBy-Before-Join query processing, namely: Early Distribution scheme and Early GroupBy scheme. 3.1 Early Distribution Scheme As the name states, Early Distribution scheme performs data distribution first before anything else (i.e. group-by and join operations). This scheme is influenced by the practice of parallel join algorithms, where raw records are first partitioneddistributed and allocated to each processor, and then each processor performs its operation [6]. This scheme is motivated by fast message passing multi processor systems. The Early Distribution scheme is divided into two phases: distribution phase and group-by-join phase. Using Query 3, the two tables to be joined are Project and Shipment based on attribute J#, and the group-by is based on table Shipment. For simplicity of notation, the table which becomes the basis for group-by is called table R (e.g. table Shipment), and the other table is called table S (e.g. table Project). For now on, we will refer them as tables R and S. In the distribution phase, raw records from both tables (i.e. tables R and S) are distributed based on the join/group-by attribute according to a data partitioning function. An example of a partitioning function is to allocate each processor with project numbers ranging on certain values. For example, project numbers (i.e. attribute Jq pl to p99 go to processor 1, project numbers p100-p199 to processor 2, project numbers p200-p299 to processor 3, and so on. We need to emphasize that the two tables R and S are both distributed. As a result, for example, processor 1 will have records from the Shipment table with J# between pl and p99, inclusive, as well as records from the Project table with J# pl-p99. This distribution scheme is commonly used in parallel join, where raw records are partitioned into buckets based on an adopted partitioning scheme like the above range partitioning [6]. Once the distribution is completed, each processor will have records within certain groups identified by the group-by/join attribute. Subsequently, the second phase (the group-by-join phase) aggregates records of table R 180
4 based on the group-by attribute and calculates the aggregate values on each group. Aggregating in each processor can be carried out through a sort or a hash function. After table R is grouped in each processor, it is joined with table S in the same processor. After joining, each processor will have a local query result. The final query result is a union of all sub-results produced by each processor. Figure 1 shows an illustration of the Early Distribution scheme. Notice that partitioning is done to the raw records of both tables R and S, and aggregate operation of table R and join with table S in each processor is carried out after the distribution phase. Perform groupb! (aggregate function) of table R. and then loin with table S. Records from where they are originally stored, I Disrriburr the I ands) on the Figure 1. Early Distribution Scheme There are several things need to be highlighted from this scheme. First, the grouping is still performed before the join (although after data distribution). This is to conform with an optimization rule for such kind of queries that group-by clause must be carried out before the join in order to achieve more efficient query processing time. Second, the distribution of records from both tables can be expensive, as all raw records are distributed and no prior filtering is done to either table. It becomes more desirable if grouping (and aggregation function) is carried out even before the distribution, in order to reduce the distribution cost especially of table R. This leads to the next scheme called Early GroupBy scheme for reducing the communication costs during distribution phase. 3.2 Early GroupBy Scheme As the name states, the Early GroupBj scheme performs the group by operation first (before data distribution). This scheme is divided into three phases: (i) local grouping phase, (ii) distribution phase, and (iii) final grouping und join phase. In the local grouping phase, each processor performs its group-by operation and calculates its local aggregate values on records of table R. In this phase, each processor groups local records R according to the designated groupby attribute and performs the aggregate function. Using the same example as that in the previous section, one processor may produce, for example, (pl, 5000) and (~140, 8000), and another processor 07100, 7000) and (p140, 4000). The numerical figures indicate the SUM(Qty) of each project. In the second phase (i.e. distriburion phase), the results of local aggregates from each processor, together with records of table S, are distributed to all processors according to a partitioning function. The partitioning function is based on the join/group-by attribute, which in this case is attribute J# of tables Project and Shipment. Again using the same partitioning function in the previous section, J# of pl-p99 are to go to processor 1, J# of pl00- p199 to processor 2, and so on. In the third phase (i.e. final grouping andjoin phase), two operations are carried out, particularly; final aggregate or grouping of R, and join it with S. The final grouping can be carried out by merging all temporary aggregate results obtained in each processor. Global aggregation in each processor is simply done by merging all identical project number (J#) into one aggregate value. For example, processor 2 will merge (p140, 8000) from one processor 4000) from another to produce (p140, 12000) which is the final aggregate value for this project number. Global aggregation can be tricky depending on the complexity of the aggregate functions used in actual query. If, for example, an AVG function was used instead of SUM in Query 3, calculating an average value based on temporary averages must taken into account the actual raw records involved in each processor. Therefore, for these kinds of aggregate functions, local aggregate must also produce number of raw records in each processor although they are not specified in the query. This is needed for the global aggregation to produce correct values. For example, one processor may produce (p140, 8000, 5) and the 4000, 1). After distribution, suppose processor 2 received all p140 records, the average for project p140 is calculated by dividing the sum of the two quantities (e.g and 4000) and the total shipment records for that project. (i.e. ( )/(5+1) = 2000). The total shipments in each project are needed to be determined in each processor although it is not specified in the query. Global Records from where they are originally stored Figure 2. Early GroupBy Scheme table R. 181
5 ... After global aggregation results are obtained, it is then joined table S in each processor. Figure 2 shows an illustration of this scheme. There are several things worth noting. First, records R in each processor are aggregatedgrouped before distributing them. Consequently, communication costs associated with table R can be expected to reduce depending on the group by selectivity factor. Second, we observe that if the number of groups is less than the number of available processors, not all processors can be exploited - reducing the capability of parallelism. 4 Problem Formulation Despite the usefulness of the above two general algorithms for parallel processing GroupBy-Before-Join queries, a number of issues worth to consider, which are as follows. All of the schemes described previously focus on general shared-nothing architectures, where each processor is equipped with its own local memory and disk, and communicates with other processors through message passing. They do not consider whether some of the processors share the one memory (and disks), as in the case with cluster architectures. In a slower network (particularly slower than system bus), it is commonly understood that communication via network should be minimized. This is not clearly identified in the previous schemes. In a shared-memory environment, where the memory is shared, we should take an advantage of load balancing and load sharing in this environment. This is also not identified in the existing schemes either. Based on these factors, we.. propose a scheme for parallel GroupBy-Before-Join queries especially designed for cluster environment, where communication among nodes is done through message passing via an interconnection network, and processors within the same node share the same memory. We need to identify how to minimize communication costs among nodes and how to group shared data in shared-memory. SMP SM P 5M P " " I : : : : ^...I /! :.. :! i 1 : 1....: 1 :! i i. I... I"".... ; I I Interconnected Network I Figure 3. Clusters of SMP There are a number of variations to cluster architecture. In this paper, each cluster node is a sharedmemory architecture connected to an interconnection j network a la shared-nothing. As each shared-memory (i.e. SMP machine) maintains a group of processing elements, collection of these clusters are often called "Clusters of SMP" [9]. Figure 4 shows an architecture of clusters of SMP. 5 Proposed Algorithm for Cluster Architecture Like the Early GroupBy scheme, the Cluster-based scheme is divided into three phases: local grouping, distribution, and final groupindjoining phases. The first phase (local grouping phase) is where in each cluster node (i.e. SMP node consisting of several processors), logically distribute table R based on the group-by attribute. In this phase, processors within a cluster node, in turn, load each record R from the shared disk and allocate to which processor the record should be processed. Since all processors within a cluster node share main-memory, data distribution or data partitioning can be achieved by creating a fragment table for each processor. At the end of this distribution each processor will have a table fragment R to work with. Once each processor in a node has a distinct fragment table to work with, aggregation operation can be carried out by each processor within a cluster node and will produce a set of distinct group-by values. Each node will have one set of local aggregates, which are the union of results produced by each processor within that cluster node. The second phase (distribution phase) distributes local aggregates produced from table R, as well as the nongroup-by table S from each cluster node to other nodes. The distribution is based on the group-by/join attribute. Remember that distribution is done at a node level, not at a processor level. In other word, from each node there is one outgoing communication stream to another node. The third phase (final grouping/joining phase) consists of merging of local aggregates from the first phase, and joining it with table S. The merging process can be explained as follows. After each node has been reallocated with local aggregates from different places, each node now merges the same aggregate values it has received possibly from other clusters. Since multiple processors exist in each node, the merging process can also be done in parallel by exploiting all processor in that node to participate in the merging process. The result of this final grouping is that each node will produce a set of distinct aggregate values. The joining operation is basically joining the result of the final grouping and the non-group-by table S. Since each node consists of several processors, the joining method adopted is a shared-memory join operation, which can be explained as follows. First, each processor reads in an aggregate value R, and hashes it to a shared hash table. Reading and hashing are carried out concurrently by all processors within each SMP node. Second, each processor 182
6 reads in record S and hashesfprobes into the shared hash table. This is also done concurrently by all processors within one node. Any matching is stored in the query result. Figure 4 shows an illustration of the Cluster-based scheme. In this diagram, it shows how the new scheme works with three cluster nodes and four processors in each cluster node. Records from where they are originally stored Figure 4. Cluster-based Scheme and join operations ; Group-By/ : Join attribute Local R in each node Partitioning R within each node on the Group-By attr,b;re, The main differences between the Cluster-based and the Early GroupBy can be outlined as follows. In the Early GroupBy scheme, each processor is considered as independent. It does not consider the fact that some processors share the same memory. Subsequently, local aggregation is to be done at a processor level, instead of at a node level. Since the table in each node is stored in a shared disk as one piece, processors then need to logically divide the one piece of data (table) into fragments. Because the disk is shared, it is common to adopt a round robin logical partitioning in order to maintain load balancing of each processor. Since it is round robin, each processor will likely produce identical groups with different aggregate values with other processor in the same cluster node. In contrast, using the Cluster-based scheme, logical partitioning adopts semantics partitioning that is partitioning is based on the group-by attribute, and subsequently each processor will produce distinct sets of aggregate values (groups), thus reducing the number of groups in the node. The impact of this is propagated to the data distribution among nodes, since less number of local aggregate values (groups) is being distributed across network. Another difference is in the final grouping/joining phase. After the distribution phase, using the general version of the scheme, each processor will have its fragments of local aggregate results and table S, to which the two are to be joined. It is most likely that each processor will have different fragment sizes and this causes the load of each processor imbalance. On the other hand, using the cluster-based scheme, data distribution is done at a node level, and hence each node will have its fragments of local aggregates and table S. Suppose a node consists of four processors, using the cluster-based scheme, there will be one fragment of R and one fragment of S. Using the Early GroupBy scheme, there will be four smaller fragments of R and four smaller fragment of S. Assuming that the four processors within each node are subsequent in the hash function, the four fragments (of the Early GroupBy scheme) are equal to the one bigger fragment (of the cluster-based scheme). However, the four smaller fragments may likely be in different sizes - causing load imbalance of the processors. In contrast, the one bigger fragment, as it is in a sharedmemory, may be divided evenly to all processors during the processing. Therefore load balancing within the node is achieved using the cluster-based. However, we need to emphasize that skew problem among nodes may still occur. 6 Performance Evaluation In order to study the behavior and to compare performance of the three schemes presented in this paper, we carried out a sensitivity analysis. A sensitivity analysis is done by varying performance parameters. For this purpose, a simulation package called Transim [7] was used in the experimentation. Transim is a transputer-based simulator, which adopts an Occam-like language. Using Transim, the number of processors and the architecture topology can be configured. In the experimentation, 64 processors were used, and the table sizes were between 1 and 10 GB, representing between 10 and 100 million records. The maximum entry for each hash table is 10000, and the projectivity ratio is 15%. 6.1 GroupBy Selectivity The graph in Figure 5 shows a comparative performance between the three parallel schemes by varying the GroupBy selectivity ratio (i.e. number of groups produced by the query). The selectivity ratio is varied from to With 100 million records as input, the selectivity of produces 10 groups, whereas the other end of selectivity ratio of 0.01 produces 1 million groups. The cluster configuration consists of 8 nodes with 8 processors in each node. Using the Early Distribution scheme, the major cost is the scanning cost in phase one of the processing. The total cost of phase one is constant regardless of the selectivity ratio, as no grouping is done in phase one. Data transfer from phase one to phase two is not that much, because of two reasons: one is the records to be transferred has been projected, and hence each record size is reduced; and two 183
7 is the communication unit costs are much smaller than the disk access unit cost. One thing should be noted that when number of groups is smaller than the number of processors, not all processors are used. By looking at the graph in Figure 5, we notice that the cost line of the Early Distribution scheme goes down from number of groups of 10 to 100. This is because in the experimentation we used 64 processors in total, and consequently, when 10 groups were produced by the query, not all 64 processors were used, and this degrades performance. When 100 groups are produced, all available processors were used. Using the Early GroupBy scheme, the majority of the processing cost is in the data scanning and loading. Notice that the performance of this scheme is quite steady when the number of groups is small, and suddenly the cost is increased when the number of groups in the output query grows. This is primarily caused by the overhead produced by the overflow hash tables. We also notice that the distribution costs do not play an important role, since communication unit cost is far smaller than disk unit cost. Using the Cluster-bused scheme, the cost components are similar to those of Early GroupBy, where the major cost is the data scanning and loading costs. It appears that the local partitioning cost, which is the cost incurred by the local partitioning within each node, is negligible. Overall, the Cluster-based scheme is better than the early GroupBy: one is due to the data transfer cost along the network, whereby the Cluster-based scheme produces relatively less groups in each node compared to that of the Early GroupBy scheme, which also gives impact to a lower final aggregation cost in the Early Distribution scheme. Another reason is the hash table overflow overhead appears later in the larger groups, and consequently, the cost line of the Early Distribution scheme goes up later than that of Early GroupBy. Comparing the three schemes, in general the Clusterbased scheme delivers better performance than the other two, except in a few situations, such as extremely large number of groups produced by the query, in which the Early Distribution scheme is better performed. The experimentation results also prove that when the number of groups is small, the Early GroupBy is good as it has filtered out records in the first phase of processing. When filtering is limited in the first phase of processing, the Early GroupBy is not at all good, as it requires double processing. On the other hand, the Early Distribution scheme, which does not filter in the first phase, is good for large number of groups. The proposed Cluster-based scheme somehow compensates the previous two schemes. It behaves like the Early GroupBy but does not increase the cost too much when filtering is insufficiently done in the first phase of processing. Based on these performance results, we can come into conclusion that the Cluster-based scheme is beneficial; as it delivers better performance in most circumstances. We can also conclude that Early Distribution does not give too much poor performance even though group-by operation is not the first operation performed. In fact, in many cases, Early Distribution performs better than Early GroupBy. This is an interesting conclusion drawn from parallel processing perspective that optimization in sequential processors may not necessary apply to parallel Pr' :essors. Sec *@I Varying Group By. '+E. Distribution +E. GroupBy +Cluster loo0 loo (31OOOOOO Number of Groups Figure 5. Varying GroupBy Selectivity Ratio 6.2 Cluster Configuration Figure 6 shows a comparative performance between the three schemes when number of cluster and cluster size are varied. Number of cluster is varied from 1 to 8, and each cluster has 4 or 8 processors. The graphs shown in Figure 6 are the experimentation results using the following parameters: the number of groups produced is 10,000 groups (selectivity ratio of O.OOOl), and a maximum hash table entry is 10,000. From the graphs, we notice that the Cluster-based scheme works well when there are more processors in a cluster node and when more nodes are used in the system. Generally, the Cluster-based scheme delivers the best performance, which is ranging from 8% to 15% improvement than the other two schemes. With large number of processors, data transfer cost for the Early Distribution scheme is expensive. Data transfer cost of the Early GroupBy scheme can also be significant, although not as much as that of the Early Distribution scheme. On the other hand, the Cluster-based scheme, as the selectivity ratio within each cluster can lower often be than the selectivity factor of the first phase of the Early GroupBy scheme, data transfer cost is trivial. With small number of processors, the Cluster-based scheme imposes additional overhead associated with the local data partitioning. Like the Early GroupBy scheme, the Cluster-based scheme also has some overheads associated with hash table overflow. This can only be minimized if more processors are used so that the workload is spread. The graphs in Figure 6 also indicate that using the current parameters, the Early GroupBy scheme performs the worst. This is due to small reduction in the original 184
8 number of records R. The Early Distribution scheme is, to some degree, better than the Early GroupBy, because of exactly the opposite reasons. The Cluster-based scheme in this case takes the advantage of low hash table overflow overhead like the Early Distribution scheme, but without an expensive data transfer cost. As a result, in most cases, the Cluster-based scheme offers the best performance. Sec \- Sec ;i!j a. 4 Processors per Node. t E. Distribution --c E. GroupBy -+Cluster Number of Cluster Nodes b. \ # 8 Processors per Node -t- E. Distribution --c E. GroupBy 43- Cluster Number of Cluster Nodes Figure 6. Varying Cluster Configuration 7 Conclusions In this paper. we have studied three parallel algorithms for processing "GroupBy-Before-Join" queries (i.e. GroupBy queries where the group by operation can be performed before the join operation) in high performance parallel database systems. These algorithms are Earl! Distribution scheme, Early GroupBp scheme, and Cluster-based scheme. The main rationale for the development ofthe Clusterbased scheme is twofold: one is to take the advantage of cluster architectures. and two is that the two existing methods were not specifically design for cluster architectures. Using the cluster-based schemes, local aggregation in each SMP node is done through a sharedmemory group-by operation whereby raw records are logically partitioned according to the group by attribute. Based on this scheme, the cluster version takes the advantage of shared memory of the cluster node in which logical data partitioning can be done easily and efficiently (load balanced), and filtering through the group-by selectivity can have better results. Our performance evaluation results show that in most cases the Cluster-based schemes deliver better performance than the other two schemes. It is surprisingly that the Early Distribution is not as bad as we initially thought since the group-by operation was not the first operation performed. This strongly indicates that optimization rules for sequential processors may not be blindly adopted in parallel processors. In parallel query optimization, we must consider other elements, such as distribution cost, parallel architecture, and so on. Also in this paper, we have shown how parallel query optimization must be tailored for specific architecture, such as cluster architecture, which is the main platform of the experimentation in this paper. Reference IO Almasi G., and Gottlieb, A., Highly Parallel Computing, 2"d ed., The BenjaminKummings Publishing Co. Inc, Bedell J.A. "Outstanding Challenges in OLAP", Proc. of 14Ih Intl. Conf. on Data Engineering, Bultzingslcewen G., "Translating and optimizing SQL queries having aggregate", Proc. of the 13Ih International Conference on Very Large Data Bases, Datta A. and Moon B., "A case for parallelism in data warehousing and OLAP", Proc. of gh Intl Workshop on Database and Expert Systems Applications, Dayal U,, "Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers", Proc. of the 13Ih Intl. Con$ on Very Large Data Bases, Brighton, UK, DeWitt, D.J. and Gray, J., "Parallel Database Systems: The Future of High Performance Database Systems", Comm. of the ACM, vol. 35, no. 6, pp , Hart, E., Transim: Protopping Parallel Algorithms, User Guide & Reference Manual, ver 3.5, Westminster University, Kim, W., "On optimizing an SQL-like nested query", ACM Transactions on Database Sjistems, Vol7, No 3, Sept Pfister, G.F., In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing, 2nd ed., Prentice Hall, Taniar, D. and Rahayu, J.W., "Parallel Processing of Aggregate Queries in a Cluster Architecture", Proc. of the 7Ih Australasian Conf. on Parallel and Real-Time Systems PART2000, Springer-Verlag. Nov Taniar, D., Jiang, Y., Liu, K.H., and Leung, C.H.C., "Aggregate-Join Query Processing in Parallel Database Systems", Proc. of the 4Ih HPCAsia '2000 lntl Conf, vol. 2, IEEE CS Press, pp , Wilkinson, B. and Allen, M. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, Yan W. P. and P. Larson, "Performing group-by before join", Proc. of the Intl. Conj on Data Engineering,
Parallel Processing of Multi-join Expansion_aggregate Data Cube Query in High Performance Database Systems
Parallel Processing of Multi-join Expansion_aggregate Data Cube Query in High Performance Database Systems David Taniar School of Business Systems Monash University, Clayton Campus Victoria 3800, AUSTRALIA
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationComputing Data Cubes Using Massively Parallel Processors
Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University
More informationHigh-Performance Parallel Database Processing and Grid Databases
High-Performance Parallel Database Processing and Grid Databases David Taniar Monash University, Australia Clement H.C. Leung Hong Kong Baptist University and Victoria University, Australia Wenny Rahayu
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery
More informationChapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction
Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of
More information! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large
Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!
More informationChapter 20: Parallel Databases
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationChapter 20: Parallel Databases. Introduction
Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!
More informationUsing A Network of workstations to enhance Database Query Processing Performance
Using A Network of workstations to enhance Database Query Processing Performance Mohammed Al Haddad, Jerome Robinson Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4
More informationChapter 17: Parallel Databases
Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems
More informationParallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism
Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationCAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1
CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationIndexing: Overview & Hashing. CS 377: Database Systems
Indexing: Overview & Hashing CS 377: Database Systems Recap: Data Storage Data items Records Memory DBMS Blocks blocks Files Different ways to organize files for better performance Disk Motivation for
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationDatabase System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static
More informationNotes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes
uery Processing Olaf Hartig David R. Cheriton School of Computer Science University of Waterloo CS 640 Principles of Database Management and Use Winter 2013 Some of these slides are based on a slide set
More informationCSIT5300: Advanced Database Systems
CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Chapter 12, Part A Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset
More informationDATABASE SCALABILITY AND CLUSTERING
WHITE PAPER DATABASE SCALABILITY AND CLUSTERING As application architectures become increasingly dependent on distributed communication and processing, it is extremely important to understand where the
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More information6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationSelection Queries. to answer a selection query (ssn=10) needs to traverse a full path.
Hashing B+-tree is perfect, but... Selection Queries to answer a selection query (ssn=) needs to traverse a full path. In practice, 3-4 block accesses (depending on the height of the tree, buffering) Any
More informationChapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"
Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!
More informationSomething to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:
Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base
More informationEvaluation of Relational Operations. Relational Operations
Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )
More informationEvaluation of relational operations
Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationChapter 12: Indexing and Hashing (Cnt(
Chapter 12: Indexing and Hashing (Cnt( Cnt.) Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationHiCoMo: High Commit Mobile Transactions
HiCoMo: High Commit Mobile Transactions Minsoo Lee and Sumi Helal Dept. of Computer & Information Science & Engineering University of Florida, Gainesville, FL 32611-6120 {mslee,helal}@cise.ufl.edu Technical
More information12. MS Access Tables, Relationships, and Queries
12. MS Access Tables, Relationships, and Queries 12.1 Creating Tables and Relationships Suppose we want to build a database to hold the information for computers (also refer to parts in the text) and suppliers
More informationCOPYRIGHTED MATERIAL. Introduction. Chapter1. Parallel databases are database systems that are implemented on parallel computing
Chapter Introduction Parallel databases are database systems that are implemented on parallel computing platforms. Therefore, high-performance query processing focuses on query processing, including database
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in
More informationdata parallelism Chris Olston Yahoo! Research
data parallelism Chris Olston Yahoo! Research set-oriented computation data management operations tend to be set-oriented, e.g.: apply f() to each member of a set compute intersection of two sets easy
More informationChapter 3. Algorithms for Query Processing and Optimization
Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms
More informationOutline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.
Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationThe Evolution of Data Warehousing. Data Warehousing Concepts. The Evolution of Data Warehousing. The Evolution of Data Warehousing
The Evolution of Data Warehousing Data Warehousing Concepts Since 1970s, organizations gained competitive advantage through systems that automate business processes to offer more efficient and cost-effective
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationImplementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12
Implementation of Relational Operations CS 186, Fall 2002, Lecture 19 R&G - Chapter 12 First comes thought; then organization of that thought, into ideas and plans; then transformation of those plans into
More informationLecture 8 Index (B+-Tree and Hash)
CompSci 516 Data Intensive Computing Systems Lecture 8 Index (B+-Tree and Hash) Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 HW1 due tomorrow: Announcements Due on 09/21 (Thurs),
More informationImplementation of Relational Operations
Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationCOURSE 12. Parallel DBMS
COURSE 12 Parallel DBMS 1 Parallel DBMS Most DB research focused on specialized hardware CCD Memory: Non-volatile memory like, but slower than flash memory Bubble Memory: Non-volatile memory like, but
More informationCSIT5300: Advanced Database Systems
CSIT5300: Advanced Database Systems L10: Query Processing Other Operations, Pipelining and Materialization Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science
More informationAdministrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design
Administrivia Physical Database Design R&G Chapter 16 Lecture 26 Homework 5 available Due Monday, December 8 Assignment has more details since first release Large data files now available No class Thursday,
More informationCOMP 430 Intro. to Database Systems. Indexing
COMP 430 Intro. to Database Systems Indexing How does DB find records quickly? Various forms of indexing An index is automatically created for primary key. SQL gives us some control, so we should understand
More informationArchitecture and Implementation of Database Systems (Winter 2014/15)
Jens Teubner Architecture & Implementation of DBMS Winter 2014/15 1 Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2014/15
More informationColumnstore and B+ tree. Are Hybrid Physical. Designs Important?
Columnstore and B+ tree Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 B+ tree & Columnstore on same table = Hybrid design 4? C O L C O L B+ tree B+ tree ? C O L C O L B+ tree B+ tree
More informationFaloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline
Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415 Faloutsos 1 introduction selection projection
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationHorizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator
Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator R.Saravanan 1, J.Sivapriya 2, M.Shahidha 3 1 Assisstant Professor, Department of IT,SMVEC, Puducherry, India 2,3 UG student, Department
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationBest Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.
IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development
More informationInteractive Responsiveness and Concurrent Workflow
Middleware-Enhanced Concurrency of Transactions Interactive Responsiveness and Concurrent Workflow Transactional Cascade Technology Paper Ivan Klianev, Managing Director & CTO Published in November 2005
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationDatabase Design and Tuning
Database Design and Tuning Chapter 20 Comp 521 Files and Databases Spring 2010 1 Overview After ER design, schema refinement, and the definition of views, we have the conceptual and external schemas for
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing
More informationCompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy
CompSci 516 Database Systems Lecture 20 Parallel DBMS Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements HW3 due on Monday, Nov 20, 11:55 pm (in 2 weeks) See some
More informationUniversity of Waterloo Midterm Examination Solution
University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry
More informationDatabase Optimization
Database Optimization June 9 2009 A brief overview of database optimization techniques for the database developer. Database optimization techniques include RDBMS query execution strategies, cost estimation,
More information. The problem: ynamic ata Warehouse esign Ws are dynamic entities that evolve continuously over time. As time passes, new queries need to be answered
ynamic ata Warehouse esign? imitri Theodoratos Timos Sellis epartment of Electrical and Computer Engineering Computer Science ivision National Technical University of Athens Zographou 57 73, Athens, Greece
More informationAn Overview of Cost-based Optimization of Queries with Aggregates
An Overview of Cost-based Optimization of Queries with Aggregates Surajit Chaudhuri Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304 chaudhuri@hpl.hp.com Kyuseok Shim IBM Almaden Research
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables
More informationParallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18
Reading aterial CompSci 516 atabase Systems Lecture 20 Parallel BS Instructor: Sudeepa Roy [RG] Parallel BS: Chapter 22.1-22.5 [GUW] Parallel BS and map-reduce: Chapter 20.1-20.2 Acknowledgement: The following
More informationStorage hierarchy. Textbook: chapters 11, 12, and 13
Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular
More informationCS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing
CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL
More informationOptimized Query Plan Algorithm for the Nested Query
Optimized Query Plan Algorithm for the Nested Query Chittaranjan Pradhan School of Computer Engineering, KIIT University, Bhubaneswar, India Sushree Sangita Jena School of Computer Engineering, KIIT University,
More informationChapter 18: Parallel Databases
Chapter 18: Parallel Databases Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationModule 5: Hash-Based Indexing
Module 5: Hash-Based Indexing Module Outline 5.1 General Remarks on Hashing 5. Static Hashing 5.3 Extendible Hashing 5.4 Linear Hashing Web Forms Transaction Manager Lock Manager Plan Executor Operator
More informationQuery Optimization in Distributed Databases. Dilşat ABDULLAH
Query Optimization in Distributed Databases Dilşat ABDULLAH 1302108 Department of Computer Engineering Middle East Technical University December 2003 ABSTRACT Query optimization refers to the process of
More informationData Warehousing and OLAP Technologies for Decision-Making Process
Data Warehousing and OLAP Technologies for Decision-Making Process Hiren H Darji Asst. Prof in Anand Institute of Information Science,Anand Abstract Data warehousing and on-line analytical processing (OLAP)
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationCS352 Lecture: Database System Architectures last revised 11/22/06
CS352 Lecture: Database System Architectures last revised 11/22/06 I. Introduction - ------------ A. Most large databases require support for accesing the database by multiple users, often at multiple
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationManaging Data Resources
Chapter 7 Managing Data Resources 7.1 2006 by Prentice Hall OBJECTIVES Describe basic file organization concepts and the problems of managing data resources in a traditional file environment Describe how
More informationRevisiting Pipelined Parallelism in Multi-Join Query Processing
Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu Elke A. Rundensteiner Department of Computer Science, Worcester Polytechnic Institute Worcester, MA 01609-2280 (binliu rundenst)@cs.wpi.edu
More informationCS34800 Information Systems
CS34800 Information Systems Indexing & Hashing Prof. Chris Clifton 31 October 2016 First: Triggers - Limitations Many database functions do not quite work as expected One example: Trigger on a table that
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationParallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke
Parallel DBMS Prof. Yanlei Diao University of Massachusetts Amherst Slides Courtesy of R. Ramakrishnan and J. Gehrke I. Parallel Databases 101 Rise of parallel databases: late 80 s Architecture: shared-nothing
More information