Massive Parallel Join in NUMA Architecture

Size: px
Start display at page:

Download "Massive Parallel Join in NUMA Architecture"

Transcription

1 2013 IEEE International Congress on Big Data Massive Parallel Join in NUMA Architecture Wei He, Minqi Zhou, Xueqing Gong, Xiaofeng He[1] Software Engineering Institute, East China Normal University, Shanghai, China Abstract Advance in hardware technology and growing demands for fast response of database application have led to active research in In-Memory Database (IMDB). Compared to traditional on-disk database, IMDB has advantages such as faster access to storage and simpler internal optimization algorithms. Because of the importance of join operation in database system, join algorithm is always a hot research topic and many join algorithms have been proposed for distributed database system. Nevertheless, due to the nature of memory access in Non- Uniform Memory Access (NUMA) architecture, most existing join algorithms for classic Symmetric Multi-Processing (SMP) architecture cannot be applied to NUMA architecture directly. In this work, we present the Distributed Bitmap Join algorithm designed exclusively for IMDB in NUMA architecture. This Distributed Bitmap Join algorithm aims at improving the overall performance for groups of queries, rather than just one single query, by utilizing bitmap to reduce the communication cost in NUMA architecture. The comparative experiments of Distributed Bitmap Join algorithm against no-partition hash join show that although no-partition hash join algorithm is faster than Distributed Bitmap Join in single query case, our algorithm outperforms it for a group of queries. I. INTRODUCTION Traditional database management systems mainly rely on hard disk for data storage. For applications such as mobile computing and online advertising [1] where the real time response is one of the major concerns, operations which have to access data stored in disk no longer satisfy the requirement. A system with fast data access capability hence becomes much desired. As the result of hardware technology improvement, computer systems witness the development from single CPU to multi-core and many-core processors which dramatically accelerate the processing speed. On the other hand, this improvement makes the problem of memory wall worse in that the race condition for memory usage is more prominent even if larger memory becomes available, which makes it possible for the data to reside in memory. To partially address this issue, the Non-Uniform Memory Access (NUMA) is designed for multi-core system such that each processor accesses its local memory faster than shared memory. NUMA effectively solves the memory-access related starvation problem in Symmetric Multi-Processing (SMP) architecture [2] [3]. In-Memory Database (IMDB) refers to database systems that primarily rely on main memory for data storage [4]. In recent years, fast development in memory, network, and processor has greatly changed database system. With cheaper RAM, IMDB is available for database application. Faster 1 Xiaofeng He is the corresponding author. memory-access ensures better performance of IMDB than traditional on-disk database system. Elimination of sophisticated data scheduling strategies also simplifies internal optimization algorithms as well [5] [6]. As a frequently-used yet expensive operation in database application, join has been studied for a long time and many join algorithms have been proposed in various application scenarios. Despite its overall performance gain, however, there still exists some problems in optimizing join algorithms for distributed IMDB: data skew, synchronization cost, communication cost, as well as the computation cost [5]. Due to the advantages of NUMA over SMP, optimization of join algorithms for IMDB in NUMA architecture attracts many research efforts [7]. In this paper, we analyze two typical join algorithms, both of which are suitable for multiprocessing, yet inevitably with some inherent flaws. We gain some insight by applying these two algorithms to NUMA architecture and analyzing their respective advantages and disadvantages under NUMA. Inspired by the analysis results, we propose the Distributed Bitmap Join algorithm for distributed IMDB in NUMA architecture. In the experiments, we compare Distributed Bitmap Join algorithm with no-partition hash join algorithm in NUMA. The nopartition hash join is faster than Distributed Bitmap Join in single query case, but when applied to a group of queries, our algorithm is more efficient than no-partition hash join. This paper makes three major contributions. We analyze two typical optimized join algorithms, the radix join and Massive Parallel Sort-Merge Join (MPSM), in NUMA and compare their respective pros and cons. This analysis reveals the root of their problems, and inspires the design of Distributed Bitmap Join algorithm. We propose Distributed Bitmap Join algorithm that optimizes a group of queries rather than one single query. We perform extensive experiments comparing Distributed Bitmap Join algorithm and no-partition hash join algorithm in NUMA architecture. The result shows that our method outperforms no-partition hash join, and is not affected by data skew. Considering skewness is ubiquitous in many real-life data, this result is significantly useful. The rest of this paper is organized as follows. In Section II, we introduce related work for join operation in IMDB setting. Section III presents our Distributed Bitmap Join algorithm. We present the algorithm implementation and experimental results in Section IV and Section V respectively, and conclude the paper in Section VI /13 $ IEEE DOI /BigData.Congress

2 II. RELATED WORK The first database engine that supports both in-memory and on-disk tables in a single database was released in 2003 [8]. Since then, the research on IMDB has been very active. In addition to the faster memory-access and simpler internal optimization algorithms, IMDB also avoids the long seek time of disk storage, which makes itself very useful in applications requiring real-time response. HANA DB of SAP [9] is one of the most famous IMDBs, while HANA is designed as a new application architecture that offers real-time analytics and aggregation capability. To catch up the pace of rapid development of Internet and multimedia, as well as the growing requirement for semi-structured, unstructured, and text data processing, HANA DB provides three in-memory storage engines: Relational Engine, Graph Engine, and Text Engine, which allows users to store relational data, graph data, and text data into one system. Furthermore, HANA offers a memory-based solution to big data challenges which process big data problem in large memory. HANA DB has 1TB RAM at its first launch which can support 5TB uncompressed data. By the end of 2011, 8TB RAM is available on market. So far HANA DB can run on servers with 100TB memory capacity. HANA s success in IMBD further stimulates the research in this field [10] [11]. Join is one of the most used operations in relational database. Its optimization is a hot topic for database researchers due to its high complexity. For traditional nested loop join, the time complexity is O( R S ), where R and S represent input relations of a join operation, and represents the size of a relation. Simple join between two large tables is apparently not suitable. Many optimized join algorithms have been proposed for SMP architecture, such as sort-merge join and hash join. The efficiency of sort-merge join is limited by sorting algorithms, while the performance of hash join is negatively influenced by data skew in many applications. In the meantime, the design of high performance join algorithms for IMDB has developed into two directions: minimizing the number of processor catch misses and minimizing processor synchronization cost [5]. In general, processor catch misses can be minimized by accessing data sequentially and avoiding too large working set that cannot fit in the data catch. Radix join algorithm is a positive example while no-partition hash join is a negative one [12]. Minimizing processor synchronization cost, on the other hand, always needs to find the tradeoff between synchronization cost and thread safety. Radix join algorithm [13] is a typical approach for minimizing the number of processor catch misses. In radix join, the partitioning phase divides the input tuples based on the values of their keys, which ensures that the memory can be read sequentially, helping the prefetcher hide the remote access latency. With the utilization of prefetches [14], partitioning can be a way to improve the performance of join operation [12]. The synchronization cost of radix join algorithm in NUMA, however, is not negligible. Almost every step in radix join algorithm requires the outcome of previous steps. It is obvious that the radix join algorithm needs more synchronization points than other join algorithms. Additionally, because radix join is essentially a hash join algorithm, data skew causes its performance issue. Another join algorithm is MPSM join which aims at minimizing processor synchronization cost. In MPSM algorithm, every node in the system is called a worker. In phase 1, each chunk of input relation S is sorted locally, resulting in runs S i. In phase 2, each worker range partitions its chunk of input relation R to runs R i. In phase 3, each worker sorts R i locally, and in phase 4, merge joins its R i with all input runs S i. This MPSM join algorithm does not require any synchronization, and the range partitioning phase helps save the work during join phases. Nonetheless, the range partitioning phase also needs to move large scale of tuples through workers, which leads to large communication cost in NUMA. Data skew is a difficult problem bothering all researchers studying hash join algorithms. Kim et al. indicates that data skew is a troublemaker to hash join algorithms [13], which makes the optimization of join algorithms even harder. Some efforts have been done to attack data skew problem. Blanas et al. [5] proposes a series of hash join algorithms in multiprocessing environment and discovers that the no-partition hash join algorithm can improve the performance in presence of data skew, whereas they still cannot eliminate the influence entirely. The recently proposed MPSM algorithm tries to solve this problem through a kind of histogram-based technique [15], but this method increases the computation cost among workers. In this paper, we present Distributed Bitmap Join algorithm that aims at optimizing join operation for IMDB in NUMA architecture by means of reducing communication cost during processing. Distributed Bitmap Join also tries to reduce the influence of data skew, and minimize the synchronization cost and computation cost at the same time. Note that our algorithm tends to optimize the join operation on a group of queries, instead of on a single query. III. DISTRIBUTED BITMAP JOIN NUMA architecture logically keeps consistent with SMP architecture, but the physical implementation is quite different. In SMP architecture, threads can communicate through shared memory, thus the optimized join algorithms for SMP need to consider more about the processor synchronization cost when accessing shared memory. In NUMA architecture, on the other hand, nodes are connected with QuickPath Interconnect (QPI) and the speed of accessing remote memory is much slower than accessing local memory, therefore special attention needs to be paid to communication cost when designing optimized join algorithms. Distributed Bitmap Join algorithm is designed to be an optimized join algorithm for equi-join in IMDB in NUMA. It focuses on reducing both computation cost and remote memory access by utilizing part of intermediate results and bitmap. Distributed Bitmap Join algorithm handles queries in the form of: 220

3 Select from R, S where R.key=S.key and other conditions All tuples reside in main memory. To the ease of reading, we list the used notations in Table I. Notation R S M S R N w i R i R i.bitmap S i S i.bitmap M S Ri M S Ri.bitmap T i t i t kr TABLE I NOTATIONS Meaning Input relation R Input relation S intermediate result of R S Number of workers in NUMA architecture i th worker R runs on w i bitmap of R i S runs on w i bitmap of S i intermediate result of R Son w i bitmap of M S Ri Temporary result on w i Number of R i tuples that can find a match Number of R tuples that can find a match Number of different key values in R and S In the following sections, we will first present a basic optimized join algorithm whose main idea is the utilization of intermediate results. After analyzing the complexity of this basic algorithm, we provide our Distributed Bitmap Join algorithm and show that how the use of bitmap can reduce communication cost in NUMA architecture. A. Basic optimized join algorithm One simple way to optimize join algorithms in NUMA architecture is to record some intermediate results we have already known. Since communication cost is much a trouble in NUMA architecture, we need to minimize remote memory access in join algorithms. Take as example the execution of a query involving a join operation between relation R and relation S. In basic optimized join algorithm, when we need to match pairs of tuples, we optimize the join operation by reusing the intermediate results, checking those with unknown results instead of all pairs. In this algorithm, we maintain some special relations to record the results of join operations, and update these relations every time when new results are generated. The input relations of basic optimized join algorithm are relations R and S which are divided into equal sized chunks R i and S i among all workers. The algorithm can be divided into four steps. Step 1: w i iterates all tuples in R i and eliminates those not satisfying join conditions. Step 2: w i checks whether there is an intermediate result relation M S Ri in local memory. If so, w i tries to match all the tuples in R i and M S Ri, adding the results of successful matches into relation T i. The tuples in both M S Ri and T i are in the form of (v, sid) where v represents key values and sid represents addresses of S tuples in NUMA architecture. Step 3: w i accesses all chunks S j in NUMA architecture to match all tuples in R i which are not matched with tuples in M S Ri, adding the results of successful matches into relation T i. Step 4: w i performs the actual join action according to the tuples in T i, and checks whether there is a relation M S Ri in local memory. If so, w i updates relation M S Ri with tuples in T i, or establishes relation M S Ri before updating otherwise. The third step of basic optimized join algorithm requires to match tuples in R i with those in all chunks S j in NUMA architecture, where hash join algorithms can be adopted. If using no-partition hash join in step 3, the basic optimized hash join algorithm only requires one synchronization point there, and the synchronization cost is low. B. Pseudo-code of basic optimized join algorithm We present pseudo code for basic optimized join algorithm in Algorithm 1 below. Algorithm 1 BASIC OPTIMIZED JOIN ALGORITHM Require: Input relations R and S for join operations Ensure: A relation Q i including the results of S R i 1: build a relation T i whose schema is T i (key, sid) 2: //Step 1: Iterates all tuples in R i 3: for each tuple tr in R i do 4: check whether tr satisfies the join conditions other than equal key. 5: end for 6: //Step 2: Match R i tuples with M S Ri tuples 7: if there is a relation M S Ri in local memory then 8: for each tuple tr in R i do 9: if there is a tuple tm in M S Ri such that t- m.key=tr.key then 10: add (tm.key, tm.sid) into T i 11: end if 12: end for 13: end if 14: //Step 3: Match R i tuples with S tuples 15: for each tuple tr in R i that cannot match with tuples in M S Ri do 16: for each worker w j in NUMA architecture do 17: if there is a tuple ts in S j such that tr.key=ts.key then 18: add (tr.key, ts.sid) into T i 19: end if 20: end for 21: end for 22: //Step 4: Do join action and update M S Ri 23: for each tuple (key, sid) in T i do 24: join the tuple tr in R i such that tr.key=key with the tuple ts in S such that ts.sid=sid 25: put the results of join action into Q i 26: end for 27: if there is no relation M S Ri in local memory then 28: build M S Ri 29: end if 30: update M S Ri with T i 221

4 C. Complexity of basic optimized join algorithm In step 1, w i walks through all tuples in R i. In step 2, w i iterate all tuples in R i and tries to find a matching tuple in M S Ri, where a hash index of M S Ri can be used to reduce computation cost of this step. In step 3, for each tuple in R i, all chunks S j in NUMA architecture are accessed sequentially to find matching tuples. Finally, in step 4, each worker iterates tuples in T i and does actual join actions accordingly, and updates M S Ri with T i tuples. The time for each worker in basic optimized join algorithm is shown in Table II. The overall complexity of basic optimized TABLE II COMPLEXITY OF BASIC OPTIMIZED JOIN ALGORITHM Complexity Action R /N Iterate all tuples in R i R /N match all tuples in R i with tuples in M S Ri S /N Build hash index for relation S R r /N N match each tuple in R i with tuples in all chunks S j t i Do actual join action t i Update M S Ri with T i join algorithm is O( R /N + S /N + R r + t i ) Because the basic optimized join algorithm requires only one synchronization point, the synchronization cost is low. The problem of communication cost, on the contrary, is serious in NUMA architecture due to the reason that each tuple in R i needs to access all chunks S j in step 3. In order to solve this problem, we introduce bitmap to our algorithm on the basis of basic optimized join algorithm. D. Distributed Bitmap Join Traditional join algorithms usually require to check each pair of R tuple and S tuple to determine whether or not they can match. But most of these actions are just wasteful because not all pairs of tuples can be matched. This forms major bottleneck for join operation. Optimizing this check process is especially critical to join algorithms in NUMA architecture because useless check means unnecessary access to remote memory. Similarly, checking the tuples whose results have already been known should also be avoided. Distributed Bitmap Join algorithm takes advantage of intermediate results to eliminate checks on known results, and utilizes bitmap to minimize communication cost caused by useless checks. In this algorithm, every time the results of checks are available, they are recorded to a special relation M S Ri in local memory. Thus, when we need to check pairs of tuples, we can first match them with tuples in M S Ri and there is no need to check the ones with known results in NUMA again. When we need to check the pairs whose results are not known, we use a bitmap to find out which tuples can be used for a match and avoid checking the ones that cannot be matched. The inputs of Distributed Bitmap Join algorithm are relations R and S which are divided into equal sized chunks R i and S i among all workers. The algorithm can be divided into four steps. Step 1: w i iterates all tuples in R i and eliminates those not satisfying join conditions. Step 2: w i tries to find M S Ri for R Sin local memory. If M S Ri exists, match R i tuples with M S Ri tuples via bitmap operations and add the results of successful matches to T i. The tuples in both M S Ri and T i are in the form of (v, sid), where v represents key values and sid represents addresses of S tuples in NUMA architecture. Assign bitmap bm2 the outcome of R i.bitmap M S Ri.bitmap, and bm1 the outcome of bm2 R i.bitmap. Step 3: w i accesses S j.bitmap in the memory of all workers w j, and obtains bitmaps bm3 as outcomes of bm1 S j.bitmap. Then it looks for each S j tuple whose key values are corresponding to 1-bit in bm3, and put the addresses of S j tuples along with their key values into T i. Step 4: w i performs the actual join action according to the tuples in T i, and checks whether there is a relation M S Ri in local memory. If so, w i updates relation M S Ri with tuples in T i, or establishes relation M S Ri before updating otherwise. Distributed Bitmap Join algorithm minimizes the number of S tuples that should be accessed by each worker. The use of intermediate results reduces the computation cost of this algorithm, while bitmap enables us to avoid useless actions. In addition, with the utilization of bitmap, Distributed Bitmap Join can process data in batch, which is desirable in the massive data situations. Throughout all steps of the algorithm, only local memory is written, which means Distributed Bitmap Join requires only one synchronization point before step 3 to make sure that S j.bitmap has already been created. Note that R i.bitmap and M S Ri.bitmap can be built before join operations, so there is no requirement for synchronization for them. Thus, this algorithm needs one synchronization point only, hence low synchronization cost. Since we utilize bitmap in Distributed Bitmap Join algorithm, it may lead to a high space complexity. To reduce the space demand, we can use bitmap compression algorithms such as word-aligned hybrid code (WAH) algorithm proposed by Wu et al. [16] [17] [18]. WAH is a run-length encoding algorithm whose main advantage is the efficient support to logical operations of bitmap. Given that the Distributed Bitmap Join algorithm requires frequent logical operations on bitmaps, WAH is a good compression choice. The behavior of searching M S Ri, R i and S j tuples with key values involved in steps 3 and 4 can be implemented with the help of dictionaries on join keys. The dictionary of a relation is a data structure in information retrieval which allows us to build bitmaps and search data quickly. In our algorithm, the terms in dictionaries are key values and the indexes in those are sids, the addresses of S tuples in NUMA architecture. The dictionaries of relation R i can be created before join operations and those of relation M S Ri are maintained at the same time when M S Ri is updated. The algorithm needs to build the dictionaries of S j at the first 222

5 query dealing with R S, and reuses these dictionaries for follow-up queries. Therefore, the cost of creating dictionaries can be amortized by all queries. E. Pseudo-code of Distributed Bitmap Join We present pseudo code for Distributed Bitmap Join algorithm in Algorithm 2 below. Algorithm 2 DISTRIBUTED BITMAP JOIN Require: Input relation R and S for join operations Ensure: A relation Q i including the results of S R i 1: build a relation T i whose schema is T i (key, sid) 2: build bitmaps bm1, bm2, bm3 3: //Step 1: Iterates all tuples in R i 4: for each tuple tr in R i do 5: check whether tr satisfies the join conditions other than equal key. 6: end for 7: //Step 2: Match R i tuples with M S Ri tuples 8: if there is a relation M S Ri in local memory then 9: bm2 R i.bitmap M S Ri.bitmap 10: bm1 bm2 R i.bitmap 11: end if 12: for i = 0 to bm2.length do 13: if i th bit of bm2 equals to 1 then 14: find tuple tm in M S Ri such that tm.key equals to the key value represented by i th bit of bm2 15: add (tm.key, tm.sid) into T i 16: end if 17: end for 18: //Step 3: Match R i tuples with S tuples 19: for each worker w j in NUMA architecture do 20: bm3 bm1 S j.bitmap 21: for i = 0 to bm3.length do 22: if i th bit of bm3 equals to 1 then 23: find tuple ts in S j such that ts.key equals to the key value represented by i th bit of bm3 24: add (ts.key, ts.sid) into T i 25: end if 26: end for 27: end for 28: //Step 4: Do join action and update M S Ri 29: for each tuple (key, sid) in T i do 30: join the tuple tr in R i such that tr.key=key with the tuple ts in S such that ts.sid=sid 31: put the results of join action into Q i 32: end for 33: if there is no relation M S Ri in local memory then 34: build M S Ri 35: end if 36: update M S Ri with T i F. Complexity of Distributed Bitmap Join Distributed Bitmap Join algorithm is executed in parallel, assuming that the number of workers in NUMA is N. In step 1, all tuples in R i are accessed sequentially. In step 2, each worker tries to match R i tuples with M S Ri tuples locally through bitmap operations, and puts the results of successful matches into T i. Then, in step 3, each worker matches bm1, one of the results of bitmap logical operations in step 2, with S j.bitmap of N workers. Finally, in step 4, each worker iterates tuples in T i and does actual join actions accordingly, as well as updates M S Ri with T i tuples. The performance of bitmap operations depends on the bandwidth in NUMA architecture as well as the compression ratio of WAH algorithm. Assuming the bandwidth in NUMA architecture is B and compression ratio of WAH in this case is c(0 <c<1), the complexity approximation of each worker w i is shown in Table III. The overall complexity of Distributed TABLE III COMPLEXITY OF DISTRIBUTED BITMAP JOIN Complexity Action R /N Iterate all tuples in R i kr/b c Logical operation on bitmaps kr c Iterate bitmap bm2 N kr/b c Logical operation on bm1 and all S j.bitmap N kr c Iterate bitmap bm3 t i Do actual join action t i Update M S Ri with T i Bitmap Join algorithm is O( R /N + N kr c + t i ). The complexity analysis shows that there are more terms involving kr than terms involving R and S in the calculation. It is because that in the Distributed Bitmap Join algorithm, we minimize the access to S tuples by utilizing bitmap in order to reduce communication cost in NUMA architecture. kr refers to the number of different key values in relation R as well as relation S, and depends on the layout of key values of these two relations in real cases. In the best case, when most R tuples can match with most S tuples, kr approximately equals to R. In the worst case, on the other hand, when almost none of R tuples can match with S tuples, kr approximately equals to R + S. Fortunately, in this case, when few tuple in R matches S tuples, there will be large numbers of continuous 0-bit in the bitmap, which makes the compression ratio of WAH algorithm very small and no performance degradation for Distributed Bitmap Join algorithm. Because Distributed Bitmap Join algorithm is designed for IMDB in NUMA architecture exclusively, we must take communication cost in NUMA architecture into consideration. Li et al. [2] evaluated the bandwidth and latency of IBM X Series x3850 X5 system, a NUMA system, to reveal the difference between the speed of local memory access and remote memory access. In general, the speed of local memory access is more than twice as fast as remote memory access for sequential access, which verifies the importance of reducing communication cost of algorithms. In order to simplify the analysis of communication cost, we focus on remote memory access and ignore all local memory access. In Distributed Bitmap Join algorithm, most 223

6 remote memory accesses are in step 3 where each worker needs to access N 1 S j.bitmap and a dictionary of each S j. Thus, the approximate communication cost can be represented as in Table IV. The overall communication cost TABLE IV COMMUNICATION COST OF DISTRIBUTED BITMAP JOIN Remote Memory Access N (N 1) sizeof(bitmap) t sizeof(dictionary entry) t sizeof(s tuple) Action Access S j.bitmap Find S j tuples according to bm3 Do actual join action of Distributed Bitmap Join algorithm is (N (N 1) sizeof(bitmap)+t sizeof(dictionary entry)+t sizeof(s tuple))/speedof(qp I). t denotes the total number of successful matches which is usually far smaller than S in join operations. Traditional join algorithms always require more accesses to S tuples, including many invalid ones, whereas in Distributed Bitmap Join algorithm, only t tuples need to be accessed, which largely reduces communication cost of join operations. Although our algorithm needs to access some bitmaps, the size of bitmaps are much smaller than the size of relation S. IV. IMPLEMENTATION Considering that the focus of Distributed Bitmap Join algorithm is to optimize a group of queries instead of one single query, we need pay special attention to following components when implementing the algorithm. A. Dictionary and bitmap In steps 2, 3, and 4 of Distributed Bitmap Join algorithm, workers need to search tuples by key values. Since the time complexity of sorting algorithms is O(n log(n)), we decide not to use sorting in our algorithm [19] [20]. Instead, we use dictionaries to search for tuples quickly. In Distributed Bitmap Join algorithm, all relations R i, S i, and M S Ri require dictionaries. Because creating dictionaries needs to iterate the entire relations, we can pre-compute some of them or reuse them to amortize the cost in the following way. For relation R i, the bitmap and dictionary on its key attribute can be pre-computed before calling join algorithms. This approach not only reduces computation cost of each join operation, but is also be reused by later queries and amortizes the cost. For relation S i, the bitmaps and dictionaries can be built in Distributed Bitmap Join algorithm at the first time of R S. Maintaining these structures in memory enables us to reuse them and amortize the cost. For relation M S Ri, we can build an empty bitmap and an empty dictionary at the first time of R S, and update them at the same time when we update M S Ri with T i tuples. In this way, the computation cost of this algorithm can be amortized and thus improve the overall performance of a group of queries. B. Bitmap compression We use WAH algorithm to compress the bitmaps in our implementation. Because Distributed Bitmap Join algorithm utilizes bitmap to improve its performance, it must deal with the space waste issue since the distribution of 1-bit in bitmap is often very sparse. According to the experiments given by Wu et al. [16], the performance of WAH algorithm is just slightly worse than BBC algorithm of Oracle [21] when the bit density is in [0.0001, 0.5], while the speed of logical operations on compressed bitmap is much faster than that of BBC. Because WAH allows us to do the logical operations on compressed bitmaps directly without decompression, it is desirable in our algorithm. V. EXPERIMENTS We provide some comparative experiments between our algorithm and no-partition hash join algorithm in this section. The reason why we choose no-partition hash join for comparison is that hash join is better than sort-merge join in general [13] and no-partition hash join has good performance at the presence of data skew compared to shared partition hash join, independent partition hash join, and radix partition hash join algorithms [5]. The experiments are implemented in C++, and use interfaces provided by C/C++ head files pthread.h and numa.h [22] [23] to control multi-threads and memory access in NUMA architecture. The source code of our experiments is partly based on the code provided by Kim et al. [13] which is written in C and implements several parallel hash join algorithms in SMP architecture. We use the part of the threads control and relation table structure in our implementation. A. Platform and experimental data design Our experiments run on IBM X Series X3950 X5 Server with one TB memory capacity and eight CPUs clocked at 2.40 GHz with 10 physical cores each. The operating system is 64-bit OpenSUSE Enterprise Edition The data set we used in the experiments includes three relations: A, B, and C. These relations have 1,000,000 tuples, 600,000 tuples, and 800,000 tuples respectively. 300,000 tuples can be joined between A and B, while 400,000 tuples can be joined between A and C. In the experiments, all the data resides in main memory. The query set used in the experiments has following properties. Assuming the first query in the query set is A B, the second query can be regarded as overlapped with the previous one if it is also A B, or it will be regarded as independent if it is not A B. It is obvious that the overlapped queries can reuse intermediate results, so the performance of Distributed Bitmap Join algorithm depends on the overlap ratio of query set to large extent. In the experiments, only the intermediate result of A Bwill be recorded. In this way, we simulate the condition that w i can find M S Ri in its local memory for A B, and that w i cannot find M S Ri in its local memory for A C. We evaluate our algorithm with query sets of different overlap ratios. 224

7 Fig. 1. Comparison on execution time under different overlap ratios Fig. 2. Comparison on execution time for each phase of both algorithms The form of the queries is as following: Select from R, S where R.key=S.key Each tuple in relations R and S consists of a 64-bit key and a 64-bit payload. We compare the performance of Distribute Bitmap Join and no-partition hash join algorithm from two aspects. One is to compare the running time of both algorithms under different overlap ratios. The other is to compare the running time of both algorithms with different number of threads. 1) Comparison under different overlap ratios: The result of comparison between Distributed Bitmap Join and no-partition hash join algorithms when the overlap ratio of query set ranging from 40 percents to 100 percents is shown in Fig. 1. The performance of Distributed Bitmap Join algorithm is better than no-partition hash join. In addition, the running time keeps decreasing when the overlap ratio becomes larger. On the other hand, there is almost no change of execution time for no-partition hash join when overlap ratio changes. It is because hash join algorithm itself is not affected by overlapped queries. If we consider both Join algorithms as two phases: build phase which includes establishing dictionaries, bitmaps, and hash indexes, and join phase which includes matching tuples, then the execution time can be divided into two parts, which is shown in Fig. 2. When the overlap ratio increases, the running time for both build phase and join phase of Distributed Bitmap Join decreases continuously, while a little bit increase appears in build phase of no-partition hash join. For join phase of our algorithm, the running time decrease is due to the utilization of intermediate results which reduces both computation cost and communication cost of join operations, while the decrease for build phase can be attributed to the reuse of various structures, e.g., the bitmaps and dictionaries. As long as one join operation between relation A and relation B has been done, there is no need for creating dictionaries and bitmaps any more. Thus, the amortization of computation cost in build phase leads to a relative high performance overall. Moreover, the increase in build phase of no-partition hash join is because relation C is larger than relation B. Fig. 3. Comparison on execution time under different number of threads 2) Comparison under different number of threads: Fig. 3 shows the execution time of both algorithms when different number of threads are used. In this experiment, there is no change of the size of relations, and every relation is divided into equal sized runs among all threads. From the figure we can see clearly that the performance of Distributed Bitmap Join becomes worse with increasing thread number, while that of no-partition hash join becomes better. The performance deterioration in join phase of Distributed Bitmap Join is due to the increase in communication cost. Since the communication cost is (N (N 1) sizeof(bitmap)+t sizeof(dictionary entry) +t sizeof(s tuple))/speedof(qp I), increase in N causes increase in communication cost. The degradation in build phase, on the other hand, can be attributed to the increasing computation cost when creating bitmaps and dictionaries. Although Distributed Bitmap Join is worse than no-partition hash join in Fig. 3, we emphasize here that the focus of our algorithm is the optimization of a group of queries instead of just one single query. In fact, the major reason for performance degradation of Distributed Bitmap Join algorithm is that it has to build bitmaps and dictionaries for all threads. When it comes 225

8 to a group of queries, this cost can be amortized among all queries, and thus optimize the overall performance, which was illustrated in Fig. 1. The experimental results show that even though Distributed Bitmap Join algorithm is worse than no-partition hash join for one single query, it outperforms no-partition hash join for a group of queries. In addition, because it is not sensitive to data skew problem, Distributed Bitmap Join algorithm performs better in this case. VI. CONCLUSION With the advance in hardware technology as well as growing demands for real-time response for database application, IMDB becomes the central focus in this field. Because of its architectural advantages, NUMA which is designed for multi-core system such that each processor accesses its local memory faster than shared memory, effectively solves the memory-access related starvation problem. In this paper, we propose a Distributed Bitmap Join algorithm which aims at optimizing join, one of the most expensive operations in database applications, by reducing the communication cost and reusing of intermediate results. The comparison with nopartition hash join shows that our method outperforms the competitor in terms of execution time for a group of queries. The performance also improves with the increase of overlap ratio of query set. Acknowledgement: This work is partially supported by Shanghai Leading Academic Discipline Project No. B412, National Science Foundation of China under grunt No , No and No , National 973 program under grant No.2010CB [12] S. Blanas and J. M. Patel, How efficient is our radix join implementation, [13] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey, Sort vs. hash revisited: fast join implementation on modern multi-core cpus, Proceedings of the VLDB Endowment, vol. 2, no. 2, pp , [14] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry, Improving hash join performance through prefetching, ACM Transactions on Database Systems (TODS), vol. 32, no. 3, p. 17, [15] M.-C. Albutiu, A. Kemper, and T. Neumann, Massively parallel sortmerge joins in main memory multi-core database systems, Proceedings of the VLDB Endowment, vol. 5, no. 10, pp , [16] K. Wu, E. J. Otoo, and A. Shoshani, Compressing bitmap indexes for faster search operations, in Scientific and Statistical Database Management, Proceedings. 14th International Conference on. IEEE, 2002, pp [17] K. Wu, E. Otoo, and A. Shoshani, On the performance of bitmap indices for high cardinality attributes, in Proceedings of the Thirtieth international conference on Very large data bases-volume 30. VLDB Endowment, 2004, pp [18] K. Madduri and K. Wu, Efficient joins with compressed bitmap indexes, in Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009, pp [19] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey, Efficient implementation of sorting on multi-core simd cpu architecture, Proceedings of the VLDB Endowment, vol. 1, no. 2, pp , [20] E. K. Donald, The art of computer programming, Sorting and searching, vol. 3, pp , [21] G. Antoshenkov, Byte-aligned bitmap compression, in Data Compression Conference, DCC 95. Proceedings. IEEE, 1995, p [22] numa(3)-linux man page, [23] A. Kleen, A numa api for linux, Novel Inc, REFERENCES [1] S. Negash, Business intelligence, Communications of the Association for Information Systems, vol. 13, no. 1, pp , [2] Y. Li and I. R. V. GuyLohman, Numa-aware algorithms: the case of data shuffling. CIDR, [3] Reilly, When multicore isn t enough: Trends and the future for multimulticore systems, Proceedings of Workshop on High Performance Embedded Computing, [4] H. Garcia-Molina and K. Salem, Main memory database systems: An overview, Knowledge and Data Engineering, IEEE Transactions on, vol. 4, no. 6, pp , [5] S. Blanas, Y. Li, and J. M. Patel, Design and evaluation of main memory hash join algorithms for multi-core cpus, in SIGMOD Conference, 2011, pp [6] Z. Majo and T. R. Gross, Memory management in numa multicore systems: trapped between cache contention and interconnect overhead, in ACM SIGPLAN Notices, vol. 46, no. 11. ACM, 2011, pp [7] M. Hassan and M. Bamha, Semi-join computation on distributed file systems using map-reduce-merge model, in Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 2010, pp [8] Solid announces general availability of boostengine 4.0, the first ondisk/in-memory hybrid database manager, PR Newswire, [9] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, Sap hana database: data management for modern business applications, ACM Sigmod Record, vol. 40, no. 4, pp , [10] S. Manegold, P. Boncz, and M. Kersten, Optimizing main-memory join on modern hardware, Knowledge and Data Engineering, IEEE Transactions on, vol. 14, no. 4, pp , [11] A. Kemper and T. Neumann, Hyper: A hybrid oltp&olap main memory database system based on virtual memory snapshots, in Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011, pp

Hash Joins for Multi-core CPUs. Benjamin Wagner

Hash Joins for Multi-core CPUs. Benjamin Wagner Hash Joins for Multi-core CPUs Benjamin Wagner Joins fundamental operator in query processing variety of different algorithms many papers publishing different results main question: is tuning to modern

More information

Performance in the Multicore Era

Performance in the Multicore Era Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An

More information

Analysis of Basic Data Reordering Techniques

Analysis of Basic Data Reordering Techniques Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu

More information

Analyzing In-Memory Hash Joins: Granularity Matters

Analyzing In-Memory Hash Joins: Granularity Matters Analyzing In-Memory Hash Joins: Granularity Matters Jian Fang Delft University of Technology Delft, the Netherlands j.fang-@tudelft.nl Jinho Lee, Peter Hofstee* IM Research *and Delft Univ. of Technology

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

In-Memory Data Management

In-Memory Data Management In-Memory Data Management Martin Faust Research Assistant Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam Agenda 2 1. Changed Hardware 2.

More information

Data Compression for Bitmap Indexes. Y. Chen

Data Compression for Bitmap Indexes. Y. Chen Data Compression for Bitmap Indexes Y. Chen Abstract Compression Ratio (CR) and Logical Operation Time (LOT) are two major measures of the efficiency of bitmap indexing. Previous works by [5, 9, 10, 11]

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Trends and Concepts in Software Industry I

Trends and Concepts in Software Industry I Trends and Concepts in Software Industry I Goals Deep technical understanding of column-oriented dictionary-encoded in-memory databases and its application in enterprise computing Foundations of database

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

Histogram-Aware Sorting for Enhanced Word-Aligned Compress

Histogram-Aware Sorting for Enhanced Word-Aligned Compress Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes 1- University of New Brunswick, Saint John 2- Université du Québec at Montréal (UQAM) October 23, 2008 Bitmap indexes SELECT

More information

Query Processing on Multi-Core Architectures

Query Processing on Multi-Core Architectures Query Processing on Multi-Core Architectures Frank Huber and Johann-Christoph Freytag Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany {huber,freytag}@dbis.informatik.hu-berlin.de

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Accelerating Analytical Workloads

Accelerating Analytical Workloads Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014 Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Enhanced Performance of Database by Automated Self-Tuned Systems

Enhanced Performance of Database by Automated Self-Tuned Systems 22 Enhanced Performance of Database by Automated Self-Tuned Systems Ankit Verma Department of Computer Science & Engineering, I.T.M. University, Gurgaon (122017) ankit.verma.aquarius@gmail.com Abstract

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

1) Partitioned Bitvector

1) Partitioned Bitvector Topics 1) Partitioned Bitvector 2 Delta Dictionary U J D bravo charlie golf young 1 1 11 Delta Partition (Compressed) 1 1 1 11 bravo charlie golf charlie young 1 1 1 11 1 1 2) Vertical Bitvector 3 a) 3

More information

IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata

IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata Research Report IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata Executive Summary The problem: how to analyze vast amounts of data (Big Data) most efficiently. The solution: the solution is threefold:

More information

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation Harald Lang 1, Tobias Mühlbauer 1, Florian Funke 2,, Peter Boncz 3,, Thomas Neumann 1, Alfons Kemper 1 1

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

Hardware Acceleration for Database Systems using Content Addressable Memories

Hardware Acceleration for Database Systems using Content Addressable Memories Hardware Acceleration for Database Systems using Content Addressable Memories Nagender Bandi, Sam Schneider, Divyakant Agrawal, Amr El Abbadi University of California, Santa Barbara Overview The Memory

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;

More information

The Right Read Optimization is Actually Write Optimization. Leif Walsh

The Right Read Optimization is Actually Write Optimization. Leif Walsh The Right Read Optimization is Actually Write Optimization Leif Walsh leif@tokutek.com The Right Read Optimization is Write Optimization Situation: I have some data. I want to learn things about the world,

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

class 6 more about column-store plans and compression prof. Stratos Idreos

class 6 more about column-store plans and compression prof. Stratos Idreos class 6 more about column-store plans and compression prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ query compilation an ancient yet new topic/research challenge query->sql->interpet

More information

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Parallel, In Situ Indexing for Data-intensive Computing. Introduction FastQuery - LDAV /24/ Parallel, In Situ Indexing for Data-intensive Computing October 24, 2 Jinoh Kim, Hasan Abbasi, Luis Chacon, Ciprian Docan, Scott Klasky, Qing Liu, Norbert Podhorszki, Arie Shoshani,

More information

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi Column-Stores vs. Row-Stores How Different are they Really? Arul Bharathi Authors Daniel J.Abadi Samuel R. Madden Nabil Hachem 2 Contents Introduction Row Oriented Execution Column Oriented Execution Column-Store

More information

Sorting Improves Bitmap Indexes

Sorting Improves Bitmap Indexes Joint work (presented at BDA 08 and DOLAP 08) with Daniel Lemire and Kamel Aouiche, UQAM. December 4, 2008 Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing.

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

SFS: Random Write Considered Harmful in Solid State Drives

SFS: Random Write Considered Harmful in Solid State Drives SFS: Random Write Considered Harmful in Solid State Drives Changwoo Min 1, 2, Kangnyeon Kim 1, Hyunjin Cho 2, Sang-Won Lee 1, Young Ik Eom 1 1 Sungkyunkwan University, Korea 2 Samsung Electronics, Korea

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

All About Bitmap Indexes... And Sorting Them

All About Bitmap Indexes... And Sorting Them http://www.daniel-lemire.com/ Joint work (presented at BDA 08 and DOLAP 08) with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 12, 2009 Database Indexes Databases use precomputed indexes (auxiliary

More information

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch Advanced Databases Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch www.mniazi.ir Parallel Join The join operation requires pairs of tuples to be

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

CST-Trees: Cache Sensitive T-Trees

CST-Trees: Cache Sensitive T-Trees CST-Trees: Cache Sensitive T-Trees Ig-hoon Lee 1, Junho Shim 2, Sang-goo Lee 3, and Jonghoon Chun 4 1 Prompt Corp., Seoul, Korea ihlee@prompt.co.kr 2 Department of Computer Science, Sookmyung Women s University,

More information

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU An Efficient Approach to Triple Search and Join of HDT Processing Using GPU YoonKyung Kim, YoonJoon Lee Computer Science KAIST Daejeon, South Korea e-mail: {ykkim, yjlee}@dbserver.kaist.ac.kr JaeHwan Lee

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Crescando: Predictable Performance for Unpredictable Workloads

Crescando: Predictable Performance for Unpredictable Workloads Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Composite Group-Keys

Composite Group-Keys Composite Group-Keys Space-efficient Indexing of Multiple Columns for Compressed In-Memory Column Stores Martin Faust, David Schwalb, and Hasso Plattner Hasso Plattner Institute for IT Systems Engineering

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Accelerating RDBMS Operations Using GPUs

Accelerating RDBMS Operations Using GPUs Ryerson University From the SelectedWorks of Jason V Ma Fall 2013 Accelerating RDBMS Operations Using GPUs Jason V Ma, Ryerson University Available at: https://works.bepress.com/jason_ma/1/ Accelerating

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

CS122 Lecture 15 Winter Term,

CS122 Lecture 15 Winter Term, CS122 Lecture 15 Winter Term, 2014-2015 2 Index Op)miza)ons So far, only discussed implementing relational algebra operations to directly access heap Biles Indexes present an alternate access path for

More information

TagFS: A Fast and Efficient Tag-Based File System

TagFS: A Fast and Efficient Tag-Based File System TagFS: A Fast and Efficient Tag-Based File System 6.033 Design Project 1 Yanping Chen yanpingc@mit.edu Dan Ports (TR11) 3/17/2011 1. Overview A typical problem in directory-based file-systems is searching

More information

Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes

Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Songrit Maneewongvatana Department of Computer Engineering King s Mongkut s University of Technology, Thonburi,

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES Angad Kataria, Simran Khurana Student,Department Of Information Technology Dronacharya College Of Engineering,Gurgaon Abstract- Hardware trends

More information

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)

More information

class 9 fast scans 1.0 prof. Stratos Idreos

class 9 fast scans 1.0 prof. Stratos Idreos class 9 fast scans 1.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ 1 pass to merge into 8 sorted pages (2N pages) 1 pass to merge into 4 sorted pages (2N pages) 1 pass to merge into

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

New Bucket Join Algorithm for Faster Join Query Results

New Bucket Join Algorithm for Faster Join Query Results The International Arab Journal of Information Technology, Vol. 12, No. 6A, 2015 701 New Bucket Algorithm for Faster Query Results Hemalatha Gunasekaran 1 and ThanushkodiKeppana Gowder 2 1 Department Of

More information

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists 00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou

More information

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

C has been and will always remain on top for performancecritical

C has been and will always remain on top for performancecritical Check out this link: http://spectrum.ieee.org/static/interactive-the-top-programminglanguages-2016 C has been and will always remain on top for performancecritical applications: Implementing: Databases

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Data Structure Optimization of AS_PATH in BGP

Data Structure Optimization of AS_PATH in BGP Data Structure Optimization of AS_PATH in BGP Weirong Jiang Research Institute of Information Technology, Tsinghua University, Beijing, 100084, P.R.China jwr2000@mails.tsinghua.edu.cn Abstract. With the

More information

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China doi:10.21311/001.39.7.41 Implementation of Cache Schedule Strategy in Solid-state Disk Baoping Wang School of software, Nanyang Normal University, Nanyang 473061, Henan, China Chao Yin* School of Information

More information

Architecture and Implementation of Database Systems (Winter 2014/15)

Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner Architecture & Implementation of DBMS Winter 2014/15 1 Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2014/15

More information