Join Processing for Flash SSDs: Remembering Past Lessons

Size: px
Start display at page:

Download "Join Processing for Flash SSDs: Remembering Past Lessons"

Transcription

1 Join Processing for Flash SSDs: Remembering Past essons Jaeyoung Do Univ. of Wisconsin-Madison Jignesh M. Patel Univ. of Wisconsin-Madison ABSTRACT Flash solid state drives (SSDs) provide an attractive alternative to traditional magnetic hard disk drives (HDDs) for DBMS applications. Naturally there is substantial interest in redesigning critical database internals, such as join algorithms, for flash SSDs. However, we must carefully consider the lessons that we have learnt from over three decades of designing and tuning algorithms for magnetic HDD-based systems, so that we continue to reuse techniques that worked for magnetic HDDs and also work with flash SSDs. The focus of this paper is on recalling some of these lessons in the context of ad hoc join algorithms. Based on an actual implementation of four common ad hoc join algorithms on both a magnetic HDD and a flash SSD, we show that many of the surprising results from magnetic HDD-based join methods also hold for flash SSDs. These results include the superiority of block nested loops join over sort-merge join and Grace hash join in many cases, and the benefits of blocked I/Os. In addition, we find that simply looking at the I/O costs when designing new flash SSD join algorithms can be problematic, as the CPU cost is often a bigger component of the total join cost with SSDs. We hope that these results provide insights and better starting points for researchers designing new join algorithms for flash SSDs. 1. INTRODUCTION Flash solid state drives (SSDs) are actively being considered as storage alternatives to replace or dramatically reduce the central role of magnetic hard disk drives (HDDs) as the main choice for storing persistent data. Jim Gray s prediction of Flash is disk, disk is tape, and tape is dead [7] is coming close to reality in many applications. Flash SSDs, which are made by packaging (NAND) flash chips, offer several advantages over magnetic HDDs including faster random reads and lower power consumption. Moreover, as flash densities continue to double as predicted in [9], and prices continue to drop, the appeal of flash SSDs for DBMSs increases. In fact, vendors such as Fusion-IO and HP sell flash- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN 29) June 28, 29, Providence, Rhode-Island Copyright 29 ACM $1.. based devices as I/O accelerators for many data-intensive workloads. The appeal of flash SSDs is also attracting interest in redesigning various aspects of DBMS internals for flash SSDs. One such aspect that is becoming attractive as a research topic is join processing algorithms, as it is well known that joins can be expensive and can play a critical role in determining the overall performance of the DBMS. While such efforts are well-motivated, we want to approach a redesign of database query processing algorithms by clearly recalling the lessons that the community has learnt from over three decades of research in query processing algorithms. The focus of this paper is on recalling some of the important lessons that we have learnt about efficient join processing in magnetic HDDs, and determining if these lessons also apply to joins using flash SSDs. In addition, if previous techniques for tuning and improving the join performance also work for flash SSDs, then it also changes what are interesting starting point for comparing new SSD-based join algorithms. In the case of join algorithms, a lot is known about how to optimize joins with magnetic HDDs to use the available buffer memory effectively, and to account for the characteristics of the I/O subsystem. Specifically, the paper by Haas et al. [8] derives detailed formulae for buffer allocation for various phases of common join algorithms such as block nested loops join, sort-merge join, Grace hash join, and hybrid hash join. Their results show that the right buffer pool allocation strategy can have a huge impact upto 4% improvements in some cases. Furthermore, the relative performance of the join algorithms changes once you optimize the buffer allocations block nested loops join is much more versatile, and Grace hash join is often not very competitive. The dangers of forgetting these lessons could lead to an incorrect starting point for comparing new flash SSD join algorithms. For example, the comparison of RARE-join [16] with Grace hash join [1] to show the superiority of the RARE-join algorithm on flash SSDs is potentially not the right starting point. (It is possible that the RARE-join is superior to the best magnetic HDD-based join method when run over flash SSDs, but this question has not been answered conclusively.) As we show in this paper, in fact even block nested loops join far outperforms Grace hash join in most cases, on both magnetic HDDs and flash SSDs. Cautiously, we note that we have only tried one specific magnetic HDD and one specific flash SSD, but even this very first test produced interesting results. (As part of future work, we want to look at wider range of hardware for both flash SSDs and magnetic HDDs.)

2 The focus of this paper in on investigating four popular ad hoc join algorithms, namely block nested loops join, sortmerge join, Grace hash join, and hybrid hash join, on both flash SSDs and magnetic HDDs. We start with the best buffer allocation methods that are proposed in [8] for these join algorithms, and first ask the question: What changes for these algorithms as we replace a magnetic HDD with a flash SSD? Then, we study the effect of changing the buffer pool sizes and the page sizes and examine the impact of these changes on these join algorithms. Our results show that many of the techniques that were invented for joins on magnetic HDDs continue to hold for flash SSDs. As an example, blocked I/O is useful on both magnetic HDDs and flash SSDs, though for different reasons. In the case of magnetic HDDs, the use of blocked I/O amortizes the cost of disk seeks and rotational delays. On the other hand, the benefit of blocked I/O with flash SSDs comes from amortizing the latency associated with the software layers of flash SSDs, and generating fewer erase operations when writing data. The remainder of this paper is organized as follows. In Section 2, we briefly introduce the characteristics of flash SSDs. Then we introduce the four classic join algorithms with appropriate assumptions and buffer allocation strategies in Section 3. In Section 4, we explain and discuss the experimental results. After reviewing related work in Section 5, we conclude in Section CHARACTERISTICS OF FASH SSD Flash SSDs are based on NAND flash memory chips and use a controller to provide a persistent block device interface. A flash chip stores information in an array of memory cells. A chip is divided into a number of flash blocks, and each flash block contains several flash pages. Each memory cell is set to 1 by default. To change the value to, the entire block has to be erased by setting it to 1, followed by selectively programming the desired cells to. Read and write operations are performed at the granularity of a flash page. On the other hand, the time-consuming erase operations can only be done at the level of a flash block. Considering the typical size of a flash page (4 KB) and a flash block (64 flash pages), the erase-before-write constraint can significantly degrade write performance. In addition, most flash chips only support erase operations per flash block. Therefore, erase operations should be distributed across the flash blocks to prolong the service life of flash chips. These kinds of constraints are handled by a software layer known as flash translation layer (FT). The major role of the FT is to provide address mappings between the logical block addresses (BAs) and flash pages. The FT maintains two kinds of data structures: A direct map from BAs to flash pages, and an inverse map for rebuilding the direct map during recovery. While the inverse map is stored on flash, the direct map is stored on flash and at least partially in RAM to support fast lookups. If the necessary portion of the direct map is not in RAM, it must be swapped in from flash as required. While flash SSDs have no mechanically moving parts, data access still incurs some latency, due to overheads associated with the FT logic. However, latencies of flash SSDs are typically much smaller than those of magnetic HDDs [4]. 3. JOINS In this section, we introduce four classic ad hoc join algorithms that we consider in this paper, namely: block nested loops join, sort-merge join, Grace hash join, and hybrid hash join. The two relations being joined are denoted as R and S. We use R, S and B to denote the sizes of the relations and the buffer pool size in pages, respectively. We also assume that R S. Each join algorithm needs some extra space to build and maintain specific data structures such as a hash table or a tournament tree. In order to model these structures, we use a multiplicative fudge factor, denoted as F. Next we briefly describe each join algorithm. We also outline the buffer allocation strategy for each join algorithm. The I/O costs for writing the final results are omitted in the discussion below, as this cost is identical for all join methods. For the buffer allocation strategy we directly use the recommendations by Haas et al. [8] (for magnetic HDDs), which shows that optimizing buffer allocations can dramatically improve the performance of join algorithms (by 4% in some cases). 3.1 Block Nested oops Join Block nested loops join first logically splits the smaller relation R into same size chunks. For each chunk of R that is read, a hash table is built to efficiently find matching pairs of tuples. Then, all of S is scanned, and the hash table is probed with the tuples. To model the additional space required to build a hash table for a chunk of R we use the fudge factor F, so a chunk of size C pages uses F C pages in memory to store a hash table on C. The buffer pool is simply divided into two spaces; one space, I outer, is for an input buffer with a hash table for R chunks, and another one, I inner, is for an input buffer to scan S. Note that reading R in chunks of size Iouter F (= C ) guarantees sufficient memory to build a hash table in memory for that chunk [5]. 3.2 Sort-Merge Join Sort-merge join starts by producing sorted runs of each R and S. After R and S are sorted into runs on disk, sortmerge join reads the runs of both relations and merges/joins them. We use the tournament sort (a.k.a. heap sort) in the first pass, which produces runs that on average are twice the size of the memory used for the initial sorting [11]. We also assume B > p F S so that the sort-merge join, which uses a tournament tree, can be executed in two passes [17]. In the first pass, the buffer pool is divided into three parts: an input buffer, an output buffer, and working space (WS) to maintain the tournament tree. During the second pass, the buffer pool is split across all the runs of R and S as evenly as possible. 3.3 Grace Hash Join Grace hash join has two phases. In the first phase, it reads each relation, applies a hash function to the input tuples, and hashes tuples into buckets that are written to disk. In the second phase, the first bucket of R is loaded into the buffer pool, and a hash table is built on it. Then, the corresponding bucket of S is read and used to probe the hash table. Remaining buckets of R and S are handled in the same way iteratively. We assume B > p F R to allow for a two-phase Grace hash join [17].

3 Join Algorithm First Phase/Pass Second Phase/Pass y S (y S +B(y+ S )) y S I inner = y+ S Block Nested oops Join y = Dl Dx I outer = B I inner Sort-Merge Join Grace Hash Join Hybrid Hash Join I = O = ( 2z 4) B I z 8 z = (Dl+Ds) F( R + S ) Dl B WS = B I O B NR+NS k = R F+ R 2 F 2 +4B R F WS F R 2B k O = B I = B k+1 WS I = B k O I = O = 1.1 B WS F R WS k k = F R (B I) B I O I = B WS WS = B I k O Table 1: Buffer allocations for join algorithms from Haas et al. [8]: Ds, Dl, and Dx denote average seek time, latency, and page transfer time for magnetic HDDs, respectively There are two sections in the buffer pool during the first partitioning phase: one input buffer and an output buffer for each of the k buckets. We subdivide the output buffer as evenly as possible based on the number of buckets, and then give the remaining pages, if any, to the input buffer. In the second phase, a portion of the buffer pool (WS ) is used for the i th bucket of R and its hash table, and the remaining pages are chosen as input buffer pages to read S. 3.4 Hybrid Hash Join As in the Grace hash join, there are two phases in this algorithm, assuming M > p F R. In the first phase, the relation R is read and hashed into buckets. Since a portion of the buffer pool is reserved for an in-memory hash bucket for R, this bucket of R is not written to a storage device while the other buckets are. Furthermore, as S is read and hashed, tuples of S matching with the in-memory R bucket can be joined immediately, and need not be written to disk. The second phase is the same as Grace hash join. During the first phase, the buffer pool is divided into three parts: one for the input, one for the output of k buckets excluding the in-memory bucket, and the working space (WS) for the in-memory bucket. The buffer allocation scheme for the second phase is the same as Grace hash join. 3.5 Buffer Allocation Table 1 shows the optimal buffer allocations for the four join algorithms, for each phase of these algorithms. Note that block nested loops join does not distinguish between the different passes. In this paper, we use the same buffer allocation method for both flash SSDs and magnetic HDDs. While these allocations may not be optimal for flash SSDs, our goal here is to start with the best allocation strategy for magnetic HDDs and explore what happens if we simply use the same settings when replacing a magnetic HDD with a flash SSD. Comments Magnetic HDD cost Magnetic HDD average seek time Magnetic HDD latency Magnetic HDD transfer rate Flash SSD cost Flash SSD latency Flash SSD read transfer rate Flash SSD write transfer rate Values $ (.36 $/GB) 12 ms 5.56 ms 34 MB/s $ (3.85 $/GB).35 ms 12 MB/s 8 MB/s Page size 2 KB 32 KB Buffer pool size 1 MB 6 MB Fudge Factor 1.2 orders table size 5 GB customer table size 73 MB Table 2: Device characteristics and parameter values 4. EXPERIMENTS In this section, we compare the performance of the join algorithms using the optimal buffer allocations when using a magnetic HDD and a flash SSD for storing the input data sets. Each data point presented here is the average over three runs. 4.1 Experimental Setup We implemented a single-thread and light-weight database engine that uses the SQite3 [1] page format for storing relational data in heap files. Each heap file is stored as a file in the operating system, and the average page utilization is 8%. Our engine has a buffer pool that allows us to control the allocation of pages to the different components of the join algorithms. The engine has no concurrency control or recovery mechanisms. Our experiments were performed on a Dual Core 3.2GHz Intel Pentium machine with 1 GB of RAM running Red Hat

4 Buffer Size (MB) (a) Magnetic HDD Buffer Size (MB) (b) Flash SSD Figure 1: Varying the size of the buffer pool (8 KB page, blocked I/O) Buffer Pool Size 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB Algorithms Join I/O Join I/O Join I/O Join I/O Join I/O Join I/O BN 1.64X 2.86X 1.59X 2.62X 1.72X 3.4X 1.73X 2.87X 1.67X 2.79X 1.65X 2.61X SM 1.41X 1.81X 1.45X 2.5X 1.44X 2.6X 1.45X 2.8X 1.43X 2.4X 1.48X 2.2X GH 1.34X 1.54X 1.29X 1.59X 1.41X 1.62X 1.33X 1.62X 1.39x 1.77X 1.3X 1.55X HH 1.45X 1.77X 1.55X 1.9X 1.35X 1.51X 1.51X 1.89X 1.5x 1.78X 1.65X 2.9X Table 3: Speedups of total join times and I/O times with flash SSDs Enterprise 5. For the comparison, we used a 54 RPM TOSHIBA 32 GB external HDD and a OCZ Core Series 6GB SATA II 2.5 inch flash SSD. We used wall clock time as a measure of execution time, and calculated the I/O time by subtracting the reported from the wall clock time. Since synchronous I/Os were used for all tests, we assumed that there is no overlap between the I/O and the computation. We also used direct I/Os so that the database engine transfers data directly from/to the buffer pool bypassing the OS cache (so there is no prefetching and double buffering). With this setup all join numbers repeated here are cold numbers. 4.2 Data Set and Join Query As our test query, we used a primary/foreign key join between the TPC-H [2] customer and the orders tables, generated with a scale factor of 3. The customer table contains 4,5, tuples (73 MB), and the orders table has 45,, (5 GB). Each tuple of both tables contains an unsigned 4 byte integer key (the customer key), and an average 13 and 9 bytes of padding for the customer and the orders tables respectively. The data for both tables were stored in random order in the corresponding heap files. Characteristics of the magnetic HDD and the flash SSD, and parameter values used in these experiments are shown in Table Effect of Varying the Buffer Pool Size The effect of varying the buffer pool size from 1 MB to 6 MB is shown in Figure 1, for both the magnetic HDD and the flash SSD. We also used blocked I/O to sequentially read and write multiple pages in each I/O operation. The size of the I/O block is calculated for each algorithm using the equations shown in Table 1. In Figure 1 error bars denote the minimum and the maximum measured I/O times (across the three runs). Note that the error bars for the s are omitted, as their variation is usually less than 1% of the total join time. Table 3 shows the speedup of the total join times and the I/O times of the four join algorithms under different buffer pool sizes. The results in Table 3 show that replacing the magnetic HDD with the flash SSD benefits all the join methods. The block nested loops join whose I/O pattern is sequential reads shows the biggest performance improvement, with speedup factors between 1.59X to 1.73X. (Interestingly, a case can be made that for sequential reads and writes, comparable or much higher speedups can be achieved with striped magnetic HDDs, for the same $ cost [15].) Other join algorithms also performed better on the flash SSD compared to the magnetic HDD, with smaller speedup improvements than the block nested loops join. This is because the write transfer rate is slower than the read transfer rate on the flash SSD (See Table 2), and unexpected erase operations might degrade write performance further.

5 Algorithms Sort-Merge Join Grace Hash Join Hybrid Hash Join Buffer Pool Size First Phase Second Phase First Phase Second Phase First Phase Second Phase 1 MB 1.52X 3.X 1.34X 2.27X 1.57X 2.61X 2 MB 1.83X 2.81X 1.43X 2.19X 1.66X 3.9X 3 MB 1.86X 2.79X 1.47X 2.12X 1.34X 2.32X 4 MB 1.9X 2.63X 1.47X 2.13X 1.7X 2.91X 5 MB 1.81X 2.86X 1.59X 2.44X 1.63X 2.83X 6 MB 2.X 2.89X 1.31X 2.68X 1.98X 2.84X Table 4: Speedups of I/O times with flash SSDs, broken down by the first and second phases 11 Page Size (KB) Page Size (KB) (a) Magnetic HDD (b) Flash SSD Figure 2: Varying the page size (5 MB Buffer Pool, blocked I/O) As an example, different I/O speedups were achieved in the first and the second phases of the sort-merge join as shown in Table 4. While the I/O speedup of the second phase was between 2.63X and 3.X due to faster random reads, the I/O speedup in the first phase (that has sequential writes as the dominant I/O pattern), was only between 1.52X and 2.X, which reduced the overall speedup for sortmerge join. In the case of Grace hash join, all the phases were executed with lower I/O speedups than those of the sort-merge join (See Table 4). Note that the dominant I/O pattern of Grace hash join is random writes in the first phase, followed by sequential reads in the second phase. While the I/O speedup between 2.12X and 2.68X was observed for the second phase of Grace hash join, the I/O speedup of its first phase was only between 1.31X and 1.59X due to expensive erase operations. This indicates that algorithms that stress random reads, and avoid random writes as much as possible are likely to see bigger improvements on flash SSDs (over magnetic HDDs). While there is little variation in the I/O times with the magnetic HDD (See the error bars in Figure 1(a) for the I/O bars), we observed higher variations in the I/O times with the flash SSD (Figure 1(b)), resulting from the varying write performance. Note that since random writes cause more erase operations than sequential writes, hash-based joins show wider range of I/O variations than sort-merge join. On the other hand, there is little variation in the I/O costs for block nested loops join regardless of the buffer pool size, since it does not incur any writes. Another interesting observation that can be made here (Figure 1) is the relative I/O performance between the sortmerge join and Grace hash join. Both have similar I/O costs with the magnetic HDD, but sort-merge join has lower I/O costs with the flash SSD. This is mainly due to the different output patterns of both join methods. In the first phase of the joins, where both algorithms incur about 8% of the total I/O cost, each writes intermediate results (sorted runs for sort-merge join, and buckets for Grace hash join) to disk in different ways; sort-merge join incurs sequential writes as opposed to the random writes that are incurred by Grace hash join. While this difference in the output patterns has a substantial impact on the join performance with the flash SSD because random writes generate more erase operations than sequential writes, the impact is relatively small with the magnetic HDD. In addition, we can not argue that the sort-merge join algorithm is better than the Grace hash join algorithm on

6 6 5 Page Size (KB) Page Size (KB) (a) Magnetic HDD (b) Flash SSD Figure 3: Varying the page size (5 MB Buffer Pool, page sized I/O) flash SSDs based solely on the I/O performance results (as in [16]). While looking at only the I/O characteristics is sometimes okay for magnetic HDDs (when the join algorithm is I/O bound), using flash SSDs makes the CPU costs more prominent. As seen in Figure 1(b), the CPU cost now dominate the I/O cost in most cases for block nested loops join and sort-merge join. In general, with flash SSDs the balance between CPU and I/Os changes, which implies that when building systems with flash SSDs we may want to consider adding more CPU processing power. From Figure 1, we notice that hybrid hash join outperformed all other algorithms on both the magnetic HDD and the flash SSD, across the entire range of buffer pool sizes that we tested. Hybrid hash join is 2.81X faster than Grace hash join with a 6 MB buffer pool, indicating that comparing only with Grace hash join (as done in [14, 16]) could be misleading. Finally note that in our experiments, with the flash SSD, block nested loops join is faster than sort-merge join and Grace hash join for large buffer pool sizes, but slower than hybrid hash join, which is more CPU efficient (though incurs a higher I/O cost!). In summary: 1. Joins on flash SSDs have a greater tendency to become CPU-bound (rather than I/O-bound), so ways to improve the CPU performance, such as better cache utilization, is of greater importance with flash SSDs. 2. Trading random reads for random writes is likely a good design choice for flash SSDs. 3. Compared to sequential writes, random writes produce more I/O variations with flash SSDs, which makes the join performance less predictable. 4.4 Effect of Varying the Page Size In this experiment, we continue to use blocked I/O, but vary the page size from 2 KB to 32 KB. (The maximum page size allowed by SQite3 is currently 32 KB.) The size of the I/O block is also calculated using the equations shown in Magnetic HDD Flash SSD Page Size User Kernel User Kernel 2 KB sec 18.4 sec 24.4 sec 324. sec 4 KB sec 49. sec 25.5 sec sec 8 KB 19.5 sec 27.9 sec sec 8.6 sec 16 KB sec 15.5 sec sec 39.8 sec 32 KB sec 8.6 sec sec 22.4 sec Table 5: s for block nested loops join on the magnetic HDD and the flash SSD Table 1, as in the previous section. The results are shown in Figure 2 for a 5 MB buffer pool. As can be seen from Figure 2, when blocked I/O is used, the page size has a small impact on the join performance in both the magnetic HDD and the flash SSD cases. The key lesson here is that if blocked I/O is used, the database system can likely set the page size based on criteria other than join performance. 4.5 Effect of Blocked I/O The major reason for using blocked I/O with magnetic HDDs is to amortize the high cost of disk seeks and rotational delays. An interesting question is: Does blocked I/O still make sense for flash SSDs? To answer this question, we repeated the experiment described in the previous section with page sized I/O instead of blocked I/O. These results are shown in Figure 3. Comparing Figure 3 with Figure 2, we can clearly see that blocked I/O is still valuable for the flash SSD, often improving the performance by 2X. The reasons for this are: a) the software layer of the flash SSD still incur some latency [4], making larger I/O size attractive, and b) the write operations with larger I/O sizes generate fewer erase operations, as the software layer is able to manage a pool of pre-erased blocks more efficiently. When the page size is 2 KB, the performance of sort-merge

7 join and Grace hash join on the flash SSD is worse than on the magnetic HDD. When the I/O size is less than the flash page size (4 KB), every write operation is likely to generate an erase operation, which severely degrades performance. This results re-confirms the observation (but for joins) that blocks should be aligned to flash pages [4]. We also observed that the CPU costs on the flash SSD and on the magnetic HDD are different for the same page size. CPU costs are generally larger with the flash SSD, due to the complex nature of FT on the flash SSD. As described in Section 2, FT not only provides the ability to map addresses, but also needs to support many other functionalities such as updating map structures, maintaining a pool of pre-erased blocks, wear-leveling and parallelism to improve performance. As an example, Table 5 shows s of the block nested loops join, broken down by the time spent in the user and the kernel modes. (Other join methods showed similar behavior.) From this table, we observe that the kernel s are larger with the flash SSD. This gap between the CPU costs for the flash SSD over the magnetic HDD is larger for smaller page sizes, since the smaller page sizes result in more I/O requests, which keeps the FT busier with frequent direct map lookups and indirect map updates. On the other hand, the gap is smaller with larger page sizes, and eventually the CPU costs for the SSD over the magnetic disk are almost the same when blocked I/O is used as in Figure 2. In summary: 1. Using blocked I/O significantly improves the join performance on flash SSDs over magnetic HDDs. 2. The I/O size should be a multiple of the flash page size. 5. REATED WORK There is substantial related work on the use of flash memory for DBMSs. Graefe [6] revisits the famed five-minute rule based on flash and points out several potential uses in DBMSs. ee et al. [12] suggests a log-based storage system that can convert random writes into sequential writes. In that paper they presents a new design for logging updates to the end of each flash erase blocks, rather than doing inplace updates to avoid expensive erase operations. ee et al. [13] observe the characteristics of hash join and sortmerge join in a commercial system and conclude that for that system, sort-merge join is better suited for flash SSDs; however the implementation details of the hash join method in the commercial system is not know. Myer [14] examines join performance on flash SSDs under a set of realistic I/O workloads with Berkeley DB on a fixed-sized buffer pool. Of the hash-based join algorithms, only Grace hash join is considered in this study. Shah et al. [16] show that for flash memory the PAX [3] layout of data pages is better than a row-based layout for scans. They also suggest a new join algorithm for flash SSDs. No direct implementation of the algorithm is presented, but the potential benefits of the new algorithm are presented by using an analytical model, and comparing it to Grace hash join. Tsirogiannis et al. [18] presents a new pipelined join algorithm in combination with the PAX layout. A key aspect of their new algorithm is to minimize I/Os by retrieving only required attributes as late as possible. They show that their algorithm is much more efficient on flash SSDs compared to hybrid hash join, especially when either few attributes or few rows are selected in the join result. More recently, Bouganim et al. [4] have provided a collection of nine micro-benchmarks based on various I/O patterns to understand flash device performance. 6. CONCUSIONS In this paper, we have presented and discussed the performance characteristics of four well-known ad hoc join algorithms on a magnetic HDD and a flash SSD. Based on our evaluation, we conclude that a) buffer allocation strategy has a critical impact on the performance of join algorithms for both magnetic HDDs and flash SSDs, b) despite the absence of mechanically moving parts, blocked I/O plays an important role for flash SSDs, and c) both s and I/O costs must be considered when comparing the performance of join algorithms as the s can be a larger (and sometimes dominating) proportion of the overall join cost with flash SSDs. Many of these observations are lessons that we have learnt from previous work on optimizing join algorithms for magnetic HDDs, and continue to be important when studying the performance of join algorithms on flash SSDs, though with different emphases. For future work, we plan to expand the range of hardware that we consider in this study, including using enterprise flash SSDs and other magnetic disk-based I/O configurations. We also plan on deriving detailed analytical models for existing join algorithms on flash SSDs and exploring if the optimal buffer allocations differ from that for magnetic HDDs. 7. ACKNOWEDGEMENT We thank Jeff Naughton for providing useful feedback on various parts of this work. This research was supported by a grant from Microsoft. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Microsoft. 8. REFERENCES [1] SQite3. [2] Transaction Processing Performance Council. [3] A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In proceedings of the 27th International Conference on Very arge Data Bases (VDB), pages , 21. [4]. Bouganim, B. Jonsson, and P. Bonnet. ufip: Understanding Flash IO Patterns. In proceedings of the 4th Biennial Conference on Innovative Data Systems Research (CIDR), 29. [5] K. Bratbergsengen. Hashing Methods and Relational Algebra Operations. In proceedings of the 1th International Conference on Very arge Data Bases (VDB), pages , [6] G. Graefe. The five-minute rule twenty years later, and how flash memory changes the rules. In proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN), 27.

8 [7] J. Gray and B. Fizgerald. Flash Disk Opportunity for Server-Applications. ACM QUEUE, 6(4):18 23, July 28. [8]. Haas, M. Carey, M. ivny, and A. Shukla. SEEKing the truth about ad hoc join costs. The VDB journal, 6(3): , [9] C. Hwang. Nanotechnology Enables a New Memory Growth Model. Proceedings of the IEEE, 91(11): , November 23. [1] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of Hash to Database Machine and Its Architecture. New Generation Computing, 1(1):63 74, [11] D. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley, Reading, Mass, [12] S. ee and B. Moon. Design of Flash-Based DBMS: An In-Page ogging Approach. In proceedings of the ACM SIGMOD International Conference on Management of Data, pages 55 66, 27. [13] S. ee, B. Moon, C. Park, J. Kim, and S. Kim. A Case for Flash Memory SSD in Enterprise Database Applications. In proceedings of the ACM SIGMOD International Conference on Management of Data, pages , 28. [14] D. Myers. On the Use of NAND Flash Memory in High-Performance Relational Databases. Master s Thesis, MIT, 28. [15] M. Polte, J. Simsa, and G. Gibson. Comparing Performance of Solid State Devices and Mechanical Disks. In proceedings of the 3rd Petascale Data Storage Workshop (PDS Workshop), 28. [16] M. Shah, S. Harizopoulos, J. Wiener, and G. Graefe. Fast Scans and Joins using Flash Drives. In proceedings of the 4th International Workshop on Data Management on New Hardware (DaMoN), 28. [17]. Shapiro. Join Processing in Database Systems with arge Main Memories. ACM Transactions on Database Systems, 11(3): , September [18] D. Tsirogiannis, S. Harizopoulos, M. Shah, J. Wiener, and G. Graefe. Query Processing Techniques for Solid State Drives. In proceedings of the ACM SIGMOD International Conference on Management of Data, 29.

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

A Cost Model of Join Algorithms on Flash Memory SSD

A Cost Model of Join Algorithms on Flash Memory SSD A Cost Model of Join Algorithms on Flash Memory SSD Jongwon Yoon, Jaeyoung Do University of Wisconsin-Madison {yoonj, jae}@cs.wisc.edu 1. ABSTRACT n relational database, join is not only the most commonly

More information

Heap-Filter Merge Join: A new algorithm for joining medium-size relations

Heap-Filter Merge Join: A new algorithm for joining medium-size relations Oregon Health & Science University OHSU Digital Commons CSETech January 1989 Heap-Filter Merge Join: A new algorithm for joining medium-size relations Goetz Graefe Follow this and additional works at:

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Designing Database Operators for Flash-enabled Memory Hierarchies

Designing Database Operators for Flash-enabled Memory Hierarchies Designing Database Operators for Flash-enabled Memory Hierarchies Goetz Graefe Stavros Harizopoulos Harumi Kuno Mehul A. Shah Dimitris Tsirogiannis Janet L. Wiener Hewlett-Packard Laboratories, Palo Alto,

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

STORING DATA: DISK AND FILES

STORING DATA: DISK AND FILES STORING DATA: DISK AND FILES CS 564- Spring 2018 ACKs: Dan Suciu, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? How does a DBMS store data? disk, SSD, main memory The Buffer manager controls how

More information

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage CSCI-GA.2433-001 Database Systems Lecture 8: Physical Schema: Storage Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com View 1 View 2 View 3 Conceptual Schema Physical Schema 1. Create a

More information

SSD. DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O

SSD.    DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O DEIM Forum 214 D8-6 SSD, 153 855 4-6-1 135 8548 3-7-5 11 843 2-1-2 E-mail: {keisuke,haya,yokoyama,kitsure}@tkl.iis.u-tokyo.ac.jp, miyuki@sic.shibaura-it.ac.jp SSD SSD HDD 1 1 I/O I/O I/O I/O,, OLAP, SSD

More information

Presented by: Nafiseh Mahmoudi Spring 2017

Presented by: Nafiseh Mahmoudi Spring 2017 Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory

More information

Chapter 6 - External Memory

Chapter 6 - External Memory Chapter 6 - External Memory Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 6 - External Memory 1 / 66 Table of Contents I 1 Motivation 2 Magnetic Disks Write Mechanism Read Mechanism

More information

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory Storage Architecture and Software Support for SLC/MLC Combined Flash Memory Soojun Im and Dongkun Shin Sungkyunkwan University Suwon, Korea {lang33, dongkun}@skku.edu ABSTRACT We propose a novel flash

More information

Optimizing OLAP Cube Processing on Solid State Drives

Optimizing OLAP Cube Processing on Solid State Drives Optimizing OLAP Cube Processing on Solid State Drives Zhibo Chen University of Houston Houston, TX 77204, USA Carlos Ordonez University of Houston Houston, TX 77204, USA ABSTRACT Hardware technology has

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Design of Flash-Based DBMS: An In-Page Logging Approach

Design of Flash-Based DBMS: An In-Page Logging Approach SIGMOD 07 Design of Flash-Based DBMS: An In-Page Logging Approach Sang-Won Lee School of Info & Comm Eng Sungkyunkwan University Suwon,, Korea 440-746 wonlee@ece.skku.ac.kr Bongki Moon Department of Computer

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto So slow So cheap So heavy So fast So expensive So efficient NAND based flash memory Retains memory without power It works by trapping a small

More information

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Data Storage and Query Answering. Data Storage and Disk Structure (2) Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

SFS: Random Write Considered Harmful in Solid State Drives

SFS: Random Write Considered Harmful in Solid State Drives SFS: Random Write Considered Harmful in Solid State Drives Changwoo Min 1, 2, Kangnyeon Kim 1, Hyunjin Cho 2, Sang-Won Lee 1, Young Ik Eom 1 1 Sungkyunkwan University, Korea 2 Samsung Electronics, Korea

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Advanced Database Systems

Advanced Database Systems Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 13 COMPUTER MEMORY So far, have viewed computer memory in a very simple way Two memory areas in our computer: The register file Small number

More information

Hybrid Storage for Data Warehousing. Colin White, BI Research September 2011 Sponsored by Teradata and NetApp

Hybrid Storage for Data Warehousing. Colin White, BI Research September 2011 Sponsored by Teradata and NetApp Hybrid Storage for Data Warehousing Colin White, BI Research September 2011 Sponsored by Teradata and NetApp HYBRID STORAGE FOR DATA WAREHOUSING Ever since the advent of enterprise data warehousing some

More information

A Multi Join Algorithm Utilizing Double Indices

A Multi Join Algorithm Utilizing Double Indices Journal of Convergence Information Technology Volume 4, Number 4, December 2009 Hanan Ahmed Hossni Mahmoud Abd Alla Information Technology Department College of Computer and Information Sciences King Saud

More information

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

COMP283-Lecture 3 Applied Database Management

COMP283-Lecture 3 Applied Database Management COMP283-Lecture 3 Applied Database Management Introduction DB Design Continued Disk Sizing Disk Types & Controllers DB Capacity 1 COMP283-Lecture 3 DB Storage: Linear Growth Disk space requirements increases

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL Toward timely, predictable and cost-effective data analytics Renata Borovica-Gajić DIAS, EPFL Big data proliferation Big data is when the current technology does not enable users to obtain timely, cost-effective,

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

Storage Devices for Database Systems

Storage Devices for Database Systems Storage Devices for Database Systems 5DV120 Database System Principles Umeå University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner Storage Devices for

More information

Sep. 22 nd, 2008 Sang-Won Lee. Toward Flash-based Enterprise Databases

Sep. 22 nd, 2008 Sang-Won Lee. Toward Flash-based Enterprise Databases Towards Flash-based Enterprise Databases Sep. 22 nd, 2008 Sang-Won Lee Sungkyunkwan University, Korea 1 SKKU VLDB Lab. SKKU VLDB Lab. Research Directions Vision: Flash is disk, disk is tape, and tape is

More information

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text.

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text. Lifecycle of an SQL Query CSE 190D base System Implementation Arun Kumar Query Query Result

More information

Data Organization and Processing

Data Organization and Processing Data Organization and Processing Indexing Techniques for Solid State Drives (NDBI007) David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline SSD technology overview Motivation for standard algorithms

More information

TID Hash Joins. Robert Marek University of Kaiserslautern, GERMANY

TID Hash Joins. Robert Marek University of Kaiserslautern, GERMANY in: Proc. 3rd Int. Conf. on Information and Knowledge Management (CIKM 94), Gaithersburg, MD, 1994, pp. 42-49. TID Hash Joins Robert Marek University of Kaiserslautern, GERMANY marek@informatik.uni-kl.de

More information

CSE 190D Database System Implementation

CSE 190D Database System Implementation CSE 190D Database System Implementation Arun Kumar Topic 1: Data Storage, Buffer Management, and File Organization Chapters 8 and 9 (except 8.5.4 and 9.2) of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris

More information

A Reliable B-Tree Implementation over Flash Memory

A Reliable B-Tree Implementation over Flash Memory A Reliable B-Tree Implementation over Flash Xiaoyan Xiang, Lihua Yue, Zhanzhan Liu, Peng Wei Department of Computer Science and Technology University of Science and Technology of China, Hefei, P.R.China

More information

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing: Lab 2 -- Due Weds. Project meeting sign ups 6.830 Lecture 8 10/2/2017 Recap column stores & paper Join Processing: S = {S} tuples, S pages R = {R} tuples, R pages R < S M pages of memory Types of joins

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

Today s Presentation

Today s Presentation Banish the I/O: Together, SSD and Main Memory Storage Accelerate Database Performance Today s Presentation Conventional Database Performance Optimization Goal: Minimize I/O Legacy Approach: Cache 21 st

More information

L7: Performance. Frans Kaashoek Spring 2013

L7: Performance. Frans Kaashoek Spring 2013 L7: Performance Frans Kaashoek kaashoek@mit.edu 6.033 Spring 2013 Overview Technology fixes some performance problems Ride the technology curves if you can Some performance requirements require thinking

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

L9: Storage Manager Physical Data Organization

L9: Storage Manager Physical Data Organization L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components

More information

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp The Benefits of Solid State in Enterprise Storage Systems David Dale, NetApp SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Information Systems (Informationssysteme)

Information Systems (Informationssysteme) Information Systems (Informationssysteme) Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2018 c Jens Teubner Information Systems Summer 2018 1 Part IX B-Trees c Jens Teubner Information

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

EXTERNAL SORTING. Sorting

EXTERNAL SORTING. Sorting EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R

More information

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems Query Processing: A Systems View CPS 216 Advanced Database Systems Announcements (March 1) 2 Reading assignment due Wednesday Buffer management Homework #2 due this Thursday Course project proposal due

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Architecture Exploration of High-Performance PCs with a Solid-State Disk Architecture Exploration of High-Performance PCs with a Solid-State Disk D. Kim, K. Bang, E.-Y. Chung School of EE, Yonsei University S. Yoon School of EE, Korea University April 21, 2010 1/53 Outline

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Record Placement Based on Data Skew Using Solid State Drives

Record Placement Based on Data Skew Using Solid State Drives Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com

More information

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage hierarchy. Textbook: chapters 11, 12, and 13 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

NAND Flash-based Storage. Computer Systems Laboratory Sungkyunkwan University

NAND Flash-based Storage. Computer Systems Laboratory Sungkyunkwan University NAND Flash-based Storage Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics NAND flash memory Flash Translation Layer (FTL) OS implications

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Fast Sorting on Flash Memory Sensor Nodes

Fast Sorting on Flash Memory Sensor Nodes Fast Sorting on Flash Memory Sensor Nodes Tyler Cossentine Department of Computer Science University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com Ramon Lawrence Department of

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

CS317 File and Database Systems

CS317 File and Database Systems CS317 File and Database Systems Lecture 9 Intro to Physical DBMS Design October 22, 2017 Sam Siewert Reminders Assignment #4 Due Friday, Monday Late Assignment #3 Returned Assignment #5, B-Trees and Physical

More information

A track on a magnetic disk is a concentric rings where data is stored.

A track on a magnetic disk is a concentric rings where data is stored. CS 320 Ch 6 External Memory Figure 6.1 shows a typical read/ head on a magnetic disk system. Read and heads separate. Read head uses a material that changes resistance in response to a magnetic field.

More information

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

Columnstore and B+ tree. Are Hybrid Physical. Designs Important? Columnstore and B+ tree Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 B+ tree & Columnstore on same table = Hybrid design 4? C O L C O L B+ tree B+ tree ? C O L C O L B+ tree B+ tree

More information

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University NAND Flash-based Storage Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics NAND flash memory Flash Translation Layer (FTL) OS implications

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Computer Science 61C Spring Friedland and Weaver. Input/Output

Computer Science 61C Spring Friedland and Weaver. Input/Output Input/Output 1 A Computer is Useless without I/O I/O handles persistent storage Disks, SSD memory, etc I/O handles user interfaces Keyboard/mouse/display I/O handles network 2 Basic I/O: Devices are Memory

More information

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM Module III Overview of Storage Structures, QP, and TM Sharma Chakravarthy UT Arlington sharma@cse.uta.edu http://www2.uta.edu/sharma base Management Systems: Sharma Chakravarthy Module I Requirements analysis

More information

C has been and will always remain on top for performancecritical

C has been and will always remain on top for performancecritical Check out this link: http://spectrum.ieee.org/static/interactive-the-top-programminglanguages-2016 C has been and will always remain on top for performancecritical applications: Implementing: Databases

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

An Optimization of Disjunctive Queries : Union-Pushdown *

An Optimization of Disjunctive Queries : Union-Pushdown * An Optimization of Disjunctive Queries : Union-Pushdown * Jae-young hang Sang-goo Lee Department of omputer Science Seoul National University Shilim-dong, San 56-1, Seoul, Korea 151-742 {jychang, sglee}@mercury.snu.ac.kr

More information

GSLPI: a Cost-based Query Progress Indicator

GSLPI: a Cost-based Query Progress Indicator GSLPI: a Cost-based Query Progress Indicator Jiexing Li #1, Rimma V. Nehme 2, Jeffrey Naughton #3 # Computer Sciences, University of Wisconsin Madison 1 jxli@cs.wisc.edu 3 naughton@cs.wisc.edu Microsoft

More information

Comparing Performance of Solid State Devices and Mechanical Disks

Comparing Performance of Solid State Devices and Mechanical Disks Comparing Performance of Solid State Devices and Mechanical Disks Jiri Simsa Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University Motivation Performance gap [Pugh71] technology

More information

Preface. Fig. 1 Solid-State-Drive block diagram

Preface. Fig. 1 Solid-State-Drive block diagram Preface Solid-State-Drives (SSDs) gained a lot of popularity in the recent few years; compared to traditional HDDs, SSDs exhibit higher speed and reduced power, thus satisfying the tough needs of mobile

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ 45 Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ Department of Computer Science The Australian National University Canberra, ACT 2611 Email: fzhen.he, Jeffrey.X.Yu,

More information

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us) 1 STORAGE LATENCY 2 RAMAC 350 (600 ms) 1956 10 5 x NAND SSD (60 us) 2016 COMPUTE LATENCY 3 RAMAC 305 (100 Hz) 1956 10 8 x 1000x CORE I7 (1 GHZ) 2016 NON-VOLATILE MEMORY 1000x faster than NAND 3D XPOINT

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24 FILE SYSTEMS, PART 2 CS124 Operating Systems Fall 2017-2018, Lecture 24 2 Last Time: File Systems Introduced the concept of file systems Explored several ways of managing the contents of files Contiguous

More information

Join algorithm costs revisited

Join algorithm costs revisited The VLDB Journal (1996) 5: 64 84 The VLDB Journal c Springer-Verlag 1996 Join algorithm costs revisited Evan P. Harris, Kotagiri Ramamohanarao Department of Computer Science, The University of Melbourne,

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University Chapter 10: File System Chapter 11: Implementing File-Systems Chapter 12: Mass-Storage

More information

Performance Modeling and Analysis of Flash based Storage Devices

Performance Modeling and Analysis of Flash based Storage Devices Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash

More information

Do Query Op5mizers Need to be SSD- aware?

Do Query Op5mizers Need to be SSD- aware? Do Query Op5mizers Need to be SSD- aware? Steven Pelley, Kristen LeFevre, Thomas F. Wenisch, University of Michigan CSE ADMS 2011 1 Enterprise flash: a new hope Disk Flash Model WD 10Krpm OCZ RevoDrive

More information

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory? Why Not Store Everything in Main Memory? Storage Structures Introduction Chapter 8 (3 rd edition) Sharma Chakravarthy UT Arlington sharma@cse.uta.edu base Management Systems: Sharma Chakravarthy Costs

More information

Flash memory: Models and Algorithms. Guest Lecturer: Deepak Ajwani

Flash memory: Models and Algorithms. Guest Lecturer: Deepak Ajwani Flash memory: Models and Algorithms Guest Lecturer: Deepak Ajwani Flash Memory Flash Memory Flash Memory Flash Memory Flash Memory Flash Memory Flash Memory Solid State Disks A data-storage device that

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information