DBMSs on a Modern Processor: Where Does Time Go? Revisited

Size: px

Start display at page:

Download "DBMSs on a Modern Processor: Where Does Time Go? Revisited"

Madeline Lawrence
5 years ago
Views:

1 DBMSs on a Modern Processor: Where Does Time Go? Revisited Matthew Becker Information Systems Carnegie Mellon University mbecker+@cmu.edu Naju Mancheril School of Computer Science Carnegie Mellon University naju@cmu.edu Steven Okamoto School of Computer Science Carnegie Mellon University sokamoto@cs.cmu.edu Abstract In 1999, the performances of four commercial DBMSs were analyzed and broken down into the amount of time spent doing useful computation and the amount of time spent stalling for data to be retrieved from memory. In this paper, we aim to revisit two of those DBMSs and see if their interactions with the underlying hardware architecture has improved. We will use the same metrics as used in the previous paper to facilitate comparison between the two papers. We divide total execution time into time spent on useful computation, L1 and L2 cache misses, resource (functional unit) stalls, and branch misprediction (pipeline draining). It s important to note that we are not interested in analyzing how fast these DBMSs are. Instead we are interested in determining the relative contributions of stalls and letting both DBMS and microprocessor designers locate these performance bottlenecks in their systems and ameliorate them. We find that despite the performance optimizations found in today s database systems, they are not able to take full advantage of many recent improvements in processor technology and data placement. 1. Introduction A major aspect of studying a piece of software with the intention of trying to improve its performance is analyzing what the computer processor is doing while that software is being executed. This kind of inspection can reveal programmatic problems which, when changed, can greatly improve the performance capabilities of a particular program. Today s modern processors employ a number of sophisticated techniques that attempt to enhance performance by overlapping the execution of computational and memory-related operations. In 1999, the performances of four commercial DBMSs were analyzed and broken down into the amount of time spent doing useful computation and the amount of time spent stalling for data to be retrieved from memory. [2] In 1999 the results showed that: On average, at least half the time is spent in stalls (implying that database designers could improve performance by programmatically attacking the causes of these stalls). In all cases the main causes of the stalls were: L2 cache data misses (implying data placement should focus on the L2 cache.) L1 instruction misses (implying that instruction placement should focus on the L1 instruction caches). About 20 percent of the stalls are caused by subtle implementation details (the most notable being the number of branch mispredictions). In this paper, we aim to revisit two of those DBMSs and see if their interactions with the underlying hardware architecture has improved. We will use the same metrics as used in the previous paper to facilitate comparison between the two papers. We divide total execution time into time spent on useful computation, L1 and L2 cache misses, resource (functional unit) stalls, and branch misprediction (pipeline draining). It s important to note that we are not interested in analyzing how fast these DBMSs are. Instead we are interested in determining the relative contributions of stalls and letting both DBMS and microprocessor designers locate these performance bottlenecks in their systems and ameliorate them. We perform our measurements with two leading commercial DBMSs releases and the following modern hardware platforms: Intel Pentium 4 processor system with 1GB RAM

2 Quad Intel Pentium III processor system with 4GB RAM Due to time constraints we are unable to present results for both DBMSs on both hardware platforms. Instead we chose to present results for one of the DBMS running on the Pentium III and then present another set of results for the other DBMS running on the Pentium 4. As a point of future research we would run both DBMS on both hardware platforms and present two sets of results for each DBMS. Nevertheless, for the scope of this paper we will be running System C from [2] on the Pentium III processor and System B, again from [2], on the Pentium 4 processor. Unless otherwise mentioned or indicated, we refer to both systems concurrently. If the conclusions of [2] impacted DBMS design, we expect to see more time devoted to useful computation and resource stalls. This increase in resource stalls would imply that the processor is no longer fast enough to keep up with the database thus shifting the bottleneck from memory to the processor. 2. Database Workload The various workloads that we used in this paper are divided into two main groups: single-table range selections and two table equijoins, all of which are performed over a memory-resident database running a single command stream. The characteristics of this type of a workload divorce our analysis from the need to consider dynamic parameters, such as concurrency control among multiple transactions or disk I/O speed, and instead isolate basic operations such as sequential access and index selection.[2] We ran all queries against the two following relations to collect data (note: only fields that are actually used in our queries are listed; all other fields have place holders): CREATE TABLE LINEITEM ( L_ORDERKEY INTEGER NOT NULL, L_PARTKEY INTEGER NOT NULL, <5 other fields>, L_TAX FLOAT NOT NULL, <rest of fields> ) CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL, <rest of fields> ) The lineitem table contains variable length tuples. L_ORDERKEY has uniformly distributed values from 1 to in sorted order in the relation. L_PARTKEY has uniformly distributed values from 1 to randomly ordered in the database. The 5 fields between L_PARTKEY and L_TAX are used to prevent a record s L_TAX from being in the same cache line as the L_PARTKEY. The orders table contains variable length records. The O_ORDERKEY attribute has values that range from 1 to This table is used to join with the lineitem table for the queries that require a join Sequential Range Selection This group of queries is one of the two groups that consist of single-table range selections. This particular group of queries emphasizes the performance aspects of the DBMSs straight sequential scan on a file. The query that is used in this case is: SELECT AVG(L_TAX) FROM lineitem WHERE L_PARTKEY > min AND L_PARTKEY <= max; This query is performed in the absence of an index on L_PARTKEY. We verified that the access path for this query involved a sequential scan of the entire table. In this case we want to examine the effects of query selectivity on the performance of the sequential scan. The values used for min and max above can be used to define a range of tuples that we are interested in. By changing the difference between these values we are able to change the selectivity of the queries. We measured data for the above query using the following selectivities: 1%, 10%, 50% and 100%. For each selectivity the query was run multiple times selecting different ranges of data (with the same selectivity), with the obvious exception of the 100% selectivity. We chose to use an aggregate operation to match up with the methodology used in [2]. There are two explanations presented in that paper as to why the use of an aggregate operation is a good idea, only one of which is still pertinent in this paper. By using an aggregate operation, the number of results that actually need to be returned is very small (in this case it s one). This is a good idea because our measurements will not be affected by the overhead associated with client/server communication. Alternatively, storing a large number of records in a temporary relation, to be returned as results, would affect the measurements due to the extra insertion operations Indexed Range Selection This is the second group of queries that make up the single-table scan group. This group of queries is used to reason about the performance of the DBMS when it goes

3 through an index. We used a similar query to the one used in the sequential range selection but we augmented the database with an index on the L_PARTKEY attribute. We again verified that the access path for each of the queries went through that index. We also used the same set of selectivites as above Sequential Join This is the only group of queries that we perform in the category of two-table equijoins. This group of queries allows us to examine the performance of the DBMS while it is carrying out a sequential join between two tables. In this case ensure that the join is performed between two sequential scans of a file. We chose to join the orders table with the lineitem table using the common attribute O_ORDERKEY (or L_ORDERKEY in the case of lineitem). The query that we used was: SELECT AVG(L_TAX) FROM lineitem, orders WHERE L_ORDERKEY = O_ORDERKEY AND L_PARTKEY > min AND L_PARTKEY <= max; Each query was performed in the absence of the above mentioned index. We also verified that the access patterns for both tables was a sequential scan. Again we choose the aggregate operation so that we minimize the amount of time the DBMS spends creating the result set. We also used min and max to enforce different selectivities for the join just as we did in the previous queries. 3. Pentium III Experimental Setup - System C 3.1. Hardware Platform Processor L1 Cache L2 Cache Memory Bus Speed (4 x) Pentium 3, 700MHz 16 KB L1-Cache on each processor 2 MB Unified Cache 4 GB 100 MHz The computer that we used to run the experiments on System C contains four Pentium III processors running at 700 MHz, each with 16KB L1 instruction and data caches and a unified 2 MB L2 cache. The system contains 4 GB of main memory connected to the processor chips through a 100 MHz system bus. The Pentium III is a powerful server processor with an out-of-order engine and speculative instruction execution. The x86 instruction set is composed of CISC instructions. Each instruction is translated into no more than three RISC instructions (µops) during the decode phase of the pipeline. Figure 1. A diagram of the Pentium III s memory hierarchy There are two levels of non-blocking cache in the system. There are separate first-level caches for instructions and data, whereas at the second level the cache is unified. The cache characteristics are summarized in the figure above Software Platform Experiments were conducted on System C, the name of which cannot be disclosed here due to legal restrictions. System C was installed on Redhat Linux 7.1 (Seawolf). The DBMSs were configured the same way in order to achieve as much consistency as possible. The buffer pool size was made large enough to fit the datasets for all the queries into main memory because the objective is to measure pure processor and memory performance. To define the schema and execute the queries, the same commands and datasets were used for all the DBMSs, with no vendorspecific SQL extensions Measurement Tools and Methodology The Pentium III processor provides two hardware-based counters for event measurements. We used emon, a tool graciously provided by Intel, to control these counters. emon can set the counters to zero, assign event codes to them, and read their values either after a pre-specified amount of time, or after a program has completed execution. Emon was used to measure 124 event types for the results presented in this report. We measured each event type in both user and kernel mode and summed the results from each of the four processors. Before taking measurements for a query, the main memory and caches were warmed up with multiple runs of this query. In order to distribute and minimize the effects of the client/server startup overhead, the unit of execution consisted of ten different queries on the same database, with

4 Processor Trace cache L1 Data cache L2 Cache Memory Bus Prefetching Hyperthreading Pentium 4 (Northwood core), 3GHz 8-way, 12Kµop trace cache, 6µop line 4-way, 8KB L1 data cache, 64B line, non-blocking, write-through, 2 cycle latency 8-way, 512KB unified L2 cache, 64B line, non-blocking, write back, 18 cycle latency 1 GB, latency 300 cycles 200 MHz, quad-pumped Enabled Disabled System B was tested on a 3 GHz uniprocessor Pentium 4 machine with 4 GB of RAM, 512 KB on-die unified L2 cache, 8 KB L1 data cache, and 12 Kµop trace cache. The trace cache replaces the traditional L1 instruction cache. An instruction cache holds CISC macroinstructions in spatially localized cache lines; the trace cache contains decoded µops from an execution trace, thereby exploiting temporal locality. This greatly helps both the hit rate of the cache (µops within basic blocks always hit) and the storage efficiency (µops from spatially distant but temporally close locations, i.e., branch targets, can be stored on the same cache line). In addition, the trace cache obviates the need for macroinstruction decoding of cached /muops. On a trace cache miss, the instruction is fetched and decoded, and a new trace is built. The Pentium 4 also features a much improved branch predictor with a branch target buffer 8 times larger than the Pentium III s. This is needed both to improve the trace cache performance and to avoid the costly branch misprediction penalty imposed by the Pentium 4 s deeper pipeline (20 stages versus 10 for the Pentium III). The Pentium 4 also supports hardware prefetching to reduce memory latency. Deeper buffering of the out-order-execution logic and more powerful register renaming in the Pentium 4 is designed to reduce the number of resource stalls and increase the number of simultaneous in-flight operations. Hyperthreading, which allows each physical processor to act as two logical processors, was disabled for these experiments as a simplification. (Preliminary work indicated that the simple microbenchmarks we tested were overwhelmingly executed on one logical processor; the sharing of performance counters between logical processors for some events added unnecessary complexity given the marginal advantages of enabling Hyperthreading.) Technical details on the Pentium 4 hardware are given in Table Software Platform Table 1. Pentium 4 hardware system the same selectivity. Each time emon executed one such unit, it measured a pair of events. In order to increase the confidence intervals, the experiments were repeated several times and the final sets of numbers exhibit a standard deviation of less than 5 percent. Finally, using a set of formulae similar to the ones used in [2] (updated with the new events and new estimates for penalties for the new architectures). 4. Pentium 4 Experimental Setup - System B 4.1 Hardware Platform Experiments were conducted on System B, the name of which cannot be disclosed here due to legal restrictions. System B was installed on Redhat Linux 9 (Shrike), kernel version The DBMS was configured the same way as System C in order to achieve as much consistency as possible. The buffer pool size was made large enough to fit the tables for all the queries into main memory because the objective is to measure pure processor and memory performance. To define the schema and execute the queries, the same commands and datasets were used for all the DBMSs, with no vendor-specific SQL extensions. 4.3 Measurement Tools and Methodology The Pentium 4 processor provides 18 hardware counters for event measurements. Emon was also used on the Pentium 4, although the different microarchitecture led to slightly different events being measured than with the Pentium III. Again, we measured both events generated by kernel and user programs separately. As with System C, the DBMS was first warmed up and queries were done in batches of 10 to amortize overhead. Memory latencies were calculated based on event results. This was checked using LMBench, which we also used to verify L1D and L2 latencies. 5. Pentium III Results In this section, we first present an overview of the execution time breakdown. Then, we focus on the chief barrier to computation: memory stalls. We divide memory stalls into L1 vs. L2 cache misses and again into data and instruction misses. Since the active processor executed in user mode more than 95% of the time, all of the measurements shown in this section reflect user mode execution, unless stated otherwise Execution time breakdown Figure 2 shows three graphs, each summarizing the average execution time breakdown for one of the queries. Each

5 bar shows the contribution of the four components (T C, T M, T B, and T R ) as a percentage of the total query execution time. Although the workload is much simpler than TPC benchmarks, the computation time is always less than half the execution time; thus, the processor spends most of the time stalled. As processor clocks become faster, the computation component will become a smaller part of overall execution time. Miss penalties will remain high since memory access times do not decrease as quickly. The memory stall time contribution varies across different queries. For example, Figure 2 shows that when System C executes the sequential range selection, it spends 30% of the time in memory stalls. When the same system executes the indexed range selection, the memory stall time contribution becomes 50%. Although the indexed range selection accesses fewer records, its memory stall component is larger than in the sequential selection, probably because the index traversal has less spatial locality than the sequential scan. Analysis of the memory behavior yields that 90% of T M is due to L1 I-cache and L2 data misses in all of the systems measured. Minimizing memory stalls has been a major focus of database research on performance improvement [1]. Although in most cases the memory stall time (T M ) accounts for most of the overall stall time, the other two components are always significant. Branch misprediction stalls account for 10-20% of the execution time, and the resource stall time contribution stays constant around 25% Memory stalls Figure 2. Graphs of the overall execution time for each type of query Since the memory stalls comprise such a large portion of the overall execution time, it is important to determine the causes of these stalls. This section discusses the significance of the memory stall components to the query execution time, according to the framework discussed earlier. The figure above shows the breakdown of T M into the following stall time components: TL1D (L1 D-cache miss stalls), TL1I (L1 I-cache miss stalls), TL2D (L2 cache data miss stalls), TL2I (L2 cache instruction miss stalls), and TITLB (ITLB miss stalls) for each of the four DBMSs. There is one graph for each type of query. It is clear that L1 D-cache stall time is insignificant. An L1 D-cache miss that hits on the L2 cache incurs low latency, which can usually be overlapped with other computation. The stall time caused by L2 cache instruction misses (TL2I) and ITLB misses (TITLB) is also insignificant in all the experiments. TL2I contributes little to the overall execution time because the second-level cache misses are two to three orders of magnitude less than the first-level instruction cache misses. The low TITLB indicates that the systems use few instruction pages, and the ITLB is enough to store

6 the translations for their addresses. The rest of this section discusses the two major memory-related stall components, TL2D and TL1I. For all of the queries, TL2D (the time spent on L2 data stalls) is one of the most significant components of the execution time. This is most evident in the case of sequential scan and join queries, in which they comprise 70% of all memory stall time. We believe that the index scan L2 data performance is just as poor, but because the number of L1 I- cache stalls is so large, data loads penalties are hidden from us. In other words, data loads to the L2 cache can be overlapped with instruction loads to the L1. Stall time due to misses at the first-level instruction cache (TL1I) is a major memory stall component for all three sets of queries. The results in this study reflect the real I-cache stall time, with no approximations. The Pentium III uses stream buffers for instruction prefetching, but L1 I-cache misses are still a bottleneck. TL1I is difficult to overlap, because L1 I-cache misses cause a serial bottleneck to the pipeline. For all DBMSs, the average contribution of TL1I to the execution time is 20% Branch Mispredictions Branch mispredictions have serious performance implications, because (a) they cause a serial bottleneck in the pipeline and (b) they cause instruction cache misses, which in turn incur additional stalls. Branch instructions account for 20% of the total instructions retired in our index scans Resource stalls Figure 3. Graphs of the breakdown of memory stalls into different components Resource-related stall time is the time during which the processor must wait for a resource to become available. Such resources include functional units in the execution stage, registers for handling dependencies between instructions, and other platform-dependent resources. The contribution of resource stalls to the overall execution time is fairly stable across the DBMSs. In all cases, resource stalls are dominated by dependency and/or functional unit stalls. Functional unit availability stalls are caused by bursts of instructions that create contention in the execution unit. Memory references account for at least half of the instructions retired, so it is possible that one of the resources causing these stalls is a memory buffer. Resource stalls are an artifact of the lowest-level details of the hardware. The compiler can produce code that avoids resource contention and exploits instruction-level parallelism. This is difficult with the X86 instruction set, because each CISC instruction is internally translated into smaller instructions (µops). Thus, there is no easy way for the compiler to see the correlation across multiple X86 instructions and optimize the instruction stream at the processor execution level.

7 * ) ( $ ' $ %& # " " ( ' & " % " #$ " #$%&'($) *%$+,+(+, -(-.+(), %-&+/$)- 0-$'+1+(/ -(- Figure 4. Execution time breakdown for sequential scans, index scans, and joins on System B on Pentium 4 6. Pentium 4 Results System B was tested on the Pentium 4 processor using the same queries as were used with System C. The execution time was again broken down into 4 principal components: useful computation time, memory stalls, resource branch misprediction penalties, and resource stalls (Figure 4). 6.1 Computation and Resource Stalls As we observed earlier with System C, very little time is actually spent doing useful computation. This is most likely an effect of the underlying hardware than the DBMS itself. The memory and resource stalls on the Pentium 4 are exacerbated by the high clock speed, and prevent the processor from performing useful operations. The majority of the resource stalls are due to a lack of storage buffers in the out-of-order execution engine. 6.2 Branch Mispredictions Only a tiny fraction of the execution time goes to branch mispredictions, which stands in marked contrast to the results obtained with System C on the Pentium III. This is because the Pentium 4 s branch predictor is so much more accurate than the Pentium III s branch predictor. As expected, the contribution of branch mispredictions is greater for index scans (which must branch through the index) than for the highly regular sequential scans (which involve only a very tight loop). Also, branch misprediction reaches a maximum for a selectivity of 50% for sequential scans, since this is when a tuple is equally likely to be selected as rejected; because joins use sequential scans to read the relations, this same phenomenon is seen with joins. Since the branch mispredictions for index scans are dominated by the effects of the index, no such effect is seen in the middle graph. 6.3 Memory Stalls Memory stalls remain the dominant contributor to execution time. We show the memory stall breakdown for in Figure 5. Due to a limitation in the performance events supported by the Pentium 4 we are not able to distinguish L2 stalls caused by instruction misses from those caused by data misses. However, it is reasonable to assume that most of the misses are caused by data since most instructions will hit in L2. The memory stalls are due almost entirely to L2 and L1D stalls. The bottleneck previously posed by the L1 instruction cache has been removed through the introduction of the Pentium 4 s trace cache. For the highly regular sequential scans and sequential joins, trace cache stalls make up less than 3% of memory stall time. For the less predictable index scans, a higher branch misprediction rate leads to a higher trace cache miss rate (around 20% for index scans versus about 5% for sequential scans and joins) and a correspondingly higher stall contribution. The ITLB s contribution to memory stall time is insignificant for all query types tested. This is not surprising since

8 " * ) ( $ ' $ %& # " ( ' & " % " #$ " #$ %&& '&' ( )& *&+ '&' #, '&' -( #. )&/ '&' Figure 5. Memory stall breakdown for sequential scans, index scans, and joins on System B on Pentium 4 the ITLB is no longer even accessed for the majority of µops. It is only needed on a trace cache miss, and the low miss rate of the trace cache more than makes up for the longer ITLB miss penalty. The large contribution from L1 data stalls is attributable to the Pentium 4 microarchitecture. The Pentium 4 s L1 data cache is both smaller than the Pentium III s, and has a larger cache line. This was a design choice to reduce hit latency and allow the L1D to keep up with the processor. However, it has the effect of increasing the miss rate for two memory locations that are not stored on the same cache line. This problem is compounded by the much higher clock speed of the Pentium 4, which effectively increases the miss penalty because the L2 latency (measured in clock cycles) is much higher than in the Pentium III. Indeed, when we reduce the miss penalty to that of the Pentium III, the L1D stall contribution drops to 20%. Correcting for the higher hit rate lowers the contribution by still more. In any case, because this is a percentage (not absolute) graph, the L1D contribution would have increased when the L1 instruction (trace cache) contribution decreased. 7. Where Do 5 Years Go? Five years ago, [2] determined that the main barriers to performance were L1 instruction stalls, L2 data stalls, branch mispredictions, and resource stalls. As mentioned earlier, there is a limit to how effectively the compiler can optimize for resource stalls due to the restrictions imposed by µop translation on X86. The other stalls, however, may be improved through better data placement schemes, improved branch predictors, and smarter compilation techniques. One system (System A), even eliminated branch mispredictions entirely on sequential scan queries by making all conditional jumps fall-through. In this section, we compare System C s performance with what was measured five years ago. The designers have clearly optimized the behavior of some queries, but there is still considerable room for improvement in others. The dramatic change in hardware from the P6 microarchitecture (used in the Pentium II tested in [2] and in the Pentium III) to the NetBurst microarchitecture used in the Pentium 4 precludes us from effectively carrying out a similar analysis for the System B at this time Sequential Scans Either the DBMS developers have tightened up their sequential scan code, or the L1 I-cache is now large enough to accomodate all of it. L1-I misses have shrunk from 40 to 20 percent. ITLB misses have shrunk from 10 percent to around 5. It appears that the bottleneck has shifted to L2 data misses which have increased from 60% to 80% of total execution time. [2] referenced some techniques to use the L1 I-cache more effectively. The authors correctly predicted that the first-level cache size will not increase at the same rate as the second-level cache size, because large L1 caches are not as fast and may slow down the processor clock. Our system has an L2 cache that is 4x the size of the original system, but the L1 cache is the same size. Consequently, the DBMSs must improve spatial locality in the instruction stream. Possible techniques include storing together

9 frequently accessed instructions while pushing instructions that are not used that often, like error-handling routines, to different locations. Consequently, the bottleneck in scan execution has shifted to L2 data stalls. As mentioned earlier, our records were created so that fetching the attribute referenced in the WHERE clause would not fill the cache with the attribute that we were selecting. The cache logic does not know that we are performing a scan on a single attribute; it s hands are tied by the fact that it must pull in a continuous cache line. To improve cache performance, the DBMS developers must improve data placement policies so that attributes that are often scanned individually are stored together (i.e. not buried in some record) [1]. Branch misprediction stalls are tightly connected to instruction stalls on all superscalar processors. For the Pentium III this connection is tighter, because it uses instruction prefetching. [2] argued the importance of processors being able to efficiently execute even unoptimized instruction streams, so a different prediction mechanism could reduce branch misprediction stalls caused by database workloads. Our results show that branch mispredictions have dropped from 20% to 7%, indicating improvements in object code layout and the branch prediction logic Index Scans The results for index scans show the least improvement. The processors need bigger L1-I caches for index queries, smarter branch predictors, and/or more optimized object code. Memory stalls still account for 50 percent of execution time. The memory stall breakdown shows that instruction misses are the major barrier to performance. The processor is choking most of the time because it cannot get enough instructions to execute. Any improvements are limited by the fact that the code to navigate an index is larger and more complicated that the code to read sequential records. Unless the system knows exactly what index traversals are going to be made ahead of time, it cannot prefetch the next page or even reference pages in parallel. The latency of reading pages is considerably reduced since we are using a memory resident database, but there is still a bottleneck; it s simply shifted from disk to main memory. Once again, branch mispredictions are probably a major cause of instruction stalls. Branch mispredictions here account for 20 percent of execution time, exactly the same percentage they used five years ago. The total lack of improvement here may imply that the major reasons for improved sequential scan performance are tighter object code and a larger L1 I-cache, not any improvements in the branch predictor Joins Surprisingly, join execution utilizes the hardware better than sequential scan or index queries. One reason for this could simply be that DBMS designers have spent much time optimizing join code. A second reason could simply be that there is more computation to be done. The operations on a single record during a join operation consist of a range check, a equality comparison (since it is an equijoin), and an application of the aggregate operation. The sequential and index scans simply perform a range check and an the aggregate operator. Since each of these steps can be implemented with a single X86 assembly instruction, the join has 50% more computation to perform. The memory stall behavior of the join is most similar to that of the sequential scan. This is not surprising since the query plan on System C shows that the join is being executed with a nested loop of sequential scans. These loops are probably very tight since there are really just 3-5 instructions required per inner-relation record. There is more code required to fetch the next page when the loop is done with one, but the low percentage of L1 I-cache miss time indicates that either most of this code can fit into the L1 I-cache, or it is called a small number of times. The L2 data misses can probably be improved in the same way that they would be improved for sequential scans: better data placement policies so that attributes that are often scanned individually are stored together [1]. 8. Conclusions Despite the performance optimizations found in today s database systems, they are not able to take full advantage of many recent improvements in processor technology and data placement. Based on a simple query execution time framework, we analyzed the behavior of two commercial DBMSs running simple selection and join queries on two different modern processors and memory architectures. The results illustrate the close interplay between hardware and software that database designers must keep in mind if they are to continue to improve DBMS performance. The results from our experiments on the Pentium III suggest that database developers should pay more attention to the data layout at the second level data cache, rather than the first, because L2 data stalls are a major component of the query execution time, whereas L1 D-cache stalls are insignificant. In addition, although first-level instruction cache misses have been reduced for sequential scans, they still dominate index scan queries often dominate memory stalls, There should be more focus on optimizing the critical paths for the instruction cache. Performance improvements should

10 address all of the stall components in order to effectively increase the percentage of execution time spent in useful computation. However, our results on the Pentium 4 indicate that some of these problems, such as L1 instruction cache stalls, can be addressed by the hardware. In their place, other bottlenecks arise, such as the L1 data cache. What works very well on one hardware system may perform very poorly on another. The DBMS designer is well advised to keep in mind these differences when optimizing his product. References [1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In The VLDB Journal, pages , [2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, pages , 1999.

Walking Four Machines by the Shore

Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0