DBMSs on a Modern Processor: Where Does Time Go? Revisited

Size: px
Start display at page:

Download "DBMSs on a Modern Processor: Where Does Time Go? Revisited"

Transcription

1 DBMSs on a Modern Processor: Where Does Time Go? Revisited Matthew Becker Information Systems Carnegie Mellon University mbecker+@cmu.edu Naju Mancheril School of Computer Science Carnegie Mellon University naju@cmu.edu Steven Okamoto School of Computer Science Carnegie Mellon University sokamoto@cs.cmu.edu Abstract In 1999, the performances of four commercial DBMSs were analyzed and broken down into the amount of time spent doing useful computation and the amount of time spent stalling for data to be retrieved from memory. In this paper, we aim to revisit two of those DBMSs and see if their interactions with the underlying hardware architecture has improved. We will use the same metrics as used in the previous paper to facilitate comparison between the two papers. We divide total execution time into time spent on useful computation, L1 and L2 cache misses, resource (functional unit) stalls, and branch misprediction (pipeline draining). It s important to note that we are not interested in analyzing how fast these DBMSs are. Instead we are interested in determining the relative contributions of stalls and letting both DBMS and microprocessor designers locate these performance bottlenecks in their systems and ameliorate them. We find that despite the performance optimizations found in today s database systems, they are not able to take full advantage of many recent improvements in processor technology and data placement. 1. Introduction A major aspect of studying a piece of software with the intention of trying to improve its performance is analyzing what the computer processor is doing while that software is being executed. This kind of inspection can reveal programmatic problems which, when changed, can greatly improve the performance capabilities of a particular program. Today s modern processors employ a number of sophisticated techniques that attempt to enhance performance by overlapping the execution of computational and memory-related operations. In 1999, the performances of four commercial DBMSs were analyzed and broken down into the amount of time spent doing useful computation and the amount of time spent stalling for data to be retrieved from memory. [2] In 1999 the results showed that: On average, at least half the time is spent in stalls (implying that database designers could improve performance by programmatically attacking the causes of these stalls). In all cases the main causes of the stalls were: L2 cache data misses (implying data placement should focus on the L2 cache.) L1 instruction misses (implying that instruction placement should focus on the L1 instruction caches). About 20 percent of the stalls are caused by subtle implementation details (the most notable being the number of branch mispredictions). In this paper, we aim to revisit two of those DBMSs and see if their interactions with the underlying hardware architecture has improved. We will use the same metrics as used in the previous paper to facilitate comparison between the two papers. We divide total execution time into time spent on useful computation, L1 and L2 cache misses, resource (functional unit) stalls, and branch misprediction (pipeline draining). It s important to note that we are not interested in analyzing how fast these DBMSs are. Instead we are interested in determining the relative contributions of stalls and letting both DBMS and microprocessor designers locate these performance bottlenecks in their systems and ameliorate them. We perform our measurements with two leading commercial DBMSs releases and the following modern hardware platforms: Intel Pentium 4 processor system with 1GB RAM

2 Quad Intel Pentium III processor system with 4GB RAM Due to time constraints we are unable to present results for both DBMSs on both hardware platforms. Instead we chose to present results for one of the DBMS running on the Pentium III and then present another set of results for the other DBMS running on the Pentium 4. As a point of future research we would run both DBMS on both hardware platforms and present two sets of results for each DBMS. Nevertheless, for the scope of this paper we will be running System C from [2] on the Pentium III processor and System B, again from [2], on the Pentium 4 processor. Unless otherwise mentioned or indicated, we refer to both systems concurrently. If the conclusions of [2] impacted DBMS design, we expect to see more time devoted to useful computation and resource stalls. This increase in resource stalls would imply that the processor is no longer fast enough to keep up with the database thus shifting the bottleneck from memory to the processor. 2. Database Workload The various workloads that we used in this paper are divided into two main groups: single-table range selections and two table equijoins, all of which are performed over a memory-resident database running a single command stream. The characteristics of this type of a workload divorce our analysis from the need to consider dynamic parameters, such as concurrency control among multiple transactions or disk I/O speed, and instead isolate basic operations such as sequential access and index selection.[2] We ran all queries against the two following relations to collect data (note: only fields that are actually used in our queries are listed; all other fields have place holders): CREATE TABLE LINEITEM ( L_ORDERKEY INTEGER NOT NULL, L_PARTKEY INTEGER NOT NULL, <5 other fields>, L_TAX FLOAT NOT NULL, <rest of fields> ) CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL, <rest of fields> ) The lineitem table contains variable length tuples. L_ORDERKEY has uniformly distributed values from 1 to in sorted order in the relation. L_PARTKEY has uniformly distributed values from 1 to randomly ordered in the database. The 5 fields between L_PARTKEY and L_TAX are used to prevent a record s L_TAX from being in the same cache line as the L_PARTKEY. The orders table contains variable length records. The O_ORDERKEY attribute has values that range from 1 to This table is used to join with the lineitem table for the queries that require a join Sequential Range Selection This group of queries is one of the two groups that consist of single-table range selections. This particular group of queries emphasizes the performance aspects of the DBMSs straight sequential scan on a file. The query that is used in this case is: SELECT AVG(L_TAX) FROM lineitem WHERE L_PARTKEY > min AND L_PARTKEY <= max; This query is performed in the absence of an index on L_PARTKEY. We verified that the access path for this query involved a sequential scan of the entire table. In this case we want to examine the effects of query selectivity on the performance of the sequential scan. The values used for min and max above can be used to define a range of tuples that we are interested in. By changing the difference between these values we are able to change the selectivity of the queries. We measured data for the above query using the following selectivities: 1%, 10%, 50% and 100%. For each selectivity the query was run multiple times selecting different ranges of data (with the same selectivity), with the obvious exception of the 100% selectivity. We chose to use an aggregate operation to match up with the methodology used in [2]. There are two explanations presented in that paper as to why the use of an aggregate operation is a good idea, only one of which is still pertinent in this paper. By using an aggregate operation, the number of results that actually need to be returned is very small (in this case it s one). This is a good idea because our measurements will not be affected by the overhead associated with client/server communication. Alternatively, storing a large number of records in a temporary relation, to be returned as results, would affect the measurements due to the extra insertion operations Indexed Range Selection This is the second group of queries that make up the single-table scan group. This group of queries is used to reason about the performance of the DBMS when it goes

3 through an index. We used a similar query to the one used in the sequential range selection but we augmented the database with an index on the L_PARTKEY attribute. We again verified that the access path for each of the queries went through that index. We also used the same set of selectivites as above Sequential Join This is the only group of queries that we perform in the category of two-table equijoins. This group of queries allows us to examine the performance of the DBMS while it is carrying out a sequential join between two tables. In this case ensure that the join is performed between two sequential scans of a file. We chose to join the orders table with the lineitem table using the common attribute O_ORDERKEY (or L_ORDERKEY in the case of lineitem). The query that we used was: SELECT AVG(L_TAX) FROM lineitem, orders WHERE L_ORDERKEY = O_ORDERKEY AND L_PARTKEY > min AND L_PARTKEY <= max; Each query was performed in the absence of the above mentioned index. We also verified that the access patterns for both tables was a sequential scan. Again we choose the aggregate operation so that we minimize the amount of time the DBMS spends creating the result set. We also used min and max to enforce different selectivities for the join just as we did in the previous queries. 3. Pentium III Experimental Setup - System C 3.1. Hardware Platform Processor L1 Cache L2 Cache Memory Bus Speed (4 x) Pentium 3, 700MHz 16 KB L1-Cache on each processor 2 MB Unified Cache 4 GB 100 MHz The computer that we used to run the experiments on System C contains four Pentium III processors running at 700 MHz, each with 16KB L1 instruction and data caches and a unified 2 MB L2 cache. The system contains 4 GB of main memory connected to the processor chips through a 100 MHz system bus. The Pentium III is a powerful server processor with an out-of-order engine and speculative instruction execution. The x86 instruction set is composed of CISC instructions. Each instruction is translated into no more than three RISC instructions (µops) during the decode phase of the pipeline. Figure 1. A diagram of the Pentium III s memory hierarchy There are two levels of non-blocking cache in the system. There are separate first-level caches for instructions and data, whereas at the second level the cache is unified. The cache characteristics are summarized in the figure above Software Platform Experiments were conducted on System C, the name of which cannot be disclosed here due to legal restrictions. System C was installed on Redhat Linux 7.1 (Seawolf). The DBMSs were configured the same way in order to achieve as much consistency as possible. The buffer pool size was made large enough to fit the datasets for all the queries into main memory because the objective is to measure pure processor and memory performance. To define the schema and execute the queries, the same commands and datasets were used for all the DBMSs, with no vendorspecific SQL extensions Measurement Tools and Methodology The Pentium III processor provides two hardware-based counters for event measurements. We used emon, a tool graciously provided by Intel, to control these counters. emon can set the counters to zero, assign event codes to them, and read their values either after a pre-specified amount of time, or after a program has completed execution. Emon was used to measure 124 event types for the results presented in this report. We measured each event type in both user and kernel mode and summed the results from each of the four processors. Before taking measurements for a query, the main memory and caches were warmed up with multiple runs of this query. In order to distribute and minimize the effects of the client/server startup overhead, the unit of execution consisted of ten different queries on the same database, with

4 Processor Trace cache L1 Data cache L2 Cache Memory Bus Prefetching Hyperthreading Pentium 4 (Northwood core), 3GHz 8-way, 12Kµop trace cache, 6µop line 4-way, 8KB L1 data cache, 64B line, non-blocking, write-through, 2 cycle latency 8-way, 512KB unified L2 cache, 64B line, non-blocking, write back, 18 cycle latency 1 GB, latency 300 cycles 200 MHz, quad-pumped Enabled Disabled System B was tested on a 3 GHz uniprocessor Pentium 4 machine with 4 GB of RAM, 512 KB on-die unified L2 cache, 8 KB L1 data cache, and 12 Kµop trace cache. The trace cache replaces the traditional L1 instruction cache. An instruction cache holds CISC macroinstructions in spatially localized cache lines; the trace cache contains decoded µops from an execution trace, thereby exploiting temporal locality. This greatly helps both the hit rate of the cache (µops within basic blocks always hit) and the storage efficiency (µops from spatially distant but temporally close locations, i.e., branch targets, can be stored on the same cache line). In addition, the trace cache obviates the need for macroinstruction decoding of cached /muops. On a trace cache miss, the instruction is fetched and decoded, and a new trace is built. The Pentium 4 also features a much improved branch predictor with a branch target buffer 8 times larger than the Pentium III s. This is needed both to improve the trace cache performance and to avoid the costly branch misprediction penalty imposed by the Pentium 4 s deeper pipeline (20 stages versus 10 for the Pentium III). The Pentium 4 also supports hardware prefetching to reduce memory latency. Deeper buffering of the out-order-execution logic and more powerful register renaming in the Pentium 4 is designed to reduce the number of resource stalls and increase the number of simultaneous in-flight operations. Hyperthreading, which allows each physical processor to act as two logical processors, was disabled for these experiments as a simplification. (Preliminary work indicated that the simple microbenchmarks we tested were overwhelmingly executed on one logical processor; the sharing of performance counters between logical processors for some events added unnecessary complexity given the marginal advantages of enabling Hyperthreading.) Technical details on the Pentium 4 hardware are given in Table Software Platform Table 1. Pentium 4 hardware system the same selectivity. Each time emon executed one such unit, it measured a pair of events. In order to increase the confidence intervals, the experiments were repeated several times and the final sets of numbers exhibit a standard deviation of less than 5 percent. Finally, using a set of formulae similar to the ones used in [2] (updated with the new events and new estimates for penalties for the new architectures). 4. Pentium 4 Experimental Setup - System B 4.1 Hardware Platform Experiments were conducted on System B, the name of which cannot be disclosed here due to legal restrictions. System B was installed on Redhat Linux 9 (Shrike), kernel version The DBMS was configured the same way as System C in order to achieve as much consistency as possible. The buffer pool size was made large enough to fit the tables for all the queries into main memory because the objective is to measure pure processor and memory performance. To define the schema and execute the queries, the same commands and datasets were used for all the DBMSs, with no vendor-specific SQL extensions. 4.3 Measurement Tools and Methodology The Pentium 4 processor provides 18 hardware counters for event measurements. Emon was also used on the Pentium 4, although the different microarchitecture led to slightly different events being measured than with the Pentium III. Again, we measured both events generated by kernel and user programs separately. As with System C, the DBMS was first warmed up and queries were done in batches of 10 to amortize overhead. Memory latencies were calculated based on event results. This was checked using LMBench, which we also used to verify L1D and L2 latencies. 5. Pentium III Results In this section, we first present an overview of the execution time breakdown. Then, we focus on the chief barrier to computation: memory stalls. We divide memory stalls into L1 vs. L2 cache misses and again into data and instruction misses. Since the active processor executed in user mode more than 95% of the time, all of the measurements shown in this section reflect user mode execution, unless stated otherwise Execution time breakdown Figure 2 shows three graphs, each summarizing the average execution time breakdown for one of the queries. Each

5 bar shows the contribution of the four components (T C, T M, T B, and T R ) as a percentage of the total query execution time. Although the workload is much simpler than TPC benchmarks, the computation time is always less than half the execution time; thus, the processor spends most of the time stalled. As processor clocks become faster, the computation component will become a smaller part of overall execution time. Miss penalties will remain high since memory access times do not decrease as quickly. The memory stall time contribution varies across different queries. For example, Figure 2 shows that when System C executes the sequential range selection, it spends 30% of the time in memory stalls. When the same system executes the indexed range selection, the memory stall time contribution becomes 50%. Although the indexed range selection accesses fewer records, its memory stall component is larger than in the sequential selection, probably because the index traversal has less spatial locality than the sequential scan. Analysis of the memory behavior yields that 90% of T M is due to L1 I-cache and L2 data misses in all of the systems measured. Minimizing memory stalls has been a major focus of database research on performance improvement [1]. Although in most cases the memory stall time (T M ) accounts for most of the overall stall time, the other two components are always significant. Branch misprediction stalls account for 10-20% of the execution time, and the resource stall time contribution stays constant around 25% Memory stalls Figure 2. Graphs of the overall execution time for each type of query Since the memory stalls comprise such a large portion of the overall execution time, it is important to determine the causes of these stalls. This section discusses the significance of the memory stall components to the query execution time, according to the framework discussed earlier. The figure above shows the breakdown of T M into the following stall time components: TL1D (L1 D-cache miss stalls), TL1I (L1 I-cache miss stalls), TL2D (L2 cache data miss stalls), TL2I (L2 cache instruction miss stalls), and TITLB (ITLB miss stalls) for each of the four DBMSs. There is one graph for each type of query. It is clear that L1 D-cache stall time is insignificant. An L1 D-cache miss that hits on the L2 cache incurs low latency, which can usually be overlapped with other computation. The stall time caused by L2 cache instruction misses (TL2I) and ITLB misses (TITLB) is also insignificant in all the experiments. TL2I contributes little to the overall execution time because the second-level cache misses are two to three orders of magnitude less than the first-level instruction cache misses. The low TITLB indicates that the systems use few instruction pages, and the ITLB is enough to store

6 the translations for their addresses. The rest of this section discusses the two major memory-related stall components, TL2D and TL1I. For all of the queries, TL2D (the time spent on L2 data stalls) is one of the most significant components of the execution time. This is most evident in the case of sequential scan and join queries, in which they comprise 70% of all memory stall time. We believe that the index scan L2 data performance is just as poor, but because the number of L1 I- cache stalls is so large, data loads penalties are hidden from us. In other words, data loads to the L2 cache can be overlapped with instruction loads to the L1. Stall time due to misses at the first-level instruction cache (TL1I) is a major memory stall component for all three sets of queries. The results in this study reflect the real I-cache stall time, with no approximations. The Pentium III uses stream buffers for instruction prefetching, but L1 I-cache misses are still a bottleneck. TL1I is difficult to overlap, because L1 I-cache misses cause a serial bottleneck to the pipeline. For all DBMSs, the average contribution of TL1I to the execution time is 20% Branch Mispredictions Branch mispredictions have serious performance implications, because (a) they cause a serial bottleneck in the pipeline and (b) they cause instruction cache misses, which in turn incur additional stalls. Branch instructions account for 20% of the total instructions retired in our index scans Resource stalls Figure 3. Graphs of the breakdown of memory stalls into different components Resource-related stall time is the time during which the processor must wait for a resource to become available. Such resources include functional units in the execution stage, registers for handling dependencies between instructions, and other platform-dependent resources. The contribution of resource stalls to the overall execution time is fairly stable across the DBMSs. In all cases, resource stalls are dominated by dependency and/or functional unit stalls. Functional unit availability stalls are caused by bursts of instructions that create contention in the execution unit. Memory references account for at least half of the instructions retired, so it is possible that one of the resources causing these stalls is a memory buffer. Resource stalls are an artifact of the lowest-level details of the hardware. The compiler can produce code that avoids resource contention and exploits instruction-level parallelism. This is difficult with the X86 instruction set, because each CISC instruction is internally translated into smaller instructions (µops). Thus, there is no easy way for the compiler to see the correlation across multiple X86 instructions and optimize the instruction stream at the processor execution level.

7 * ) ( $ ' $ %& # " " ( ' & " % " #$ " #$%&'($) *%$+,+(+, -(-.+(), %-&+/$)- 0-$'+1+(/ -(- Figure 4. Execution time breakdown for sequential scans, index scans, and joins on System B on Pentium 4 6. Pentium 4 Results System B was tested on the Pentium 4 processor using the same queries as were used with System C. The execution time was again broken down into 4 principal components: useful computation time, memory stalls, resource branch misprediction penalties, and resource stalls (Figure 4). 6.1 Computation and Resource Stalls As we observed earlier with System C, very little time is actually spent doing useful computation. This is most likely an effect of the underlying hardware than the DBMS itself. The memory and resource stalls on the Pentium 4 are exacerbated by the high clock speed, and prevent the processor from performing useful operations. The majority of the resource stalls are due to a lack of storage buffers in the out-of-order execution engine. 6.2 Branch Mispredictions Only a tiny fraction of the execution time goes to branch mispredictions, which stands in marked contrast to the results obtained with System C on the Pentium III. This is because the Pentium 4 s branch predictor is so much more accurate than the Pentium III s branch predictor. As expected, the contribution of branch mispredictions is greater for index scans (which must branch through the index) than for the highly regular sequential scans (which involve only a very tight loop). Also, branch misprediction reaches a maximum for a selectivity of 50% for sequential scans, since this is when a tuple is equally likely to be selected as rejected; because joins use sequential scans to read the relations, this same phenomenon is seen with joins. Since the branch mispredictions for index scans are dominated by the effects of the index, no such effect is seen in the middle graph. 6.3 Memory Stalls Memory stalls remain the dominant contributor to execution time. We show the memory stall breakdown for in Figure 5. Due to a limitation in the performance events supported by the Pentium 4 we are not able to distinguish L2 stalls caused by instruction misses from those caused by data misses. However, it is reasonable to assume that most of the misses are caused by data since most instructions will hit in L2. The memory stalls are due almost entirely to L2 and L1D stalls. The bottleneck previously posed by the L1 instruction cache has been removed through the introduction of the Pentium 4 s trace cache. For the highly regular sequential scans and sequential joins, trace cache stalls make up less than 3% of memory stall time. For the less predictable index scans, a higher branch misprediction rate leads to a higher trace cache miss rate (around 20% for index scans versus about 5% for sequential scans and joins) and a correspondingly higher stall contribution. The ITLB s contribution to memory stall time is insignificant for all query types tested. This is not surprising since

8 " * ) ( $ ' $ %& # " ( ' & " % " #$ " #$ %&& '&' ( )& *&+ '&' #, '&' -( #. )&/ '&' Figure 5. Memory stall breakdown for sequential scans, index scans, and joins on System B on Pentium 4 the ITLB is no longer even accessed for the majority of µops. It is only needed on a trace cache miss, and the low miss rate of the trace cache more than makes up for the longer ITLB miss penalty. The large contribution from L1 data stalls is attributable to the Pentium 4 microarchitecture. The Pentium 4 s L1 data cache is both smaller than the Pentium III s, and has a larger cache line. This was a design choice to reduce hit latency and allow the L1D to keep up with the processor. However, it has the effect of increasing the miss rate for two memory locations that are not stored on the same cache line. This problem is compounded by the much higher clock speed of the Pentium 4, which effectively increases the miss penalty because the L2 latency (measured in clock cycles) is much higher than in the Pentium III. Indeed, when we reduce the miss penalty to that of the Pentium III, the L1D stall contribution drops to 20%. Correcting for the higher hit rate lowers the contribution by still more. In any case, because this is a percentage (not absolute) graph, the L1D contribution would have increased when the L1 instruction (trace cache) contribution decreased. 7. Where Do 5 Years Go? Five years ago, [2] determined that the main barriers to performance were L1 instruction stalls, L2 data stalls, branch mispredictions, and resource stalls. As mentioned earlier, there is a limit to how effectively the compiler can optimize for resource stalls due to the restrictions imposed by µop translation on X86. The other stalls, however, may be improved through better data placement schemes, improved branch predictors, and smarter compilation techniques. One system (System A), even eliminated branch mispredictions entirely on sequential scan queries by making all conditional jumps fall-through. In this section, we compare System C s performance with what was measured five years ago. The designers have clearly optimized the behavior of some queries, but there is still considerable room for improvement in others. The dramatic change in hardware from the P6 microarchitecture (used in the Pentium II tested in [2] and in the Pentium III) to the NetBurst microarchitecture used in the Pentium 4 precludes us from effectively carrying out a similar analysis for the System B at this time Sequential Scans Either the DBMS developers have tightened up their sequential scan code, or the L1 I-cache is now large enough to accomodate all of it. L1-I misses have shrunk from 40 to 20 percent. ITLB misses have shrunk from 10 percent to around 5. It appears that the bottleneck has shifted to L2 data misses which have increased from 60% to 80% of total execution time. [2] referenced some techniques to use the L1 I-cache more effectively. The authors correctly predicted that the first-level cache size will not increase at the same rate as the second-level cache size, because large L1 caches are not as fast and may slow down the processor clock. Our system has an L2 cache that is 4x the size of the original system, but the L1 cache is the same size. Consequently, the DBMSs must improve spatial locality in the instruction stream. Possible techniques include storing together

9 frequently accessed instructions while pushing instructions that are not used that often, like error-handling routines, to different locations. Consequently, the bottleneck in scan execution has shifted to L2 data stalls. As mentioned earlier, our records were created so that fetching the attribute referenced in the WHERE clause would not fill the cache with the attribute that we were selecting. The cache logic does not know that we are performing a scan on a single attribute; it s hands are tied by the fact that it must pull in a continuous cache line. To improve cache performance, the DBMS developers must improve data placement policies so that attributes that are often scanned individually are stored together (i.e. not buried in some record) [1]. Branch misprediction stalls are tightly connected to instruction stalls on all superscalar processors. For the Pentium III this connection is tighter, because it uses instruction prefetching. [2] argued the importance of processors being able to efficiently execute even unoptimized instruction streams, so a different prediction mechanism could reduce branch misprediction stalls caused by database workloads. Our results show that branch mispredictions have dropped from 20% to 7%, indicating improvements in object code layout and the branch prediction logic Index Scans The results for index scans show the least improvement. The processors need bigger L1-I caches for index queries, smarter branch predictors, and/or more optimized object code. Memory stalls still account for 50 percent of execution time. The memory stall breakdown shows that instruction misses are the major barrier to performance. The processor is choking most of the time because it cannot get enough instructions to execute. Any improvements are limited by the fact that the code to navigate an index is larger and more complicated that the code to read sequential records. Unless the system knows exactly what index traversals are going to be made ahead of time, it cannot prefetch the next page or even reference pages in parallel. The latency of reading pages is considerably reduced since we are using a memory resident database, but there is still a bottleneck; it s simply shifted from disk to main memory. Once again, branch mispredictions are probably a major cause of instruction stalls. Branch mispredictions here account for 20 percent of execution time, exactly the same percentage they used five years ago. The total lack of improvement here may imply that the major reasons for improved sequential scan performance are tighter object code and a larger L1 I-cache, not any improvements in the branch predictor Joins Surprisingly, join execution utilizes the hardware better than sequential scan or index queries. One reason for this could simply be that DBMS designers have spent much time optimizing join code. A second reason could simply be that there is more computation to be done. The operations on a single record during a join operation consist of a range check, a equality comparison (since it is an equijoin), and an application of the aggregate operation. The sequential and index scans simply perform a range check and an the aggregate operator. Since each of these steps can be implemented with a single X86 assembly instruction, the join has 50% more computation to perform. The memory stall behavior of the join is most similar to that of the sequential scan. This is not surprising since the query plan on System C shows that the join is being executed with a nested loop of sequential scans. These loops are probably very tight since there are really just 3-5 instructions required per inner-relation record. There is more code required to fetch the next page when the loop is done with one, but the low percentage of L1 I-cache miss time indicates that either most of this code can fit into the L1 I-cache, or it is called a small number of times. The L2 data misses can probably be improved in the same way that they would be improved for sequential scans: better data placement policies so that attributes that are often scanned individually are stored together [1]. 8. Conclusions Despite the performance optimizations found in today s database systems, they are not able to take full advantage of many recent improvements in processor technology and data placement. Based on a simple query execution time framework, we analyzed the behavior of two commercial DBMSs running simple selection and join queries on two different modern processors and memory architectures. The results illustrate the close interplay between hardware and software that database designers must keep in mind if they are to continue to improve DBMS performance. The results from our experiments on the Pentium III suggest that database developers should pay more attention to the data layout at the second level data cache, rather than the first, because L2 data stalls are a major component of the query execution time, whereas L1 D-cache stalls are insignificant. In addition, although first-level instruction cache misses have been reduced for sequential scans, they still dominate index scan queries often dominate memory stalls, There should be more focus on optimizing the critical paths for the instruction cache. Performance improvements should

10 address all of the stall components in order to effectively increase the percentage of execution time spent in useful computation. However, our results on the Pentium 4 indicate that some of these problems, such as L1 instruction cache stalls, can be addressed by the hardware. In their place, other bottlenecks arise, such as the L1 data cache. What works very well on one hardware system may perform very poorly on another. The DBMS designer is well advised to keep in mind these differences when optimizing his product. References [1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In The VLDB Journal, pages , [2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, pages , 1999.

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Database Systems and Modern CPU Architecture

Database Systems and Modern CPU Architecture Database Systems and Modern CPU Architecture Prof. Dr. Torsten Grust Winter Term 2006/07 Hard Disk 2 RAM Administrativa Lecture hours (@ MI HS 2): Monday, 09:15 10:00 Tuesday, 14:15 15:45 No lectures on

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture Minglong Shao Anastassia Ailamaki Babak Falsafi Database Group Carnegie Mellon University Computer Architecture

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Data Page Layouts for Relational Databases on Deep Memory Hierarchies

Data Page Layouts for Relational Databases on Deep Memory Hierarchies Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

X-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF

X-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF X-Stream II Peter J. Pupalaikis Principal Technologist September 2, 2010 Summary This paper explains how X- Stream II techonlogy improves the speed and responsiveness of LeCroy oscilloscopes. TECHNICAL

More information

Buffering Database Operations for Enhanced Instruction Cache Performance

Buffering Database Operations for Enhanced Instruction Cache Performance ing Database Operations for Enhanced Instruction Cache Performance Jingren Zhou Columbia University jrzhou@cs.columbia.edu Kenneth A. Ross Columbia University kar@cs.columbia.edu ABSTRACT As more and more

More information

DBMSs On A Modern Processor: Where Does Time Go?

DBMSs On A Modern Processor: Where Does Time Go? DBMSs On A Modern Processor: Where Does Time Go? Anastassia Ailamaki David J. DeWitt Mark D. Hill David A. Wood University of Wisconsin-Madison Computer Science Dept. 1210 W. Dayton St. Madison, WI 53706

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

Meltdown or "Holy Crap: How did we do this to ourselves" Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory

Meltdown or Holy Crap: How did we do this to ourselves Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory Meltdown or "Holy Crap: How did we do this to ourselves" Abstract Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory locations Breaks all security assumptions given

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

P6: Dynamic Execution

P6: Dynamic Execution P: Dynamic Execution A Unique Combination Of: lspeculative Execution lmultiple Branch Prediction ldata-flow Analysis External Bus I D BIU IFU L BTB MIS RAT R S MOB DCU MIU AGU IEU FEU ROB RRF, Intel Corporation

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Computer Systems Organization The CPU (Central Processing Unit) is the brain of the computer. Fetches instructions from main memory.

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Reduced Instruction Set Computers

Reduced Instruction Set Computers Reduced Instruction Set Computers The acronym RISC stands for Reduced Instruction Set Computer. RISC represents a design philosophy for the ISA (Instruction Set Architecture) and the CPU microarchitecture

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator). Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

The Implications of Multi-core

The Implications of Multi-core The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version) Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Fall 2002 Edward F. Gehringer Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version)

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Week 6 out-of-class notes, discussions and sample problems

Week 6 out-of-class notes, discussions and sample problems Week 6 out-of-class notes, discussions and sample problems We conclude our study of ILP with a look at the limitations of ILP and the benefits and costs of dynamic versus compiler-based approaches to promote

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Supra-linear Packet Processing Performance with Intel Multi-core Processors White Paper Dual-Core Intel Xeon Processor LV 2.0 GHz Communications and Networking Applications Supra-linear Packet Processing Performance with Intel Multi-core Processors 1 Executive Summary Advances

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information