Data Prefetching by Dependence Graph Precomputation

Size: px
Start display at page:

Download "Data Prefetching by Dependence Graph Precomputation"

Transcription

1 Data Prefetching by Dependence Graph Precomputation Murali Annavaram, Jignesh M. Patel, Edward S. Davidson Electrical Engineering and Computer Science Department The University of Michigan, Ann Arbor fannavara, jignesh, davidson Abstract Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular access patterns make it difficult to accurately predict the address sufficiently early to mask large cache miss latencies. This paper explores an alternative to predicting prefetch addresses, namely precomputing them. The Dependence Graph Precomputation scheme (DGP) introduced in this paper is a novel approach for dynamically identifying and precomputing the instructions that determine the addresses accessed by those load/store instructions marked as being responsible for most data cache misses. DGP s dependence graph generator efficiently generates the required dependence graphs at run time. A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching. Our results show that 94% of the prefetches issued by DGP are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes us about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the perfect D-cache performance. 1 Address Prediction vs. Precomputation Out-of-order execution is the norm in current superscalar processor designs; it is intended to allow processors to tolerate pipeline stalls due to data dependences, resource conflicts, cache misses, etc., by buffering stalled instructions in reservation stations and executing other ready instructions out of program order. However, today s dominant application domains, including databases, multimedia, and games, have large memory footprints and do not use processor caches effectively, resulting in many cache misses. Furthermore, processor clock speeds are continuing to outpace memory access speeds, resulting in larger cache miss latencies (measured in processor cycles). Thus the large number of cache misses of today s dominant application domains, coupled with the increasing cache miss latencies of current processor designs cause significant performance degradation, even in aggressive out-of-order processors. Prefetching data is one way to reduce the high miss rates of these irregular applications, but most existing data prefetch schemes are not effective in predicting prefetch addresses for irregular applications. Furthermore, previous work [11] has shown that for integer applications it is difficult to both accurately predict prefetch addresses and issue the prefetches in a timely fashion, i.e. sufficiently ahead of the corresponding demand reference to mask the large cache miss latencies. In this paper we explore an alternative to predicting prefetch addresses for irregular access patterns, namely precomputing them. The Dependence Graph Precomputation scheme (DGP) introduced in this paper is a novel approach to prefetching. Once an instruction is fetched from the I- cache into the Instruction Fetch Queue (IFQ), its dependences are determined and stored as pointers with the instruction in the IFQ. When a load/store instruction that is deemed likely to cause a cache miss enters the IFQ, a Dependence Graph Generator (DGG) follows the dependence pointers still within the IFQ to generate the dependence graph of those instructions yet to be executed that will determine the address of the load/store instruction. The dependence graphs generated by DGG are fed to a separate Precomputation Engine (PE) that executes them to generate the load/store addresses early enough for timely prefetching. DGG does not remove any instructions from the IFQ; consequently all precomputed instructions will be executed in the normal manner by the main processor pipeline. Furthermore, the results generated by PE are used only for prefetching data; in particular they are not used for updating the architected state of the main processor. Since precomputation of the dependence graphs is speculative, it runs ahead of the normal computation by avoiding fetch queue and reorder buffer delays experienced by the normal computation

2 due to the in-order decode and the instruction scheduling priorities of the main processor pipeline. Our results show that 98% of the precomputed addresses match the actual address accessed by the marked load/store instructions, indicating that prefetch addresses generated by DGP are highly accurate. DGP actually issued a prefetch for only 15% of these addresses, since 85% of the precomputed addresses were found to be already in the L1 cache. Of the prefetches issued by DGP, 94% are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes us about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the performance obtained with perfect D-cache. The rest of this paper is organized as follows. Section 2 describes a hardware implementation for generating the dependence graphs of the load/store instructions that are marked as likely to miss. It then describes a mechanism to precompute the instructions in the dependence graphs and generate the prefetch addresses. Section 3 describes the simulation environment and performance analysis tools that we used to assess the effectiveness of DGP. Results of this assessment are presented in Section 4. Section 5 compares DGP with the most relevant prior work in the area of dependence graph generation; it also contrasts DGP with prior data prefetching techniques by highlighting the applicability and constraints of prior schemes that use either address prediction or precomputation for prefetching. Conclusions and some future directions for DGP are presented in Section 6. 2 Dependence Graph Generation and Precomputation Figure 1 highlights (in bold capital font) the additional hardware added by DGP. For the simulation results of this paper, the set of load/stores that generated 90% of the D- cache misses in a profile run were marked for prefetching during the production execution of each benchmark. The dependence graph of a load/store instruction, I, in the Instruction Fetch Queue (IFQ) is the set of all unexecuted instructions, waiting either in the IFQ or in the Reorder Buffer (ROB), that contribute to the address accessed by I. In this paper we approximate the dependence graph by considering only the set of instructions that are in the IFQ at the time that the dependence graph of I is being generated. Note that the approximate definition ignores the dependence of I on unexecuted instructions waiting in the ROB. To build this approximate dependence graph for a marked load/store instruction, I, in the IFQ, we place I in the graph and recursively find and add to the graph, earlier instructions in the IFQ which define the register operands of instructions already in the graph. In the Alpha instruction set architecture used in this study, all instruction operands are either registers or immediate values. Since an immediate operand value is available in the instruction itself, these operands are not dependent on any other instruction, and hence are not considered while generating dependence graphs. In our processor model the fetch stage of the processor pipeline fetches instructions from the I-cache and buffers them in the IFQ. The IFQ is a circular FIFO buffer with 2 N entries that store the fetched instructions in program order. We modified the IFQ entry format by adding three new fields, namely the Index-Tag bit (IT), OP1 and OP2. Each instruction, as it enters the IFQ, is assigned the next sequential N +1bit index value (modulo 2 N+1 ). The lower N bits of the index determine the IFQ entry number where the instruction will be buffered; the most significant bit (MSB) of the index is stored in the IT bit of that entry. IT is thus toggled between 0 and 1 each time IFQ wraps around, e.g. the sequence of instructions in the 4th entry of a 16 entry IFQ will have indices that alternate between and The IT bit helps prevent stale or incorrect instructions from being included in dependence graphs (see Section 2.1). Fetch Predecode Decode Execute Writeback Commit RDV IFQ ROB I-cache IT Inst OP1 OP2 DEPENDENCE GRAPH GENERATOR DEPENDENCE GRAPH BUFFER PRECOMPUTATION Data ENGINE SRF Prefetches Figure 1. DGP hardware The OP1 field of the IFQ entry for instruction I stores the N +1bit index of an earlier instruction in the IFQ that defines the value of input operand 1 of instruction I; ifno earlier instruction still residing in the IFQ defines operand 1, or if I has no input operand 1, then OP1 is set to NULL. OP2 is defined similarly for operand 2. The Predecode stage of the processor pipeline is responsible for carrying out the dependence analysis. To do so, the predecoder is augmented with a new hardware structure, called the Register Definition Vector (RDV). RDV has one slot for each architected register of the processor. RDV slot j stores the N +1bit index value of the instruction that most recently defined register R j. When the predecoder decodes instruction I, stored in entry i of the IFQ, it accesses the RDV slots corresponding to the input operand fields of I and sets the OP1 and OP2 fields of ith IFQ entry to the corresponding indices read from RDV. The predecoder then considers the result register, R d, of instruction I and sets RDV slot d to the N +1bit index of instruction I (namely i or 2 N + i

3 according to the IT bit of instruction I). DGP is not concerned with the remaining stages of the pipeline or the ROB shown in Figure 1. They operate as in a traditional out-of-order superscalar processor and hence are not described here. 2.1 What to Precompute The Dependence Graph Generator (DGG) scans the predecoded instructions in the IFQ in program order. Whenever it finds a load/store instruction, L, that is marked for prefetching, it initiates the process of building a dependence graph for L. DGG begins this process by storing the marked load/store instruction, L, along with its IT, OP1, and OP2 fields from the IFQ in a new dependence graph. Then for each instruction, I, added to the dependence graph, DGG follows the pointer in its OP1 field, if any, to an earlier instruction, E, in the IFQ that defines the first input operand of I. It is possible, however, that E has left the IFQ by this time. In this case the IFQ entry vacated by E may be empty, or it may contain a new instruction due the circular FIFO ordering of IFQ. If a new instruction, N, occupies the IFQ entry then the OP1 field of I points to N which does not define the input operand of I; furthermore N is to be executed after I in the program order. DGG therefore adds an instruction to the dependence graph only if OP1 is not NULL, the IFQ entry it points to is not empty, the register defined by the instruction pointed to by OP1 matches the operand 1 register of I, and the IT bit in the entry pointed to by OP1 matches the MSB of the OP1 index. The OP 2 field of instruction I is then used similarly to determine the instruction, if any, in the IFQ that defines the second input operand of I. Note that this scheme does not always prevent a newer instruction from being included in the dependence graph. If the RDV slot used to define OP1 has a stale value, i.e. the IFQ entry it points to has been overwritten by a newer instruction since the RDV slot was last defined. Often in this case the IT bit will not match. However, if the IFQ entry that the RDV slot points to has been overwritten an even number of times since the time the RDV slot was last defined, then the IT bit will match the MSB of the OP field. But even in this case, given 32 registers, it is possible but unlikely that the result register of the new instruction in the IFQ entry will match the input operand register of I. In our precomputation method (see Section 2.2), precomputed instructions from the dependence graphs never modify the architected state of the main processor. Thus although incorrect instructions in the dependence graphs, if any, will almost certainly result in generating a wrong prefetch address, they do not corrupt the architected state of the machine; they simply reduce the DGP performance gain. DGG builds the dependence graph by repeating the above operations until it finds no new (unanalyzed) instructions in the dependence graph, or the Dependence Graph Buffer (DGB) is full. If the DGB is full, the DGG squashes the dependence graph that it is currently generating, as it is too large to store in DGB and would also be relatively unlikely to result in a timely prefetch due to its size. Otherwise, the graph for L is deemed complete and is stored in DGB. Each entry in DGB holds one dependence graph; the order of the instructions in the entry is the order in which DGG added them to the graph. DGG then resumes searching the IFQ for the next marked load/store instruction. It is possible that a new marked load/store instruction, M, enters the IFQ while DGG is still in the process of building the dependence graph of an earlier load/store instruction, L. The graph generation for M is delayed until DGG completes building the dependence graph for L. In our evaluations we assume a simple implementation of DGG that takes one cycle for each instruction added to the dependence graph. As shown in Section 4.2, 92% of the dependence graphs have fewer than 8 instructions, and the largest has only 15 instructions. In our evaluations, the IFQ holds up to 32 (or 64) instructions, the DGB has 2 entries, and the size of each DGB entry is 64 bytes (16 instructions). No dependence graph encountered in our benchmark evaluations exceeded this size. Of the two DGB entries, one is used by DGG to store the dependence graph that it is generating; the other holds a completed dependence graph that is ready for execution or is being executed by the Precomputation Engine. Since the dependence graph of one load/store instruction may use the register value loaded by another load instruction, dependence graphs may contain intermediate load instructions. The dependence graph generator follows the OP 1 and OP 2 pointers of an intermediate load instruction, IL, as usual to identify the earlier instructions in the IFQ that define the address accessed by IL. During the precomputation IL, the processor s cache is accessed using the precomputed address to get IL s operand value. However, if there is an uncommitted store to the same address which precedes IL in program order, this hazard cannot be detected by DGG since the operand addresses of load/store instructions in the IFQ are unknown. Such hazards may cause incorrect addresses to be generated for marked load/store instructions. Nevertheless, as shown in Section 4, 94% of the prefetches issued by DGP are useful. Thus such hazards are not a significant hindrance to generating accurate prefetch addresses using DGP. 2.2 How to Precompute A separate Precomputation Engine (PE) is used to execute the dependence graphs stored in the Dependence Graph Buffer. Since DGG generates the dependence graphs by tracking dependences backwards from a marked load/store in the IFQ, PE executes the dependence graph by beginning with the last instruction added to the graph and continuing

4 forward until it executes the marked load/store instruction at the root of the graph. PE executes instructions speculatively. The results generated by PE are used only for prefetching data, and in particular they never update the architected state of the main processor. When executing a marked instruction, PE generates the load/store address and prefetches the data into the L1 cache without loading a value into the main processor s register (for a load) and without storing a value in the prefetched address (for a store). Since PE is prevented from modifying the architected state of the main processor, it can speculatively run ahead of the actual computation without having to guarantee the correctness of the precomputation. Note that DGG does not remove any instructions from the IFQ; consequently all precomputed instructions will be executed in the normal manner by the main processor pipeline. The PE is a simple single issue, in-order execution engine that executes one integer instruction every cycle (except that multiply takes 2 cycles, and integer divide takes 4). As soon as it finishes one graph it proceeds to the next buffered graph. Since address computation arithmetic does not involve any floating point operations, and DGG (since it operates on instructions in the IFQ) follows only the predicted control flow path from the branch predictor while building the dependence graphs, the dependence graphs never contain floating point operations or branch instructions. Hence PE needs no floating point or branch execution units or branch prediction hardware. The precomputation engine has a scratch register file (SRF) to store the live results of precomputed instructions. The maximum number of SRF registers needed is equal to the maximum width of the dependence graph, which never exceeded 4 instructions in our benchmarks. Since PE executes at most one instruction every cycle, the SRF needs only two read ports and one write port. PE is thus a simple fixed-point engine that can be implemented with a modest transistor count. The SRF registers are all set to unassigned before beginning a new graph. Before PE executes an instruction, I, an SRF register is assigned to store its result. To get the necessary operand values for I, PE accesses the SRF only if the corresponding OP field of I is not NULL, in which case an already executed instruction in the dependence graph would have generated the operand value and stored it in the SRF. When the OP field is NULL, the PE obtains the corresponding operand value by accessing the processor s register file and the ROB for forwarding uncommitted register values. Hence in our implementation, the register file and the ROB of the main processor each need two additional read ports for PE accesses. Due to the lack of interlocks between PE and the main processor, operand values read by PE from the main processor s register file and ROB could be incorrect. At the time PE accesses a main processor register, the desired value might not yet be defined (because the instruction that defines it has not yet executed), or might have already been overwritten by a later instruction if the main processor has run ahead of the PE. Accessing the ROB as well as the register shifts (and generally reduces) this problem, but does not eliminate it. Thus we note that register hazards may arise due to the lack of interlocks between the main processor pipeline and PE. However, we must be careful not to delay the operation of the main processor pipeline. Furthermore we note that any hazards that do manifest themselves will simply reduce DGP performance, and our experimental results indicate that DGP is quite robust in practice. The PE accesses the processor s cache to get the values for intermediate load instructions, if any, in the dependence graph; if such an access results in a miss, the PE squashes the remaining precomputation; it does not issue a prefetch, and moves on to begin precomputing the next dependence graph in the dependence graph buffer, or await its arrival. After the last (load/store) instruction in a dependence graph is executed, the address generated is used to prefetch the data into the processor s L1 cache. The prefetch is squashed by L1 if the data is already present. 2.3 Graph Generation and Precomputation Costs Since DGG takes one cycle to generate each instruction in the dependence graph, one way to reduce this cost is to cache recently generated dependence graphs so as to avoid regenerating a graph every time its root load/store instruction is encountered. Caching would save the time needed to regenerate dependence graphs and might thereby increase prefetch timeliness. One simple caching strategy is to store the most recently generated dependence graph of each marked load/store instruction in a dependence graph cache. DGG could then access the graph cache using the PC of a marked load/store instruction as an index. On a hit, the graph is sent to the dependence graph buffer to await precomputation. But on a graph cache miss, DGG must build the dependence graph as before. Whenever a new dependence graph is built, it is sent to the buffer to await precomputation and is also stored in the dependence graph cache, replacing some previously stored entry. We implemented this simple graph caching scheme, but our results (not presented here) show that it actually degrades the performance. Caching dependence graphs gains better timeliness at the cost of using a cached graph that may be inaccurate because the predicted control path that leads up to the current instance of the marked load/store may be different from the path used to generate the cached graph. The performance loss due to the reduced accuracy of cached graphs outweighed the performance gains due to the more timely prefetching. In our implementation we did not explore the possibility of reducing the instruction overhead due to precomputation. One way to reduce this overhead is to let PE up-

5 date the main processor s architected state after executing the instructions in the dependence graphs, thereby eliminating the need to re-execute them in the main processor. In our implementation, instructions executed by PE never update any architected state of the main processor. We chose this approach because it avoids complicating the main processor by almost entirely decoupling the DGG and PE from the main processor pipeline execution. Constraining the PE to update the architected state would significantly increase the control logic and add performance degrading interlocks between the PE and the main processor pipeline to ensure correct updating of the processor state. This would also increase the complexity of the testing and verification process of the design. Moreover the increased communication between the PE and main processor would add new wire delays which may reduce the clock speed. Thus in our implementation we chose the simplicity of a free running PE without interlocks, despite the often redundant reexecution of precomputed instructions in the main processor. Our evaluations show that on average PE executes about 20% of the main processor s instructions. Furthermore as 98% of the precomputed addresses generated by PE match the addresses generated by the main processor, we infer that the vast majority of the instructions executed by the PE are in fact eventually reexecuted by the main processor without change, as otherwise the addresses would be highly unlikely to match. For future processors with further increased transistor counts [9], the 20% reexecution overhead will be even less of a concern relative to the savings in cache miss stall overhead achieved without slowing down the main processor pipeline [1]. The additional hardware needed for implementing DGP is the Dependence Graph Generator, Dependence Graph Buffer, and the Precomputation Engine, in addition to augmenting the IFQ and adding the RDV to the predecode stage. We assumed a simple Dependence Graph Generator that reads one entry from the IFQ every cycle. The Dependence Graph Buffer uses only 128 bytes of storage. The Precomputation Engine has one general purpose integer functional unit and a small scratch register file with two read ports and one write port. Even with such simple hardware components, DGP reduces the data cache miss stalls quite significantly. Thus we believe that the extra computation and the transistor budget required for DGP is well justified. 3 Simulation Environment and Benchmarks To evaluate DGP performance we selected seven benchmarks from the CPU2000 integer benchmark suite (gzip, gcc, crafty, parser, gap, bzip2 and twolf) and two sets of database queries generated from eight queries (1 through 7 and 9) from the Wisconsin benchmark [2] and five queries (1,2,3,5, and 6) from the TPC-H benchmark [7]. These benchmarks were selected because they exhibit poor data cache behavior and our existing simulation infrastructure allows us to run them without modifying the benchmark source codes. The wisc workload consists of all the eight Wisconsin queries running on a 10MB database; wisc+tpch is all eight Wisconsin queries plus the five TPC-H queries running concurrently on a database of size 40MB. The database queries are implemented on top of SHORE [4], a fully functional storage manager which has been used extensively in the database research community and is also used in some commercial database systems. All the benchmarks were compiled on an Alpha processor running OSF Version 4.0F using the Compaq C++ compiler, version 6.2, with -O5 -ifo -inline and speed optimization flags turned on. To the O5 optimized binary, we applied the OM [12] tool which reduces I-cache misses by performing feedback-directed code layout. Although OM is targeted toward improving the spatial locality of the code, it s ability to analyze object level code at link time also opens up new opportunities for redoing some traditional compiler optimizations, such as inter-procedural dead code elimination and loop-invariant code motion, which even the O5 optimizations could not perform effectively at compile time. Thus, in addition to reducing I-cache misses, OM also reduces the number of instructions executed in a program. Although the CPU2000 benchmarks do not benefit much from the OM optimizations, OM does provide an 11% performance improvement over O5 code for the database benchmarks. We used this highly optimized binary as a baseline to study the performance impact of DGP. The CPU2000 benchmarks were first run on the test input set provided by SPEC. The set of load/store instructions were sorted according to the number of misses they generated, then those that missed most often were marked in the code down to the level where 90% of the cache misses were covered. Similarly, a set of three queries from the Wisconsin benchmark: query 1 (sequential scan), query 5 (nonclustered index select) and query 9 (two-way join) were run on a small database of 2100 tuples, and the load/store instructions that caused 90% of the cache misses were marked in the wisc and wisc+tpch workloads. Recall that during actual execution of the full workload on the large datasets, DGG will generate dependence graphs only for the marked load/store instructions. The CPU2000 benchmarks were then run on the train input set, wisc was run on a 10MB database, and wisc+tpch on a 40MB database. The results presented in this paper were generated by terminating the execution of each benchmark after 2 billion committed instructions. The SimpleScalar out-of-order processor simulator [3] was modified by adding a predecode stage between the fetch and dispatch stages of the processor pipeline. We also added the DGG and the PE that execute concurrently with the main

6 processor simulator. This modified simulator was used for detailed cycle-level processor simulation. The microarchitecture parameters were set as shown in Table 1. 4 Simulation Results In this section we first present the results from our initial feasibility studies. We measure the instruction delays in the processor pipeline of the target microarchitecture (Table 1) and show that instructions spend 23 cycles on average in the IFQ and ROB before entering the execution stage. Our measurements also show that most marked load/store instructions have dependence graphs of fewer than 8 instructions. The 23 cycle delay between the fetch and the beginning of execution, coupled with the small dependence graph sizes encouraged us to consider prefetching by dependence graph precomputation. We then show that DGP did in fact reduce the D-cache miss stall time by 47%, resulting in a 13% average overall performance improvement. Fetch, Decode & Issue Width 4 Inst Fetch Queue Size 32,64 Reservation stations 64 Functional Units 4GeneralPurpose/2mult Memory system ports to CPU 4 L1 I and D cache each 32KB,2-way,32byte Unified L2 cache 1MB,4-way,32byte L1 hit latency(cycles) 1 L2 hit latency(cycles) 16 Mem latency (cycles) 80 Branch Predictor Hybrid(2-lev+2-bit) Table 1. Microarchitecture Parameter Values 4.1 Delays in the IFQ and Reservation Stations In our processor model the fetch stage of the pipeline can fetch 4 instructions every cycle and the processor can commit 4 instructions every cycle. But the IPC column of Table 2 shows that on average only 1.4 instructions are actually committed per clock cycle by the baseline system without prefetching. (The average figures in Tables 2-4 and Figure 2 are computed as if all 9 benchmarks were combined into one long run, and dividing by 9 for the counts in Tables 3 and 4.) IPC is much less than the maximum fetch rate because of pipeline stalls due primarily to branch mispredictions, cache misses, and resource conflicts. Because of this disparity between the instruction fetch and completion rate, the IFQ will fill quickly and tend to remain near full as can be seen from the IFQ full% column. IFQ is full on average for 69% of the execution time. In fact, a careful analysis of the IFQ occupancy showed that the only time IFQ is not full is immediately after a branch misprediction, during which time the fetch entries are all squashed and instructions from the correct path are being fetched. Wisc and wisc+tpch suffer from much higher branch misprediction rates than the CPU2000 benchmarks, resulting in their IFQs being full only 39% and 44% of the time, respectively. An instruction in the processor pipeline spends sometime (IFQ Delay) in the IFQ waiting for both an available decoder and an empty reservation station before it can leave the IFQ; Table 2 shows a 13 cycle IFQ wait for the average instruction. After leaving the IFQ, instructions wait in a reservation station for dependence resolution and functional unit availability (Exec Delay) before beginning execution, this takes the average instruction 10 additional cycles. Thus an average instruction spends 23 cycles in the pipeline before entering the execution stage. After beginning execution, it spends another 8 cycles before it is committed. Assuming the average 23 cycle delay before beginning execution, to completely mask a 16 cycle L1 D-cache miss latency, a prefetch for the corresponding load/store instruction would have to be issued within 7 cycles after that instruction enters the IFQ. Thus the dependence graph generation plus the precomputation of the address, which both depend on the size of the dependence graph, would have to be completed within 7 cycles from the time the marked load/store instruction enters the IFQ. Of course individual cases will vary from this average and an out-of-order processor should be capable of masking some portion of the miss latency itself. Furthermore, although masking the full miss latency is most desirable, masking a portion of the latency is still helpful. Pipeline Delay (cycles) IPC IFQ full% IFQ Exec Commit gcc % parser % twolf % gap % crafty % bzip % gzip % wisc % wisc+tpch % Avg % Table 2. Instruction Flow and IFQ occupancy (baseline system) 4.2 Dependence Graph Size Figure 2 shows the cumulative distribution of dependence graph sizes for each benchmark and the average. On average 92% of the dependence graphs have 8 instructions or fewer. Since DGG takes one cycle to generate each instruction in the dependence graph, 92% of the dependence graphs can be generated within 8 cycles. Likewise PE can execute these dependence graphs and generate the prefetch address in at most 8 cycles. Thus a prefetch can almost always be issued within 16 cycles after a marked load/store enters the IFQ. Since the average instruction spends 23 cycles before entering the execution stage we estimate that at least 7 cycles of the D-cache miss latency would be masked

7 by these prefetches. Thus a 16 cycle miss stall might be reduced to 9 cycles, a 44% reduction for an 8 instruction graph. Moreover, since 56% of all the dependence graphs have 3 or fewer instructions, 56% of the prefetches can be generated within 6 cycles, leaving 17 cycles to complete the prefetch. Thus over half the prefetches should completely mask the 16 cycle cache miss latency. This observation encouraged us to explore precomputation as a viable technique for prefetching. % Dependence Chains Dependence Graph Size gcc parser twolf gap crafty bzip2 gzip wisc wisc+tpch Avg Figure 2. Cumulative Distribution of Dependence Graph Sizes (in instructions) 4.3 Marked Load/Store Characteristics Table 3 shows the basic characteristics of the load/stores that generated 90% of the cache misses in the profile run, and were marked for precomputation. Out of 57,493 load/stores in the average benchmark, only 166 were marked for precomputation. Thus only a very few load/stores in a program cause the vast majority of D- cache misses. Precomputation can therefore focus only on this small set of load/stores, which in turn greatly reduces the dependence graph generation and precomputation overhead. Tot L/S Mark L/S Mark Ref% Over% gcc (0.24%) 13.5% 13.1% parser (1.33%) 25.8% 34.2% twolf (0.50%) 23.5% 24.7% gap (0.20%) 17.4% 23.0% crafty (0.78%) 17.5% 22.5% bzip (0.62%) 46.2% 45.3% gzip (0.43%) 4.6% 6.8% wisc (0.11%) 4.4% 3.9 % wisc+tpch (0.20%) 10.0% 8.9% Avg (0.28%) 10.8% 19.9% Table 3. Marked Load/Store Characteristics The total references generated by the marked load/stores (Mark Ref%) on average account for 11% of all the references during the actual execution. The average instruction overhead (Over%) of 20% shows that on average, 1 instruction is executed by PE per 5 instructions executed by the main processor. In bzip2 the marked load/stores from the profile run accounted for 46% of the total references generated in the actual execution, and 45% of all its instructions were also precomputed by PE. Since DGP generates a dependence graph for every dynamic instance of a marked load/store execution, bzip2 has significantly higher instruction overhead than the other benchmarks. 4.4 Precomputation Characteristics Table 4 shows the characteristics of precomputed addresses and the resulting prefetches that were issued by the PE. Of the 110,008,552 precomputed addresses (Prec) of the marked load/stores in the average benchmark, 98% matched the address generated by the main processor, indicating that PE is highly accurate in generating prefetch addresses. Thus neither the potential register hazards due to the lack of interlocks between the main processor pipeline and the PE, nor the inability of DGP to detect memory aliases, significantly hinder DGP from generating accurate prefetch addresses. Recall that a prefetch is issued to L2 only if the precomputed address is not already in L1; Pref Iss shows that 19% of the total precomputed addresses resulted in a prefetch being issued. Pref Used indicates that on average 18% of the total precomputations (94% of the issued prefetches) resulted in a useful prefetch, i.e. the prefetched line was used by the main processor before being replaced. Prec Match Prec Pref Iss Pref Used gcc parser twolf gap crafty bzip gzip wisc wisc+tpch Avg (98%) 21.4(19%) 20.2(18%) Table 4. Precomputations and Prefetches (in millions) 4.5 Performance Improvement with DGP Figure 3 shows the reduction in execution cycles due to DGP. There are six bars for each benchmark. The leftmost bar shows the base performance with a data cache size of 32KB. The second bar shows performance with tagged next line (NL) prefetching, where the main processor issues a prefetch to the next cache line on every cache miss and also on a first hit to a previously prefetched cache line. The performance of DGP with and without NL, using an IFQ size of 32, is shown in the third and fourth bars, respectively. The fifth shows the execution cycles that would be required with a perfect data cache, wherein every access to the data

8 Exec Cycles(X10^9) dcache NL DGP NL+DGP Perf-dcache 8wide gcc parser twolf gap crafty bzip2 gzip wisc wisc+tp ch Figure 3. Performance with a 32 entry IFQ cache is completed in 1 cycle. The rightmost bar shows the performance when the baseline processor s issue width and general purpose functional units are doubled (from 4 to 8), with other parameters remaining as in Table 1. These results show that NL prefetching improves the performance by 7%. When DGP is used with or without NL it reduces the average cache miss stall time by 47%. This stall time reduction results in a 7% further improvement over NL, and 13% over the baseline processor with no prefetching. Furthermore doubling the issue width and functional units of the main processor gives only a 2% performance improvement over DGP. The extra transistors required to implement DGP is approximately equivalent to adding only one additional issue width and one functional unit to the 4-wide baseline processor. DGP is also within 15% of the perfect data cache performance. 4.6 Prefetch Effectiveness and Bus Traffic Figure 4 shows the prefetch effectiveness of DGP without NL by categorizing the issued prefetches into three categories. The bottom component, Pref Hits, shows the number of times that the next reference to a prefetched line found the referenced instruction already in the L1 cache. The center component, Delayed Hits, shows the number of times that the next reference to a prefetched line found the referenced instruction still en-route to the L1 cache from lower levels of memory. Finally the upper component, Useless prefetches, shows the number of cache lines that were prefetched, but replaced before their next reference. On average, for IFQ size 32 (left bars) only 6% of the prefeches issued by DGP are useless. Thus the extra traffic generated due to DGP is minimal. Of the 94% useful prefetches (Pref Hits+Delayed Hits), 40% are Delayed Hits. Further analysis of the delayed hits showed that 35% of them incurred less than 8 cycles of miss penalty. Another 35% incurred more than 16 cycles of miss penalty because the prefetched line was also missing in the L2 cache. Since DGP is most effective in covering L1 miss penalties, it could benefit from Avg techniques that improve L2 cache performance. However, note that 57% of the prefetches fully masked the corresponding miss penalty which is consistent with the ballpark estimate in Section 4.2. Precomputation Distribution(X10^6) IFQ SIZE=32 IFQ SIZE= Useless Delayed-Hits Pref-Hits fq64_wasted_precomp delayed_precomp timely_precomp gcc parser twolf gap crafty bzip2 gzip wisc wisc+tp Avg ch Figure 4. Distribution of DGP prefetches issued with IFQs of 32 and 64 entries 4.7 Increasing Timeliness with a Larger IFQ One possible way to increase prefetch timeliness is to increase the IFQ size, allowing instructions to be buffered in the IFQ for a longer time, and thereby giving more time for the precomputations to complete. However, to fill a longer IFQ the fetch stage has to run further ahead of the execution stage, which it does by predicting several more branches ahead, which in turn decreases the branch prediction accuracy. Decreased branch prediction accuracy may negate the gains due to increased prefetch timeliness. Moreover increasing the IFQ size may also increase dependence graph sizes, thereby increasing the time required to generate and precompute the dependence graphs. To evaluate these tradeoffs we simulated IFQ sizes of 64, 128, 256 and found that the queue size has a negligible effect on performance; results for IFQ size 64 are shown in Figure 5. Exec Cycles(X10^9) dcache NL DGP NL+DGP Perf-dcache 8wide gcc parser twolf gap crafty bzip2 gzip wisc wisc+tp ch Figure 5. Performance with a 64 entry IFQ The right bars in Figure 4 show the distribution of precomputations for an IFQ of size 64. As can be seen the useful precomputations increase and delayed precomputations Avg

9 decrease when the fetch queue size is increased from 32 to 64. However, the number of useless precomputations also increases because of the increase in branch mispredictions. Although Figure 4 shows that a longer IFQ allows more of the precomputations to be useful, Figure 5 shows that the actual performance of DGP with a 64 entry IFQ, due to a 2% reduction in branch prediction accuracy which negates the gains from better prefetch timeliness, is nearly unchanged from the 32 entry IFQ (third bar in Figure 3). However, the two database benchmarks, wisc and wisc+tpch, despite having lower branch prediction accuracy than the CPU-2000 benchmarks, do show a modest net gain with the larger IFQ. 5 Comparison of DGP with Previous Work Program slicing [13] is a more general technique used for focusing on parts of a program that have a direct or indirect effect on the values computed at some points of interest. The concept of a dependence graph, used in this paper, is one way of generating a program slice. Dependence graphs are used extensively by compilers for performing several traditional compiler optimizations and by debuggers in order to focus in on instructions that cause a program bug. Typically these dependence graphs are generated at compile time, but their size is too large to be useful for precomputation. DGP uses hardware to generate its dependence graphs dynamically, which significantly reduces the graph size and makes precomputation quite feasible, as the domain of interest includes only the predicted control path and extends back only as far as the unexecuted instructions that still reside in the IFQ. Zilles and Sohi [14] studied the characteristics of dependence graphs called backward slices for performance degrading instructions such as cache miss generating instructions and hard to predict branches. Their results show that in a window of 512 instructions leading up to a performance degrading instruction, fewer than 10% of the instructions contribute to the data used by the performance degrading instruction. Roth and Sohi [10] proposed speculative data driven multithreading (DDMT) in which backward slices are first generated from a program trace and are then preexecuted on a separate thread in a simultaneous multithreaded (SMT) processor. Speculative threads are spawned when a designated trigger instruction from the non-speculative thread starts execution. In their scheme, instructions in the speculative threads are not reexecuted by the main thread and hence speculative threads do update the processor s architected state. To guarantee correct updating of the architected state by the speculative thread, their approach requires complex instruction issue and register renaming logic in the processor core Collins, et al. [6] use the DDMT approach to generate backward slices for cache miss causing loads from program traces. These slices are preexecuted on a separate thread in an Itanium processor for early generation of load addresses. However, as in DGP, the speculative slices are used only for prefetching data, and in particular do not update the architected state of the processor. This approach has the advantage that it avoids complicating the processor core by almost entirely decoupling speculative thread execution from main thread execution. Collins proposes a chaining trigger mechanism whereby speculative threads are also allowed to spawn other speculative threads. This trigger mechanism differs from Roth s approach wherein only non-speculative threads are allowed to spawn speculative threads. While all the approaches above use trace driven analysis to generate their dependence graphs, in this paper we have presented an easily implementable technique to generate these graphs dynamically at runtime, together with a mechanism to precompute the dependence graphs and generate accurate and timely prefetches. Farcy, et al. [8] use precomputation to generate branch outcomes earlier. To precompute the branch outcome of a loop branch they skip several iterations of the loop and get ahead of the normal execution to determine future branch outcomes of the loop branch. This technique relies on the regularity of the program to gain the required distance for precomputation. But in the case of a cache miss, the miss address needs to be generated further ahead of the demand in order to completely mask today s increasingly large cache miss latencies. The technique presented in [8] is not suitable for masking large L1 cache miss latencies for irregular applications. 6 Conclusions and Future Work Today s dominant application domains exhibit irregular access patterns resulting in a significant number of cache misses. Furthermore, processor clock speeds are continuing to outpace memory access speeds, resulting in longer cache miss latencies. Thus even aggressive out-of-order processors suffer significant performance degradation when executing these applications due to their frequent cache misses and increasingly long cache miss latencies. Existing data prefetch techniques, do not accurately predict the prefetch addresses for irregular applications in a timely fashion. In this paper we have explored an alternative to predicting prefetch addresses, namely precomputing them. Dependence Graph Precomputation (DGP) is a novel approach for accurately and dynamically identifying and precomputing the instructions that determine the addresses accessed by those load/store instructions deemed to be responsible for most data cache misses. The Dependence Graph Generator described in this paper efficiently generates the required dependence graphs at run time. A separate Precomputation Engine executes these dependence graphs to generate the data addresses of the designated load/store instructions early enough for timely prefetching. The re-

10 sults generated by PE are used only for prefetching data; in particular they do not update the architected state of the main processor. Since PE is prevented from modifying the architected state of the main processor, it can speculatively run ahead of the actual computation without having to guarantee the correctness of the precomputation; nevertheless, it is nearly always correct. We have used a very aggressive out-of-order processor with well tuned application codes as a baseline system for studying the performance impact of DGP. Even with this formidable baseline, our results show that 94% of the prefetches issued by DGP are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes us about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the perfect D-cache performance. The prefetch timeliness of DGP can be improved if the IFQ size is increased. But increasing IFQ size decreases branch prediction accuracy, thereby reducing the net performance. Hence branch prediction accuracy will eventually limit the size of the IFQ that should be used for the purpose of generating the dependence graphs. Although this paper focuses on load/store address precomputation, this technique could also be used for early generation of branch outcomes, thus enabling it to ameliorate both the instruction and data supply problems of current processors. Another alternative to implementing the precomputation engine is to execute the dependence graphs on special threads in a simultaneous multithreaded processor, as suggested by Chappell, et al. [5]. We are currently considering this alternative to implementing the precomputation engine. The insights gained from this work can also be used to change the scheduling priority among the reservation stations. By scheduling and executing the instructions in the dependence graphs of marked load/store instructions ahead of other ready instructions, results similar to those shown in this paper might be achieved without a special precomputation engine. One of the main advantages of out-of-order processors is their ability to tolerate cache miss latencies by executing other ready instructions in the reservation stations. If a simple in-order main processor were used with several autonomous precomputation engines to simultaneously reduce cache misses and branch mispredictions, it would be interesting to see whether complex out-of-order processors could still be justified. 7 Acknowledgements This research was supported by a gift from IBM. The simulation facility was provided through an Intel Technology for Education 2000 grant. We would like to thank Josef Burger for providing us a version of SHORE that runs on Alpha machines, and Steve Reinhardt for his suggestions and for graciously allowing us to use his Alpha machines. We would also like to thank Joel Emer for suggesting several future directions for this work. References [1] M. Annavaram, G. Tyson, and E. Davidson. Instruction Overhead and Data Locality Effects in Superscalar Processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, pages , April [2] D. Bitton, D. DeWitt, and C. Turbyfill. Benchmarking Database Systems A Systematic Approach. In 9th International Conference on Very Large Data Bases, pages 8 19, October [3] D. Burger and T. Austin. The SimpleScalar Tool Set. Technical report, University of Wisconsin-Madison, Computer ScienceDepartment Technical Report #1342, June [4] M. Carey, D. DeWitt, M. Franklin, N. Hall, M. McAuliffe, J. Naughton, D. Schuh, M. Solomon, C. Tan, O. Tsatalos, S. White, and M. Zwilling. Shoring Up Persistent Applications. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages , May [5] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture, pages , June [6] J. D. Collins, H. Wang, D. M. Tullsen, H. J. Christopher, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, page??, July [7] T. P. P. Council. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0, June [8] A. Farcy, O. Temam, R. Espasa, and T. Juan. Dataflow Analysis of Branch Mispredictions and Its Applications to Early Resolution of Branch Outcomes. In Proceedings of the 31st International Symposium on Microarchitecture, pages 59 68, Dec [9] Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. In IEEE COMPUTER, volume 30(9), pages 51 57, Sept [10] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the High Performance Computer Architecture, pages 37 48, Jan [11] C. Selvidge. Compilation-Based Prefetching for Memory Latency Tolerance. PhD thesis, MIT, May [12] A. Srivastava and D. Wall. A Practical System for Intermodule Code Optimization at Link-Time. Technical Report Technical Report 92/6, Digital Western Research Laboratory, June [13] M. Weiser. Program Slicing. IEEE Transactions on Software Engineering, 11(4): , [14] C. Zilles and G. Sohi. Understanding the Backward Slices of Performance Degrading Instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages , June 2000.

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Loose Loops Sink Chips

Loose Loops Sink Chips Loose Loops Sink Chips Eric Borch Intel Corporation, VSSAD eric.borch@intel.com Eric Tune University of California, San Diego Department of Computer Science etune@cs.ucsd.edu Srilatha Manne Joel Emer Intel

More information

Pre-Computational Thread Paradigm: A Survey

Pre-Computational Thread Paradigm: A Survey Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

EECS 470 Final Project Report

EECS 470 Final Project Report EECS 470 Final Project Report Group No: 11 (Team: Minion) Animesh Jain, Akanksha Jain, Ryan Mammina, Jasjit Singh, Zhuoran Fan Department of Computer Science and Engineering University of Michigan, Ann

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes The Predictability of Computations that Produce Unpredictable Outcomes Tor Aamodt Andreas Moshovos Paul Chow Department of Electrical and Computer Engineering University of Toronto {aamodt,moshovos,pc}@eecg.toronto.edu

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

Superscalar SMIPS Processor

Superscalar SMIPS Processor Superscalar SMIPS Processor Group 2 Qian (Vicky) Liu Cliff Frey 1 Introduction Our project is the implementation of a superscalar processor that implements the SMIPS specification. Our primary goal is

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

Data Speculation. Architecture. Carnegie Mellon School of Computer Science Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Compiler Support for Dynamic Speculative Pre-Execution

Compiler Support for Dynamic Speculative Pre-Execution Compiler Support for Dynamic Speculative Pre-Execution Department of Electrical Engineering University of Southern California wro@usc.edu Won W. Ro and Jean-Luc Gaudiot G Department of Electrical Engineering

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information