Cluster Assignment Strategies for a Clustered Trace Cache Processor

Size: px
Start display at page:

Download "Cluster Assignment Strategies for a Clustered Trace Cache Processor"

Transcription

1 Cluster Assignment Strategies for a Clustered Trace Cache Processor Ravi Bhargava and Lizy K. John Technical Report TR Laboratory for Computer Architecture The University of Texas at Austin Austin, Texas, {ravib,ljohn}@ece.utexas.edu March 31, 2003 Abstract This report examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed clustering techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters and a heavily partitioned instruction window also increase the degree of difficulty for effective cluster assignment. In this report, the trace cache and fill unit are used to perform effective dynamic cluster assignment. The retire-time fill unit analysis is aided by a dynamic profiling mechanism embedded within the trace cache. This mechanism provides information on inter-trace trace dependencies and critical inputs, elements absent in previous retire-time CTCP cluster assignment work. The strategy proposed in this report leads to more intra-cluster data forwarding and shorter data forwarding distances. In addition, performing this strategy at retire-time reduces issuetime complexity and eliminates early pipeline stages. This increases overall performance for the SPEC CPU2000 integer programs by 8.4% over our base CTCP architecture. This speedup is significantly higher than a previously proposed retire-time CTCP assignment strategy (1.9%). Dynamic cluster assignment is also evaluated for several alternate cluster designs as well as media benchmarks. 1 Introduction A clustered microarchitecture design allows for wide instruction execution while reducing the amount of complexity and long-latency communication [2, 3, 5, 7, 11, 21]. The execution resources and register file are partitioned into smaller and simpler units. Within a cluster, communication is fast while inter-cluster communication is more costly. Therefore, the key to high performance on a clustered microarchitecture is assigning instructions to clusters in a way that limits inter-cluster data communication. During cluster assignment, an instruction is designated for execution on a particular cluster. This assignment process can be accomplished statically, dynamically at issue-time, or dynamically

2 2 at retire-time. Static cluster assignment is traditionally done by a compiler or assembly programmer and may require ISA modification and intimate knowledge of the underlying cluster hardware. Studies that have compared static and dynamic assignment conclude that dynamic assignment results in higher performance [2, 15]. Dynamic issue-time cluster assignment occurs after instructions are fetched and decoded. In recent literature, the prevailing philosophy is to assign instructions to a cluster based on data dependencies and workload balance [2, 11, 15, 21]. The precise method varies based on the underlying architecture and execution cluster characteristics. Typical issue-time cluster assignment strategies do not scale well. Dependency analysis is an inherently serial process that must be performed in parallel on all fetched instructions. Therefore, increasing the width of the microarchitecture further delays and frustrates this dependency analysis (also noted by Zyuban et al. [21]). Accomplishing even a simple steering algorithm requires additional pipeline stages early in the instruction pipeline. In this report, the clustered execution architecture is combined with an instruction trace cache, resulting in a clustered trace cache processor (CTCP). A CTCP achieves a very wide instruction fetch bandwidth using the trace cache to fetch past multiple branches in a low-latency and highbandwidth manner [13, 14, 17]. The CTCP environment enables the use of retire-time cluster assignment, which addresses many of the problems associated with issue-time cluster assignment. In a CTCP, the issue-time dynamic cluster assignment logic and steering network can be removed entirely. Instead, instructions are issued directly to clusters based on their physical instruction order in an trace cache line or instruction cache block. This eliminates critical latency from the front-end of the pipeline. Instead, cluster assignment is accomplished at retire-time by physically (but not logically) reordering instructions so that they are issued directly to the desired cluster. Friendly et al. present a retire-time cluster assignment strategy for a CTCP based on intra-trace data dependencies [6]. The trace cache fill unit is capable of performing advanced analysis since the latency at retire-time is more tolerable and less critical to performance [6, 8]. The shortcoming of this strategy is the loss of dynamic information. Inter-trace dependencies and workload balance information are not available at instruction retirement and are ignored. In this report, we increase the performance of a wide issue CTCP using a feedback-directed, retire-time (FDRT) cluster assignment strategy. Extra fields are added to the trace cache to accumulate inter-trace dependency history, as well as the criticality of instruction inputs. The fill unit combines this information with intra-trace dependency analysis to determine cluster assignments.

3 3 This novel strategy increases the amount of critical intra-cluster data forwarding by 44% while decreasing the average data forwarding distance by 35% over our baseline four-cluster, 16-way CTCP. This leads to a 8.4% improvement in performance over our base architecture compared to 1.9% improvement for Friendly s method. 2 Clustered Microarchitecture A clustered microarchitecture is designed to reduce the performance bottlenecks that result from wide-issue complexity [11]. Structures within a cluster are small and data forwarding delays are reduced as long as communication takes place within the cluster. The target microarchitecture in this report is composed of four, four-way clusters. Four-wide, out-of-order execution engines have proven manageable in the past and are the building blocks of previously proposed two-cluster microarchitectures. Similarly configured 16-wide CTCP s have been studied [6, 21], but not with respect to the performance of dynamic cluster assignment options. An example of the instruction and data routing for the baseline CTCP is shown in Figure 1. Notice that the cluster assignment for a particular instruction is dependent on its placement in the instruction buffer. The details of a single cluster are explored later in Figure 3 Instruction Memory Subsystem (Trace Cache, L1 I Cache) I1 I2 I3 I I13. I14 I15 I16 Decoded/Renamed Instruction Trace C2 C3 CLUSTER 1 CLUSTER 4 Inter Cluster Data Bypass Data Memory System Figure 1: Overview of a Clustered Trace Cache Processor C2 and C3 are clusters identical to Cluster 1 and Cluster 4.

4 4 2.1 Shared Components The front-end of the processor (i.e. fetch and decode) is shared by all of the cluster resources. Instructions fetched from the trace cache (or from the instruction cache on a trace cache miss) are decoded and renamed in parallel before finally being distributed to their respective clusters. The memory subsystem components, including the store buffer, load queue, and data cache, are also shared. Pipeline The baseline pipeline for our microarchitecture is shown in Figure 2. Three pipeline stages are assigned for instruction fetch (illustrated as one box). After the instructions are fetched, there are additional pipeline stages for decode, rename, issue, dispatch, and execute. Register file accesses are initiated during the rename stage. Memory instructions incur extra stages to access the TLB and data cache. Floating point instructions and complex instructions (not shown) also endure extra pipeline stages for execution. mem D TLB DC Access DC Return Fetch x 3 Decode Rename RF Access RS Issue Steer FU Dispatch Execute Writeback Fill Unit Figure 2: The Pipeline of the Baseline CTCP Trace Cache The trace cache allows multiple basic blocks to be fetched with just one request [13, 14, 17]. The retired instruction stream is fed to the fill unit which constructs the traces. These traces consist of up to three basic blocks of instructions. When the traces are constructed, the intratrace and intra-block dependencies are analyzed. This allows the fill unit to add bits to the trace cache line which accelerates register renaming and instruction steering [13]. This is the mechanism which is exploited to improve instruction reordering and cluster assignment. 2.2 Cluster Design The execution resources modeled in this report are heavily partitioned. As shown in Figure 3, each cluster consists of five reservation stations which feed a total of eight special-purpose functional units. The reservation stations hold eight instructions and permit out-of-order instruction selec-

5 5 tion. The economical size reduces the complexity of the wake-up and instruction select logic while maintaining a large overall instruction window size [11]. INTRA CLUSTER INSTRUCTION CROSSBAR INTRA CLUSTER DATA BYPASS IN RS RS RS RS RS Inter Cluster Data Bypass CPX ALU ALU BR MEM OUT DATA RESULT BUS To Load & Store Queues Figure 3: Details of One Cluster There are eight special-purpose functional units per cluster: two simple integer units, one integer memory unit, one branch unit, one complex integer unit, one basic floating point (FP), one complex FP, one FP memory. There are five 8-entry reservation stations: one for the memory operations (integer and FP), one for branches, one for complex arithmetic, two for the simple operations. FP is not shown. Intra-cluster communication (i.e. forwarding results from the execution units to the reservation stations within the same cluster) is done in the same cycle as instruction dispatch. However, to forward data to a neighboring cluster takes two cycles and beyond that another two cycles. This latency includes all of the communication and routing overhead associated with sharing intercluster data [12, 21]. The end clusters do not communicate directly. There are no data bandwidth limitations between clusters in our work. Parcerisa et al. show that a point-to-point interconnect network can be built efficiently and is preferable to bus-based interconnects [12]. 2.3 Cluster Assignment The challenge to high performance in clustered microarchitectures is assigning instructions to the proper cluster. This includes identifying which instruction should go to which cluster and then routing the instructions accordingly. With 16 instructions to analyze and four clusters from which

6 6 to choose, picking the best execution resource is not straightforward. Accurate dependency analysis is a serial process and is difficult to accomplish in a timely fashion. For example, approximately half of all result-producing instructions have data consumed by an instruction in the same cache line. Some of this information is preprocessed by the fill unit, but issue-time processing is also required. Properly analyzing the relationships is critical but costly in terms of pipe stages. Any extra pipeline stages hurt performance when the pipeline refills after branch mispredictions and instruction cache misses. Totally flexible routing is also a high-latency process. So instead, our baseline architecture steers instructions to a cluster based on its physical placement in the instruction buffer. Instructions are sent in groups of four to their corresponding cluster where they are routed on a smaller crossbar to their proper reservation station. This style of partitioning results in less complexity and fewer potential pipeline stages, but is restrictive in terms of issue-time flexibility and steering power. A large crossbar will permit instruction movement from any position in the instruction buffer to any of the clusters. In addition to the latency and complexity drawbacks, this option mandates providing enough reservations stations write ports to accommodate up to 16 new instructions per cycle. Therefore, we concentrate on simpler, low-latency instruction steering. Assignment Options For comparison purposes, we look at the following dynamic cluster assignment options: Issue-Time: Instructions are distributed to the cluster where one or more of their input data is known to be generated. Inter-trace and intra-trace dependencies are visible. A limit of four instructions are assigned to each cluster every cycle. Besides simplifying hardware, this also balances the cluster workloads. This option is examined with zero latency and with four cycles of latency for dependency analysis, instruction steering, and routing. Friendly Retire-Time: This is the only previously proposed fill unit cluster assignment policy. Friendly et al. propose a fill unit reordering and assignment scheme based on intratrace dependency analysis [6]. Their scheme assumes a front-end scheduler restricted to simple slot-based issue, as in our base model. For each issue slot, each instruction is checked for an intra-trace input dependency for the respective cluster. Based on these data dependencies, instructions are physically reordered within the trace. 3 CTCP Characteristics The following characterization serves to highlight the cluster assignment optimization opportunities.

7 7 3.1 Trace-level Analysis Table 1 presents some run-time trace line characteristics for our benchmarks. The first metric (% TC Instr) is the percentage of all retired instructions fetched from the trace cache. Benchmarks with a large percentage of trace cache instructions benefit more from fill unit optimizations since instructions from the instruction cache are unoptimized for the CTCP. Trace Size is the average number of instructions per trace line. When the fill unit does the intra-trace dependency analysis for a trace, this is the available optimization scope. Table 1: Trace Characteristics %TCInstr Trace Size bzip crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr Avg Figure 4 breaks down the sources of the most critical input. The input data that arrives last is the most critical. This figure presents the percentage of instructions that had their most critical input arrive from the register file, the producer for RS1, and the producer for RS2. On average, 44% of the instructions obtain their critical input data from the register file, 30% receive their final data input from the producer for register RS1, and 26% from the producer for RS2. Table 2 looks at instruction data dependencies from a different perspective. The total number of unique consumers is tracked for each result-producing instruction (i.e. not stores or branches). This only includes consumers that receive their input via data forwarding. On average, each instruction has slightly more than one consumer. An inter-trace consumer is an instruction that has a data producer in a different trace line, shown in the second column. Only 28% of consumers are inter-trace consumers. Table 3 further analyzes the critical dependencies seen in data forwarding. The first column indicates the percentage of all data dependencies that are critical. On average, about 85% of dependencies are critical, i.e. the last (and possibly only) data nput received by an instruction. Of those, 27% are inter-trace dependencies. Table 3 shows a large percentage of intra-trace dependencies.

8 8 100% % Dynamic Instructions With Inputs 80% 60% 40% 20% From RS2 From RS1 From RF 0% bzp crf eon gap gcc gzp mcf psr prl twf vor vpr Figure 4: Source of Critical Input Dependency From RS2: Critical input provided by the producer for input RS2. From RS1: Critical input provided by the producer for input RS1. From RF: Critical input provided by the register file. Table 2: Dynamic Consumers Per Instruction Total Inter-Trace bzip crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr Avg Despite this, the speedup gained by removing the inter-trace forwarding latency is similar to that of removing intra-trace forwarding latency. In the case of bzip2, it is more beneficial to remove inter-trace dependencies. 3.2 Impact of Dependency Latencies Figure 5 looks at the effects on performance due to eliminating certain dependency-related latencies. The bars labeled No Fwd Lat correspond to our base model but with no data forwarding latency. If inter-cluster forwarding could take place in the same cycle, performance improves by 26.8%. Next, only the last forwarded value to an instruction is given a latency of zero cycles (No Crit

9 9 Table 3: Critical Data Forwarding Dependencies % of all dep. s % of critical dep. s that are critical that are inter-trace bzip % 29.69% crafty 81.04% 24.32% eon 86.58% 35.40% gap 86.15% 22.72% gcc 87.59% 20.35% gzip 80.94% 24.38% mcf 89.62% 25.92% parser 88.90% 38.16% perlbmk 86.11% 27.76% twolf 78.58% 23.95% vortex 85.51% 27.63% vpr 82.32% 25.84% Avg 84.91% 27.18% Fwd Lat). This improves performance by 24.4% 1. This result means that most of the improvement provided by eliminating all forwarding latencies is achieved just by eliminating the latency for the last arriving data input. This observation is exploited in the proposed cluster assignment scheme. No Fwd Lat No Crit Fwd Lat No Intra-Trace Lat No Inter-Trace Lat No RF Lat Speedup Over Base bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 5: Speedup When Removing Forwarding Latencies Note that the y-axis starts at However no speedups fall below The speedups due to only removing inter-trace and inter-trace data forwarding latencies are also presented in Figure 5. Although inter-trace dependencies are less frequent, removing them has almost the same impact as removing intra-trace dependencies. Finally, speedup due to eliminating 1 For bzip2, the branch predictor accuracy is sensitive to the rate at which instructions retire and the better case with no data forwarding latency actually leads to an increase in branch mispredictions and worse performance.

10 10 the register file read latency is presented and has almost no effect on overall performance. In fact, register file latencies between zero and 10 cycles have no impact on performance. This is due to the abundance and critical nature of in-flight instruction data forwarding. 3.3 Resolving Inter-Trace Dependencies The fill unit accurately determines intra-trace dependencies. Since a trace is an atomic trace cache unit, the same intra-trace instruction data dependencies will exist when the trace is later fetched. However, incorporating inter-trace dependencies at retire-time is essentially a prediction of issuetime dependencies, some of which may occur thousands or millions of cycles in the future. This problem presents an opportunity for an execution history based mechanism to predict the source clusters or producers for instructions with inter-trace dependencies. Table 4 examines the how often an instruction s forwarded data comes from the same producer instruction. For each static instruction, the program counter of the last producer is tracked for each source register (RS1 and RS2). The table shows that an instruction s data forwarding producer is the same for RS1 96.3% of the time and the same for RS2 94.3% of the time. Table 4: Frequency of Repeated Forwarding Producers All Critical Inter-trace Input RS1 Input RS2 Input RS1 Input RS2 bzip % 97.66% 89.30% 91.17% crafty 96.95% 97.82% 88.64% 93.55% eon 93.83% 89.84% 85.79% 73.34% gap 96.82% 93.65% 80.26% 77.88% gcc 96.39% 95.15% 85.36% 83.33% gzip 98.14% 99.02% 92.93% 96.04% mcf 96.61% 95.40% 92.36% 87.06% parser 96.13% 87.66% 83.93% 78.76% perlbmk 97.78% 93.79% 90.83% 79.27% twolf 96.69% 90.78% 87.09% 76.40% vortex 89.67% 95.04% 70.87% 83.64% vpr 98.53% 96.06% 95.64% 91.67% Average 96.25% 94.32% 86.92% 84.34% The last two columns isolate the critical producers of inter-trace consumers. These percentages are expectedly lower, but producers are still the same for 86.9% of the critical inter-trace RS1 inputs and for 84.3% critical inter-trace RS2 inputs. Therefore, mechanism to predict the forwarding producer could work with a high degree of success. However, what we are really interested in is whether inter-trace dependencies are arriving from the same location in the same trace. Just because the same instruction is repeatedly producing data for the inputs, does not mean that the inputs are coming from the same cluster or same trace.

11 11 Inter-trace dependencies do not necessarily arrive from the previous trace. They could arrive from any trace in the past. different dynamic traces. In addition, static instructions are sometimes incorporated into several Table 5 analyzes the distance between an instruction and its critical inter-trace producer. The values are the percentage of such instructions that encounter the same distance in consecutive executions. This percentage correlates very well to the percentages in the last two columns of Table 4. On average, 85.9% of critical inter-trace forwarding is the same distance from a producer as the previoue dynamic instance of the instruction. Table 5: Frequency of Repeated Critical Inter-Trace Forwarding Distances bzip % crafty 92.41% eon 80.15% gap 87.16% gcc 85.51% gzip 93.19% mcf 87.89% parser 84.46% perlbmk 81.55% twolf 76.80% vortex 84.56% vpr 88.93% Average 85.93% Distance is measures in number of retired instructions. 4 Retire-Time Cluster Assignment The enhancements in this section take advantage of the dynamic optimization opportunities of a long-latency fill unit. The remainder of this section presents the background, motivation, and details for our feedback-directed, retire-time (FDRT) instruction reordering and cluster assignment scheme. The proposed cluster assignment strategy improves retire-time cluster assignment by assigning instructions to a cluster based on both intra-trace and inter-trace dependencies. In addition, the critical instruction input is identified and considered in the assignment strategy. The fill unit is available at instruction retirement to analyze the instruction dependencies. Instructions are physically reordered to match the desired cluster assignments. When the trace is later fetched from the trace cache, the instructions issue to the proper cluster based on their physical position. Logically, the instructions remain in the same order. The logical ordering is marked in the trace cache line by the fill unit. Though this adds some complexity, techniques such as precomputing register renaming information and predecoding compensate and lead to an overall complexity

12 12 decrease. The important aspect is that physical reordering reduces inter-cluster communications while maintaining low-latency, complexity-effective issue logic. 4.1 Pinning Instructions Physically reordering instructions at retire-time based on execution history can cause more intercluster data forwarding than it eliminates. When speculated inter-trace dependencies guide the reordering strategy, the same trace of instructions can be reordered in a different manner each time the trace is constructed. Producers shift from one cluster to another, never allowing the consumers to accurately gauge the cluster from which their input data will be produced. A retiretime instruction reordering heuristic must be chosen carefully to avoid unstable cluster assignments and self-perpetuating performance problems. To combat this problem, we pin key producers and their subsequent inter-trace consumers to the same cluster to force intra-cluster data forwarding between inter-trace dependencies. This creates a pinned chain of instructions with inter-trace dependencies. The first instruction of the pinned chain is called a leader. The subsequent links of the chain are referred to as followers. The criteria for selecting pin leaders and followers are presented in Table 6. Table 6: Leader and Follower Criteria Conditions to become a pin leader: 1. Cannot already be a leader or a follower 2. Forwards data to inter-trace consumers 3. Repeatedly meets conditions 1 and 2 above Conditions to become a pin follower: 1. Not already a leader or follower 2. Producer provides last input data 3. Producer is a leader or follower 4. Producer is from different trace There are two key aspects to these guidelines. The first is that only inter-trace dependencies are considered. Placing instructions with intra-trace dependencies on the same cluster is easy and reliable. Therefore, these instructions do not require the pinned chain method to establish dependencies. Second, once an instruction is assigned to a pinned chain as a leader or a follower, its status should not change. The idea is to pin an instruction to one cluster and force the other instructions in the inter-trace dependency chain to follow it to that same cluster. If the pinned cluster is allowed to change, then it could lead to performance-limiting race conditions discussed earlier.

13 13 Table 7: FDRT Cluster Assignment Strategy Dependency type Option A Option B Option C Option D Option E if... if... if... if... if... Intra-trace producer: yes no yes no no Inter-trace pin leader/follower: no yes yes no no Intra-trace consumer: yes no then... then... then... then... then... Resulting Cluster 1. producer 1. pin 1. pin 1. middle 1. skip Assignment Priority: 2. neighbor 2. neighbor 2. producer s 2. skip 3. skip 3. skip 3. neighbor 4. skip 4.2 Dynamic Instruction Feedback The trace cache framework provides a unique instruction feedback loop. Instructions are fetched from the trace cache, executed, and then compiled into new traces. This cycle enables run-time, instruction-level profiling. By associating a few bits with each instruction in the trace cache 2,a dynamic execution history is built for each instruction. This execution data remains associated with the instruction as it travels through the pipeline. This method of run-time execution profiling is very effective as long as the trace lines are not frequently evicted from the trace cache. Similarly, Zyuban et al. suggested placing static cluster assignments in the trace cache, but do not provide details or results [21]. In our enhanced clustered trace cache processor microarchitecture, run-time per-instruction information is used to provide inter-trace dependency information to the fill unit. There are four fields added to the trace cache storage: Leader/Follower Value: This two-bit field indicates whether the instruction is a leader, follower, or neither. Confidence Counter: This two-bit counter must reach a value of two before an instruction is deemed worthy to be a leader. It is incremented when the criteria in Table 6 are met and decremented otherwise. This prevents instructions with variable behavior from becoming pinned chain leaders. Pinned Cluster: This field holds the pinned cluster number. Producers forward this two-bit value along with its data to consumers. Last Input Reg: This field tracks which input register is satisfied last. This is used to focus the cluster assignment policies. Figure 5 illustrated the usefulness of this characteristic. 2 Using Cacti 2.0 [16], an additional byte per instruction in a trace cache line is determined not to change the fetch latency of the trace cache.

14 Cluster Assignment Strategy The fill unit must weigh intra-trace information along with the inter-trace feedback from the trace cache execution histories. Table 7 summarizes our proposed cluster assignment policies. The inputs to the strategy are: 1) the presence of an intra-trace dependency for the instruction s most critical source register (i.e. the input that was satisfied last during execution), 2) the pin chain status, and 3) the presence of intra-trace consumers. The fill unit starts with the oldest instruction and progresses in logical order to the youngest instruction. For option A in Table 7, the fill unit attempts to place instructions that have only an intra-trace dependency on the same cluster as its producer. If there are no instruction slots available for the producer s cluster, an attempt is made to assign the instruction to a neighboring cluster. For an instruction with just an inter-trace pin dependency (option B), the fill unit attempts to place the instruction on the pinned cluster (which is found in the Pinned Cluster trace profile field) or a neighboring cluster. An instruction can have both an intra-trace dependency and a pinned inter-trace dependency (option C). Recall that pinning an instruction is irreversible. Therefore, if the critical input changes or an instruction is built into a new trace, an intra-trace dependency could exist along with a pinned inter-trace dependency. When an instructions has both a pinned cluster and an intra-trace producer, the pinned cluster takes precedence (although our simulations show that it doesn t matter which gets precedence). If four instructions have already been assigned to this cluster, the intra-trace producer s cluster is the next target. Finally, there is an attempt to assign the instruction to a pinned cluster neighbor. If an instruction has no dynamically forwarded input data but does have an intra-trace output dependency (option D), it is assigned to a middle cluster (to reduce potential forwarding distances). Instructions are skipped if they have no input or output dependencies (option E), or if they cannot be assigned to a cluster near their producer (lowest priority assignment for options A-D). These instructions are later assigned to the remaining slots using Friendly s method. 4.4 Example A simple illustration of the FDRT cluster assignment strategy is shown in Figure 6. There are two traces, T1 and T2. Trace T1 is the older of the two traces. Four instructions (out of at least 10) of each trace are shown. Instruction I7 is older than instruction I10. The arrows indicate dependencies. The arrows originate at the producer and go to the consumer. A solid black arrowhead represents an intra-trace dependence and a white arrowhead represents an inter-trace dependence. A solid

15 15 arrow line represents a critical input, and a dashed line represents a non-critical input. Arrows with numbers adjacent to them are chain dependencies where the number represents the chain cluster to which the instructions should be pinned. Trace 2 T2_I7 T2_I8 T2_I9 T2_I10 Trace T1_I7 T1_I8 T1_I9 T1_I intra trace pin inter trace critical input non crit input # pinned cluster FDRT Cluster Assignment (C) T2_I9 (A) T2_I10 (B) T2_I7 (A) T2_I8 T1_I7 (D) T1_I8 (A) T1_I9 (C) T1_I10 (B) CLUSTER 2 CLUSTER 3 Figure 6: FDRT Cluster Assignment Example The upper portion of this example examines four instructions from the middle of two different traces. The instruction numberings are in logical order. The arrows indicate dependencies, traveling from the producer to the consumer. The instructions are assigned to clusters 2 and 3 based on their dependencies. The letterings in parenthesis match those in Table 7. Instruction T1 I7 is the first to be assigned to a cluster. Since it has no detected producers but has an intra-trace consumer (Option D), it is assigned to a middle cluster (Cluster 2 in this case). Instruction T1 I8 has one intra-trace producer (Option A), and is assigned to the same cluster as its producer TI I7. The next two instructions have critical inputs from an chain inter-trace

16 16 dependencies (Option B). The chain cluster value is 3, so these instructions are assigned to that cluster. Note that the intra-trace input to T1 I9 is a non-critical input and therefore is ignored. After (or while) the next trace, trace T2, is constructed, the FDRT assignment strategy is applied to it. Instruction T2 I7 follows Option B and is assigned to Cluster 3 based on its chain inter-trace dependency. Instruction T2 I8 is assigned to the same cluster due to its intra-trace dependency (Option A). The instructions T2 I9 and T2 I10 are assigned to Cluster 2 in the exact same way. The first instruction through Option B and the second through Option A. In the baseline case, instructions I7 and I8 from each trace would execute on Cluster 2 while instructions I9 and I10 would execute on Cluster 3. In this case, the inter-trace consumers T2 I7 and T2 I9 would see a large inter-cluster latency. Using the FDRT assignment strategy, these two dependencies now benefit from intra-cluster forwarding, while no other forwarded data encounters additional latency. 4.5 Improvements Over Prior Method This method of instruction reordering and cluster assignment has several advantages over Friendly s previously proposed CTCP instruction reordering strategy. The most obvious improvement is the inclusion of inter-trace information via the pin information gathered in the trace cache instruction profiles. The variable forwarding latencies are taken into account only by the FDRT instruction reordering. Producer-only instructions are funneled to the middle clusters. This reduces the amount of three-cluster forwarding and increases the probability of zero- or one-cluster forwarding. Finally, Friendly s strategy examined each instruction slot and looked for a suitable instruction while the proposed method looks at each instruction and finds a suitable slot. This subtle difference accounts for some performance improvement as well. 5 Performance Evaluation 5.1 Simulation Methodology The clustered trace cache processor is evaluated using integer benchmarks from the SPEC CPU2000 suite [18]. The benchmarks and their respective inputs are presented in Table 8. The MinneSPEC reduced input set [9] is used when applicable. Otherwise, the SPEC test input is used. The benchmark executables are the precompiled Alpha binaries available with the SimpleScalar 3.0 simulator [1]. The C benchmarks were compiled using the Digital C compiler (V ). The C++ benchmark (eon) was compiled using the Digital C++ compiler (V ). The target

17 17 Table 8: SPEC CINT2000 Benchmarks Benchmark Input Source Inputs bzip2 MinneSPEC lgred.source 1 crafty SPEC test crafty.in eon SPEC test chair.control.kajiya chair.camera chair.surfaces ppm gap SPEC test -q -m 64M test.in gcc MinneSPEC mdred.rtlanal.i gzip MinneSPEC smred.log 1 mcf MinneSPEC lgred.in parser MinneSPEC 2.1.dict -batch mdred.in perlbmk MinneSPEC mdred.makerand.pl twolf MinneSPEC mdred vortex MinneSPEC mdred.raw vpr MinneSPEC mdred.net small.arch.in -nodisp -place only -init t5 -exit t alpha t inner num 2 architecture for the compiler was a four-way Alpha 21264, which in many ways is convenient for a clustered architecture with four-wide clusters. All benchmarks are run for 100 million instructions. Selected results for longer runs can be found in the Appendix. To perform the simulations, a detailed, cycle-accurate microprocessor simulator is interfaced to the functional simulator from the SimpleScalar 3.0 simulator suite (sim-fast) [1]. The architectural parameters for the simulated base microarchitecture are found in Table 9. The trace cache latency is determined for our target technology (100nm) and frequency (3.5 GHz) using the analytical cache model, CACTI 2.0 [16, 20]. L1 Data Cache: L2 Unified cache: Non-blocking: D-TLB: Store buffer: Load queue: Main Memory: Trace cache: L1 Instr cache: Branch Predictor: BTB: Table 9: Baseline Architecture Configuration Data memory 4-way, 32KB, 2-cycle access 4-way, 1MB, +8 cycles 16 MSHRs and 4 ports 128-entry, 4-way, 1-cyc hit, 30-cyc miss 32-entry w/load forwarding 32-entry, no speculative disambiguation Infinite, +65 cycles Fetch Engine 2-way, 1K-entry, 3-cycle access 4-way, 4KB, 2-cycle access 16k-entry gshare/bimodal hybrid 512 entries, 4-way Execution Cluster Functional unit # Exec. lat. Issue lat. Simple Integer 2 1 cycle 1 cycle Simple FP Memory Int. Mul/Div 1 3/20 1/19 FP Mul/Div/Sqrt 1 3/12/24 1/12/24 Int Branch FP Branch Inter-Cluster Forwarding Latency: 2 cycles per forward Register File Latency: 2 cycles 5 Reservation stations 8 entries per reservation station 2 write ports per reservation station 192-entry ROB Fetch width: 16 Decode width: 16 Issue width: 16 Execute width: 16 Retire width: Performance Analysis Figure 7 presents the speedups over our base architecture for different dynamic cluster assignment strategies. The proposed feedback-directed, retire-time (FDRT) cluster assignment strategy pro-

18 18 vides a 7.3% improvement. Friendly s method improves performance by 1.9% 3. This improvement in performance is due to enhancements in both the intra-trace and inter-trace aspects of cluster assignment. Additional simulations (not shown) show that isolating the intra-trace heuristics from the FDRT strategy results in a 3.4% improvement by itself. The remaining performance improvement generated by FDRT assignment comes from the inter-trace analysis No-lat Issue-time Issue-time FDRT Frendly bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 7: Speedup Due to Different Cluster Assignment Strategies The performance boost is due to an increase in intra-cluster forwarding and a reduction in average data forwarding distance. Table 10 presents the changes in intra-cluster forwarding. On average, both CTCP retire-time cluster assignment schemes increase the amount of same-cluster forwarding to above 50%, with FDRT assignment doing better. The inter-cluster distance is the primary cluster assignment performance-related factor (Table 11). For every benchmark, the retire-time instruction reordering schemes are able to improve upon the average forwarding distance. In addition, the FDRT scheme always provides shorter overall data forwarding distances than the Friendly method. This is a result of funneling producers with no input dependencies to the middle clusters and placing consumers as close as possible to their producers. For the program eon, the Friendly strategy provides a higher intra-cluster forwarding percentage than FDRT without resulting in higher performance. The reasons for this are two-fold. Most 3 All speedup averages are computed using the harmonic mean.

19 19 Table 10: Percentage of Intra-Cluster Forwarding For Critical Inputs Base Friendly FDRT bzip % 60.84% 79.54% crafty 38.35% 54.29% 57.02% eon 33.73% 52.83% 51.35% gap 46.80% 58.77% 63.06% gcc 47.27% 58.14% 60.13% gzip 32.94% 53.91% 58.25% mcf 50.35% 64.69% 65.34% parser 44.96% 57.67% 57.26% perlbmk 44.95% 58.36% 62.01% twolf 47.83% 56.91% 58.92% vortex 42.41% 54.00% 58.81% vpr 38.67% 58.70% 59.58% Average 42.34% 57.43% 60.94% Table 11: Average Data Forwarding Distance For Critical Inputs Base Friendly FDRT bzip crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr Average Distance is the number of clusters traversed by forwarded data. importantly, the average data forwarding distance is reduced compared to the Friendly method despite the extra inter-cluster forwarding. There are also secondary effects that result from improving overall forwarding latency, such as a change in the update rate for the branch predictor and BTB. In this case, our simulations show that FDRT scheme led to improved branch prediction as well. The two retire-time instruction reordering strategies are also compared to issue-time instruction steering in Figure 7. In one case, instruction steering and routing is modeled with no latency (labeled as No-lat Issue-time) and in the other case, four cycles are modeled (Issue-time). The results show that latency-free issue-time steering is the best, with a 9.9% improvement over the base. However, when applying an aggressive four-cycle latency, issue-time steering is only preferable for three of the 12 benchmarks and the average performance improvement (3.8%) is almost half that of FDRT cluster assignment.

20 FDRT Assignment Analysis Figure 8 is a breakdown of instructions based on their FDRT assignment strategy option. On average 32% have only an intra-trace dependency, while 16% of the instructions have just an intertrace pinned dependency. Only 7% of the instructions have both a pin inter-trace dependency and a critical intra-trace dependency. Therefore, 55% of the instructions are considered to be consumers and are therefore placed near their producers. 100% 90% 80% 70% 60% 50% 40% A just intra-trace B just pin Cpin&intra D just consumer E none skipped 30% 20% 10% 0% bzp crf eon gap gcc gzp mcf psr prl twf vor vpr Avg Figure 8: FDRT Critical Input Distribution The letters A-E correspond to the lettered options in Table 7. Around 10% of the instructions had no input dependencies but did have an intra-trace consumer. These producer instructions are assigned to a middle cluster where their consumers will be placed on the same cluster later. Only a very small percentage (less than 1%) of instructions with identified input dependencies are initially skipped because their is no suitable neighbor cluster for assignment. Finally, a large percentage of instructions (around 34%) are determined to not have a critical intra-trace dependency or pinned inter-trace dependency. Most of these instructions do have data dependencies, but they did not require data forwarding or did not meet the pin criteria. Table 12 presents pinned chain characteristics, including the average number of leaders per trace and average number of followers per trace. Because pin dependencies are limited to intertrace dependencies, the combined number of leaders and followers is only 2.90 per trace. This is about 1/4 of the instructions in a trace. For some of the options in Table 7, the fill unit cluster assignment mechanism attempts to place

21 21 Table 12: Pinned Chain Characteristics Per Trace #ofleaders # of Followers bzip crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr Avg consumers with intra-trace dependencies and/or chain dependencies on the same cluster as its producer. If this does not happen, the a neighboring cluster is tried. Figure 9 shows the breakdown of how often these instructions are placed on the same cluster as their producer (Same Cluster), one cluster above their producer (Cluster + 1), and one cluster below their producer (Cluster - 1). Around 85% of detected consumers are placed on the same cluster as their identified producer. The FDRT strategy tries the neighbor cluster closest to the middle first. This results in nearly the same amount of neighbors being moved up one cluster as being moved down one cluster. 100% 80% 60% 40% Cluster + 1 Same Cluster Cluster % 0% bzp crf eon gap gcc gzp mcf psr prl twf vor vpr Figure 9: Breakdown of Cluster Distance Between Detected Producers and Consumers

22 Cluster Migration This section examines the effects of permanently pinning chain leaders to a cluster. Recall that without this pinning, both chain leaders and followers could become part of any chain including their own. This could result in constantly changing chain cluster assignments where chain members are essentially chasing a moving target. We call this effect cluster migration. Instruction cluster migration is quantified in Table 13 for the presented version of FDRT (Pinning) and for a version that does not pin chains to a cluster (No Pinning). On average, without pinning the chain to a cluster, 5.39% of all dynamically encountered instructions are physically reordered such that their assigned cluster is different from their previous dynamic invocation. The largest percentage is 8.9% for twolf. However, once the leaders are pinned, the overall percentage of instructions experiencing cluster migration is reduced to an average of 3.84%. More importantly, the percentage of chain instructions that migrate is reduced by 44%. Table 13: Instruction Cluster Migration All Instr. All Instr. All Instr. Chain Instr. Pinning No Pinning Reduction Reduction bzip2 0.35% 0.98% 63.86% 62.73% crafty 4.08% 5.85% 30.23% 57.84% eon 5.94% 8.27% 28.20% 52.97% gap 3.51% 3.74% 6.11% 24.81% gcc 4.29% 6.63% 35.21% 60.00% gzip 5.97% 8.26% 27.68% 30.65% mcf 0.53% 0.78% 32.08% 43.02% parser 3.16% 5.67% 44.20% 50.51% perlbmk 3.77% 3.59% -4.98% 19.30% twolf 5.08% 8.92% 42.99% 56.29% vortex 5.03% 7.25% 30.67% 50.32% vpr 4.36% 4.77% 8.48% 23.94% Average 3.84% 5.39% 28.73% 44.37% Another interesting observation from this table is the increase in cluster migration for perlbmk while pinning the chains. In this case, the migration is reduced for chain instructions, which is the intended effect. However, for this benchmark, the scheduling of the non-chain instructions proves to be less stable when pinning the chain leaders. The purpose of pinning the instructions is to reduce instructions from oscillating between clusters and to increase the amount of intra-cluster forwarding for critical data dependencies. The change in intra-cluster forwarding can be seen in Table 14. Overall, pinning instructions increases the amount of same-cluster critical data forwarding for eight out of 12 benchmarks and by a small percentage overall (60.04% to 59.48%). Only bzip2 sees a significant change in intra-cluster data forwarding with pinning.

23 23 Table 14: Intra-Cluster Critical Data Forwarding During Cluster Migration With Pinning No Pinning bzip % 66.69% crafty 49.72% 56.50% eon 49.72% 50.88% gap 63.50% 63.39% gcc 60.42% 59.81% gzip 56.03% 55.03% mcf 68.05% 66.73% parser 56.83% 56.99% perlbmk 65.32% 65.36% twolf 57.51% 57.13% vortex 58.93% 58.86% vpr 57.01% 56.34% Average 60.04% 59.48% Figure 10 illustrates how the values from Tables 13 and 14 translate into performance. Other than the dramatic performance drop for bzip2, removing pinning has little noticeable effect on the individual benchmarks. This matches the observations from the table. In addition, there is not a strong correlation between the reduction in cluster migration and the performance numbers. However, the overall speedup has increased from 7.3% to 6.4%. FDRT FDRT - Pinning Leader Speedup Over Base bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 10: Performance Effect of Allowing Leaders To Migrate

24 FDRT Modifications Emphasizing Instructions That Stall the ROB In the FDRT strategy described in Section 4, instructions are prioritized and assigned based on characteristics of the input (e.g. inter-trace, critical, pinned). In this section, we include an additional type of priority based on the importance of the instructions [19]. To do this, an additional piece of dynamic information is retained to help the cluster assignment mechanism. When an instruction eligible for retirement cannot be written back because one of its inputs has not arrived, this instruction is marked as a ROB stall instruction. These instructions pass through the FDRT cluster assignment logic first. The idea is that instructions that stall the ROB are the most important instructions. These instructions should be assigned to their ideal clusters first. This prevents less important instructions from stealing the limited issue slots available to each cluster. The drawback is that an additional pass exists, requiring more complex fill unit assignment logic. Figure 11 shows the change in performance when the new ROB stall pass is added to the existing FDRT strategy. The overall performance boost from FDRT improves from 7.3% to 8.4%. All benchmarks except for gap and twolf improve with the new information, but only bzip2, gzip, and parser have noticable improvements FDRT FDRT + ROB Stall Speedup Over Base bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 11: Performance Effect of Incorporating ROB Stall Information

25 25 Overall, this new modification did not provide the desired result. Further work needs to be done to recognize, critical chains of instructions instead of just the instructions at the end of the chain [19, 4]. This way all the instructions int the dependency chain can be assigned to the same cluster instead. Chaining Intra-Trace Dependencies In Figure 12, FDRT + Chain Intra-Trace Deps represents allowing instructions that have an intra-trace dependency with their producer to be part of the pinned chains. This increases the number of followers and leaders, which in turn increases the number of instructions that will be handled by options B and C in Table 7. FDRT FDRT + Chain Intra-Trace Deps Speedup Over Base bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 12: Performance Impact of Chaining Instructions With Intra-Trace Dependencies This results in a significant reduction in performance improvement. The inclusion of intra-trace dependencies in chaining causes congestion during the early steps of FDRT scheduling and the less important intra-trace dependencies prevent some inter-trace dependencies from forwarding their data within the same cluster. 6 Strategy Robustness This section presents the speedup for our main cluster assignment strategies on slightly different base models. We believe that the presented base model incorporates realistic parameters and implementation characteristics for a future clustered trace cache processor. However, it is fair to

26 26 ask how the presented cluster assignment strategy will perform in environments that do not match our estimates. We choose to present more aggressive base configurations, i.e. machine models that will not see as much benefit from modifications to cluster assignment. 6.1 Mesh Network The first modification is a slight change in the cluster communication network. In this mesh network, the only change is that clusters 1 and 4 communicate directly. This is significant in that eliminates the three cluster communications. This approach is advocated by Parcerisa et al. [12]. The feasibility of this approach may depend on the layout of the clusters as well as the instruction path. The performance improvements for a mesh network is shown in Figure 13. These improvements are relative to base machine with a mesh network. These results with the mesh communication scheme are similar to the baseline results. However, the speedups are lower because the mesh network eliminates the longest latencies, leaving less room for optimization. Using FDRT assignment, performance improves by 6.3% on average. The Friendly method improves performance by 2.1%. FDRT assignment is superior to Friendly assignment for all benchmarks except eon. Speedup Over Base With Mesh Network FDRT Friendly Issue-Time No Lat Issue-Time bzp gcc crf eon gap gzp mcf psr prl twf vor vpr HM Figure 13: Performance With A Mesh Cluster Communication Network Issue-time assignment improves performance by 2.3% and is the best choice for eon as well as

Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors

Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors Ravi Bhargava and Lizy K. John Electrical and Computer Engineering Department The University of Texas at Austin {ravib,ljohn}@ece.utexas.edu

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses Onur Mutlu Hyesoon Kim Yale N. Patt High Performance Systems Group Department of Electrical

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Copyright by Mary Douglass Brown 2005

Copyright by Mary Douglass Brown 2005 Copyright by Mary Douglass Brown 2005 The Dissertation Committee for Mary Douglass Brown certifies that this is the approved version of the following dissertation: Reducing Critical Path Execution Time

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

PMPM: Prediction by Combining Multiple Partial Matches

PMPM: Prediction by Combining Multiple Partial Matches 1 PMPM: Prediction by Combining Multiple Partial Matches Hongliang Gao Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {hgao, zhou}@cs.ucf.edu Abstract

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Macro-op Scheduling: Relaxing Scheduling Loop Constraints Appears in Proceedings of th International Symposium on Microarchitecture (MICRO-), December 00. Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim and Mikko H. Lipasti Department of

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Dynamic Scheduling with Narrow Operand Values

Dynamic Scheduling with Narrow Operand Values Dynamic Scheduling with Narrow Operand Values Erika Gunadi University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in partial

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

High Performance SMIPS Processor

High Performance SMIPS Processor High Performance SMIPS Processor Jonathan Eastep 6.884 Final Project Report May 11, 2005 1 Introduction 1.1 Description This project will focus on producing a high-performance, single-issue, in-order,

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

C152 Laboratory Exercise 3

C152 Laboratory Exercise 3 C152 Laboratory Exercise 3 Professor: Krste Asanovic TA: Christopher Celio Department of Electrical Engineering & Computer Science University of California, Berkeley March 7, 2011 1 Introduction and goals

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set 2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering University of Texas

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information