Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations

Size: px
Start display at page:

Download "Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations"

Transcription

1 Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations James Laudon, Anoop Gupta, and Mark Horowitz Computer Systems Laboratory Stanford University Stanford, CA Abstract There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only natural that architectural features that benefit only multiprocessors are less likely to be adopted in commodity microprocessors. In this paper, we explore multiple-context processors, an architectural technique proposed to hide the large memory latency in multiprocessors. We show that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments. We propose an alternative design that combines the best features of two existing approaches, and present simulation results that show it yields better performance for both multiprogrammed workloads on a workstation and parallel applications on a multiprocessor. By addressing the needs of the workstation environment, our proposal makes multiple contexts more attractive for commodity microprocessors. 1 Introduction Large-scale multiprocessors, such as the one shown in Figure 1, are increasingly built using commodity microprocessors [2, 16]. While these commodity microprocessors provide a relatively lowcost compute node, their performance depends heavily on employing a sophisticated cache hierarchy to insulate the processor from the long remote memory latency. Providing the ability to cache shared data [16] can greatly increase the amount of computation that can be done before requiring a long-latency operation, however, it cannot remove the long-latency operations completely. To address the performance loss associated with remote cache misses, several latency tolerating schemes have been proposed, including relaxed memory consistency models [7], prefetching [17] and multiple-context processors [13, 22, 26]. Recent studies [8, 13] have shown multiple contexts to be a promising way to address the problem; this paper focuses on the multiple-context solution. James Laudon is currently at Silicon Graphics, 2011 N. Shoreline Blvd., Mountain View, CA Proc Cache & Dir Proc Cache Scalable Interconnection Network & Dir Figure 1: General structure of a scalable shared-memory multiprocessor. Multiple contexts tolerate latency by overlapping the longlatency operations of one context with the execution of other contexts. Multiple contexts are a universal latency tolerance mechanism any latency can be tolerated as long as (a) enough parallelism is available and (b) the cost to switch between contexts is substantially lower than the latency to be tolerated. Points (a) and (b) both must be addressed for optimal multiple-context performance. By reducing the number of contexts required for latency tolerance, the number of workloads that can benefit from multiple contexts is increased. By lowering the switch cost, the class of stalls the multiple-context processor can tolerate is broadened. This second point is important because processors encounter a wide range of stall latencies, from the short latencies caused by pipeline dependencies between instructions and primary cache misses which hit in the secondary cache to the much longer latencies of local and remote memory accesses and interprocess synchronization. Most existing multiple-context designs have targeted large-scale multiprocessors. Many applications running in this environment have substantial parallelism and application performance is often dominated by the large remote memory latency [8, 13]. In contrast, high-performance commodity microprocessors primarily target the workstation environment. Parallelism is less abundant in workstation workloads, and may consist of running a large application in the background while editing, reading mail, video-conferencing, or stressing sophisticated graphical user interfaces in the foreground. In addition, uniprocessor memory latencies tend to be shorter than in a multiprocessor, because memory can be more tightly coupled to the processor. Workstation cache hierarchies are also more effective because there are no misses caused by interprocessor communication. Because of their emphasis on multiprocessors, existing multiplecontext architectures either require too much parallelism or are simply unable to handle the shorter latencies prevalent in workstations. We believe that until multiple contexts address the needs of workstations, they simply will not be incorporated in main- 1

2 stream microprocessors. Therefore, this paper focuses on developing a multiple-context architecture suitable for the workstation environment. Such a processor must satisfy several constraints. First, because the amount of additional parallelism available in the workload is modest, effective latency tolerance must be possible with a small number of contexts. Second, because certain jobs are higher priority and require the shortest time to completion, the single-threaded performance of the multiple-context processor must be the same as that of a comparable single-context processor. Third, the cost to switch contexts must be low enough to tolerate short stalls such as those caused by pipeline dependencies and primary cache misses that hit in the secondary cache. Finally, the implementation cost of adding multiple contexts to the processor must be reasonable, or the implementation complexity budget of the processor will be spent on competing features. This paper is organized as follows. In Section 2, we describe how previous multiple-context proposals do not address all these constraints. Section 3 presents our proposal for a new multiplecontext processor. Section 4 describes the experimental methodology and Section 5 shows that our proposed scheme substantially outperforms the existing approaches for both workstations and multiprocessors. In Section 6, we outline the implementation costs of our proposal and in Section 7 we discuss future trends and recent related work. Finally, we conclude in Section 8. 2 Previous Work Multiple-context processors employed in commercial and research machines have used one of two approaches: either fine-grained or blocked. In this section we briefly examine each of these approaches. 2.1 Fine-grained Multiple-Context Processors Fine-grained processors, exemplified by the Denelcor HEP [22], switch contexts each cycle, effectively making the switch cost zero. Because of this low switch cost the processor is able to tolerate the latency of both pipeline dependencies and memory references. Unfortunately, machines based on these fine-grained processors: (a) have not supported data caches and (b) limited each context to only a single instruction active in the pipeline (i.e. the processor has no pipeline interlocks). While these constraints allow the pipeline design to be simplified, they place two onerous burdens on applications. First, a large number of threads are necessary to fully utilize the processor enough to both fill the pipeline and hide the memory latency. Second, the performance of a single thread is extremely poor, as each thread can issue a new instruction every pipeline-depth cycles at best. Therefore, any serial portion of an application can greatly impact the overall application performance. The limitations of fine-grained processors are quite severe (especially for workstation workloads) and most recent designs have instead focused on the blocked approach. 2.2 Blocked Multiple-Context Processors Blocked multiple-context processors, exemplified by Weber and Gupta [26] and the MIT APRIL [13], share the processor between a number of contexts; however, a single context utilizes all of the processor resources until it reaches a long-latency operation, such as a cache miss, at which point the processor switches to another context. Blocked multiple-context processors address the poor singlethread performance and need for large number of contexts of the fine-grained schemes, but they do so at the expense of increasing the context switch cost. This is because the decision to switch contexts depends on determining whether a cache miss occurred and this determination is made late in the pipeline. Without extensive modifications to the processor, the cost to switch contexts will be close to the depth of the pipeline, as the partially-executed instructions from the switching context will need to be flushed from the pipeline. In attempt to reduce this switch cost, a few blocked architectures have been proposed which replicate the pipeline registers [18, 19]. With this pipeline register replication, the contextswitch cost could be a low as a single cycle (at least one cycle is needed to broadcast the switch decision to the entire chip for use by the TLB, pipeline forwarding logic, etc). Unfortunately, replicating the pipeline registers results in a substantial increase in pipeline size, as latches which hold pipeline state are a significant fraction of the total pipeline area. In addition, the outputs of these replicated latches need to be multiplexed before being sent to the combinational portion of the pipeline. When these multiplexor delays are combined with the higher fanout and longer wire delays resulting from the area increase, it is difficult to imagine that the cycle time of the processor will not be significantly impacted. Instead of trying to reduce the switch cost by brute force, a better architecture can be developed by reexamining the fine-grained approach. 3 Interleaved Multiple-Context Proposal The primary problems with fine-grained processors, the need for large numbers of active contexts and the poor single-thread performance, are not intrinsic limitations of cycle-by-cycle context switching. Instead, two decisions incorporated into fine-grained processors are the culprits. The first is the decision to not support data caching, which makes every memory reference a long-latency operation. The second is the decision to not provide hardware interlocks which prevents a context from having more than one instruction in the pipeline, thereby increasing the minimum latency of each instruction to the pipeline depth. The basic idea behind our proposal, the interleaved multiplecontext processor, is simple. By adding both caching and full pipeline interlocks to the fine-grained scheme, it becomes possible to design a multiple-context processor which interleaves contexts on a cycle-by-cycle basis, yet effectively supports a single context. More specifically, issuing of instructions is switched each cycle between available contexts in a round-robin fashion. Contexts become unavailable when they encounter a long-latency operation, and are made available again when the long-latency operation completes. When a context becomes unavailable (an operation analogous to the context switch of the blocked scheme), the processor only squashes those instructions in the pipeline from the context becoming unavailable. IF1 IF2 RF EX DF1 DF2 WB Pipeline A A A A A A A Blocked Scheme C B A D C B A Interleaved Scheme Figure 2: Example illustrating the lower context switch cost of interleaved scheme. The advantage of this selective instruction squashing is illustrated in Figure 2 for a processor with four active contexts (labeled A D) and a MIPS R4000-like pipeline (the pipeline we

3 Miss Inst. Miss Miss Miss Lat. Miss Inst. Lat. Miss Miss Miss A B B C C C D D D D D D A B C D B C D C D D D D A A A A A A A B B B B B B B C C C C C C C Blocked Scheme C B A D C B A B D C B D C B D C D C B D C Interleaved Scheme Figure 3: Comparison of the blocked and interleaved multiple-context schemes for a set of four threads. will be using in our simulations). In this figure, context A has encountered a cache miss, and will need to be made unavailable. Unfortunately, the cache access occurs late in the pipeline and the context switch determination cannot be made until the WB stage. Because the blocked scheme will need to squash all instructions in the pipeline (including the instruction which caused the cache miss) before it can start the next context, a context switch costs seven cycles on the pipeline shown. However, for the interleaved scheme only the instructions from context A need to be squashed, reducing the overhead to handle the cache miss to two cycles. This reduction in switch cost from cycles for the blocked scheme to for an interleaved processor with active contexts allows the interleaved scheme to effectively tolerate much shorter stall latencies. In addition to lowering the switch cost, the cycle-by-cycle interleaving spaces out instructions from the same context. Like the pipeline of a single-context processor, the interleaved processor stalls when an instruction requires a result from a previous instruction which has not yet been computed. However, because there are no dependencies between instructions from different contexts, if sufficient instructions from other contexts are interleaved between dependent instructions from the same context, the stall due to the pipeline dependency can be avoided. 1 (Of course, the compiler will still need to schedule code to minimize stalls, as it is possible that only a single context may be executing). To illustrate these two advantages of the interleaved scheme more concretely, we show the execution of four threads for both the blocked and interleaved schemes in Figure 3. The four threads all end with the final instruction causing a cache miss and are: A Two instructions long. B Three instructions long, with a two-cycle pipeline dependency between the first and second instructions. C Four instructions long. D Six instructions long. The blocked scheme is shown on the left timeline. Context A starts executing, issuing its two instructions, the second of which causes a cache miss. The pipeline must be flushed at this point before context B can execute, as shown below the timeline. Context B then executes one instruction, stalls due to the pipeline dependency, and then executes until it encounters its cache miss, at which point the pipeline is flushed and C starts executing, and so on. The interleaved scheme executing the same set of threads is shown on the right timeline. The processor starts with all four contexts being interleaved. As we can see, this interleaving is enough 1 The instruction-set architecture best suited for the interleaved processor has no compiler-filled branch or load delay slots. This is the trend for modern architectures [6]. to separate the dependent instructions from context B, completely hiding the pipeline dependency. The lower switch cost of the interleaved scheme is also illustrated in Figure 3; the switch cost associated with a cache miss is reduced from the seven cycles of the blocked scheme to two cycles for context A, three cycles for contexts B and C. Note that as contexts are made unavailable, the number of contexts being interleaved on the pipeline decreases, until we reach the point where only context D is being interleaved (and the processor is now behaving like a single-context processor). As a result of the lower switch cost and pipeline dependency tolerance, the interleaved scheme was able to complete all four threads well before the blocked scheme. We have just qualitatively argued that the interleaved scheme should outperform the blocked, and we now need to quantify that gain for both the uniprocessor and multiprocessor environments. Evaluating the performance of the interleaved processor requires a simulator that supports accurate pipeline and memory system modeling. We also need the code to be correctly scheduled for our pipeline. The next section briefly discusses this simulation environment. 4 Evaluation Methodology We begin by describing our base workstation architecture. We then outline the simulation environment and multiprogramming workloads studied. Both the uniprocessor and multiprocessor evaluations use the same simulation environment, however, the multiprocessor study examines a different system architecture and application suite. These differences for the multiprocessor study will be discussed in Section 5.2 before presenting the multiprocessor performance results. 4.1 Base Architecture Proc Core Inst Cache Data Cache Sec Cache Figure 4: Base architecture. Bank 0 Bank 1 Bank 2 Bank 3 The system architecture models a high-end workstation, as shown in Figure 4. The cache parameters are given in Table 1 all caches are direct-mapped. The memory system is four-way interleaved,

4 connected to the processor across a high-speed, split-transaction bus. The unloaded memory access times are given in Table 2. Cache and memory contention are modeled, and can add to these latencies. While the data cache is lockup-free [12], the instruction cache is blocking and no context switching will be done for instruction cache misses. We have deliberately modeled an aggressive memory system, because its short memory latencies are difficult for multiple contexts to tolerate. Table 1: Cache parameters. Parameter Primary Data Primary Inst Secondary Size 64 Kbytes 64 Kbytes 1 Mbyte Line Size 32 bytes 32 bytes 32 bytes Fetch Size 1 line 2 lines 1-2 lines Read Occupancy 1 cycle 1 cycle 2 cycles Write Occupancy 1 cycle NA 2 cycles Invalidate Occupancy 2 cycles NA 4 cycles Cache Fill Occupancy 1 cycle 8 cycles 2 cycles Table 2: ory latencies. Hit in Primary Cache Hit in Secondary Cache Reply from ory 1 cycle 9 cycles 34 cycles The processor architecture was selected to be representative of current, high-end RISC microprocessors. It executes the MIPS II instruction set [9], except that the delayed branches of the MIPS architecture have been removed. Delayed branches are an artifact of the first-generation RISC processors that do not extend well into future generations. The integer pipeline of the processor is based on the MIPS R4000 [9], but is slightly more aggressive. As Figure 5 shows, the integer pipeline modeled is seven stages deep, one less than the R4000. The R4000 has a separate Tag Check stage between DF2 and WB, which has been folded into the DF2 stage for our processor. The floating-point pipeline is based on the DEC Alpha [6], and is nine stages deep. Both pipelines forward results whenever possible to reduce operation latency. The arrows above the pipelines in Figure 5 denote possible result forwarding paths. IF1 IF2 RF EX DF1 DF2 WB Integer Pipeline IF1 IF2 RF EX1 EX2 EX3 EX4 EX5 WB Floating Point Pipeline Figure 5: Processor pipeline. Because the ALU provides result bypassing, most operations have an effective latency of a single cycle. However, a small number of operations require longer to complete. Load operations are followed by two delay slots; the load result is not available until the end of DF2 for forwarding to the following EX phase. Branches have an even longer latency. Since the branch condition is evaluated in the EX phase, taken branches could potentially cost four cycles, however, a 2048-entry direct-mapped branch target buffer (BTB) is used to reduce the branch penalty to zero for a correctly predicted branch (mispredicted branches still pay a three cycle penalty). Table 3: Long-latency operations. Operation Issue Latency Integer Divide Integer Multiply Shift 1 2 Load 1 3 Floating-point Add/Sub/Conv/Mult 1 5 Floating-point Divide 61 (31) 61 (31) The latency and issue rate of operations which take greater than a single cycle are given in Table 3. For floating-point division, operation on single-precision numbers is faster than for doubleprecision, and the single-precision numbers are shown in parenthesis in the table. 4.2 Compiler and Simulation Environment In order to accurately model the effects of latency on processor performance, a number of issues must be addressed. First, the code needs to be highly optimized and scheduled for the target system pipeline. Second, the simulator needs to insure that a correct interleaving of processes is used. Finally, the pipeline and memory system simulator must accurately model the real processor and system architecture. Our compilation and simulation environment addresses all these issues. To insure that application performance was not affected by suboptimal compilation, we compiled all applications using recent MIPS CC and F77 compilers (version 2.1) with the -O2 level of optimization. To schedule the code properly for our target pipeline, the MIPS compilers produced assembly files which were then run through the Twine scheduler [23], an aggressive global scheduler that is part of the Stanford SUIF compiler. Twine takes in an instruction set parameter file, which allows the latency, functional unit usage, and ability to cause exceptions to be specified for each instruction. Twine scheduled the optimized code for our pipeline, leaving the final two issues. We have developed a detailed processor and system simulator that interfaces to Tango-Lite [5]. Tango-Lite is a simulation package that allows execution-driven simulation of parallel programs on uniprocessors. Our simulator accepts basic block and memory reference addresses from Tango-Lite. The basic block addresses index into the properly scheduled object file and are used to generate register and functional unit usage information on the instructions in that basic block. This usage information is then passed to the pipeline simulator. The pipeline simulator models all major pipeline dependencies, including load, execution result, execution issue, and controltransfer hazards. These hazards are tracked in the simulator through a scoreboard which maintains information on the functional unit and register usage of all operations in progress. Instructions are not allowed to progress to the execute stage until the desired functional unit is available and all register dependencies (true, anti-, and output) are satisfied. Finally, if the instruction is a load or store, the proper address and reference type is sent to the memory system simulator. Table 4: Context switch costs. Switch Cause Blocked Interleaved Cache Miss Explicit switch instruction 3 NA Backoff instruction NA 1 3

5 In addition to modeling pipeline dependencies, the processor simulator also handles the multiplexing of the multiple contexts on the processor. For the blocked scheme, contexts are switched whenever a cache miss occurs. In addition, the blocked processor is assumed to have an explicit context switch instruction, for use in tolerating latencies other than cache misses. The interleaved scheme switches contexts each cycle, so its context switch cost is zero. However, making a context unavailable does cost cycles (the number of instructions in the pipeline from that context), and this will be referred to as the switch cost of the interleaved scheme. The interleaved processor has the ability to issue a backoff instruction to tolerate latencies other than due to cache misses. The backoff instruction causes a context to become unavailable for a number of cycles specified by the instruction and is analogous to the explicit switch of the blocked scheme. The context switch costs for both schemes are listed in Table 4, with the exact cost for the interleaved scheme depending on the dynamic context interleaving. The cost for the explicit switch and backoff instructions are smaller than for cache misses, as a switch can be triggered immediately upon decoding the instruction. 4.3 Application Workloads To model workstation usage, we generated six workloads using members of the Spec89 suite. The programs selected include Doduc, Eqntott, Li, Matrix300, Tomcatv, and NASA7, with NASA7 being broken into its seven kernels: Btrix, Cholsky, Cfft2d, Emit, Gmtry, Mxm, and Vpenta. 2 The six workloads were constructed to have the following characteristics: IC stresses the instruction cache, DC stresses the data cache, DT stresses the data TLB, FP is floating-point intensive, and R0 and R1 are random workloads. A seventh workload (SP) consists of uniprocessor versions of four SPLASH [21] applications (using the input data sets given in Section 5.2). These seven workloads are listed in Table 5. Table 5: Uniprocessor workloads. IC Doduc Li Eqntott Mxm DC Cfft2d Gmtry Tomcatv Vpenta DT Btrix Cholsky Gmtry Vpenta FP Emit Cholsky Doduc Matrix300 R0 Emit Btrix Cfft2d Eqntott R1 Mxm Li Matrix300 Tomcatv SP MP3D Water Locus Barnes Since many of the applications take several minutes of CPU time without simulation, and our simulator slows down the applications over a thousandfold, we were not able to run the workloads to completion. Instead, each workload was run for 36 time-slices (roughly 1 second of CPU time). Because we only simulated a fraction of the complete application, it was important that we were simulating the section of the application responsible for most of the execution time for the complete run. This was ensured by not generating references to the simulator until the initialization phase of the applications in the workloads had been completed. In addition, to remove cold-start effects, each application in the workload was run for a time slice before simulation statistics are gathered. Thus, when simulation statistics were being gathered, the applications had completed initialization and the caches were loaded. A simple model of the operating system was employed for this study. The time-slice used by the operating system is 30 ms, 2 These applications were selected from the Spec89 suite solely on the basis of being able to pass through our compilation system with no or minor adjustments to the application itself. and assuming a 200 MHz processor this translates to a scheduler interrupt every six million processor cycles. The scheduler uses a simple affinity mechanism which keeps the same application scheduled on the processor for the equivalent of three time slices (i.e. for the single-context processor each application runs for three time slices before switching, for the two-context processor a pair of applications is run for six time slices before switching, etc). Because of this affinity mechanism, the number of processes switched on each scheduler call will either be zero or the number of hardware contexts supported. The simulator does not actually run the operating system scheduler at each interrupt, rather, the scheduler is modeled as a routine with negligible latency which displaces some number of cache lines. The amount of cache interference caused by the scheduler is based on a study by Torrellas [25] of IRIX, a UNIX System V variant, running on a Silicon Graphics 4D/340. The scheduler interference depends on the number of contexts which must be swapped, as shown in Table 6, and is modeled by issuing the number of memory requests given in the table to random addresses. Table 6: Operating system costs. Processes Instruction Cache Data Cache Switched Interference Interference Performance Results We start this section by presenting results for the workstation workloads, and then turn to the multiprocessor evaluation. In both environments, we show that the interleaved approach significantly outperforms the blocked approach. 5.1 Uniprocessor Results Before we examine the effects of multiple contexts on multiprogramming throughput, we first need to discuss the impact of multiple contexts on process scheduling. As was observed in [3], applications with lower miss rates tend to get more cycles under blocked multiple contexts than applications with higher miss rates. This is because lower miss rates usually translate into longer runlengths, and assuming a strict round-robin scheduling, the fraction of the total processor cycles allocated to each application will depend on the size of its runlength relative to the other runlengths. A similar effect also occurs for the interleaved scheme. Assuming the processor supports contexts, an application receives 1 of the processor cycles as long as it is not unavailable due to an outstanding memory request. Since applications with longer runlengths spend less of their time waiting for memory, they will get a larger fraction of the processor. Because we would like to compare the blocked and interleaved schemes based on how well they improve the throughput of all applications in the workload, not on whether they devote more processor time to applications with better memory behavior, we will assume that the hardware provides context-usage feedback to the operating system, and the operating system schedules the workload to even out the amount of processor cycles devoted to each application. Therefore, we will normalize our results (which do not include the effects of this feedback to the operating system) to the case where each application out of is given 1 of the processor.

6 Table 7: Increase in application throughput with multiple contexts. Scheme IC DC DT FP R0 R1 SP Mean Two Interleaved Contexts Blocked Four Interleaved Contexts Blocked The importance of a low switch cost shows up as we look at the processor utilization breakdown for the blocked processor in Figure 6. Processor utilization is broken into five categories: busy, time spent doing useful work, instruction, time stalled due to pipeline dependencies, inst cache/tlb, time stalled on memory due to instruction references, data cache/tlb, time stalled on memory due to data references, and context switch, time spent context switching. The number on top of the bars show the percent of time spent busy. The seven workloads are listed along the bottom of the graphs; results are given for one, two, and four contexts per processor. Normalized Execution Time Context Switch Data Cache/TLB Inst Cache/TLB Instruction Stall Busy Normalized Execution Time IC DC DT FP R0 R1 SP Figure 6: Blocked scheme processor utilization. Context Switch Data Cache/TLB Inst Cache/TLB Instruction Stall Busy In general, the processor utilization of the blocked scheme does not increase much with additional contexts. This is because many of the workloads simply do not contain that much memory or longinstruction latency for the multiple contexts to tolerate. Even for workloads where there is a fair amount of memory latency, such as DC and DT, most of the memory stall time is due to secondary cache hits, and the gains from tolerating the memory latency are consumed by the switch overhead. Consequently, the throughput of these workloads with four contexts only increases by 23% and 9%, respectively. In contrast, the lower switch cost of the interleaved scheme allows it to tolerate both pipeline dependencies and memory latency, as shown in Figure 7. Processor utilization increases significantly under the interleaved scheme for workloads with large amounts of instruction latency, as the cycle-by-cycle interleaving tolerates shorter instruction latencies, while the backoff instruction is used to tolerate long instruction latency. In addition, the memory latency of workloads such as DC and DT can be effectively tolerated because of the lower switch cost of the interleaved scheme, resulting in a 65% and 46% increase in throughput for the two workloads with a four-context processor. Table 7 summarizes the performance of the blocked and interleaved schemes. The interleaved scheme is able to increase throughput by 22% (geometric mean across the workloads) with only two contexts per processor. With four contexts the improvement is 50%. In contrast, the blocked scheme shows very small improvements in throughput, 3% with two contexts and 11% with four. 0 IC DC DT FP R0 R1 SP Figure 7: Interleaved scheme processor utilization. We have just shown the interleaved scheme to substantially outperform the blocked when several large jobs are multiprogrammed on a single workstation. While this is the situation found on the workstations in our research lab, many workstations run with one large job in the background which is timesharing the processor with the operating system, windowing system, and a number of smaller foreground jobs. Even though we have not explicitly modeled this workload here, interleaved multiple contexts obviously also benefit this environment. The smaller foreground jobs can be loaded and run on the processor without requiring the larger job to be switched out. The response time of the windowing system can be improved if it does not require other jobs to be swapped before it can run. The operating system can also take advantage of the multiple contexts, especially with the trend towards microkernel operating systems, where much of the operating system functionality is encapsulated in separate processes. In addition to providing these advantages, multiple contexts allow background applications which suffer from significant memory latency to be written as parallel programs to take advantage of the latency tolerance. There are also a large number of applications which are designed to run on workstation clusters or smallscale multiprocessors that are already multithreaded and can take advantage of the multiple contexts on the processor. By providing a multiple-context processor that performs just as well with a single thread as the single-context processor and is able to show significant performance improvements with as few as two loaded hardware contexts, the interleaved scheme allows a workstation to be built that will appear significantly faster to the user. 5.2 Multiprocessor Results Our simulated multiprocessor consists of a number of nodes connected together by a high-bandwidth, low-latency interconnect. Each node consists of a single processor, instruction and data cache, and a portion of the global memory. The caches are kept coherent using a distributed, directory-based protocol similar to that of the Stanford DASH multiprocessor [16]. Because the same primary cache sizes and parameters are used for the multiprocessor as

7 for the uniprocessor, shared data communication will be the major contributor to the cache miss rate, and therefore the instruction cache was modeled as ideal (% hit rate) and only a single-level of data cache was simulated, as multi-level hierarchies do not help reduce the communication miss rate. Table 8: Default memory latencies. Hit in Primary Cache Reply from Local ory Reply from Remote ory Reply from Remote Cache 1 cycle cycles cycles cycles Normalized Execution Time Context Switch Synchronization ory Instruction (Long) Instruction (Short) Busy Unloaded memory latencies are selected from a uniform distribution spanning the ranges given in Table 8 and are based on Stanford DASH latencies. Contention for the caches is modeled, which can increase these base latencies. While cache contention is modeled, the network and memories are modeled as contentionless to speed up simulation. Simplifying the network and memory system allows us to simulate larger problems, while still providing a sufficient model of the memory system behavior, as cache contention is likely to dominate network and memory contention [1]. The SPLASH suite of applications was used for our multiprocessor study. An overview of the seven SPLASH applications and their input sets is presented in Table 9. More information on the computational behavior and important data structures for each application can be found in [21]. For all SPLASH applications simulated for multiple time steps, we only gathered performance statistics after the first step was finished since the behavior of the first step is often different from all other steps. For Cholesky and LocusRoute we only gathered statistics for the parallel sections of the code. Speedups from adding multiple contexts to our base processor are shown in Table 10. On occasion, the best performance was encountered with fewer than the maximum number of hardware contexts, and the numbers presented in the table are for the application running on the optimum number of contexts. As expected, the performance gains due to multiple contexts are in general much larger in the multiprocessor environment, and only Cholesky shows no gains from multiple contexts. In addition, the interleaved scheme outperformed the blocked scheme for all applications when using four and eight contexts per processor and for nearly all applications with two contexts per processor. The largest performance differences between the two schemes are exhibited for Barnes and Water (both applications have large amounts of instruction latency, mainly due to a large number of floating-point divides). The performance difference between the two schemes for all applications is substantial in fact, with four contexts per processor, the interleaved scheme outperforms the eight-context blocked scheme for all applications except MP3D. We show a breakdown of the multiple-context execution time for the blocked scheme in Figure 8, for the interleaved scheme in Figure 9. In these graphs, execution time of the measured portion of the application is shown for one, two, four, and eight contexts per processor, normalized to the single-context time. This execution time is divided into six categories: busy, time spent active, instruction, time stalled due to pipeline dependencies, memory, time stalled on data cache misses, synchronization, time spent on interprocess synchronization, and context switch, time spent in switching overhead. Pipeline dependencies of four or fewer cycles (four being the maximum stall due to a floating point add/subtract/multiply result hazard) are labeled short, while all longer pipeline dependencies are labeled long. Again, the blocked scheme squanders more cycles in context switching than the interleaved scheme. Because memory latencies Normalized Execution Time 0 MP3D Barnes Water Ocean Locus PTHOR Cholesky Figure 8: Application execution time breakdown for the blocked scheme MP3D Barnes Water Context Switch Synchronization ory Instruction (Long) Instruction (Short) Busy Ocean Locus PTHOR Cholesky Figure 9: Application execution time breakdown for the interleaved scheme. are so much larger in a multiprocessor, the effects of the higher switch cost are not as serious as they were in the uniprocessor environment and the blocked scheme is still able to show reasonable application speedups. While the effect of the high switch cost is less serious for the multiprocessor, it still has a negative impact, allowing the interleaved scheme to outperform the blocked. In addition, while the blocked scheme is somewhat effective in tolerating longer pipeline dependencies, it cannot tolerate short pipeline dependencies, which accounted for 12% of the total single-context execution time when averaged across the SPLASH applications. In contrast, the interleaved scheme is able to tolerate both long and short pipeline dependencies, and as a result achieves much better processor utilizations for applications like Water that contain large amounts of pipeline dependencies. 6 Implementation Issues We have just shown the interleaved scheme to outperform the blocked, and now need to compare the implementation costs of both approaches. Due to space limitations, the discussion must be brief; a more complete examination of the hardware requirements can be found in [14]. The first requirement of all multiple-context processors is that the cache be capable of handling multiple outstanding memory requests. These lockup-free caches [12] are more expensive than standard blocking caches, and represent a substantial portion of the extra complexity involved in building any multiple-context

8 Table 9: SPLASH suite summary. Application Language Lines Description Input Iterations Barnes-Hut C 2700 hierarchical N-body gravitation simulation 4K particles 3 steps Cholesky C 2000 Cholesky factorization of sparse matrices BCSSTK23 NA LocusRoute C 6400 routes wires in VLSI standard cell designs Primary2.grin NA MP3D C 1500 simulates rarefied hypersonic flow 150K particles 4 steps Ocean Fortran 3300 simulates eddy currents in an ocean basin 258x258 grid 3 steps PTHOR C 9200 simulates digital logic circuits NTT 20 cycles Water C 1500 simulates water molecule interaction 256 molecules 2 steps Table 10: Application speedup due to multiple contexts. Scheme MP3D Barnes Water Ocean Locus PTHOR Cholesky Mean Two Interleaved Contexts Blocked Four Interleaved Contexts Blocked Eight Interleaved Contexts Blocked processor [14]. Because of the widening gulf between processor and memory speeds, off-the-shelf microprocessors have started to provide lockup-free caches [24]. Future generations of processors are likely to include even more extensive support for multiple outstanding requests. Thus, the component which causes the largest complexity increase for multiple-context processors will exist in future microprocessors, and we can focus on the remaining implementation issues. These issues include: instruction issue from multiple streams, control for the multiple contexts, and replication of key per-process state. The blocked scheme issues instructions from only a single stream at any given time, and its issue logic is fairly similar to a single-context processor. The major costs of the blocked processor are in replicating the per-process state (the program counter, process-specific portion of the processor status word, and the register file) and providing the control to switch between that state at the point of the context switch. In contrast, the interleaved multiple-context processor needs to issue instructions from multiple independent streams concurrently. This requires that the instruction issue logic also be replicated, and that each instruction in the pipeline be tagged with the context that it issued from. This context identifier (CID) is then used by the pipeline to determine which state to access (e.g. for the register file access, in determining TLB hits, etc). This additional complexity in the interleaved instruction issuing logic is the largest difference between the two schemes, and we end this section by exploring possible implementations for both the blocked and interleaved program counter (PC) units for our simulated pipeline. We first present the single-context PC unit, and then examine the additional complexity of the blocked and interleaved PC units. 6.1 Single-context Program Counter Unit The PC unit for our single-context processor is shown in Figure 10. The rectangles in the diagram represent registers; all registers have clock enable capability. The clock enable and tristate control are not shown. On any given cycle, one of several sources drives the PC bus. The possible PC sources are: (a) old PC value plus the instruction size (normal sequential flow), (b) Branch Target Buffer (predicted branch), (c) computed branch target (mis- or unpredicted branch), (d) exception vector, or (e) EPC register (restore from an exception). Except Vector Instruction Size Sequential PC Result Bus Computed Branch Target Tag Hit? PC Bus BTB Target Predicted Branch Mispredict Branch Exception PC Figure 10: Single-context processor PC unit. PC Chain The exception vector and EPC register provide the ability to take and recover from exceptions. During normal execution, as each instruction retires the address of that instruction is loaded into the EPC register from the PC chain. When an exception occurs, the loading of the EPC register is stopped with the guilty instruction, and it and the following instructions in the pipeline are squashed (marked to not update any state). The exception vector is then placed on the PC bus, and the handler starts executing. The EPC is connected to the result bus to allow the exception handler to save and restore the EPC manually. When the exception has been handled, the EPC is forced onto the PC Bus via the use of an ERET (exception return) instruction, and execution continues from the point at which it left off. Note that because we have removed the MIPS branch delay slot, only a single EPC is needed. Supporting multiple contexts on a processor with branch delay slots is discussed in [15]. 6.2 Blocked Program Counter Unit The ability of the EPC to save an instruction address for later repeat is exactly the same functionality needed to correctly save the program counter after a context switch on the blocked processor. To be able to support multiple contexts, we simply need to replicate this functionality for each context. A PC unit capable of supporting two contexts is shown in Figure 11. This PC unit is very similar to that of a single-context processor, with the only difference being a modification to the EPC register in order to support the two contexts. This modification adds an EPC register

9 per context, which doubles as both the exception PC register and the context restart register (which contains the saved PC for that context). Instruction Size Sequential PC Result Bus Computed Branch Target Except Vector Tag Hit? BTB Target Predicted Branch PC Bus Mispredict Branch Exception PC 0 Exception PC 1 Figure 11: Two-context blocked processor PC unit. PC Chain Exceptions continue to use the EPC register in the same manner as the single-context processor. The EPC register for the active context is continually being updated during normal operation, while the EPC for the idle context remains unchanged. When an exception occurs, the EPC stops being updated and the exception vector is driven onto the bus. The behavior of the PC unit for a context switch is very similar to that for an exception. When a context needs to be switched, the context switch is delayed until the normal exception point, at which time the EPC register for the blocked context stops loading as if an exception had occurred. The partially completed instructions in the pipeline are squashed and the next context is selected. The EPC register of the next context is driven onto the PC bus, and the new context starts execution with the instruction that caused its previous context switch Interleaved Program Counter Unit We have just shown that the changes to the blocked PC unit to handle multiple contexts are fairly straightforward. We now examine the changes required for the interleaved PC unit. Computed Branch Target Instruction Size Except Vector Next PC 0 Target BTB Next PC 1 Result Bus Tag Hit? PC Bus Exception PC 0 Exception PC 1 Figure 12: Two-context interleaved processor PC unit. CID and PC Chain The PC unit must be able to determine the next instruction to be issued from each context. Determining this next instruction has become somewhat complicated under the interleaved scheme because the new PC value for a context becomes available a specific number of pipeline stages after issue and must be held until the context becomes active again. The delay between when an instruction issues and when the new PC value becomes available depends on the type of instruction that was issued. The new PC value is not known until the end of the EX cycle for a branch 3 Note that this sharing of the EPC between exception handling and context switching requires context switching to be disabled upon entry into the exception handler until the EPC can be saved. which was not predicted correctly, while for nonbranch and correctly predicted branches the new PC value is available at the end of IF1. Because of the context interleaving, the context may not be able to drive its new PC value onto the PC bus immediately when it becomes available, and holding registers must be provided until the context is selected to drive the PC Bus. The three PC sources (sequential, predicted branch, computed branch) can be multiplexed into a single holding register per context, as shown in Figure 12. A bit is associated with each holding register to signify whether the register holds a computed branch which was loaded as a result of a previous branch being mispredicted. This bit is used to signal that the BTB needs to be updated to reflect the new branch prediction when the holding register is driving the PC Bus. This next PC register (NPC) is loaded by one of the following sources (in order of decreasing priority): The computed branch target if the instruction in EX is a mispredicted branch from this context. The predicted branch target if the current PC is from this context and hit in the BTB. The sequential address if the current PC is from this context. The NPC, causing the register to maintain its current value. The computed branch has priority over all other sources if a previous branch was mispredicted, because this is guaranteed to be the correct path of instruction flow. Note that the determination of the mispredicted branch can actually occur before the predicted branch address has been issued to the PC Bus due to the context interleaving. If this occurs, the branch will only cost a single cycle, even though it was mispredicted. However, if a mispredicted branch was issued, all incorrectly fetched instructions in the pipeline from the context must be squashed when the misprediction is detected. This is accomplished by sending a branch squash signal coupled with a branch squash CID. Instructions following the branch in the pipeline with the same CID as the squash CID are then marked to not update any state. In addition to squashing the incorrectly fetched instructions, the computed branch is loaded into the NPC and the mispredicted status bit is set. On the next instruction issue from this context, the NPC will be driven onto the PC Bus and the BTB updated. The interleaved PC unit must also be able to handle context availability changes. This can be accomplished in a manner similar to that of the blocked scheme. For example a change in context availability due to a cache miss requires that once the miss is detected, the issuing of instructions from that context is stopped and all partially completed instructions in the pipeline are squashed. Squashing of already issued instructions can be accomplished by providing a conditional squash signal coupled with a squash CID. All instructions in the pipeline that match the CID will be marked to not affect any state. The PC value of the instruction causing the miss is loaded into the EPC and the EPC valid bit is set. When the context eventually becomes available again, the EPC of that context will be driven onto the PC Bus when the context issues its first instruction. The context will then start reexecuting from the instruction which caused the context to become unavailable. Handing exceptions on the interleaved processor is more complicated than the blocked processor due to execution of the multiple threads and is discussed in detail in [14]. While the interleaved PC unit is more complex than the blocked, this complexity is not overwhelming, especially when compared with the instruction issue complexity of recent dynamic superscalar processors [24]. For a manageable increase in complexity, the interleaved processor provides a significant increase in performance.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Evaluation of Design Alternatives for a Multiprocessor Microprocessor Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

A Fine-Grain Multithreading Superscalar Architecture

A Fine-Grain Multithreading Superscalar Architecture A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43 The CPU Pipeline 3 This chapter describes the basic operation of the CPU pipeline, which includes descriptions of the delay instructions (instructions that follow a branch or load instruction in the pipeline),

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory

Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory Daniel J. Scales and Kourosh Gharachorloo Western Research Laboratory Digital Equipment Corporation Chandramohan Thekkath

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance: #1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information