Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations

Size: px

Start display at page:

Download "Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations"

Kory Farmer
5 years ago
Views:

1 Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations James Laudon, Anoop Gupta, and Mark Horowitz Computer Systems Laboratory Stanford University Stanford, CA Abstract There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only natural that architectural features that benefit only multiprocessors are less likely to be adopted in commodity microprocessors. In this paper, we explore multiple-context processors, an architectural technique proposed to hide the large memory latency in multiprocessors. We show that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments. We propose an alternative design that combines the best features of two existing approaches, and present simulation results that show it yields better performance for both multiprogrammed workloads on a workstation and parallel applications on a multiprocessor. By addressing the needs of the workstation environment, our proposal makes multiple contexts more attractive for commodity microprocessors. 1 Introduction Large-scale multiprocessors, such as the one shown in Figure 1, are increasingly built using commodity microprocessors [2, 16]. While these commodity microprocessors provide a relatively lowcost compute node, their performance depends heavily on employing a sophisticated cache hierarchy to insulate the processor from the long remote memory latency. Providing the ability to cache shared data [16] can greatly increase the amount of computation that can be done before requiring a long-latency operation, however, it cannot remove the long-latency operations completely. To address the performance loss associated with remote cache misses, several latency tolerating schemes have been proposed, including relaxed memory consistency models [7], prefetching [17] and multiple-context processors [13, 22, 26]. Recent studies [8, 13] have shown multiple contexts to be a promising way to address the problem; this paper focuses on the multiple-context solution. James Laudon is currently at Silicon Graphics, 2011 N. Shoreline Blvd., Mountain View, CA Proc Cache & Dir Proc Cache Scalable Interconnection Network & Dir Figure 1: General structure of a scalable shared-memory multiprocessor. Multiple contexts tolerate latency by overlapping the longlatency operations of one context with the execution of other contexts. Multiple contexts are a universal latency tolerance mechanism any latency can be tolerated as long as (a) enough parallelism is available and (b) the cost to switch between contexts is substantially lower than the latency to be tolerated. Points (a) and (b) both must be addressed for optimal multiple-context performance. By reducing the number of contexts required for latency tolerance, the number of workloads that can benefit from multiple contexts is increased. By lowering the switch cost, the class of stalls the multiple-context processor can tolerate is broadened. This second point is important because processors encounter a wide range of stall latencies, from the short latencies caused by pipeline dependencies between instructions and primary cache misses which hit in the secondary cache to the much longer latencies of local and remote memory accesses and interprocess synchronization. Most existing multiple-context designs have targeted large-scale multiprocessors. Many applications running in this environment have substantial parallelism and application performance is often dominated by the large remote memory latency [8, 13]. In contrast, high-performance commodity microprocessors primarily target the workstation environment. Parallelism is less abundant in workstation workloads, and may consist of running a large application in the background while editing, reading mail, video-conferencing, or stressing sophisticated graphical user interfaces in the foreground. In addition, uniprocessor memory latencies tend to be shorter than in a multiprocessor, because memory can be more tightly coupled to the processor. Workstation cache hierarchies are also more effective because there are no misses caused by interprocessor communication. Because of their emphasis on multiprocessors, existing multiplecontext architectures either require too much parallelism or are simply unable to handle the shorter latencies prevalent in workstations. We believe that until multiple contexts address the needs of workstations, they simply will not be incorporated in main- 1

2 stream microprocessors. Therefore, this paper focuses on developing a multiple-context architecture suitable for the workstation environment. Such a processor must satisfy several constraints. First, because the amount of additional parallelism available in the workload is modest, effective latency tolerance must be possible with a small number of contexts. Second, because certain jobs are higher priority and require the shortest time to completion, the single-threaded performance of the multiple-context processor must be the same as that of a comparable single-context processor. Third, the cost to switch contexts must be low enough to tolerate short stalls such as those caused by pipeline dependencies and primary cache misses that hit in the secondary cache. Finally, the implementation cost of adding multiple contexts to the processor must be reasonable, or the implementation complexity budget of the processor will be spent on competing features. This paper is organized as follows. In Section 2, we describe how previous multiple-context proposals do not address all these constraints. Section 3 presents our proposal for a new multiplecontext processor. Section 4 describes the experimental methodology and Section 5 shows that our proposed scheme substantially outperforms the existing approaches for both workstations and multiprocessors. In Section 6, we outline the implementation costs of our proposal and in Section 7 we discuss future trends and recent related work. Finally, we conclude in Section 8. 2 Previous Work Multiple-context processors employed in commercial and research machines have used one of two approaches: either fine-grained or blocked. In this section we briefly examine each of these approaches. 2.1 Fine-grained Multiple-Context Processors Fine-grained processors, exemplified by the Denelcor HEP [22], switch contexts each cycle, effectively making the switch cost zero. Because of this low switch cost the processor is able to tolerate the latency of both pipeline dependencies and memory references. Unfortunately, machines based on these fine-grained processors: (a) have not supported data caches and (b) limited each context to only a single instruction active in the pipeline (i.e. the processor has no pipeline interlocks). While these constraints allow the pipeline design to be simplified, they place two onerous burdens on applications. First, a large number of threads are necessary to fully utilize the processor enough to both fill the pipeline and hide the memory latency. Second, the performance of a single thread is extremely poor, as each thread can issue a new instruction every pipeline-depth cycles at best. Therefore, any serial portion of an application can greatly impact the overall application performance. The limitations of fine-grained processors are quite severe (especially for workstation workloads) and most recent designs have instead focused on the blocked approach. 2.2 Blocked Multiple-Context Processors Blocked multiple-context processors, exemplified by Weber and Gupta [26] and the MIT APRIL [13], share the processor between a number of contexts; however, a single context utilizes all of the processor resources until it reaches a long-latency operation, such as a cache miss, at which point the processor switches to another context. Blocked multiple-context processors address the poor singlethread performance and need for large number of contexts of the fine-grained schemes, but they do so at the expense of increasing the context switch cost. This is because the decision to switch contexts depends on determining whether a cache miss occurred and this determination is made late in the pipeline. Without extensive modifications to the processor, the cost to switch contexts will be close to the depth of the pipeline, as the partially-executed instructions from the switching context will need to be flushed from the pipeline. In attempt to reduce this switch cost, a few blocked architectures have been proposed which replicate the pipeline registers [18, 19]. With this pipeline register replication, the contextswitch cost could be a low as a single cycle (at least one cycle is needed to broadcast the switch decision to the entire chip for use by the TLB, pipeline forwarding logic, etc). Unfortunately, replicating the pipeline registers results in a substantial increase in pipeline size, as latches which hold pipeline state are a significant fraction of the total pipeline area. In addition, the outputs of these replicated latches need to be multiplexed before being sent to the combinational portion of the pipeline. When these multiplexor delays are combined with the higher fanout and longer wire delays resulting from the area increase, it is difficult to imagine that the cycle time of the processor will not be significantly impacted. Instead of trying to reduce the switch cost by brute force, a better architecture can be developed by reexamining the fine-grained approach. 3 Interleaved Multiple-Context Proposal The primary problems with fine-grained processors, the need for large numbers of active contexts and the poor single-thread performance, are not intrinsic limitations of cycle-by-cycle context switching. Instead, two decisions incorporated into fine-grained processors are the culprits. The first is the decision to not support data caching, which makes every memory reference a long-latency operation. The second is the decision to not provide hardware interlocks which prevents a context from having more than one instruction in the pipeline, thereby increasing the minimum latency of each instruction to the pipeline depth. The basic idea behind our proposal, the interleaved multiplecontext processor, is simple. By adding both caching and full pipeline interlocks to the fine-grained scheme, it becomes possible to design a multiple-context processor which interleaves contexts on a cycle-by-cycle basis, yet effectively supports a single context. More specifically, issuing of instructions is switched each cycle between available contexts in a round-robin fashion. Contexts become unavailable when they encounter a long-latency operation, and are made available again when the long-latency operation completes. When a context becomes unavailable (an operation analogous to the context switch of the blocked scheme), the processor only squashes those instructions in the pipeline from the context becoming unavailable. IF1 IF2 RF EX DF1 DF2 WB Pipeline A A A A A A A Blocked Scheme C B A D C B A Interleaved Scheme Figure 2: Example illustrating the lower context switch cost of interleaved scheme. The advantage of this selective instruction squashing is illustrated in Figure 2 for a processor with four active contexts (labeled A D) and a MIPS R4000-like pipeline (the pipeline we

3 Miss Inst. Miss Miss Miss Lat. Miss Inst. Lat. Miss Miss Miss A B B C C C D D D D D D A B C D B C D C D D D D A A A A A A A B B B B B B B C C C C C C C Blocked Scheme C B A D C B A B D C B D C B D C D C B D C Interleaved Scheme Figure 3: Comparison of the blocked and interleaved multiple-context schemes for a set of four threads. will be using in our simulations). In this figure, context A has encountered a cache miss, and will need to be made unavailable. Unfortunately, the cache access occurs late in the pipeline and the context switch determination cannot be made until the WB stage. Because the blocked scheme will need to squash all instructions in the pipeline (including the instruction which caused the cache miss) before it can start the next context, a context switch costs seven cycles on the pipeline shown. However, for the interleaved scheme only the instructions from context A need to be squashed, reducing the overhead to handle the cache miss to two cycles. This reduction in switch cost from cycles for the blocked scheme to for an interleaved processor with active contexts allows the interleaved scheme to effectively tolerate much shorter stall latencies. In addition to lowering the switch cost, the cycle-by-cycle interleaving spaces out instructions from the same context. Like the pipeline of a single-context processor, the interleaved processor stalls when an instruction requires a result from a previous instruction which has not yet been computed. However, because there are no dependencies between instructions from different contexts, if sufficient instructions from other contexts are interleaved between dependent instructions from the same context, the stall due to the pipeline dependency can be avoided. 1 (Of course, the compiler will still need to schedule code to minimize stalls, as it is possible that only a single context may be executing). To illustrate these two advantages of the interleaved scheme more concretely, we show the execution of four threads for both the blocked and interleaved schemes in Figure 3. The four threads all end with the final instruction causing a cache miss and are: A Two instructions long. B Three instructions long, with a two-cycle pipeline dependency between the first and second instructions. C Four instructions long. D Six instructions long. The blocked scheme is shown on the left timeline. Context A starts executing, issuing its two instructions, the second of which causes a cache miss. The pipeline must be flushed at this point before context B can execute, as shown below the timeline. Context B then executes one instruction, stalls due to the pipeline dependency, and then executes until it encounters its cache miss, at which point the pipeline is flushed and C starts executing, and so on. The interleaved scheme executing the same set of threads is shown on the right timeline. The processor starts with all four contexts being interleaved. As we can see, this interleaving is enough 1 The instruction-set architecture best suited for the interleaved processor has no compiler-filled branch or load delay slots. This is the trend for modern architectures [6]. to separate the dependent instructions from context B, completely hiding the pipeline dependency. The lower switch cost of the interleaved scheme is also illustrated in Figure 3; the switch cost associated with a cache miss is reduced from the seven cycles of the blocked scheme to two cycles for context A, three cycles for contexts B and C. Note that as contexts are made unavailable, the number of contexts being interleaved on the pipeline decreases, until we reach the point where only context D is being interleaved (and the processor is now behaving like a single-context processor). As a result of the lower switch cost and pipeline dependency tolerance, the interleaved scheme was able to complete all four threads well before the blocked scheme. We have just qualitatively argued that the interleaved scheme should outperform the blocked, and we now need to quantify that gain for both the uniprocessor and multiprocessor environments. Evaluating the performance of the interleaved processor requires a simulator that supports accurate pipeline and memory system modeling. We also need the code to be correctly scheduled for our pipeline. The next section briefly discusses this simulation environment. 4 Evaluation Methodology We begin by describing our base workstation architecture. We then outline the simulation environment and multiprogramming workloads studied. Both the uniprocessor and multiprocessor evaluations use the same simulation environment, however, the multiprocessor study examines a different system architecture and application suite. These differences for the multiprocessor study will be discussed in Section 5.2 before presenting the multiprocessor performance results. 4.1 Base Architecture Proc Core Inst Cache Data Cache Sec Cache Figure 4: Base architecture. Bank 0 Bank 1 Bank 2 Bank 3 The system architecture models a high-end workstation, as shown in Figure 4. The cache parameters are given in Table 1 all caches are direct-mapped. The memory system is four-way interleaved,

4 connected to the processor across a high-speed, split-transaction bus. The unloaded memory access times are given in Table 2. Cache and memory contention are modeled, and can add to these latencies. While the data cache is lockup-free [12], the instruction cache is blocking and no context switching will be done for instruction cache misses. We have deliberately modeled an aggressive memory system, because its short memory latencies are difficult for multiple contexts to tolerate. Table 1: Cache parameters. Parameter Primary Data Primary Inst Secondary Size 64 Kbytes 64 Kbytes 1 Mbyte Line Size 32 bytes 32 bytes 32 bytes Fetch Size 1 line 2 lines 1-2 lines Read Occupancy 1 cycle 1 cycle 2 cycles Write Occupancy 1 cycle NA 2 cycles Invalidate Occupancy 2 cycles NA 4 cycles Cache Fill Occupancy 1 cycle 8 cycles 2 cycles Table 2: ory latencies. Hit in Primary Cache Hit in Secondary Cache Reply from ory 1 cycle 9 cycles 34 cycles The processor architecture was selected to be representative of current, high-end RISC microprocessors. It executes the MIPS II instruction set [9], except that the delayed branches of the MIPS architecture have been removed. Delayed branches are an artifact of the first-generation RISC processors that do not extend well into future generations. The integer pipeline of the processor is based on the MIPS R4000 [9], but is slightly more aggressive. As Figure 5 shows, the integer pipeline modeled is seven stages deep, one less than the R4000. The R4000 has a separate Tag Check stage between DF2 and WB, which has been folded into the DF2 stage for our processor. The floating-point pipeline is based on the DEC Alpha [6], and is nine stages deep. Both pipelines forward results whenever possible to reduce operation latency. The arrows above the pipelines in Figure 5 denote possible result forwarding paths. IF1 IF2 RF EX DF1 DF2 WB Integer Pipeline IF1 IF2 RF EX1 EX2 EX3 EX4 EX5 WB Floating Point Pipeline Figure 5: Processor pipeline. Because the ALU provides result bypassing, most operations have an effective latency of a single cycle. However, a small number of operations require longer to complete. Load operations are followed by two delay slots; the load result is not available until the end of DF2 for forwarding to the following EX phase. Branches have an even longer latency. Since the branch condition is evaluated in the EX phase, taken branches could potentially cost four cycles, however, a 2048-entry direct-mapped branch target buffer (BTB) is used to reduce the branch penalty to zero for a correctly predicted branch (mispredicted branches still pay a three cycle penalty). Table 3: Long-latency operations. Operation Issue Latency Integer Divide Integer Multiply Shift 1 2 Load 1 3 Floating-point Add/Sub/Conv/Mult 1 5 Floating-point Divide 61 (31) 61 (31) The latency and issue rate of operations which take greater than a single cycle are given in Table 3. For floating-point division, operation on single-precision numbers is faster than for doubleprecision, and the single-precision numbers are shown in parenthesis in the table. 4.2 Compiler and Simulation Environment In order to accurately model the effects of latency on processor performance, a number of issues must be addressed. First, the code needs to be highly optimized and scheduled for the target system pipeline. Second, the simulator needs to insure that a correct interleaving of processes is used. Finally, the pipeline and memory system simulator must accurately model the real processor and system architecture. Our compilation and simulation environment addresses all these issues. To insure that application performance was not affected by suboptimal compilation, we compiled all applications using recent MIPS CC and F77 compilers (version 2.1) with the -O2 level of optimization. To schedule the code properly for our target pipeline, the MIPS compilers produced assembly files which were then run through the Twine scheduler [23], an aggressive global scheduler that is part of the Stanford SUIF compiler. Twine takes in an instruction set parameter file, which allows the latency, functional unit usage, and ability to cause exceptions to be specified for each instruction. Twine scheduled the optimized code for our pipeline, leaving the final two issues. We have developed a detailed processor and system simulator that interfaces to Tango-Lite [5]. Tango-Lite is a simulation package that allows execution-driven simulation of parallel programs on uniprocessors. Our simulator accepts basic block and memory reference addresses from Tango-Lite. The basic block addresses index into the properly scheduled object file and are used to generate register and functional unit usage information on the instructions in that basic block. This usage information is then passed to the pipeline simulator. The pipeline simulator models all major pipeline dependencies, including load, execution result, execution issue, and controltransfer hazards. These hazards are tracked in the simulator through a scoreboard which maintains information on the functional unit and register usage of all operations in progress. Instructions are not allowed to progress to the execute stage until the desired functional unit is available and all register dependencies (true, anti-, and output) are satisfied. Finally, if the instruction is a load or store, the proper address and reference type is sent to the memory system simulator. Table 4: Context switch costs. Switch Cause Blocked Interleaved Cache Miss Explicit switch instruction 3 NA Backoff instruction NA 1 3

5 In addition to modeling pipeline dependencies, the processor simulator also handles the multiplexing of the multiple contexts on the processor. For the blocked scheme, contexts are switched whenever a cache miss occurs. In addition, the blocked processor is assumed to have an explicit context switch instruction, for use in tolerating latencies other than cache misses. The interleaved scheme switches contexts each cycle, so its context switch cost is zero. However, making a context unavailable does cost cycles (the number of instructions in the pipeline from that context), and this will be referred to as the switch cost of the interleaved scheme. The interleaved processor has the ability to issue a backoff instruction to tolerate latencies other than due to cache misses. The backoff instruction causes a context to become unavailable for a number of cycles specified by the instruction and is analogous to the explicit switch of the blocked scheme. The context switch costs for both schemes are listed in Table 4, with the exact cost for the interleaved scheme depending on the dynamic context interleaving. The cost for the explicit switch and backoff instructions are smaller than for cache misses, as a switch can be triggered immediately upon decoding the instruction. 4.3 Application Workloads To model workstation usage, we generated six workloads using members of the Spec89 suite. The programs selected include Doduc, Eqntott, Li, Matrix300, Tomcatv, and NASA7, with NASA7 being broken into its seven kernels: Btrix, Cholsky, Cfft2d, Emit, Gmtry, Mxm, and Vpenta. 2 The six workloads were constructed to have the following characteristics: IC stresses the instruction cache, DC stresses the data cache, DT stresses the data TLB, FP is floating-point intensive, and R0 and R1 are random workloads. A seventh workload (SP) consists of uniprocessor versions of four SPLASH [21] applications (using the input data sets given in Section 5.2). These seven workloads are listed in Table 5. Table 5: Uniprocessor workloads. IC Doduc Li Eqntott Mxm DC Cfft2d Gmtry Tomcatv Vpenta DT Btrix Cholsky Gmtry Vpenta FP Emit Cholsky Doduc Matrix300 R0 Emit Btrix Cfft2d Eqntott R1 Mxm Li Matrix300 Tomcatv SP MP3D Water Locus Barnes Since many of the applications take several minutes of CPU time without simulation, and our simulator slows down the applications over a thousandfold, we were not able to run the workloads to completion. Instead, each workload was run for 36 time-slices (roughly 1 second of CPU time). Because we only simulated a fraction of the complete application, it was important that we were simulating the section of the application responsible for most of the execution time for the complete run. This was ensured by not generating references to the simulator until the initialization phase of the applications in the workloads had been completed. In addition, to remove cold-start effects, each application in the workload was run for a time slice before simulation statistics are gathered. Thus, when simulation statistics were being gathered, the applications had completed initialization and the caches were loaded. A simple model of the operating system was employed for this study. The time-slice used by the operating system is 30 ms, 2 These applications were selected from the Spec89 suite solely on the basis of being able to pass through our compilation system with no or minor adjustments to the application itself. and assuming a 200 MHz processor this translates to a scheduler interrupt every six million processor cycles. The scheduler uses a simple affinity mechanism which keeps the same application scheduled on the processor for the equivalent of three time slices (i.e. for the single-context processor each application runs for three time slices before switching, for the two-context processor a pair of applications is run for six time slices before switching, etc). Because of this affinity mechanism, the number of processes switched on each scheduler call will either be zero or the number of hardware contexts supported. The simulator does not actually run the operating system scheduler at each interrupt, rather, the scheduler is modeled as a routine with negligible latency which displaces some number of cache lines. The amount of cache interference caused by the scheduler is based on a study by Torrellas [25] of IRIX, a UNIX System V variant, running on a Silicon Graphics 4D/340. The scheduler interference depends on the number of contexts which must be swapped, as shown in Table 6, and is modeled by issuing the number of memory requests given in the table to random addresses. Table 6: Operating system costs. Processes Instruction Cache Data Cache Switched Interference Interference Performance Results We start this section by presenting results for the workstation workloads, and then turn to the multiprocessor evaluation. In both environments, we show that the interleaved approach significantly outperforms the blocked approach. 5.1 Uniprocessor Results Before we examine the effects of multiple contexts on multiprogramming throughput, we first need to discuss the impact of multiple contexts on process scheduling. As was observed in [3], applications with lower miss rates tend to get more cycles under blocked multiple contexts than applications with higher miss rates. This is because lower miss rates usually translate into longer runlengths, and assuming a strict round-robin scheduling, the fraction of the total processor cycles allocated to each application will depend on the size of its runlength relative to the other runlengths. A similar effect also occurs for the interleaved scheme. Assuming the processor supports contexts, an application receives 1 of the processor cycles as long as it is not unavailable due to an outstanding memory request. Since applications with longer runlengths spend less of their time waiting for memory, they will get a larger fraction of the processor. Because we would like to compare the blocked and interleaved schemes based on how well they improve the throughput of all applications in the workload, not on whether they devote more processor time to applications with better memory behavior, we will assume that the hardware provides context-usage feedback to the operating system, and the operating system schedules the workload to even out the amount of processor cycles devoted to each application. Therefore, we will normalize our results (which do not include the effects of this feedback to the operating system) to the case where each application out of is given 1 of the processor.

6 Table 7: Increase in application throughput with multiple contexts. Scheme IC DC DT FP R0 R1 SP Mean Two Interleaved Contexts Blocked Four Interleaved Contexts Blocked The importance of a low switch cost shows up as we look at the processor utilization breakdown for the blocked processor in Figure 6. Processor utilization is broken into five categories: busy, time spent doing useful work, instruction, time stalled due to pipeline dependencies, inst cache/tlb, time stalled on memory due to instruction references, data cache/tlb, time stalled on memory due to data references, and context switch, time spent context switching. The number on top of the bars show the percent of time spent busy. The seven workloads are listed along the bottom of the graphs; results are given for one, two, and four contexts per processor. Normalized Execution Time Context Switch Data Cache/TLB Inst Cache/TLB Instruction Stall Busy Normalized Execution Time IC DC DT FP R0 R1 SP Figure 6: Blocked scheme processor utilization. Context Switch Data Cache/TLB Inst Cache/TLB Instruction Stall Busy In general, the processor utilization of the blocked scheme does not increase much with additional contexts. This is because many of the workloads simply do not contain that much memory or longinstruction latency for the multiple contexts to tolerate. Even for workloads where there is a fair amount of memory latency, such as DC and DT, most of the memory stall time is due to secondary cache hits, and the gains from tolerating the memory latency are consumed by the switch overhead. Consequently, the throughput of these workloads with four contexts only increases by 23% and 9%, respectively. In contrast, the lower switch cost of the interleaved scheme allows it to tolerate both pipeline dependencies and memory latency, as shown in Figure 7. Processor utilization increases significantly under the interleaved scheme for workloads with large amounts of instruction latency, as the cycle-by-cycle interleaving tolerates shorter instruction latencies, while the backoff instruction is used to tolerate long instruction latency. In addition, the memory latency of workloads such as DC and DT can be effectively tolerated because of the lower switch cost of the interleaved scheme, resulting in a 65% and 46% increase in throughput for the two workloads with a four-context processor. Table 7 summarizes the performance of the blocked and interleaved schemes. The interleaved scheme is able to increase throughput by 22% (geometric mean across the workloads) with only two contexts per processor. With four contexts the improvement is 50%. In contrast, the blocked scheme shows very small improvements in throughput, 3% with two contexts and 11% with four. 0 IC DC DT FP R0 R1 SP Figure 7: Interleaved scheme processor utilization. We have just shown the interleaved scheme to substantially outperform the blocked when several large jobs are multiprogrammed on a single workstation. While this is the situation found on the workstations in our research lab, many workstations run with one large job in the background which is timesharing the processor with the operating system, windowing system, and a number of smaller foreground jobs. Even though we have not explicitly modeled this workload here, interleaved multiple contexts obviously also benefit this environment. The smaller foreground jobs can be loaded and run on the processor without requiring the larger job to be switched out. The response time of the windowing system can be improved if it does not require other jobs to be swapped before it can run. The operating system can also take advantage of the multiple contexts, especially with the trend towards microkernel operating systems, where much of the operating system functionality is encapsulated in separate processes. In addition to providing these advantages, multiple contexts allow background applications which suffer from significant memory latency to be written as parallel programs to take advantage of the latency tolerance. There are also a large number of applications which are designed to run on workstation clusters or smallscale multiprocessors that are already multithreaded and can take advantage of the multiple contexts on the processor. By providing a multiple-context processor that performs just as well with a single thread as the single-context processor and is able to show significant performance improvements with as few as two loaded hardware contexts, the interleaved scheme allows a workstation to be built that will appear significantly faster to the user. 5.2 Multiprocessor Results Our simulated multiprocessor consists of a number of nodes connected together by a high-bandwidth, low-latency interconnect. Each node consists of a single processor, instruction and data cache, and a portion of the global memory. The caches are kept coherent using a distributed, directory-based protocol similar to that of the Stanford DASH multiprocessor [16]. Because the same primary cache sizes and parameters are used for the multiprocessor as

7 for the uniprocessor, shared data communication will be the major contributor to the cache miss rate, and therefore the instruction cache was modeled as ideal (% hit rate) and only a single-level of data cache was simulated, as multi-level hierarchies do not help reduce the communication miss rate. Table 8: Default memory latencies. Hit in Primary Cache Reply from Local ory Reply from Remote ory Reply from Remote Cache 1 cycle cycles cycles cycles Normalized Execution Time Context Switch Synchronization ory Instruction (Long) Instruction (Short) Busy Unloaded memory latencies are selected from a uniform distribution spanning the ranges given in Table 8 and are based on Stanford DASH latencies. Contention for the caches is modeled, which can increase these base latencies. While cache contention is modeled, the network and memories are modeled as contentionless to speed up simulation. Simplifying the network and memory system allows us to simulate larger problems, while still providing a sufficient model of the memory system behavior, as cache contention is likely to dominate network and memory contention [1]. The SPLASH suite of applications was used for our multiprocessor study. An overview of the seven SPLASH applications and their input sets is presented in Table 9. More information on the computational behavior and important data structures for each application can be found in [21]. For all SPLASH applications simulated for multiple time steps, we only gathered performance statistics after the first step was finished since the behavior of the first step is often different from all other steps. For Cholesky and LocusRoute we only gathered statistics for the parallel sections of the code. Speedups from adding multiple contexts to our base processor are shown in Table 10. On occasion, the best performance was encountered with fewer than the maximum number of hardware contexts, and the numbers presented in the table are for the application running on the optimum number of contexts. As expected, the performance gains due to multiple contexts are in general much larger in the multiprocessor environment, and only Cholesky shows no gains from multiple contexts. In addition, the interleaved scheme outperformed the blocked scheme for all applications when using four and eight contexts per processor and for nearly all applications with two contexts per processor. The largest performance differences between the two schemes are exhibited for Barnes and Water (both applications have large amounts of instruction latency, mainly due to a large number of floating-point divides). The performance difference between the two schemes for all applications is substantial in fact, with four contexts per processor, the interleaved scheme outperforms the eight-context blocked scheme for all applications except MP3D. We show a breakdown of the multiple-context execution time for the blocked scheme in Figure 8, for the interleaved scheme in Figure 9. In these graphs, execution time of the measured portion of the application is shown for one, two, four, and eight contexts per processor, normalized to the single-context time. This execution time is divided into six categories: busy, time spent active, instruction, time stalled due to pipeline dependencies, memory, time stalled on data cache misses, synchronization, time spent on interprocess synchronization, and context switch, time spent in switching overhead. Pipeline dependencies of four or fewer cycles (four being the maximum stall due to a floating point add/subtract/multiply result hazard) are labeled short, while all longer pipeline dependencies are labeled long. Again, the blocked scheme squanders more cycles in context switching than the interleaved scheme. Because memory latencies Normalized Execution Time 0 MP3D Barnes Water Ocean Locus PTHOR Cholesky Figure 8: Application execution time breakdown for the blocked scheme MP3D Barnes Water Context Switch Synchronization ory Instruction (Long) Instruction (Short) Busy Ocean Locus PTHOR Cholesky Figure 9: Application execution time breakdown for the interleaved scheme. are so much larger in a multiprocessor, the effects of the higher switch cost are not as serious as they were in the uniprocessor environment and the blocked scheme is still able to show reasonable application speedups. While the effect of the high switch cost is less serious for the multiprocessor, it still has a negative impact, allowing the interleaved scheme to outperform the blocked. In addition, while the blocked scheme is somewhat effective in tolerating longer pipeline dependencies, it cannot tolerate short pipeline dependencies, which accounted for 12% of the total single-context execution time when averaged across the SPLASH applications. In contrast, the interleaved scheme is able to tolerate both long and short pipeline dependencies, and as a result achieves much better processor utilizations for applications like Water that contain large amounts of pipeline dependencies. 6 Implementation Issues We have just shown the interleaved scheme to outperform the blocked, and now need to compare the implementation costs of both approaches. Due to space limitations, the discussion must be brief; a more complete examination of the hardware requirements can be found in [14]. The first requirement of all multiple-context processors is that the cache be capable of handling multiple outstanding memory requests. These lockup-free caches [12] are more expensive than standard blocking caches, and represent a substantial portion of the extra complexity involved in building any multiple-context

8 Table 9: SPLASH suite summary. Application Language Lines Description Input Iterations Barnes-Hut C 2700 hierarchical N-body gravitation simulation 4K particles 3 steps Cholesky C 2000 Cholesky factorization of sparse matrices BCSSTK23 NA LocusRoute C 6400 routes wires in VLSI standard cell designs Primary2.grin NA MP3D C 1500 simulates rarefied hypersonic flow 150K particles 4 steps Ocean Fortran 3300 simulates eddy currents in an ocean basin 258x258 grid 3 steps PTHOR C 9200 simulates digital logic circuits NTT 20 cycles Water C 1500 simulates water molecule interaction 256 molecules 2 steps Table 10: Application speedup due to multiple contexts. Scheme MP3D Barnes Water Ocean Locus PTHOR Cholesky Mean Two Interleaved Contexts Blocked Four Interleaved Contexts Blocked Eight Interleaved Contexts Blocked processor [14]. Because of the widening gulf between processor and memory speeds, off-the-shelf microprocessors have started to provide lockup-free caches [24]. Future generations of processors are likely to include even more extensive support for multiple outstanding requests. Thus, the component which causes the largest complexity increase for multiple-context processors will exist in future microprocessors, and we can focus on the remaining implementation issues. These issues include: instruction issue from multiple streams, control for the multiple contexts, and replication of key per-process state. The blocked scheme issues instructions from only a single stream at any given time, and its issue logic is fairly similar to a single-context processor. The major costs of the blocked processor are in replicating the per-process state (the program counter, process-specific portion of the processor status word, and the register file) and providing the control to switch between that state at the point of the context switch. In contrast, the interleaved multiple-context processor needs to issue instructions from multiple independent streams concurrently. This requires that the instruction issue logic also be replicated, and that each instruction in the pipeline be tagged with the context that it issued from. This context identifier (CID) is then used by the pipeline to determine which state to access (e.g. for the register file access, in determining TLB hits, etc). This additional complexity in the interleaved instruction issuing logic is the largest difference between the two schemes, and we end this section by exploring possible implementations for both the blocked and interleaved program counter (PC) units for our simulated pipeline. We first present the single-context PC unit, and then examine the additional complexity of the blocked and interleaved PC units. 6.1 Single-context Program Counter Unit The PC unit for our single-context processor is shown in Figure 10. The rectangles in the diagram represent registers; all registers have clock enable capability. The clock enable and tristate control are not shown. On any given cycle, one of several sources drives the PC bus. The possible PC sources are: (a) old PC value plus the instruction size (normal sequential flow), (b) Branch Target Buffer (predicted branch), (c) computed branch target (mis- or unpredicted branch), (d) exception vector, or (e) EPC register (restore from an exception). Except Vector Instruction Size Sequential PC Result Bus Computed Branch Target Tag Hit? PC Bus BTB Target Predicted Branch Mispredict Branch Exception PC Figure 10: Single-context processor PC unit. PC Chain The exception vector and EPC register provide the ability to take and recover from exceptions. During normal execution, as each instruction retires the address of that instruction is loaded into the EPC register from the PC chain. When an exception occurs, the loading of the EPC register is stopped with the guilty instruction, and it and the following instructions in the pipeline are squashed (marked to not update any state). The exception vector is then placed on the PC bus, and the handler starts executing. The EPC is connected to the result bus to allow the exception handler to save and restore the EPC manually. When the exception has been handled, the EPC is forced onto the PC Bus via the use of an ERET (exception return) instruction, and execution continues from the point at which it left off. Note that because we have removed the MIPS branch delay slot, only a single EPC is needed. Supporting multiple contexts on a processor with branch delay slots is discussed in [15]. 6.2 Blocked Program Counter Unit The ability of the EPC to save an instruction address for later repeat is exactly the same functionality needed to correctly save the program counter after a context switch on the blocked processor. To be able to support multiple contexts, we simply need to replicate this functionality for each context. A PC unit capable of supporting two contexts is shown in Figure 11. This PC unit is very similar to that of a single-context processor, with the only difference being a modification to the EPC register in order to support the two contexts. This modification adds an EPC register

9 per context, which doubles as both the exception PC register and the context restart register (which contains the saved PC for that context). Instruction Size Sequential PC Result Bus Computed Branch Target Except Vector Tag Hit? BTB Target Predicted Branch PC Bus Mispredict Branch Exception PC 0 Exception PC 1 Figure 11: Two-context blocked processor PC unit. PC Chain Exceptions continue to use the EPC register in the same manner as the single-context processor. The EPC register for the active context is continually being updated during normal operation, while the EPC for the idle context remains unchanged. When an exception occurs, the EPC stops being updated and the exception vector is driven onto the bus. The behavior of the PC unit for a context switch is very similar to that for an exception. When a context needs to be switched, the context switch is delayed until the normal exception point, at which time the EPC register for the blocked context stops loading as if an exception had occurred. The partially completed instructions in the pipeline are squashed and the next context is selected. The EPC register of the next context is driven onto the PC bus, and the new context starts execution with the instruction that caused its previous context switch Interleaved Program Counter Unit We have just shown that the changes to the blocked PC unit to handle multiple contexts are fairly straightforward. We now examine the changes required for the interleaved PC unit. Computed Branch Target Instruction Size Except Vector Next PC 0 Target BTB Next PC 1 Result Bus Tag Hit? PC Bus Exception PC 0 Exception PC 1 Figure 12: Two-context interleaved processor PC unit. CID and PC Chain The PC unit must be able to determine the next instruction to be issued from each context. Determining this next instruction has become somewhat complicated under the interleaved scheme because the new PC value for a context becomes available a specific number of pipeline stages after issue and must be held until the context becomes active again. The delay between when an instruction issues and when the new PC value becomes available depends on the type of instruction that was issued. The new PC value is not known until the end of the EX cycle for a branch 3 Note that this sharing of the EPC between exception handling and context switching requires context switching to be disabled upon entry into the exception handler until the EPC can be saved. which was not predicted correctly, while for nonbranch and correctly predicted branches the new PC value is available at the end of IF1. Because of the context interleaving, the context may not be able to drive its new PC value onto the PC bus immediately when it becomes available, and holding registers must be provided until the context is selected to drive the PC Bus. The three PC sources (sequential, predicted branch, computed branch) can be multiplexed into a single holding register per context, as shown in Figure 12. A bit is associated with each holding register to signify whether the register holds a computed branch which was loaded as a result of a previous branch being mispredicted. This bit is used to signal that the BTB needs to be updated to reflect the new branch prediction when the holding register is driving the PC Bus. This next PC register (NPC) is loaded by one of the following sources (in order of decreasing priority): The computed branch target if the instruction in EX is a mispredicted branch from this context. The predicted branch target if the current PC is from this context and hit in the BTB. The sequential address if the current PC is from this context. The NPC, causing the register to maintain its current value. The computed branch has priority over all other sources if a previous branch was mispredicted, because this is guaranteed to be the correct path of instruction flow. Note that the determination of the mispredicted branch can actually occur before the predicted branch address has been issued to the PC Bus due to the context interleaving. If this occurs, the branch will only cost a single cycle, even though it was mispredicted. However, if a mispredicted branch was issued, all incorrectly fetched instructions in the pipeline from the context must be squashed when the misprediction is detected. This is accomplished by sending a branch squash signal coupled with a branch squash CID. Instructions following the branch in the pipeline with the same CID as the squash CID are then marked to not update any state. In addition to squashing the incorrectly fetched instructions, the computed branch is loaded into the NPC and the mispredicted status bit is set. On the next instruction issue from this context, the NPC will be driven onto the PC Bus and the BTB updated. The interleaved PC unit must also be able to handle context availability changes. This can be accomplished in a manner similar to that of the blocked scheme. For example a change in context availability due to a cache miss requires that once the miss is detected, the issuing of instructions from that context is stopped and all partially completed instructions in the pipeline are squashed. Squashing of already issued instructions can be accomplished by providing a conditional squash signal coupled with a squash CID. All instructions in the pipeline that match the CID will be marked to not affect any state. The PC value of the instruction causing the miss is loaded into the EPC and the EPC valid bit is set. When the context eventually becomes available again, the EPC of that context will be driven onto the PC Bus when the context issues its first instruction. The context will then start reexecuting from the instruction which caused the context to become unavailable. Handing exceptions on the interleaved processor is more complicated than the blocked processor due to execution of the multiple threads and is discussed in detail in [14]. While the interleaved PC unit is more complex than the blocked, this complexity is not overwhelming, especially when compared with the instruction issue complexity of recent dynamic superscalar processors [24]. For a manageable increase in complexity, the interleaved processor provides a significant increase in performance.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline