A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

Size: px

Start display at page:

Download "A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor."

Evan Ramsey
5 years ago
Views:

1 A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. Recent years have seen a great deal of interest in multiple-issue machines or superscalar processors, processors that can issue several mutually independent instructions in the same cycle. These machines exploit the parallelism that programs exhibit at the instruction level. The superscalar processor designs dynamically extract parallelism by executing many instructions within a single, sequential program in parallel. To find independent instructions within a sequential sequence of instructions, or thread of control, today s processors increasingly make use of sophisticated architectural features. Examples are out-of-order instruction execution and speculative execution of instructions after branches predicted with dynamic hardware branch prediction techniques. However it is important to know how much parallelism is available in typical applications. Machines providing a high degree of multiple-issue would be of little use if applications did not display that much parallelism. The available parallelism depends strongly on how hard we are willing to work to find it. Future performance improvements will require processors to be enlarged to execute more instructions per clock cycle. Speculative execution, or issuing an instruction whose data dependencies are satisfied but control dependencies are not. That is, we issue a potential future instruction early even though an intervening branch may send us in another direction entirely. However, reliance on a single thread of control limits the parallelism available for many applications, and the cost of extracting parallelism from a single thread is becoming prohibitive. This cost manifests itself in numerous ways, including increased die area and longer design and verification times. In general, we see diminishing returns when trying to extract parallelism from a single thread. To continue this trend will trade only incremental performance increases for large increases in overall complexity. Exploiting Parallelism: Parallelism exists at multiple levels in modern systems. Parallelism between individual, independent instructions in a single application is instruction-level parallelism (ILP). Loop-level parallelism results when the instruction-level parallelism comes from data-independent loop iterations. The finite number of instructions that can be examined at once by hardware looking for instruction level parallelism to exploit is called the instruction window size. Compilers, which have essentially infinite

2 virtual instruction windows as they generate code, can help increase usable parallelism by reordering instructions. Instructions are reordered so that instructions that can be issued in parallel are close to each other in executable code, allowing the hardware s finite window to detect the resulting instruction-level parallelism. Some compilers can also divide a program into multiple threads of control, exposing threadlevel parallelism (TLP). This form of parallelism simulates a single, large, hardware instruction window by allowing multiple, smaller instruction windows one for each thread to work together on one application. A third form of very coarse parallelism, processlevel parallelism, involves completely independent applications running in independent processes controlled by the operating system. In the future, we expect thread and process parallelism to become widespread, for two reasons: the nature of the applications and the nature of the operating system. As a result researchers have proposed two alternative micro architectures that exploit multiple threads of control: simultaneous multithreading (SMT)[1, 3] and chip multiprocessors (CMP)[4, 7, 9]. Simultaneous multithreading is a technique permitting several independent threads to issue instructions to a superscalar s multiple functional units in a single cycle it is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Chip multiprocessors CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores[5]. CMPs are identical to most of today s AMP machines. But having multiple CPUs on a single chip yields speedup on data transactions among processors. This speedup makes CMP faster than conventional multichip multiprocessors in running parallel programs especially when threads communicate frequently. Wide-issue super scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel-processing styles statically partition processor resources, thus preventing them from adapting to dynamically-changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996; Gulati et al. 1996] allows multiple threads to compete for and share available processor resources every cycle. One of its key advantages when executing parallel applications is its ability to use thread-level

3 parallelism and instruction-level parallelism interchangeably. Software trends favors multithread programming for its various benefits. As multiprocessor systems can provide multiple simultaneous points of execution. With the help of the operating system. Independent threads can run on independent processors simultaneously. However, the need to limit the effects of interconnect delays, which are becoming much slower than transistor gate delays, and the ability to exploits increasing transistor count on chip, favors CMPs[5]. 1. Trends in Multiprocessor Architecture: The major trend in commercial microprocessor architecture is the use of complex architecture to exploit the ILP. There are two approaches that are used to exploit the ILP: SUPERSCALER and Very Long Instruction Word (VLIW). Both approaches attempt to issue multiple instructions to independent functional units at every clock cycle. SUPERSCALER uses Hardware to dynamically find data independent instruction in an instruction window and issue them to independent functional units. On the other hand, VLIW relies in the compiler to find ILP and schedule the exec of independent instruction statically. Superscalar is more appealing in commercial microprocessor because it can improve the performance of existing application binaries[7]. However Superscalar is complex to design and difficult to implement. Looking for parallelism in large instruction window requires a significant amount of hardware and usually does not improve the performance as much as one might expect. Due to this complexity, it is difficult not only to make the architecture correct but also to optimize the pipeline and circuit to achieve high clock frequency. On the other hand VLIW relies on the compiler to find bunches of independent inst. Since VLIW does not require the hardware for dynamic scheduling it can be much simpler to design and implement. However it requires significant compiler support such as a trace scheduling to find out ILP an application program. VLIW is preferred over Superscalar when the issue width is so large that that dynamic scheduling hardware in Superscalar is too complex and expensive to implement. However in VLIW such a wide-issue machine has a centralized register file that must have many ports to supply operands to independent functional units. The access time for the register file and complexity of the buses connecting to the functional units may limit clock Frequency. Another disadvantage of VLIW is they cannot use

4 precompiled binaries. What if the source code id not available. VLIW forces a bunch of instruction to execute together. If one instruction in the bunch is stalls then other instruction in the bunch must stall too. This limits VLIW's ability to deal unpredictable events such as data accesses causing cache misses. Currently most of commercial microprocessors such as Intel Pentium, Compaq Alpha 21264, IBM PowerPC620, Sun UltraSparc, HP PA8000 and MIPS R use Superscalar design technique. Performance of these microprocessor has been improving at a phenomenal rate for decades. This performance growth has been driven by (1) the innovation in compilers, (2) the improvements in architecture and (3) tremendous improvements in VLSI technology. The latest Superscalar microprocessors can execute four to six instructions concurrently with many nontrivial techniques including dynamic branch prediction, out-of-order execution, and speculative execution method. However speedup may not be achieved by using these techniques because of the limitations of the instructions window size and the ILP in a typical program. Moreover considerable design efforts are required to develop such high performance microprocessor. Therefore developing a complex wide issue superscalar microprocessor as a next generation microprocessor may not be an efficient approach to satisfy the required performance. Superscalar Bottlenecks: Where Have All the Cycles Gone? Figure 1: gives the issue utilization i.e. the percentage of issue slots that are filled each cycle, for most of the SPEC benchmarks. The cause of each empty issue slot is also recorded. For example, if the next instruction cannot be scheduled in the same cycle as the current instruction, then the remaining issue slots this cycle, as well as all issue slots for idle cycles between the execution of the current instruction and the next (delayed) instruction, are assigned to the cause of the delay. When there are overlapping causes, all cycles are assigned to the cause that delays the instruction the most if the delays are additive, such as an I TLB miss and an I cache miss, the wasted cycles are divided up appropriately[1]. Thus it can be seen that the functional units in the wide Superscalar used are highly underutilized. These results also indicate that there is no dominant source of wasted issue bandwidth. Although there are dominant items in individual applications (e.g., mdljsp2, swm, fpppp), the dominant cause is different in each case. In the composite results we see that the largest cause (short FP dependence) is responsible for 37% of the issue bandwidth, but there are six other causes that account for at least 4.5% of wasted cycles. Even completely eliminating any one factor will not necessarily improve performance to the degree that this graph might imply, because many of the causes overlap. Not only is there no dominant cause of wasted cycles there appears to be no dominant solution. If specific latency-hiding techniques are

limited, then any dramatic increase in parallelism needs to come from a general latency-hiding solution, of which multithreading or multiprocessing are examples.

5 limited, then any dramatic increase in parallelism needs to come from a general latency-hiding solution, of which multithreading or multiprocessing are examples. Table 1 gives an idea of the possible causes of wasted issue slots, and the latency-reducing technique that can reduce that number of cycles wasted by each cause.

6 2. Hardware Multithreading: Increasing miss rates and increasing latency of cache misses are having a compounding effect on the portion of execution time that is wasted on cache misses. The solution to this problem is to use coarsegrained multithreading to enable the processor to perform useful instructions during cache misses. Why are there increasing miss rates and increasing latency of cache misses? Workload Characteristics: Taking for instance the server workloads represent such market segments as on-line transaction processing (OLTP), business intelligence, enterprise resource planning (ERP), web serving, and collaborative groupware. The applications are often large and function-rich; they use a large number of operating system services and access large databases. These characteristics make the instruction and data working sets large. These workloads are also inherently multi-user and multitasking. The large working set and high frequency of task switches cause the cache-miss rates to be high. In addition, research in this area points out that such applications can also have data that is frequently read write shared. In multiprocessors, this can make the miss rates significantly higher. Also, because of the large instruction working set, branch-prediction rates can be poor. These characteristics are all detrimental to the performance of the processor. Application Characteristics: Current trends in application characteristics and languages are likely to make this worse. Object-

7 oriented programming with languages such as C++ and Java has been popular for several years and is increasing in popularity. Virtual-function pointers are a feature of these languages that did not exist in the languages used in older applications. Virtual-function pointers lead to branches that can have very poor branch-miss prediction rates. The frequency of dynamic memory allocation in these languages is also higher than in older languages, which leads to more allocation of memory from the heap. Memory from the heap is more scattered than memory from the stack, which can cause higher cache-miss rates. Java also does garbage collection. Garbage collection has access patterns that lead to poor cache-miss rates because it references many objects and uses each only a small number of times. All of these factors are causing the already high miss rates to become even higher. Faster clock rates: A large portion of the execution time can already be spent on cache misses and branch mispredictions. The trend in processor microarchitecture is toward decreasing cycle time at a faster rate than the decrease in memory access time. This is causing the number of processor cycles for a cache-miss latency to increase. For a given miss rate, this causes the portion of the execution time due to cache misses to become larger. This trend, combined with the trend toward higher miss rates in workloads that already have high miss rates, causes a compounding effect on the cycles-per-instruction (CPI) increase due to cache misses. Multithreading: In a multithreaded processor, the processor holds the state of several tasks/threads. The several threads provide additional instruction-level parallelism, enabling the processor to better utilize all of its resources. When one of the threads would normally be stalled, instructions from the other threads can utilize the processor s resources. The observation that cache misses were becoming a very large portion of the execution time led to the investigation of multithreaded hardware as a way to execute useful instructions during cache misses. In fine-grained multithreading, a different thread is executed every cycle While fine-grained multithreading covers control and data dependencies quite well (although this may require more than two threads), the impact of cycle interleaving on single-task performance was deemed too large. In coarse-grained multithreading, a single thread, called the foreground thread, executes until some long-latency event such as a cache miss occurs, causing execution to switch to the background

8 thread. If there are no such events, a single thread can consume all execution cycles. This minimizes the impact on single-task execution speed, making it performance-competitive with non-multithreaded processors. The processor executes instructions in order so coarse-grained multithreading is used. Simultaneous multithreading is a technique permitting several independent threads to issue instructions to a superscalar s multiple functional units in a single cycle it is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. In a deeply pipelined out-of-order execution processor, simultaneous multithreading is chosen. Simultaneous multithreading combines hardware multithreading with superscalar processor technology, it makes it easier to compare the performance of a Simultaneous Multithreaded processor with that of a Superscalar processor. So I chose Simultaneous Multithreaded Processor for my study. Simultaneous Multithreading (SMT): Multiple instruction issue has the potential to increase performance, but is ultimately limited by instruction dependencies (i.e., the available parallelism) and long-latency operations within the single executing thread. The effects of these are shown as horizontal waste and vertical waste in Figure 2 Multithreaded architectures, on the other hand, such as HEP [28], Tera [3], MASA [15] and Alewife [2] employ multiple threads with fast context switch between threads. Traditional multithreading hides memory and functional unit latencies, attacking vertical waste. In any one cycle, though, these

9 architectures issue instructions from only one thread. The technique is thus limited by the amount of parallelism that can be found in a single thread in a single cycle. And as issue width increases, the ability of traditional multithreading to utilize processor resources will decrease. Simultaneous multithreading, in contrast, attacks both horizontal and vertical waste. Simultaneous multithreading (SMT), allows multiple threads to compete for and share all of the processor s resources every cycle. By permitting multiple threads to share the processor s functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor s resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. When a program has only a single thread, i.e., it lacks TLP, all of the SMT processor s resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. An SMT processor can uniquely exploit whichever type of parallelism is available, thereby utilizing the functional units more effectively to achieve the goals of greater throughput and significant program speedups. Performance of Simultaneous Multithreading: (the results in this section are based on the observations in [1] ) This section presents performance results for simultaneous muhithreaded processors. Several machine models have been defined for simultaneous multithreading, spanning a range of hardware complexities. It is also shown that simultaneous multithreading provides significant performance improvement over both single-thread superscalar and fine-grain multithreaded processors, both in the limit, and also under less ambitious hardware assumptions. Instruction Class Latency integer multiply 8,16 I Cache D Cache L2 L3 Cache conditional move 2 Cache compare 0 Size 64 KB 64 KB 256 KB 2 MB all other integer 1 Assoc DM DM 4 -way 4 way FP divide 17, 30 Line Size all other FP 4 Banks load (Ll cache hit, no bank conflicts) 2 Transfer 1 cycle 1 cycle 2 cycle 2 cycle load (L2 cache hit) 8 time/bank load (L3 cache hit) 14 Table 3: Details of the Cache Hierarchy load (memory) 50 control hazard (br or jmp redicted) 1 control hazard br or jmp mispredicted) 6 Table 4: Simulated Instruction Latencies

10 The Machine Models: The following models reflect several possible design choices for a combined multithreaded, superscalar processor. The models differ in how threads can use issue slots and functional units each cycle; in all cases, however, the basic machine is a wide superscalar with 10 functional units capable of issuing 8 instructions per cycle (the same core machine as Section 3). The models are: Fine-Grain Multithreading. Only one thread issues instructions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. It is the only model that does not feature simultaneous multithreading. Among existing or proposed architectures, this is most similar to the Tera processor [3], which issues one 3-operation LIW instruction per cycle. SM: FuIl Simultaneous Issue. This is a completely flexible simultaneous multithreaded superscalar all eight threads compete for each of the issue slots each cycle. This is the least realistic model in terms of hardware complexity, but provides insight into the potential for simultaneous multithreading. The following models each represent restrictions to this scheme that decrease hardware complexity. SM: Single Issue, SM: Dual Issue, and SM: Four Issue. These three models limit the number of instructions each thread can issue, or have active in the scheduling window, each cycle. For example, in a SM: Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. SM: Limited Connection. Each hardware context is directly connected to exactly one of each type of functional unit. For example, if the hardware supports eight threads and there are four integer units, each integer unit could receive instructions from exactly two threads. The partitioning of functional units among threads is thus less dynamic than in the other models, but each functional unit is still shared (the critical factor in achieving high utilization). Since the choice of functional units available to a single thread is different than in the original target machine, recompilation is done for a 4-issue (one of each type of functional unit) processor for this model. Table 2 shows the important differences in hardware implementation complexity.

The simulator models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor.

11 The simulator models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor. It is based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution. The typical simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle. We assume that all functional units are completely pipelined. Tables 3 and 4 show Details of the Cache Hierarchy and Simulated Instruction Latencies respectively. Figure 3 shows the performance of the various models as a function of the number of threads.

12 Observations: Each of these models become increasingly competitive with full simultaneous issue as the ratio of threads to issue slots increases. The increase in processor utilization is a direct result of threads dynamically sharing processor resources that would otherwise remain idle much of the time. Lowest priority thread (at 8 threads) runs at 55% of the speed of the highest priority thread. Competition for non-execution resources, play nearly as significant a role in this performance region as the competition fro execution resources. Caches are more strained by a multi-threaded work load than a single threads work load, due to decrease in locality. Sharing caches is the dominant effect in the wasted issue cycles. Data TLB waste also increases Total speedups relatively constant across a wide range of cache sizes. Instruction throughput of the various SM models is some what hampered by the sharing of caches and TLBs. Cache Design for a Simultaneous Multithreaded Processor: The measurements show a performance degradation due to cache sharing in simultaneous multithreaded processors. In this section, the cache problem is explored further. The study focuses on the organization of the first-level (Ll ) caches, comparing the use of private per-thread caches to shared caches for both instructions and data. (It is assumed assume that L2 and L3 caches are shared among all threads.) All experiments use the 4-issue model with up to 8 threads. Not all of the private caches will be utilized when fewer than eight threads are running. Figure 4 exposes several interesting properties for multithreaded caches. It is seen that shared caches optimize for a small number of threads (where the few threads can use all available cache), while private caches perform better with a large number of threads. For example, the 64s.64s cache ranks first among all models at 1 thread and last at 8 threads, while the 64p.64p cache gives nearly the opposite result. However, the tradeoffs are not the same for both instructions and data. A shared data cache outperforms a private data cache over all numbers of threads (e.g., compare 64p.64s with 64p.64p), while instruction caches benefit from private caches at 8 threads. One reason for this is the differing access patterns between instructions and data. Private I caches eliminate conflicts between different threads in the I cache, while a shared D cache allows a single thread to issue multiple memory instructions to different banks. There are two configurations that appear to be good choices. Because there is little performance difference at 8 threads, the cost of optimizing for a small number of threads is small, making 64s.64s an

13 attractive option. However, typically operating with all or most thread slots full, the 64p.64s gives the best performance in that region and is never worse than the second best performer with fewer threads. The shared data cache in this scheme allows it to take advantage of more flexible cache 64p.64s has eight private 8 KB I caches and a shared 64 KB data cache. partitioning, while the private instruction caches make each thread less sensitive to the presence of other threads. Shared data caches also have a significant advantage in a data-sharing environment by allowing sharing at the lowest level of the data cache hierarchy without any special hardware for cache coherence. For SMT processors potential bottlenecks may occur in the fetch stages, particularly when instructions from different blocks are fetched simultaneously, causing contention at the instruction cache. Furthermore, the cache size becomes more critical as the threads share the same cache [1]. In addition to memory I/O, the pipeline is lengthened by the addition of two stages when reading and writing to registers. The increase in pipeline length, places potential strain on the branch prediction unit. However, the single thread performance degraded by only 2% with the insertion of these two stages [1, 13, 14]. SMT provides an option by which a processor can exploit TLP. Threads are executed in parallel by scheduling instructions from multiple threads simultaneously. This is done to increase usage of the functional units already present in multiple issue processors. Logically, SMT is chip multi-processor (CMP) except all of the functional units are combined to allow for very flexible scheduling. Unlike CMP, threads on an SMT system share the same caches.

14 Presently, SMT technologies are scheduled to be used in upcoming Pentium IV and future Alpha processors. While SMT is transparent to the user, usage of the SMT features requires the application to be multithreaded so that TLP can be used. Therefore, applications must be multithreaded in order to take advantage of SMT capable processors. In particular, such thread support via the hardware facilitates improved performance by fine-grained threading of programs. Fine-grain threading attempts TLP wherever possible by threading every independent unit of work. With the imminent arrival of SMT support in commercial microprocessors, multithreaded programs will be needed to take advantage of these enhancements

15 1. Chip Multiprocessor: CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores. Implementation technology concerns that favors CMPs: Today, as most microprocessor designers use the increased transistor budgets to build larger and more complex uniprocessors, Several problems are beginning to make this approach to microprocessor design difficult to continue. To address these problems, the future processor design methodology is shifting from simply making progressively larger uniprocessors to implementing more than one processor on each chip. The following discusses the key reasons why single-chip microprocessors are a good idea. Parallelism Superscalar processors can extract greater amounts of instruction-level parallelism, or ILP, by finding nondependent instructions that occur near each other in the original program code. Designers primarily use additional transistors on chips to extract more parallelism from programs to perform more work per clock cycle. Unfortunately, there is only a finite amount of ILP present in any particular sequence of instructions that the processor executes because instructions from the same sequence are typically highly interdependent. As a result, processors that use this technique are seeing diminishing returns as they attempt to execute more instructions per clock cycle, even as the logic required to process multiple instructions per clock cycle increases quadratically. A CMP avoids this limitation by primarily using a completely different type of parallelism: thread-level parallelism. A CMP may also exploit small amounts of ILP within each of its individual processors, since ILP and TLP are orthogonal to each other. Wire delay As CMOS gates become faster and chips become physically larger, the delay caused by interconnects between gates is becoming more significant. Due to rapid process technology improvement, within the next few years wires will only be able to transmit signals over a small portion of large processor chips during each clock cycle. However, a CMP can be designed so that each of its small

16 processors takes up a relatively small area on a large processor chip, minimizing the length of its wires and simplifying the design of critical paths. Only the more infrequently used, and therefore less critical, wires connecting the processors need to be long. Design time Processors are already difficult to design. Larger numbers of transistors, increasingly complex methods of extracting ILP, and wire delay considerations will only make this worse. A CMP can help reduce design time, however, because it allows a single, proven processor design to be replicated multiple times over a die. Each processor core on a CMP can be much smaller than a competitive uniprocessor, minimizing the core design time. Also, a core design can be used over more chip generations simply by scaling the number of cores present on a chip. Only the processor interconnection logic is not entirely replicated on a CMP. Why aren t CMPs used now? A CMP addresses all of these potential problems in a straightforward, scalable manner, the treason for them not being common are: Integration densities are just reaching levels where these problems are becoming significant enough to consider a paradigm shift in processor design. The primary reason, however, is because it is very difficult to convert today s important uniprocessor programs into multiprocessor ones. Conventional multiprocessor programming techniques typically require careful data layout in memory to avoid conflicts between processors, minimization of data communication between processors, and explicit synchronization at any point in a program where processors may actively share data. A CMP is much less sensitive to poor data layout and poor communication management, since the interprocessor communication latencies are lower and bandwidths are higher. However, sequential programs must still be explicitly broken into threads and synchronized properly. Parallelizing compilers have been only partially successful at automatically handling these tasks for programmers. As a result, acceptance of multiprocessors has been slowed because only a limited number of programmers have mastered these techniques. The architectures major design considerations in a qualitative manner. CPU cores: To keep the processors execution units busy, the superscalar and SMT processors as shown

17 above are assumed to feature advanced branch prediction, register renaming, out-of-order instruction issue, and nonblocking data caches. As a result, the processors have numerous multiported rename buffers, issue queues, and register files. The inherent complexity of these architectures results in three major hardware design problems Their area increases quadratically with the core s complexity. The number of registers in each structure must increase proportionally to the instruction window size. Additionally, the number of ports on each register must increase proportionally to the processor s issue width. The CMP approach minimizes this problem because it attempts to exploit higher levels of instruction-level parallelism using more processors instead of larger issue widths within a single processor. This results in an approximately linear area-to-issue width relationship, since the area of each additional processor is essentially constant, and it adds a constant number of issue slots. Using this relationship, the area of an 8 2-issue CMP (16 total issue slots) has an area similar to that of a single 12- issue processor. They can require longer cycle times. Long, high capacitance I/O wires span the large buffers, queues, and register files. Extensive use of multiplexers and crossbars to interconnect these units adds more capacitance. Delays associated with these wires will probably dominate the delay along the CPU s critical path. The cycle time impact of these structures can be mitigated by careful design using deep pipelining, by breaking up the structures into small, fast clusters of closely related components connected by short wires, or both. But deeper pipelining increases branch misprediction penalties, and clustering tends to reduce the ability of the processor to find and exploit instruction-level parallelism. The CMP approach allows a fairly short cycle time to be targeted with relatively little design effort, since its hardware is naturally clustered each of the small CPUs is already a very small fast cluster of components. Since the operating system allocates a single software thread of control to each processor, the partitioning of work among the clusters is natural and requires no hardware to dynamically allocate instructions to different component clusters. This heavy reliance on software to direct instructions to clusters limits the amount of instruction-level parallelism that can be dynamically exploited by the entire CMP, but it allows the structures within each CPU to be small and fast. Since these factors are difficult to quantify, the evaluated superscalar and SMT architectures

18 represent how these systems would perform if it was possible to build an optimal implementation with a fairly shallow pipeline and no clustering, a combination that would result in an unacceptably low clock cycle time in reality. This probably gives the CMP a handicap in the simulations. The CPU cores are complicated and composed of many closely interconnected components. As a result, design and verification costs will increase since they must be designed and verified as single, large units. The CMP architecture uses a group of small, identical processors. This allows the design and verification costs for a single CPU core to be lower, and amortizes those costs over a larger number of processor cores. It may also be possible to utilize the same core design across a family of processor designs, simply by including more or fewer cores. With even more advanced IC technologies, the logic, wire, and design complexity advantages will increasingly favor a multiprocessor implementation over a superscalar or SMT implementation. Memory: A 12-issue superscalar or SMT processor can place large demands on the memory system. For example, to handle load and store instructions quickly enough, the processors would require a large primary data cache with four to six independent ports. The SMT processor requires more bandwidth from the primary cache than the superscalar processor, because its multiple independent threads will typically allow the core to issue more loads and stores in each cycle, some from each thread. To accommodate these accesses, the superscalar and SMT architectures have 128-Kbyte, multibanked primary caches with a two-cycle latency due to the size of the primary caches and the bank interconnection complexity. The CMP architecture features sixteen 16-Kbyte caches. The eight cores are completely independent and tightly integrated with their individual pairs of caches another form of clustering, which leads to a simple, high-frequency design for the primary cache system. The small cache size and tight connection to these caches allows single-cycle access. The rest of the memory system remains essentially unchanged, except that the secondary cache controller must add two extra cycles of secondary cache latency to handle requests from multiple processors. To make a shared memory multiprocessor, the data caches could be made writethrough, or a MESI (modified, exclusive, shared, and invalid) cachecoherence protocol could be established between the primary data caches. Because the bandwidth to an

19 on-chip cache can easily be made high enough to handle the write-through traffic, the simpler coherence scheme is chosen for the CMP. In this way, designers can implement a small-scale multiprocessor with very low interprocessor communication latency. To provide enough off-chip memory bandwidth for the high-performance processors, all simulations were made with main memory composed of multiple banks of Rambus DRAMs (RDRAMs), attached via multiple Rambus channels to each processor. Compiler support: The main challenge for the compiler targeting the superscalar processor is finding enough instruction-level parallelism in applications to use a 12-issue processor effectively. Code reordering is fundamentally limited by true data dependencies and control dependencies within a thread of instructions. It is likely that most integer applications will be unable to use a 12-issue processor effectively, even with very aggressive branch prediction and advanced compiler support for exposing instruction-level parallelism. Limit studies with large instruction windows and perfect branch prediction have shown that a maximum of approximately instructions per cycle are possible for general-purpose integer applications.9 Branch mispredictions will reduce this number further in a real processor. On the other hand, programmers must find threadlevel parallelism in order to maximize CMP performance. The SMT also requires programmers to explicitly divide code into threads to get maximum performance, but, unlike the CMP, it can dynamically find more instruction-level parallelism if threadlevel parallelism is limited. With current trends in parallelizing compilers, multithreaded operating systems, and the awareness of programmers about how to program parallel computers, however, these problems should prove less daunting in the future. Additionally, having all eight of the CPUs on a single chip allows designers to exploit thread-level parallelism even when threads communicate frequently. This has been a limiting factor on today s multichip multiprocessors, preventing some parallel programs from attaining speedups, but the low communication latencies inherent in a single-chip microarchitecture allow speedup to occur across a wide range of parallelism.4

20 Hardware Performance & Comparison: In this section I have tried to compare the three architectures based on the simulations and experiments conducted by the various research groups. I have presented the results form ref [??] [??] to draw conclusions for my study. CMP versus Superscalar: Two main Concerns: (1) Area, (2) Time. For an instruction window to enable the dynamic issue of instructions, require large die area. PA-8000, 4- issue Superscalar devotes 20% of die area solely to instruction window. in general area requirement increases quadratically with issue width. Increase in issue width typically requires an increase in the number of ports in the register file. Alternatively it may involve the replication of the register file as in Alpha The number of datapaths between the functional units and register files increase quadratically with the issue width. CMP requires extra hardware for speculation support. However overhead for register communication is quite modest. Register bypass network (values forwarded directly form output of functional unit to their inputs, this permits back to back issue of data dependent instructions) may be an important factor in determining the cycle time in future high-issue processor. Other concerns: Inability to extract a significant amount of parallelism form the application leads to uneven distribution of work among the different processors in a CMP. CMP is able to exploit parallelism in application that are fully loop based and most of the loops have few or no loop carried dependence, better than 12 issue Superscalar. Each processor in the CMP executes an iteration and most of the time can independently issue instructions without being affected by dependence with other threads. In the 12 issue Superscalar, instead, the centralized instruction window is often cloggy by instructions that are either data dependent on long-latency FP operations or are waiting on cache misses. On an average IPC of a 4X4issue CMP is on an average nearly twice that of a 12-issue superscalar. Thus it can be seen that Superscalar Norm of today's high-performance microprocessor. Issue rate of these microprocessor has continued to increase over the past few years Compaq Alpha 21264, IBM Power PC, Intel Pentium Pro, MIPS R10000 issue four instructions

21 per cycle. special hardware to dynamically identify independent instructions. maintaining a large pool of instructions in a large associative window. register renaming to eliminate false dependence out of order issue (instruction issued as soon as its operand and functional units are available). Thus parallelism is extracted only from ILP at the program at run time. Requires centralized hardware structure that lengthens the critical path for the processor pipeline. register renaming logic instruction window wake-up and select mechanism register bypass logic long latency interconnects in centralized approach. Chip multi processor Exploits thread level parallelism Exploits increasing transistor count on chip Wide issue dynamic processor would make fact communication at register level will soon popular CMPs Speculative mode improves performance but needs true memory dependence violation to be handled Decentralized architecture divide the application into multiple threads and exploit ILP across them. multiple threads run on multiple simple processing units on a single chip. CMP architecture. design simplicity o fast clocking each of the processing units o eases time consuming design validation phase fast communication in processing units localizes interconnects (long latency interconnects in centralized approach) better utilization of silicon space hence avoids extra logic devoted to centralized architecture => higher overall issue bandwidth. Olukotum et al show how a CMP with eight 2-issue superscalar processing units would occupy the same area as a conventional 12-issue superscalar processor. ideal for running multithreaded applications May not be able to give good performance when running sequential applications as parallel compilers are successful only at a restricted class of applications typically numeric ones. so cannot handle a

22 large class of sequential applications. speculation can help as compilers assume existence of inter thread dependence when it cannot guarantee data independence among threads. Speculative mode improves performance but needs true memory dependence violation to be handled. Technique to solve this problem has been discussed in [21]. 1. Superscalar: The superscalar processor, shown in Figure 1a, can dynamically issue up to 12 instructions per cycle. 2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible. 3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor. Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture Characteristic Superscalar Simultaneous multithreading Chip multiprocessor Number of CPUs CPU issue width per CPU Number of threads per CPU Architecture registers (for integer and FP) per thread 32 per CPU Physical registers (for integer and FP) per CPU Instruction window size per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries

23 Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) I and D cache access times (cycles) Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks Secondary cache size (bytes) Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) Secondary cache access time (cycles) Secondary cache occupancy per access (cycles) Memory organization (no. of banks) Memory access time (cycles) Memory occupancy per access (cycles) Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip multiprocessor architectures compared to a baseline, 2-issue superscalar architecture. Performance results: Figure?? shows the performance of the superscalar, SMT, and CMP architectures on the four benchmarks relative to a baseline architecture a single 2-issue processor attached to the superscalar/smt memory system.

24 The first two benchmarks show performance on applications with moderate memory behavior and no thread-level parallelism (compress) or large amounts of thread-level parallelism (mpeg). The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor. The separate primary caches are beneficial because they can be accessed by all processors in parallel. In a separate test with eight processors sharing a single cache, bank contention between accesses from different processors degraded performance significantly. The average memory access time to the primary cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks, and overall performance dropped 24 percent. In contrast, the shared secondary cache is not a bottleneck in the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends. The speedups tracked the CMP results closely when modeling similar degrees of data cache contention. The nominal performance was similar to that of the CMP s with a single primary cache, and performance improved 17 percent when primary cache contention is temporarily deactivated. The multiple threads of control in the SMT allowed it to exploit thread-level parallelism. Additionally, the dynamic resource allocation in the SMT allowed it to be competitive with the CMP, even though it had fewer total issue slots. However, tomcatv s memory behavior highlighted a fundamental problem with the SMT architecture: the unified data cache architecture was a bandwidth limitation. Making a data cache with enough banks or ports to keep up with the memory requirements of eight threads requires a more sophisticated crossbar network that will add more latency to every cache access, and may not help if there is a particular bank that is heavily accessed. The CMP s independent data caches avoid this problem but are not possible in an SMT. As with compress, the multiprogramming workload has limited amounts of instruction-level parallelism, so the speedup of the superscalar architecture was only a 35 percent increase over the baseline processor. Unlike compress, however, the multiprogramming workload had large amounts of process-level parallelism, which both the SMT and CMP exploited effectively. This resulted in a linear eight-times speedup for the CMP. The SMT achieved nearly a seven-times speedup over the 2-issue baseline processor, more than the increase in the number of issue slots would indicate possible, because it efficiently utilized processor resources by interleaving threads cycle by cycle. Thus this approach proves that CMP has superior performance using relatively simple hardware.

25 Fine Comparison of Simultaneous Multithreading versus Single-Chip Multiprocessing: (These are the results as shown in ref [??]) This section compares the performance of simultaneous multithreading to small-scale, single-chip multiprocessing (MP). On the organizational level, the two approaches are extremely similar both have multiple register sets, multiple functional units, and high issue bandwidth on a single chip. The key difference is in the way those resources are partitioned and scheduled: the multiprocessor statically partitions resources, devoting a fixed number of functional units to each thread; the SM processor allows the partitioning to change every cycle. Clearly, scheduling is more complex for an SM processor however, it is shown that in other areas the SM model requires fewer resources, relative to multiprocessing, in order to achieve a desired level of performance. For these experiments, SM and MP configurations that are reasonably equivalent. For most of the comparisons all or most of the following are kept equal: the number of register sets (i.e., the number of threads for SM and the number of processors for MP), the total issue bandwidth, and the specific functional unit configuration. A consequence of the last item is that the functional unit configuration is often optimized for the multiprocessor and represents an inefficient configuration for simultaneous multithreading. All experiments use 8 KB private instruction and data caches (per thread for SM, per processor for MP), a 256 KB 4-way set-associative shared second-level cache, and a 2 MB direct-mapped third-level cache. It is desired to keep the caches constant in the comparisons, and this (private I and D caches) is the most natural configuration for the multiprocessor. MPs are evaluated with 1, 2, and 4 issues per cycle on each processor. SM processors are evaluated with 4 and 8 issues per cycle; however the SM: Four Issue model (defined in Section??) is used, for all of the SM measurements (i.e., each thread is limited to four issues per cycle). Using this model minimizes some of the inherent complexity differences between the SM and MP architectures. For example, an SM: Four Issue processor is similar to a single-threaded processor with 4 issues per cycle in terms of both the number of ports on each register file and the amount of inter-instruction dependence checking. In each experiment the same version of the benchmarks is run for both configurations (compiled for a 4-issue, 4 functional unit processor, which most closely matches the MP configuration) on both the MP and SM models; this typically favors the MP. It must be noted that, while in general it is tried that the bias is in favor of the MP, the SM results may be optimistic in two respects the amount of time required to schedule instructions onto functional units,

26 and the shared cache access time. The distance between the load/store units and the data cache can have a large impact on cache access time. The multiprocessor, with private caches and private load/store units, can minimize the distances between them. The SM processor cannot do so, even with private caches, because the load store units are shared. However, two alternate configurations could eliminate this difference. Having eight load/store units (one private unit per thread, associated with a private cache) would still allow us to match MP performance with fewer than half the total number of MP functional units (32 vs. 15). Or with 4 load/store units and 8 threads, it is possible to statically share a single cache l load store combination among each set of 2 threads. Threads O and 1 might share one load/store unit, and all accesses through that load/store unit would go to the same cache, thus allowing us to minimize the distance between cache and load/store unit, while still allowing resource sharing. Figure?? shows the results of the SM/MP comparison for various configurations. Tests A, B, and C compare the performance of the two schemes with an essentially unlimited number of functional units (FUS); i.e., there is a functional unit of each type available to every issue slot. The number of register sets and total issue bandwidth are constant for each experiment. In these models, the ratio of functional units (and threads) to issue bandwidth is high, so both configurations should be able to utilize most of their issue bandwidth. Simultaneous multithreading, however, does so more effectively.

27 Test D repeats test A but limits the SM processor to a more reasonable configuration (the same 10 functional unit configuration used throughout this paper). This configuration outperforms the multiprocessor by nearly as much as test A, even though the SM configuration has 22 fewer functional units and requires fewer forwarding connections. In tests E and F, the MP is allowed a much larger total issue bandwidth. In test E, each MP processor can issue 4 instructions per cycle for a total issue bandwidth of 32 across the 8 processors; each SM thread can also issue 4 instructions per cycle, but the 8 threads share only 8 issue slots. The results are similar despite the disparity in issue slots. In test F, the 4-thread, 8-issue SM slightly outperforms a 4- processor, 4-issue per processor MP, which has twice the total issue bandwidth. Simultaneous multithreading performs well in these tests, despite its handicap, because the MP is constrained with respect to which 4 instructions a single processor can issue in a single cycle. Test G shows the greater ability of SM to utilize a fixed number of functional units. Here both SM and MP have 8 functional units and 8 issues per cycle. However, while the SM is allowed to have 8 contexts (8 register sets), the MP is limited to two processors (2 register sets), because each processor must have at least 1 of each of the 4 functional unit types. Simultaneous multithreading s ability to drive up the utilization of a fixed number of functional units through the addition of thread contexts achieves more than 2.5 times the throughput. These comparisons show that simultaneous multithreading outperforms single-chip multiprocessing in a variety of configurations because of the dynamic partitioning of functional units. More important, SM requires many fewer resources (functional units and instruction issue slots) to achieve a given performance level. For example, a single 8-thread, 8-issue SM processor with 10 functional units is 24~o faster than the 8-processor, single-issue MP (Test D), which has identical issue bandwidth but requires 32 functional units; to equal the throughput of that 8-thread 8-issue SM, an MP system requires eight 4-issue processors (Test E), which consume 32 functional units and 32 issue slots per cycle. Finally, there are further advantages of SM over MP that are not shown by the experiments: Performance with few threads These results show only the performance at maximum utilization. The advantage of SM (over MP) is greater as some of the contexts (processors) become unutilized. An idle processor leaves l/p of an MP idle, while with SM, the other threads can expand to use the available resources. This is important when (1) running parallel code where the degree of parallelism

28 varies overtime, (2) the performance of a small number of threads is important in the target environment, or (3) the workload is sized for the exact size of the machine (e.g., 8 threads). In the last case, a processor and all of its resources is lost when a thread experiences a latency orders of magnitude larger than that simulated (e.g., IO). Granularity and flexibility of design The configuration options are much richer with SM, because the units of design have finer granularity. That is, with a multiprocessor, it would be typically to add computing in units of entire processors. With simultaneous multithreading, it is possible to benefit from the addition of a single resource, such as a functional unit, a register context, or an instruction issue slot; furthermore, all threads would be able to share in using that resource. The comparisons did not take advantage of this flexibility. Processor designers, taking full advantage of the configure ability of simultaneous multithreading, should be able to construct configurations that even further out-distance multiprocessing. Performance Comparison of SMP and CMP Using Parallel Workloads. Why Parallel Applications? SMT is most effective when threads have complementary hardware resource requirements. Multiprogrammed workloads and workloads consisting of parallel applications both provide TLP via independent streams of control, but they compete for hardware resources differently. Because a multiprogrammed workload (used in our previous work [Tullsen et al. 1995; 1996]) does not share memory references across threads, it places more stress on the caches. Furthermore, its threads have different instruction execution patterns, causing interference in branch prediction hardware. On the other hand, multiprogrammed workloads are less likely to compete for identical functional units. Although parallel applications have the benefit of sharing the caches and branch prediction hardware, they are an interesting and different test of SMT for several reasons. First, unlike the Table V. Throughput Comparison of MP2, MP4, and SMT, Measured in Instructions per Cycle Number of Threads Configuration MP MP SMT multiprogrammed workload, all threads in a parallel application execute the same code and, therefore, have similar execution resource requirements, memory reference patterns, and levels

29 of ILP. Because all threads tend to have the same resource needs at the same time, there is potentially more contention for these resources compared to a multiprogrammed workload. For example, a particular loop may have a large degree of instruction-level parallelism, so each thread will require a large number of renaming registers and functional units. Because all threads have the same resource needs, they may exacerbate or create bottlenecks in these resources. Parallel applications are therefore particularly appropriate for this study, which focuses on these execution resources. Second, parallel applications illustrate the promise of SMT as an architecture for improving the performance of single applications. By using threads to parallelize programs, SMT can improve processor utilization, but more important, it can achieve program speedups. Finally, parallel applications are a natural workload for traditional parallel architectures and therefore serve as a fair basis for comparing SMT and multiprocessors. For the sake of comparison, in Section 7, we also briefly contrast our parallel results with the multiprogrammed results from Tullsen et al. [1996]. For another set of experiments as shown in ref[??] The Processor instruction latencies and memory hirerchy details are as shown figure?? Contributions regarding design tradeoffs for future high-end processors First, the performance costs of resource partitioning for various multiprocessor configurations has been identified. By partitioning execution resources between processors, multiprocessors enforce the distinction between instruction- and thread-level parallelism. In this study, we examined two MP design choices with similar hardware cost in terms of execution resources: one design with more resources per processor (MP2) and one with twice as many processors, but fewer resources on each (MP4). Our results showed that both alternatives frequently suffered from an inefficient use of their resources and that improvements could only be obtained with costly upgrades in processor resources. The MP designs were unable to adapt to varying levels of ILP and TLP, so their performance depended heavily on the parallelism characteristics of the applications. For programs with more ILP, MP2 outperformed MP4; for programs with less ILP, MP4 was superior because it exploited more thread-level parallelism. To maximize performance on an MP, compilers and parallel programmers are therefore faced with the difficult task of partitioning program parallelism (ILP and TLP) in a manner that matches the physical partitioning of resources.

30 Second, it has been illustrated that in contrast, simultaneous multithreading allows compilers and programmers to focus on extracting whatever parallelism exists, by treating instruction- and thread-level parallelism equally. ILP and TLP are fundamentally identical; they both represent independent instructions that can be used to increase processor utilization and improve performance. SMT has the flexibility to use both forms of parallelism interchangeably, because threads can share resources dynamically. Rather than adding more resources to further improve performance, existing resources are used more effectively. By using more hardware contexts, SMT can take advantage of TLP to expose more parallelism and attain an average throughput of 4.88 instructions per cycle, while increasing its performance edge over MP2 and MP4 to 64% and 52%, respectively. Third, our results demonstrate that SMT can achieve large program speedups on parallel applications. Even though these parallel threads have greater potential for interference because of similar resource usage patterns (including memory references and demands for renaming registers and functional units), simultaneous multithreading has the ability to compensate for these potential conflicts. We found that interthread cache interference, bank contention, and branch prediction interference on an SMT processor had only minimal effects on performance. The latency hiding characteristics of simultaneous multithreading allow it to achieve a 2.68 average speedup over a single MP2 processor, whereas MP2 and MP4 speedups are limited to 1.63 and 1.76, respectively. The bottom line is that simultaneous multithreading makes better utilization of on-chip resources to run parallel applications effectively. For these reasons, as well as the performance and complexity results shown, it can be believed that when component densities permit us to put multiple hardware contexts and wide issue bandwidth on a single chip, simultaneous multithreading represents the most efficient organization of those resources.

31 1. Superscalar: The superscalar processor, shown in Figure 1a, can dynamically issue up to 12 instructions per cycle. 2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible. 3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor. Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture Characteristic Superscalar Simultaneous multithreading Chip multiprocessor Number of CPUs CPU issue width per CPU Number of threads per CPU Architecture registers (for integer and FP) per thread 32 per CPU Physical registers (for integer and FP) per CPU Instruction window size per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) I and D cache access times (cycles) Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks

Secondary cache size (bytes) 8 8 8 Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) 32 32 32 Secondary cache access time (cycles) 5 5 7 Secondary cache

32 Secondary cache size (bytes) Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) Secondary cache access time (cycles) Secondary cache occupancy per access (cycles) Memory organization (no. of banks) Memory access time (cycles) Memory occupancy per access (cycles) Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip multiprocessor architectures compared to a baseline, 2-issue superscalar architecture. The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor. The separate primary caches are beneficial because they could be accessed by all processors in parallel. In a separate test with eight processors sharing a single cache, bank contention between accesses from different processors degraded performance significantly. The average memory access time to the primary cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks, and overall performance dropped 24 percent. In contrast, the shared secondary cache was not a bottleneck in the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends. The speedups tracked the CMP results closely when modeling similar degrees of data cache contention.

Integrated circuit processing technology offers

Integrated circuit processing technology offers Theme Feature A Single-Chip Multiprocessor What kind of architecture will best support a billion transistors? A comparison of three architectures indicates that a multiprocessor on a chip will be easiest