Department of Electrical and Computer Engineering~

CARNEGIE MELLON Department of Electrical and Computer Engineering~ Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance John A. Miller 1996 Advisor:: Prof. Nagle ~cgic \ Melh~,,

Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance John Alan Miller 1.0 Abstract The lack of single application instruction-level parallelism significantly limited performance of superscalar microprocessors, which often find up to 80% of their hardware functional units idle. Simultaneous Multithreading (SMT) is an architectural technique that attempts to utilize the idle functional units by enabling multiple threads to run concurrently on a single superscalar processor. Unfortunately, the increased number of threads also increases the load on on-chip memory structures such as caches, Translations Lookaside Buffers (TLB), and branch target buffers (BTB). Unlike functional units, these hardware resources often run at near-peak utilization on a single-threaded machine, making them the limiting factor in singlethread performance and potentially preventing SMT from achieving significant performance improvement. This study examines the impact SMT has on the on-chip cache and branch target buffers. Unlike previous SMT studies, which have used application-only small-sized working set programs, this study measures SMT performance under large applications, including operating system code. Results: SMT can significantly improve thruput, even under memory intensive workloads. SMT increased IPCfrom 1.03 at one thread to 1.14 and 1.87 at two and four threads respectively. However these numbers are about half as much as what was obtained from the SPEC92 benchmark suite. There is significant shar#zg of operating system code between different threads, without this sharing SMT with two threads performs worse than single thread performance. The choice of threads can have a significant impact on the performance of the SMT system. This implies that choosing which threads are executed together is important. Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance

2.0 Introduction Current architectures attempt to maximize Instruction Per Cycle (IPC) by finding all possible Instruction Level Parallelism (ILP) within a single program. Unfortunately, it is becoming increasingly difficult to find enough single-thread parallelism to continue increasing IPC beyond its current level of two IPC [Tullsen96]. One symptom of this is the fact that the utilization of functional units in today s processors has dropped dramatically. Most processors today use less than 40% of their functional unit bandwidth ltullsen95]. Multithreading tries to offset this by ignoring single thread ILP and concentrating on increasing overall IPC. Various forms of multithreading have been proposed [Smith78, Papadopoulos90, Alverson90, Agarwal90, Tullsen96], enabling significant advances in increasing parallelism. The increases are achieved by allowing multiple programs (threads ~ ) to execute concurrently on a processor, filling the stall cycles and unused issue bandwidth of one thread with useful work from another thread. Multithreading has the added benefit of increasing the processor functional unit utilization [Agarwal90]. One of the more advanced methods of multithreading, Simultaneous Multithreading (SMT), claims to achieve an IPC as high as 5.4 for an eight threaded machine [Tullsen96]. Despite these benefits, multithreading exhibits several inherent drawbacks such as a large demand for memory bandwidth, increased cache conflicts between threads, and increased strain on the branch prediction units. These hardware resources are already bottlenecks in many of today s systems [Uhlig95] and SMT s larger working sets simply increase the load on the system s precious on-chip memory resources. However, SMT s ability to switch between different threads provides it with some degree of latency tolerance, possible enough to overcome the increased demand on on-chip memory structures. Note that "thread" is used to represent a hardware context, simultaneous multithreading requires that threads must not share data to avoid coherency problems. This is different than operating system "threads" which often share the same memory space. SMT s Real Effect on Cache and BP Performance 2 Introduction

Unfortunately previous SMT studies have relied on simple workloads that do not tax the memory system, leaving unanswered two fundamental questions. 1. How much does SMT increase the demand on on-chip memory structures? 2. Can SMT s tolerance for latency hide any performance loss due to larger working sets? These questions are important because if SMT increases the load on on-chip cache structures beyond what it can hide, it is possible that SMTs overall performance would not be any greater or even less than singlethreaded systems. This work addresses these questions by focusing on the impact SMT has on on-chip memory structures under realistic workloads. The workloads, from the Instruction Benchmark Suite (IBS) are used to drive custom SMT memory-system simulator that measures the impact various degrees of multithreading have on caches, branch address buffers, and overall system performance. The rest of the paper is organized as follows. Section 3 gives a brief methodology for the simulations in this study. Section 4 shows the results and analysis; Section 5 examines previous work related to Multithreading and SMT and compares it to the areas studied in this paper. Section 6 presents the conclusions. Appendix A discusses the SMACS simulator and its traces. Section 8 contains references. 3.0 Methodology To examine the cache conflicts and branch prediction performance in an SMT machine a trace driven simulator, Simultaneous Multithreaded Adjustable Cache Simulator (SMACS) was developed. SMACS is cycle by cycle simulator that fully models the on-chip memory structures and issue logic of an SMT architecture. SMACS includes instruction and data fetching, caching and CPU data dependencies, accurately modeling the characterization of an SMT memory system. SMACS does not model function unit behavior. [Tullsen96] showed that modeling functional units was not needed to accurately measure performance, so this drawback has little impact of the trends found in this paper. SMT s Real Effect on Cache and BP Perlbrmance 3 Methodology

Trace Name Mpeg_play Jpeg_play real_gee Verilog Groff Kenbus Ousterhout Sdet Description mpeg_play version 2.0, 85 frames form compressed file xloadimage version 3.0, displays two JPEG images GNU C compiler, version 2.6 Verilog-XL version 1.6b, simulating a microprocessor GNU C++ implementation of nroff version 1.09 SPEC SDM benchmark suite, simultaneous multi-user code development. John Ousterhout s benchmark suite from [Ousterhout89] Multiprocess benchmark from the SPEC SDM suite Table 1: IBS Description SMACS is also not an out-of-order machine. Modeling an out-of-order machine would alter the numbers presented in this paper, increasing all relevant IPCs. This might even increase single-thread IPC more than in the multiple-thread cases, thought this is thought to be a second order effect that would not affect the trends found in this paper. The one exception is stores, Stores which complete from a write buffer, do complete out-of-order from the other instructions. The workloads used in this study are from the IBS (Instruction Benchmark Suite) [Nagle92, Uhlig95] and contain both user and operating system references from the Mach operating system. The workloads cover a range of widely-used general-purpose applications, including file system and graphics software (see Table 1 ). A complete description of how the workloads were used is included in Section 7. 111 order for SMT simulations to make sense, the hardware models that simulate multiple threads must be comparable to hardware available to a single thread. Therefore, all of the simulations were run with identical machine resources. For this reason a single thread simulation might have more resources than it can use, artificially making single thread performance relatively low compared to the other simulations. The level-one cache size was varied throughout the simulations in order to determine the effect of cache size on the parameters studied. No other variables were changed with respect to the cache. Table 2 summa- SMT s Real Effect on Cache and BP Performance 4 Methodology

Instruction Level 1 Data Level 1 Level 2 Level 3 Size vat. var. 128K bytes 2M bytes Associativity 1 1 1 Miss delay 6 cycles 6 cycles 15 cycles 50 cycles Table 2: Cache Characteristics rizes the values of the cache structure used in all of the simulations in this paper. See Appendix A for a complete description of the SMACS simulator and a list of all of the other variables in the simulator. To examine branch prediction performance, a standard two level scheme was added to SMACS. The scheme was based on the [Sechrest96] scheme which was from [Yeh92]. I set the scheme values to resemble [Tullsen96]. The branch prediction simulator only simulates the branch prediction, so SMACS does not simulate penalties associated with branch prediction and no wrong path execution occurs. Including these two factors would increase the accuracy of the IPC study but was beyond this study s scope. Excluding this does not greatly affect the results on branch prediction, because they would not change the pattern of the branches in the study. 4.0 Results and Analysis 4.1 Increases in On-Chip Cache Conflicts SMT s ability to switch between different instruction streams automatically provides a higher degree of latency tolerance. When one thread misses in the cache, an SMT processor will try to fill the stalled threads unused issue slots with another thread s instructions. In essence, SMT decouples overall machine throughput from any single thread s performance. However, SMT s execution of multiple threads forces multiple working sets to reside in the same physical cache. This increases the overall working set size, which is now the sum of every single thread s working set, potentially reducing the effectiveness of the cache to the point where very thread thrashes. In this case, no thread can make forward progress and SMT s Real Effect on Cache and BP Performance 5 Results and Analysis

8K 16K 32K 64K Cache Size Figure 1: Data cache miss rates Run of SMACS with the operating system removed from the IBS traces Line Size = 32 bytes, Associativity = I. ~K 16K 32K 64K Cache Size Figure 2: Instruction cache miss rates Run of SMACS with the operating system removed from the IBS traces Line Size = 32 bytes, Associativity = 1. SMT s performance may actually drop below single thread performance. To determine the impact real workloads have on an SMT cache system and on overall performance, I used the SMACS simulator to model data and instruction caches performance on the IBS benchmark suite. The first set of experiments, shown in Figure 1 and 2, measure the cache miss rates for user-only references (i.e,, only the application, not the operating system). As expected, both the instruction and data cache miss rates increase with the number of concurrent threads and confirm the trends found in [Tullsen96]. In the data cache, the miss rate almost doubles between one and four threads while the instruction cache miss rate doubles between just one and two threads. These are significant increases, especially in the instruction cache, where a thread cannot continue execution until its instruction cache miss is resolved. More importantly, the raw cache miss rates are very high. While Tullsen s SPEC workload showed instruction cache miss rates as low as 0.6% for a 32K cache and data cache miss rates as low as 1.2%, our miss rates are 2.0% (instruction cache) and 5.7% (data cache) for a single thread. Worse, our 2-thread rates are 4.5% (instruction cache) and 7.8% (data cache) while Tullsen s are less than 1.7% (instruction SMT s Real Effect on Cache and BP Perfi~rmance 6 Results and Analysis

3 2 1 - no OS 8K 16K 32K 64K Cache Size 8K 16K 32K Cache Size Figure 3: Operating system affect on single threads Run of SMACS with only single threads from IBS. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = I. 64K cache) and 2.5% D-cache. These high cache miss rates suggest that SMT may have some problems hiding all of misses generated by these user-only references. 4.2 Impact of the Operating System on SMT Numerous previous works have shown that operating system references significantly increase the cache miss rates for applications that rely on operating system services [Nagle92, Uhlig95]. Figure 3 shows the difference in cache miss rates for the IBS workloads with and without operating system references. In both the instruction and data caches, the miss rates increase by at least 20% when the operating system is included. It is almost never the case that including operating system activity reduces the cache miss rate in a single threaded processor. In SMT, however, the impact of operating system references is less clear. Operating system references do increase the size of the working set, but multiple threads can share all code segments and possibly some of the operating system data. For example, if one thread calls the operating system to perform some service, SMT s Real Effect on Cache and BP Performance 7 Results and Analysis

sharing) T=2(ne ~ ~ T=4(OS) ""T=2(OS} T--4(no OS) T=4(Tullsen s valuc)~ " " " -... " " T=I(OS) T=I ~ (Tullscn s value) " " - T=l(no OS) 8K I6K 32K 64K Cache Size Figure 4: Data Cache Miss Rates Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. Tutlsen s values taken from [Tullsen96] table3, and extrapolated up to misses/data from misses/instruction. 8K 6K 32K 64K Cache Size Figure 5: Instruction Cache Miss Rates Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. Tullsen s values taken from [Tullsen96] table3. OS no sharing - contains operating system references but does not allow sharing between threads. it will warm the cache with the operating system code. Another thread also making requests for operating system services will share a (potentially significant) portion of the operating system code brought into the cache by the first thread. Given a four-threaded machine and applications that each spend 25% of their time in the operating system [Nagle92], it is probable that at any given time at least one thread will be executing operating system code. This cross-thread sharing creates a very different behavior from conventional single-threaded systems, which typically execute infrequent operating system requests that often are not in the cache and, when execute, flush significant portions of the cache. This makes the operating system influence on SMT very important and also more important than single threaded systems. To measure the impact of operating system references on SMT s cache performance, I reran the experiments in Figures 1 and 2 using the IBS workloads and included the operating system references. Figure 4 shows the results for data references. In our model, there was no sharing between operating system data SMT s Real Effect on Cache and BP Performance 8 Results and Analysis

references and hence, the data cache miss rates increased significantly when operating system data references were included. Therefore, Figure 4 represents a worst-case upper-bound on data cache miss rates. For the instruction cache, sharing of operating system references between threads was supported by SMACS. The single-thread case shows the well-known increase between user-only and user+operating system references. For the two- and four-thread cases, the results are much more surprising. For small caches, including operating system references does increase the cache miss rate a small amount, but once the caches become large enough to retain some of the operating system s working set, operating system sharing actually improves the miss rate beyond the user-only trace. For the 2-thread case, the cross-over point is at 32K. For the 4-thread case, it appears that the cross-over would occur just above 64K (just beyond the range of our experiments). This intertask sharing represents a significant performance win for SMT, potentially mitigating some of the performance problems caused by increased working-set. 4.3 The Effect of SMT on IPC an Overall Performance SMTs ability to switch between different instruction streams allows it to better tolerate cache misses, possibly mitigating any increased cache load caused by the larger working sets of multiple threads. This poses the question of which is the more dominate factor - latency tolerance or increased working set size? This is an important question because if the increased strain on on-chip caches is larger than SMTs tolerance for increased cache miss rates, then SMT may not see a significant performance improvement. To understand the system-level impact of SMT s memory system, we used SMACS to measure the Instructions Per Cycle (IPC) for one, two and four threads. The results in Figure 6 show I that the increase in IPC between one and two threads (with the operating system) is about 15%. However, any increase IPC means that SMT has improved the performance of the system. For the four-thread case, the performance improvements are much more significant, with IPC almost doubling over the single-threaded case. 1. The simulations did neglect out-of-order execution, the latency of functional units and the TLB. However, adding these factors should change impacthe IPC performance of each thread about equally. SMT s Real Effect on Cache and BP Performance 9 Results and Analysis

i T=4(no OS) 2 F T=4(OS) ~o i I _... _--- T=l(noOS) ] ----T=2(Om OS)- 8K 16K 32K 64K Cache Size Figure 6: IPC Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. I believe that the significant increase is due to: 1) the increased tolerance to cache misses; 2) the increased sharing of operating system code. The four-threaded machine has more threads to run when a cache miss occurs and the probability of the cache containing operating system code can actually increase with more threads (it is also dependant upon the cache size). Using the data from Figure 6 and separate simulations, I computed the total run time for each of the systems (Figure 7). The results show that a two-threaded SMT architecture would reduce the total run time two applications by about 20%, while the 4-threaded SMT machine reduces the total run time by up to 55%. 4.4 Does the Choice of Threads Matter? Most of the simulations in this study were running the same two or four threads together. The worst case single thread programs were chosen to try to get an lower bound on the performance numbers. However, the two worst case single threads did not turn out to provide the worst case two-threaded simulation. This SMT s Real Effect on Cache and BP Performance 10 Results and Analysis

.l~- one thread two threads fourthreads Figure 7: Run Time Comparison of the run time of four programs (kenbus, Verilog, Sdet, Mpeg_play), all run with the operating system with different degrees of multithreading. (one thread) - The run time of the four programs added together. (two threads) - Two threads are started together. When one finishes another is added, when only one is left the processor switches into single thread mode. (four threads) - All four threads are started at the same time. When two threads finish the processor switches to two thread mode. and so on. Switching modes (from two threads to a single thread etc...) can be justified by looking at [Tullsen96] in which the processor can switch between modes. (Line Size = 32 bytes, Associativity = 1, Cache Size = 32K) can be explained because, threads with larger operating system would do better because of increased sharing while examining single-thread performance does not take this into account. Figure 8 and 9 contain the results of a study where half of the combinations of two threads (out of eight) were simulated together, to evaluate the variation in performance of the instruction and data cache do to thread choice. The results showed that there was a significant variation over the spectrum of combinations, with a 30%variation to the data cache and a 40% variation to the instruction cache. This points to the importance of thread execution choice in a SMT environment. Two threads with large amounts of operating system references (sharing in the caches) might do really well together, but run either together with thread with very little operating system and performance might degrade. SMT s Real Effect on Cache and BP Performance 11 Results and Analysis

12;.averaqe(no OS) ~ 10 average OS) iverage(noxos} x, x averaae(os/ 6I x o o x o o o 0 x... o x... x x o... x o Figure 8: Thread Variation (data cache) 14 different sets of traces were run through a two threaded machine. The lines are the average of the miss rates. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) Note the operating system was not removed for this simulation. (Line Size = 32 bytes, Associativity = 1, Cache Size = }6K) Figure 9: Thread Variation (instruction cache) ] 4 different sets of traces were run through a two threaded machine. The lines are the average of the miss rates. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) Note the operating system was not removed for this simulation. (Line Size = 32 bytes, Associativity = 1, Cache Size = 16K) The first column in both figures show the traces used in the other simulations in this paper. These traces do appear to be close to the worst case threads, so the threads do constitute a general lower bound on performance. This implies that other thread choices could improve the two thread performance over single thread performance. 4.5 Branch Prediction Branch prediction is a very serious problem in today s processors. Branch penalties are increasing and prediction schemes are getting more complex in an effort to reduce the effect of the long penalties [Sechrest96]. For an SMT machine branch prediction is even more important. In order to incorporate a bigger register file, the pipeline of a SMT machine may have to get longer [Tullsen96]. This increases the SMT s Real Effect on Cache and BP Performance 12 Results and Analysis

branch penalty and puts even more strain on making the correct prediction and finding the target address to compensate for increased penalties. The effects of branch prediction on SMT could limit the effectiveness of SMT if the latency tolerating characteristics are not greater than the additional strain on the system because of an increased number of threads. Few previous works have addressed the issue of large working set and operating system effects on branch prediction in an SMT environment. It was not until [Gloy96, Sechrest96] that the impact of the operating system with large working sets was understood in the context of branch prediction. This work extends the SMT work on branch prediction by evaluating it with traces from the IBS benchmark suite which contain large working sets and operating system references. A two level scheme was introduced into the SMAC simulator. In the first level a BTB (Branch Target Buffer) lookup is preformed on the branch address, this is an index into the low order bits of the address (with the possibility of associativity). The BTB contains the last target address of the branch or jump and small amount of bit history of the last branches (taken or not taken) from the current BTB entry. The second level, of the two level scheme, uses this history and the lower part of the address as an index into an array of two bit counters. The machine counts up for a taken branch and down for not taken. The branch is then predicted based on value of the counter [Yeh92]. Refer to Section 7.11 for details about the branch prediction scheme. The study of branch prediction in this paper was separated into two parts roughly corresponding to the two levels of the two level scheme. The first part is the misprediction rate, or rather how many times the branch was predicted incorrectly. The second part was the significant BTB miss rate, or assuming the branch was predicted correctly does the BTB contain the correct address for the branch. See Section 7.11 for more details on branch prediction. Since it is clear from the above sections that including the operating system references is the desired way to evaluate SMT, the study of branch prediction always included the operating system and did not remove it. SMT s Real Effect on Cache and BP Performance 13 Results and Analysis

15 tour lhreads two threads one thread T= 1 (Tullscn s value) "- one thread # o[ BTB entries (at: direct mapped Figure 10: Branch Misprediction Rate Misprediction rates over all branches for a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) Tullsen s values taken from [Tullsen96] table3. ~ ~ t 256 512 IK 2K # of BTB entries (b): four way associative 4~K 4.5.1 Branch Misprediction Rate Many papers today evaluate branch prediction by examining the branch misprediction rate, for this reason the branch miss prediction rate was separated from the BTB miss rate and evaluated separately. Figure 10 shows the results of a study of the branch misprediction rate when operating system sharing is used in a varying ~. size BTB The results show a large increase in the miss rates due to SMT with a direct mapped BTB. The increase is about 4% for two threads and 5% for four. Four-way associativity had little effect on the misprediction rate with less than a 2% change even at the highest BTB sizes. Figure 10b also contains results from ITullsen96] and shows that Tullsen achieved a significantly lower misprediction rate with his use of SPEC92 and its small working set size. The accuracy of the branch does depend on the number of BTB entries. This is because the first level lookup determines which counter in the second level table is used. Increasing the number of BTB entries would mean less aliasing on the history of the first step improving the accuracy of the second level. SMT s Real Effect on Cache and BP Performance 14 Results and Analysis

256 512 IK 2K # of B]]~ entries (a): direct mapped ~ tour threads 4K 256 512 IK 2K 4K # of BTB entdes Figure 11: BTB Miss Rate (b): four way associative Significant BTB miss (or misses in the BTB given that the branch was predicted correctly and it was a taken branch) rate over all branches. For a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) - - _ ~ two threads 4.5.2 Significant BTB Misses BTB misses are important because a miss would imply that the address that was predicted correct could not be fetched because it was not known. SMT can have a large effect on BTB misses because the BTB is just a cache and its additional threads might overload it. The same study that was preformed is section 4.5.1 was extended to the BTB miss rate 1, the results appears in Figure 11. At higher BTB sizes, the BTB miss rate increased from 2.2% (one thread) to 4.8% (two threads) and (four threads) with a direct mapped BTB. However, moving to a four way associative BTB fully compensates for the increased miss rates do to SMT, lowering the miss rate to 1.3% (two threads) and 1.8% (four threads). 1. A BTB miss is only counted here if the branch was predicted correctly. SMT s Real Effect on Cache and BP Performance 15 Results and Analysis

40r ~ 25 ~ t5 one thread to.~r t~hhrreeaa~ss one thread 256 512 IK 2K 4K ~ ot BTB entries (a): direct mapped 256 512 IK 2K 4K ~ of BTB entries (b): four way associative Figure 12: Effective Overall Branch Misprediction Misprediction rates plus significant BTB misses over all branches. For a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) Combining both the branch misprediction values and the significant BTB misses yields the tree number of limes the correct instruction after the branch could not be executed the cycle after the branch (Figure 12). The Figure shows that at higher BTB sizes the overall branch prediction miss rates increased to 13.4% for two threads and 14.1% for four threads, over the single thread performance of 8.5%. However, moving from a direct mapped to a four-way associative BTB makes up for most of the additional miss rates do to SMT, with miss rates of 8.6% for two threads and 9.4% for four threads. 4.5.3 History Length l)ifferent values of history length were used in this study. The results support the findings of [Gloy96], which showed that including the operating system makes lower history length values more cost effective especially on smaller table sizes. Also, lower table sizes performed better with smaller histories. Furthermore, when table size increased, larger history lengths performed better. SMT s Real Effect on Cache and BP Perfounance 16 Results and Analysis

4.5 8 4 3.5t 3 x o... 0... 0... o... tverage(noxos) o x average(os) o o o x o o x x x x x x x o 1 0.5 Figure 13: Thread variation (branch prediction) 14 different sets of traces were run through a two threaded machine. The lines are the average of the numbers. Significant BTB miss - misses in the BTB given that the branch was predicted correctly and it was a taken branch over all branches. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) (Number Counters = 2048, History Length = 3, 2048 BTB entries, Associativity = 4) Preliminary results pointed to the possibility that SMT might even strengthen the above findings. Other studies should be performed to determine the effects of SMT on different prediction schemes. 4.5.4 Does the Choice of Threads Affect Branch Prediction? Branch prediction is very thread oriented. Some threads can be predicted very well while others cannot [Gloy96]. This studies tries to show the effect of my choice of threads on the trends of this paper. Figure 13 contains results from a branch prediction simulation similar to the cache simulation in Section 4.4. These figures show that the variance is rather high 45% variation on the misprediction rate 75% variation on the BTB miss rate. This points to the importance of thread execution choice in a SMT environment, and suggests that operating system sharing is importanto branch prediction. SMT s Real Effect on Cache and BP Performance 17 Results and Analysis

Fine Grained Multithreading: A new thread executes on the processor every cycle [Smith78, Alverson90] Course Grained Multithreading: When one thread stalls, a new thread starts executing [Weber89, Agarwal90] a super fast context switch occurs and Simultaneous Multithreading: Multiple threads run at the same time, sharing processor resources [Daddis91, Tullsen95] Figure 14: Types of Multithreading 5.0 Previous Work This section looks at previous work in multithreading and compares other studies in SMT to the results found in this study. 5.1 Work In ILP Many of the latency hiding techniques in this paper originally came from papers that focus on ILP. [Smith81] focuses on the effects of fetch, decoding, dependence-checking, and branch prediction. [Butler91] added scheduling window size, scheduling policy, and functional unit configuration. [Lam92] focus on the interaction of branches and ILR while [Wall91] examines scheduling window size, branch prediction, register renaming, and aliasing. [Gloy96] examines branch prediction and IBS s operating systems effect on 1LR 5.2 What is Multithreading? Multithreading is an architectural technique where more than one concurrent thread is supported on the same processor. This enables a processor to continue executing useful work from non-stalled threads when one thread is stalled by a long latency operation or hazard. The first machine containing multithreading was one of the CDC 6600 s peripheral processors [Thornton64], while the rationale for multithreading was SMT s Real Effect on Cache and BP Performance 18 Previous Work

first presented in [Flynn72]. Today there are typically three types of multithreading which are listed in Figure 14. 5.2.1 Fine Grained Multithreading In traditional fine grained multithreading each thread is allocated once per N cycles, where N is the number of threads. If N is large enough, the stall cycles from any hazard will be masked by the execution of other threads. The HEP system presented in [Smith78, Smith81 ] is a classic example of fine grained multithreading. A drawback inherent to the approach in HEP is that single thread latency will be on the order of N times greater than if the thread was executing alone on the processor. This is because each thread is only allocated a fraction of the cycles. The Monsoon architecture counters this problem by forming a pool of available instructions from all the threads to issue from [Papadopoulos90, Papadopoulos91]. The TERA system extends this by allowing out-of-order execution within a thread [Alversong0]. 5.2.2 Course Grained Multithreading Course grained multithreading can achieve better single thread latency than fine grained by swapping to a new thread only when the first hits a long latency hazard. This idea was first presented in [Weber89], and it was implemented in APRIL [Agarwal90, Kurihara91]. 5.2.3 Simultaneous Multithreading A fundamental problem with fine and course grained multithreading is the fact that only one thread can issue instructions during any one cycle, simultaneous multithreading however does not have this drawback. An SMT machine tries to dispatch one (or more) instruction per thread per cycle (through variations on this rule exist) The instructions then execute and complete much as they do in a superscalar processor. For a very good example and a definition of Simultaneous Multithreading see [Tullsen95, Tullsen96]. SMT s Real Effect on Cache and BP Performance 19 Previous Work

Tullsen points out that the other multithreading methods allowed the single cycle issue bandwidth to be under-utilized creating horizontal waste, and asserts that SMT avoids this problem. The main benefit to SMT as compared to the other techniques is that it can attain higher functional unit utilization by allowing concurrent threads to dynamically fill as many functional units as possible during every cycle. If one thread can only issue a single instruction due to a data or control hazard, other unstalled threads will fill in and issue instructions to the unused functional units. As long as enough threads are unstalled the processor s resources will be used to their full potential. Large amounts of memory bandwidth are necessary because multiple threads need to be able to access the caches during the same cycle, and repeatedly continue this every cycle. On-chip caches and branch prediction becomes more of a problem because of the increased working set size. [Daddis91] simulates the first SMT machine. This paper modelled an architecture with an Instruction Window. This window is basically an issue buffer filled with a mixture of instructions from multiple threads. Instruction cache requests are controlled such that if one thread is stalled, instructions from other threads are fetched unti~ the window is full. An interesting effect of this fetch policy is that it stalls a thread when it encounters a branch, and fetches instructions from other streams until the branch is resolved. This eliminates any penalty due to branches, as long as enough other threads are unstalled, and thus eliminates the need for branch prediction hardware. The results showed a 100% speedup when moving from one thread to two threads, but these results are only an upper bound as their model did not include cache effects. [Hirata92] extended their multithreading model to include SMT. Hirata obtained a speed-up of 5.79 over a regular superscalar when using 8 threads. This speedup did not account for any cache effects or branch prediction. However, it showed the potential of SMT. [Tullsen95] introduces Simultaneous Multithreading and examines a number of SMT configurations, and compares them to single threaded and finely multithreaded architectures. To do this they create a model of SMT s Real Effect on Cache and BP Performance 2(/ Previous Work

a multithreaded processor based on the Alpha 21164 (with extremely relaxed issue restrictions and multiple register files.) This model allows up to 8 threads to issue at once into 10 functional units. One of the driving forces behind my research was the fact that [Tullsen95] virtually ignored cache problems in his SMT research. He designed an SMT architecture model based on a Dec Alpha 21164 processor, and quoted numbers of 6.3 IPC using multiple copies of the SPEC92 benchmark suite as threads. Unfortunately, his numbers are overly optimistic due to the fact that the SPEC92 benchmark suite runs amazingly well in small caches [Gee93]. His later work [Tullsen96] does a good job closing the gap between the theoretical and the implementable. This research provides a simulator with much more detail than SMACS. Their simulator includes: functional units, instruction queues, register file analysis, TLBs, partial memory disambiguation, full out-oforder, speculative execution, branch penalties, wrong path execution, and extensive work on the fetch policies used in their machine, all of which SMACS does poorly or not at all. Their simulator has problems however, because it is based on the SPEC92 benchmark suite (even though one other real application trace was used). In addition, [Tullsen96] does not have operating system references in his traces and has a mixture of floating point and integer benchmarks running together. The major contributions of [Tullsen96] is the detail they put into compensating for the increased size of the register file, and the extensive work they put towards deciding which thread to fetch from. These factors are major contributions to the SMT machine. Table 3 shows a comparison of my work with [Tullsen96] s work with similar machine characteristics. The numbers for this chart were extrapolated from Figure 4 and 6 of [Tullsen96]. These comparisons indicate that Tullsen s IPC values are about twice what mine are. The differences probably came from his use of the SPEC92 benchmark suite. My benchmarks are much more memory intensive than SPEC92, adding greatly to the number of possible cache conflicts. In addition, his processor does not have operating system sharing however Spe92 s operating system is small and would not matter to performance. SMT s Real Effect on Cache and BP Performance 21 Previous Work

1 Thread 2 Threads 4 Threads Tullsen s results(rr) 2.1 IPC 2.8 IPC 3.6 IPC Tullsen s results(bigq,icount) 2.1 IPC 3.3 IPC 4.2 IPC My results (without OS) 1.2 IPC 1.0 IPC 2.2 IPC My results (with OS) 1.0 IPC 1.1 IPC 1.9 IPC Table 3: Comparison of Results Tullsen s examination of the location of the bottleneck in SMT is not complete because he used non memory-intensive traces. His analysis should be reexamined with real full system traces to verify that the bottleneck is not on the memory system. [Keckler92, Prasadh91] simulate a extension of a VILW that has some SMT features. [Gunther93, Li95, Govindarajan95} all simulate SMT machines but do not issue more than one instruction per cycle per thread. [Yamamoto95, Gulati96] also simulate SMT machines, however none of the above machines have looked at cache effects with large working sets. My work differs from these and the other authors above in that detail was made to the memory system of the machine. In addition, I used system traces containing operating system references. 6.0 Conclusions and Future Work This paper demonstrates that SMT can successfully increase thruput, even for memory intensive workloads. The large working sets of multiple applications do increase the cache and BTB miss rates but SMT s tolerance for latency can successfully overcome these performance hits. The result is an increase in IPC from 1.03 for one thread to 1.14 and 1.87 at two and four threads respectively. Much of SMT s performance improvement can be attributed to significant sharing between different threads s operating system references, although the exact impact is hard to measure. Without this sharing SMT s Real Effecl on Cache and BP Performance 22 Conclusions and Future Work

SMT with two threads performs worse than single thread performance. The instruction cache and BTB results show that cache miss rates increased dramatically from one to two threads, but only slightly from two to four threads, most likely do to the increased sharing of operating systems references. The effects of increased sharing plus a larger set of threads to execute gives a four-threaded machine a significant performance improvement over a two-threaded system. The choice of threads can have a significant impact on the performance of the SMT system. Two thread s instruction, data, BTB caches miss rates vary 40%, 30%, and 75% respectively from the average depending on which threads were used. This implies that choosing which threads are run together will greatly influence performance. This work did not consider the impact of full out-of-order, investigate the increased branch misprediction rate on IPC, or evaluate the effects of sharing of the operating system in the data cache. This work should be expanded to include these areas to better understand SMT in a large working set environment. Other work on SMT machines should encompass optimizing the BTB for multiple threads and investigating SMT s effect on different types of caches and branch prediction schemes. These are the areas hardest hit by SMT and it is unlikely that the single- and multiple-threads share the same design solution. Other problems should not be overlooked as well, such as costs of a greater memory bandwidth, more complex instruction dispatch, and larger register files. SMT s Real Effect on Cache and BP Performance 23 Conclusions and Future Work

7.0 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache Simulator (SMACS) This appendix fully explains the SMACS cache simulator. 7.1 Traces The IBS (Instruction Benchmark Suite) traces used in the simulations were created using Monster, a hardware logic analyzer described in [Nagle92]. These traces represent real programs that ran on a DECStation 3100 using the Mach operating system. They contain both application and operation system memory references as well as PID, opcode and operand information [Uhlig95]. Each trace originally consisted of about 2400 files (about 2 gigabytes worth of data.) The files were grouped in sets of three: an address trace file, an instruction opcode file and a file containing the PIDs of the processes running. One of these groups of three files takes up such a large amount of memory it was impossible to run four different traces at a time in SMACS. To combat this problem a tool was created that can split one set of trace files into 10 new sets, each about 1/10 of the size of the original. This keeps the amount of memory SMACS uses small enough to avoid excessive paging, thus increasing simulation speed. To differentiate between multiple threads cache accesses a unique THID (Thread-ID) was added to each address when it was first accessed. This models a multithreaded operating system assigning different memory spaces to each thread. If operating thread sharing is turned on then SMACS assigns a special OSID (Operating System ID) to every operating system instruction reference, modelling the fact that the operating system code is shared among threads and exists in only one place in physical memory. If operating system sharing is turned off then the operating system instructions are skipped in the traces. Note that all accesses are in terms of their physical addresses and that SMACS does not model a TLB. SMT s Real Effect on Cache and BP Perforlnance 24 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

7.2 Main Cycle Loop The main cycle by cycle loop in SMACS executes the following steps: 1. Loop through threads until the maximum number instructions have been issued (Round Robin) Break if all threads are stalled on cache misses or data dependency 2. For each unstalled thread check for the next instruction in the buffer If not in the buffer but is in the level one cache stall for one cycle ff not in the Level 1 I cache, queue the request to the cache hierarchy If not stalled on the Instruction cache, mark destination operands as in use Check that all operands are ready if not stall on data dependency If write, queue into write buffer stall if full If read check the Level 1 data cache and write buffer If not in the Level 1 D cache or buffer, queue the request to the cache hierarchy If the instruction is not stalled mark destination as ready and call the branch routine if the instruction is a branch 3. If a data port is not in use, write back the head of the write buffer 4. Check the I and D, Level 2, Level 3, and Main memory request queues and process any ready requests On any cache or main memory hit (when dequeing) all higher cache levels (and the instruction buffer it s an I cache access) are updated to hold the new information. There is a potential problem caused by the memory hierarchy update method. If two requests that map to the same Level one data cache line are issued by two different threads, they can both be filled during the same cycle, and the second one to be filled will write over the first one s data. This problem was circumvented with the addition of a time-out counter. A request is reissued when it has been waiting for longer than the maximum delay required to access main memory. This might not accurately model real hardware, but this happened during less than 0.05% of the cache misses when running the benchmarks, so this hack did not affect the results. Another problem occurs for branch prediction. If a branch occurs at the end of a trace file the branch must be skipped and not included in the branch prediction analysis. This happens only one out of 2500 branches so this does not disrupt the branch prediction numbers. SMT s Real Effect on Cache and BP Performance 25 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

7.3 Number of Issue Slots The number of total issue slots per cycle is limited to eight. After the cache and data hazards, this is the next culprit for limiting threads execution. 7.4 Functional Units This simulator does not simulate functional units. [Tullsen96] showed that he obtained only a 0.5% increase to his IPC with infinite functional units over his normal nine functional units. This fact justifies omitting functional units in this study. 7.5 Data Dependency and Instruction Latency SMACS has true data dependency (and register renaming to eliminate false data dependencies) built into the simulator. However, it treats all non read and write operations as one cycle. This can be justified because other processors do not have long latencies for most of this study s instructions, mainly because no floating point benchmarks were used. However, this could still be the cause of error in my study. 7.6 Non Out-of-Order Machine SMACS is not an out-of-order machine. Making it one would increase the accuracy of the results in this paper. This could even raise the low thread numbers IPC more than the high thread numbers. 7.7 Instruction Buffers and Prefetching The SMACS machine contains four prefetch buffers each one linesize in length, these buffers are divided among the processes, i.e. with two threads, each thread would get two buffers, etc. If fewer than four processes are in use then prefetching is preformed on the inactive buffer. If no other instruction fetch is going SMT s Real Effect on Cache and BP Performance 26 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

Size Associativity Miss delay Instruction level 1 Data level 1 level 2 vat. vat. 128K bytes 1 1 6 cycles 6 cycles 15 cycles Table 2: Cache Characteristics level 3 2M bytes 1 50 cycles on then the machine loads one of these prefetch buffers with the next address of the active buffer. If only a single thread is in use there is still only one prefetch that occurs, i.e. only one buffer per thread is prefetched at any one time. Prefetching does not fill the cache on a miss. 7.8 Cache Characteristics Cache characteristic values are summarized on Table 3. Note that a level one cache hit takes one cycle, the caches are fully pipelined, and that only the ports limit their use. 7.8.1 Non-Banked Caches The caches in SMACS are not banked as they should be for multiple level one cache accesses [Sohi91]. However, this study was done with the believe that this will not alter the finding much and that if the cache was banked it would only decrease the performance slightly. 7.8.2 Write Back Policy SMACS simulates a write through cache, storing to memory after an instruction leaves the write queue. The rest of the written line is then loaded into the cache. 7.8.3 Split Instruction and Data Caches Most current processors split their level one caches into separate instruction and data caches and use unified level two and higher caches. In fact, [Smith82] shows that splitting a cache is more effective for SMT s Real Effect on Cache and BP Performance 27 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache