Department of Electrical and Computer Engineering~

Size: px
Start display at page:

Download "Department of Electrical and Computer Engineering~"

Transcription

1 CARNEGIE MELLON Department of Electrical and Computer Engineering~ Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance John A. Miller 1996 Advisor:: Prof. Nagle ~cgic \ Melh~,,

2 Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance John Alan Miller 1.0 Abstract The lack of single application instruction-level parallelism significantly limited performance of superscalar microprocessors, which often find up to 80% of their hardware functional units idle. Simultaneous Multithreading (SMT) is an architectural technique that attempts to utilize the idle functional units by enabling multiple threads to run concurrently on a single superscalar processor. Unfortunately, the increased number of threads also increases the load on on-chip memory structures such as caches, Translations Lookaside Buffers (TLB), and branch target buffers (BTB). Unlike functional units, these hardware resources often run at near-peak utilization on a single-threaded machine, making them the limiting factor in singlethread performance and potentially preventing SMT from achieving significant performance improvement. This study examines the impact SMT has on the on-chip cache and branch target buffers. Unlike previous SMT studies, which have used application-only small-sized working set programs, this study measures SMT performance under large applications, including operating system code. Results: SMT can significantly improve thruput, even under memory intensive workloads. SMT increased IPCfrom 1.03 at one thread to 1.14 and 1.87 at two and four threads respectively. However these numbers are about half as much as what was obtained from the SPEC92 benchmark suite. There is significant shar#zg of operating system code between different threads, without this sharing SMT with two threads performs worse than single thread performance. The choice of threads can have a significant impact on the performance of the SMT system. This implies that choosing which threads are executed together is important. Simultaneous Multithreading s Real Effect on Cache and Branch Prediction Performance

3 2.0 Introduction Current architectures attempt to maximize Instruction Per Cycle (IPC) by finding all possible Instruction Level Parallelism (ILP) within a single program. Unfortunately, it is becoming increasingly difficult to find enough single-thread parallelism to continue increasing IPC beyond its current level of two IPC [Tullsen96]. One symptom of this is the fact that the utilization of functional units in today s processors has dropped dramatically. Most processors today use less than 40% of their functional unit bandwidth ltullsen95]. Multithreading tries to offset this by ignoring single thread ILP and concentrating on increasing overall IPC. Various forms of multithreading have been proposed [Smith78, Papadopoulos90, Alverson90, Agarwal90, Tullsen96], enabling significant advances in increasing parallelism. The increases are achieved by allowing multiple programs (threads ~ ) to execute concurrently on a processor, filling the stall cycles and unused issue bandwidth of one thread with useful work from another thread. Multithreading has the added benefit of increasing the processor functional unit utilization [Agarwal90]. One of the more advanced methods of multithreading, Simultaneous Multithreading (SMT), claims to achieve an IPC as high as 5.4 for an eight threaded machine [Tullsen96]. Despite these benefits, multithreading exhibits several inherent drawbacks such as a large demand for memory bandwidth, increased cache conflicts between threads, and increased strain on the branch prediction units. These hardware resources are already bottlenecks in many of today s systems [Uhlig95] and SMT s larger working sets simply increase the load on the system s precious on-chip memory resources. However, SMT s ability to switch between different threads provides it with some degree of latency tolerance, possible enough to overcome the increased demand on on-chip memory structures. Note that "thread" is used to represent a hardware context, simultaneous multithreading requires that threads must not share data to avoid coherency problems. This is different than operating system "threads" which often share the same memory space. SMT s Real Effect on Cache and BP Performance 2 Introduction

4 Unfortunately previous SMT studies have relied on simple workloads that do not tax the memory system, leaving unanswered two fundamental questions. 1. How much does SMT increase the demand on on-chip memory structures? 2. Can SMT s tolerance for latency hide any performance loss due to larger working sets? These questions are important because if SMT increases the load on on-chip cache structures beyond what it can hide, it is possible that SMTs overall performance would not be any greater or even less than singlethreaded systems. This work addresses these questions by focusing on the impact SMT has on on-chip memory structures under realistic workloads. The workloads, from the Instruction Benchmark Suite (IBS) are used to drive custom SMT memory-system simulator that measures the impact various degrees of multithreading have on caches, branch address buffers, and overall system performance. The rest of the paper is organized as follows. Section 3 gives a brief methodology for the simulations in this study. Section 4 shows the results and analysis; Section 5 examines previous work related to Multithreading and SMT and compares it to the areas studied in this paper. Section 6 presents the conclusions. Appendix A discusses the SMACS simulator and its traces. Section 8 contains references. 3.0 Methodology To examine the cache conflicts and branch prediction performance in an SMT machine a trace driven simulator, Simultaneous Multithreaded Adjustable Cache Simulator (SMACS) was developed. SMACS is cycle by cycle simulator that fully models the on-chip memory structures and issue logic of an SMT architecture. SMACS includes instruction and data fetching, caching and CPU data dependencies, accurately modeling the characterization of an SMT memory system. SMACS does not model function unit behavior. [Tullsen96] showed that modeling functional units was not needed to accurately measure performance, so this drawback has little impact of the trends found in this paper. SMT s Real Effect on Cache and BP Perlbrmance 3 Methodology

5 Trace Name Mpeg_play Jpeg_play real_gee Verilog Groff Kenbus Ousterhout Sdet Description mpeg_play version 2.0, 85 frames form compressed file xloadimage version 3.0, displays two JPEG images GNU C compiler, version 2.6 Verilog-XL version 1.6b, simulating a microprocessor GNU C++ implementation of nroff version 1.09 SPEC SDM benchmark suite, simultaneous multi-user code development. John Ousterhout s benchmark suite from [Ousterhout89] Multiprocess benchmark from the SPEC SDM suite Table 1: IBS Description SMACS is also not an out-of-order machine. Modeling an out-of-order machine would alter the numbers presented in this paper, increasing all relevant IPCs. This might even increase single-thread IPC more than in the multiple-thread cases, thought this is thought to be a second order effect that would not affect the trends found in this paper. The one exception is stores, Stores which complete from a write buffer, do complete out-of-order from the other instructions. The workloads used in this study are from the IBS (Instruction Benchmark Suite) [Nagle92, Uhlig95] and contain both user and operating system references from the Mach operating system. The workloads cover a range of widely-used general-purpose applications, including file system and graphics software (see Table 1 ). A complete description of how the workloads were used is included in Section order for SMT simulations to make sense, the hardware models that simulate multiple threads must be comparable to hardware available to a single thread. Therefore, all of the simulations were run with identical machine resources. For this reason a single thread simulation might have more resources than it can use, artificially making single thread performance relatively low compared to the other simulations. The level-one cache size was varied throughout the simulations in order to determine the effect of cache size on the parameters studied. No other variables were changed with respect to the cache. Table 2 summa- SMT s Real Effect on Cache and BP Performance 4 Methodology

6 Instruction Level 1 Data Level 1 Level 2 Level 3 Size vat. var. 128K bytes 2M bytes Associativity Miss delay 6 cycles 6 cycles 15 cycles 50 cycles Table 2: Cache Characteristics rizes the values of the cache structure used in all of the simulations in this paper. See Appendix A for a complete description of the SMACS simulator and a list of all of the other variables in the simulator. To examine branch prediction performance, a standard two level scheme was added to SMACS. The scheme was based on the [Sechrest96] scheme which was from [Yeh92]. I set the scheme values to resemble [Tullsen96]. The branch prediction simulator only simulates the branch prediction, so SMACS does not simulate penalties associated with branch prediction and no wrong path execution occurs. Including these two factors would increase the accuracy of the IPC study but was beyond this study s scope. Excluding this does not greatly affect the results on branch prediction, because they would not change the pattern of the branches in the study. 4.0 Results and Analysis 4.1 Increases in On-Chip Cache Conflicts SMT s ability to switch between different instruction streams automatically provides a higher degree of latency tolerance. When one thread misses in the cache, an SMT processor will try to fill the stalled threads unused issue slots with another thread s instructions. In essence, SMT decouples overall machine throughput from any single thread s performance. However, SMT s execution of multiple threads forces multiple working sets to reside in the same physical cache. This increases the overall working set size, which is now the sum of every single thread s working set, potentially reducing the effectiveness of the cache to the point where very thread thrashes. In this case, no thread can make forward progress and SMT s Real Effect on Cache and BP Performance 5 Results and Analysis

7 8K 16K 32K 64K Cache Size Figure 1: Data cache miss rates Run of SMACS with the operating system removed from the IBS traces Line Size = 32 bytes, Associativity = I. ~K 16K 32K 64K Cache Size Figure 2: Instruction cache miss rates Run of SMACS with the operating system removed from the IBS traces Line Size = 32 bytes, Associativity = 1. SMT s performance may actually drop below single thread performance. To determine the impact real workloads have on an SMT cache system and on overall performance, I used the SMACS simulator to model data and instruction caches performance on the IBS benchmark suite. The first set of experiments, shown in Figure 1 and 2, measure the cache miss rates for user-only references (i.e,, only the application, not the operating system). As expected, both the instruction and data cache miss rates increase with the number of concurrent threads and confirm the trends found in [Tullsen96]. In the data cache, the miss rate almost doubles between one and four threads while the instruction cache miss rate doubles between just one and two threads. These are significant increases, especially in the instruction cache, where a thread cannot continue execution until its instruction cache miss is resolved. More importantly, the raw cache miss rates are very high. While Tullsen s SPEC workload showed instruction cache miss rates as low as 0.6% for a 32K cache and data cache miss rates as low as 1.2%, our miss rates are 2.0% (instruction cache) and 5.7% (data cache) for a single thread. Worse, our 2-thread rates are 4.5% (instruction cache) and 7.8% (data cache) while Tullsen s are less than 1.7% (instruction SMT s Real Effect on Cache and BP Perfi~rmance 6 Results and Analysis

8 no OS 8K 16K 32K 64K Cache Size 8K 16K 32K Cache Size Figure 3: Operating system affect on single threads Run of SMACS with only single threads from IBS. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = I. 64K cache) and 2.5% D-cache. These high cache miss rates suggest that SMT may have some problems hiding all of misses generated by these user-only references. 4.2 Impact of the Operating System on SMT Numerous previous works have shown that operating system references significantly increase the cache miss rates for applications that rely on operating system services [Nagle92, Uhlig95]. Figure 3 shows the difference in cache miss rates for the IBS workloads with and without operating system references. In both the instruction and data caches, the miss rates increase by at least 20% when the operating system is included. It is almost never the case that including operating system activity reduces the cache miss rate in a single threaded processor. In SMT, however, the impact of operating system references is less clear. Operating system references do increase the size of the working set, but multiple threads can share all code segments and possibly some of the operating system data. For example, if one thread calls the operating system to perform some service, SMT s Real Effect on Cache and BP Performance 7 Results and Analysis

9 sharing) T=2(ne ~ ~ T=4(OS) ""T=2(OS} T--4(no OS) T=4(Tullsen s valuc)~ " " " -... " " T=I(OS) T=I ~ (Tullscn s value) " " - T=l(no OS) 8K I6K 32K 64K Cache Size Figure 4: Data Cache Miss Rates Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. Tutlsen s values taken from [Tullsen96] table3, and extrapolated up to misses/data from misses/instruction. 8K 6K 32K 64K Cache Size Figure 5: Instruction Cache Miss Rates Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. Tullsen s values taken from [Tullsen96] table3. OS no sharing - contains operating system references but does not allow sharing between threads. it will warm the cache with the operating system code. Another thread also making requests for operating system services will share a (potentially significant) portion of the operating system code brought into the cache by the first thread. Given a four-threaded machine and applications that each spend 25% of their time in the operating system [Nagle92], it is probable that at any given time at least one thread will be executing operating system code. This cross-thread sharing creates a very different behavior from conventional single-threaded systems, which typically execute infrequent operating system requests that often are not in the cache and, when execute, flush significant portions of the cache. This makes the operating system influence on SMT very important and also more important than single threaded systems. To measure the impact of operating system references on SMT s cache performance, I reran the experiments in Figures 1 and 2 using the IBS workloads and included the operating system references. Figure 4 shows the results for data references. In our model, there was no sharing between operating system data SMT s Real Effect on Cache and BP Performance 8 Results and Analysis

10 references and hence, the data cache miss rates increased significantly when operating system data references were included. Therefore, Figure 4 represents a worst-case upper-bound on data cache miss rates. For the instruction cache, sharing of operating system references between threads was supported by SMACS. The single-thread case shows the well-known increase between user-only and user+operating system references. For the two- and four-thread cases, the results are much more surprising. For small caches, including operating system references does increase the cache miss rate a small amount, but once the caches become large enough to retain some of the operating system s working set, operating system sharing actually improves the miss rate beyond the user-only trace. For the 2-thread case, the cross-over point is at 32K. For the 4-thread case, it appears that the cross-over would occur just above 64K (just beyond the range of our experiments). This intertask sharing represents a significant performance win for SMT, potentially mitigating some of the performance problems caused by increased working-set. 4.3 The Effect of SMT on IPC an Overall Performance SMTs ability to switch between different instruction streams allows it to better tolerate cache misses, possibly mitigating any increased cache load caused by the larger working sets of multiple threads. This poses the question of which is the more dominate factor - latency tolerance or increased working set size? This is an important question because if the increased strain on on-chip caches is larger than SMTs tolerance for increased cache miss rates, then SMT may not see a significant performance improvement. To understand the system-level impact of SMT s memory system, we used SMACS to measure the Instructions Per Cycle (IPC) for one, two and four threads. The results in Figure 6 show I that the increase in IPC between one and two threads (with the operating system) is about 15%. However, any increase IPC means that SMT has improved the performance of the system. For the four-thread case, the performance improvements are much more significant, with IPC almost doubling over the single-threaded case. 1. The simulations did neglect out-of-order execution, the latency of functional units and the TLB. However, adding these factors should change impacthe IPC performance of each thread about equally. SMT s Real Effect on Cache and BP Performance 9 Results and Analysis

11 i T=4(no OS) 2 F T=4(OS) ~o i I _... _--- T=l(noOS) ] ----T=2(Om OS)- 8K 16K 32K 64K Cache Size Figure 6: IPC Run of SMACS with the operating system from the IBS traces compared with the operating system removed. OS - contains operating system references no OS - does not contain operating system references. Line Size = 32 bytes, Associativity = 1. I believe that the significant increase is due to: 1) the increased tolerance to cache misses; 2) the increased sharing of operating system code. The four-threaded machine has more threads to run when a cache miss occurs and the probability of the cache containing operating system code can actually increase with more threads (it is also dependant upon the cache size). Using the data from Figure 6 and separate simulations, I computed the total run time for each of the systems (Figure 7). The results show that a two-threaded SMT architecture would reduce the total run time two applications by about 20%, while the 4-threaded SMT machine reduces the total run time by up to 55%. 4.4 Does the Choice of Threads Matter? Most of the simulations in this study were running the same two or four threads together. The worst case single thread programs were chosen to try to get an lower bound on the performance numbers. However, the two worst case single threads did not turn out to provide the worst case two-threaded simulation. This SMT s Real Effect on Cache and BP Performance 10 Results and Analysis

12 .l~- one thread two threads fourthreads Figure 7: Run Time Comparison of the run time of four programs (kenbus, Verilog, Sdet, Mpeg_play), all run with the operating system with different degrees of multithreading. (one thread) - The run time of the four programs added together. (two threads) - Two threads are started together. When one finishes another is added, when only one is left the processor switches into single thread mode. (four threads) - All four threads are started at the same time. When two threads finish the processor switches to two thread mode. and so on. Switching modes (from two threads to a single thread etc...) can be justified by looking at [Tullsen96] in which the processor can switch between modes. (Line Size = 32 bytes, Associativity = 1, Cache Size = 32K) can be explained because, threads with larger operating system would do better because of increased sharing while examining single-thread performance does not take this into account. Figure 8 and 9 contain the results of a study where half of the combinations of two threads (out of eight) were simulated together, to evaluate the variation in performance of the instruction and data cache do to thread choice. The results showed that there was a significant variation over the spectrum of combinations, with a 30%variation to the data cache and a 40% variation to the instruction cache. This points to the importance of thread execution choice in a SMT environment. Two threads with large amounts of operating system references (sharing in the caches) might do really well together, but run either together with thread with very little operating system and performance might degrade. SMT s Real Effect on Cache and BP Performance 11 Results and Analysis

13 12;.averaqe(no OS) ~ 10 average OS) iverage(noxos} x, x averaae(os/ 6I x o o x o o o 0 x... o x... x x o... x o Figure 8: Thread Variation (data cache) 14 different sets of traces were run through a two threaded machine. The lines are the average of the miss rates. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) Note the operating system was not removed for this simulation. (Line Size = 32 bytes, Associativity = 1, Cache Size = }6K) Figure 9: Thread Variation (instruction cache) ] 4 different sets of traces were run through a two threaded machine. The lines are the average of the miss rates. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) Note the operating system was not removed for this simulation. (Line Size = 32 bytes, Associativity = 1, Cache Size = 16K) The first column in both figures show the traces used in the other simulations in this paper. These traces do appear to be close to the worst case threads, so the threads do constitute a general lower bound on performance. This implies that other thread choices could improve the two thread performance over single thread performance. 4.5 Branch Prediction Branch prediction is a very serious problem in today s processors. Branch penalties are increasing and prediction schemes are getting more complex in an effort to reduce the effect of the long penalties [Sechrest96]. For an SMT machine branch prediction is even more important. In order to incorporate a bigger register file, the pipeline of a SMT machine may have to get longer [Tullsen96]. This increases the SMT s Real Effect on Cache and BP Performance 12 Results and Analysis

14 branch penalty and puts even more strain on making the correct prediction and finding the target address to compensate for increased penalties. The effects of branch prediction on SMT could limit the effectiveness of SMT if the latency tolerating characteristics are not greater than the additional strain on the system because of an increased number of threads. Few previous works have addressed the issue of large working set and operating system effects on branch prediction in an SMT environment. It was not until [Gloy96, Sechrest96] that the impact of the operating system with large working sets was understood in the context of branch prediction. This work extends the SMT work on branch prediction by evaluating it with traces from the IBS benchmark suite which contain large working sets and operating system references. A two level scheme was introduced into the SMAC simulator. In the first level a BTB (Branch Target Buffer) lookup is preformed on the branch address, this is an index into the low order bits of the address (with the possibility of associativity). The BTB contains the last target address of the branch or jump and small amount of bit history of the last branches (taken or not taken) from the current BTB entry. The second level, of the two level scheme, uses this history and the lower part of the address as an index into an array of two bit counters. The machine counts up for a taken branch and down for not taken. The branch is then predicted based on value of the counter [Yeh92]. Refer to Section 7.11 for details about the branch prediction scheme. The study of branch prediction in this paper was separated into two parts roughly corresponding to the two levels of the two level scheme. The first part is the misprediction rate, or rather how many times the branch was predicted incorrectly. The second part was the significant BTB miss rate, or assuming the branch was predicted correctly does the BTB contain the correct address for the branch. See Section 7.11 for more details on branch prediction. Since it is clear from the above sections that including the operating system references is the desired way to evaluate SMT, the study of branch prediction always included the operating system and did not remove it. SMT s Real Effect on Cache and BP Performance 13 Results and Analysis

15 15 tour lhreads two threads one thread T= 1 (Tullscn s value) "- one thread # o[ BTB entries (at: direct mapped Figure 10: Branch Misprediction Rate Misprediction rates over all branches for a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) Tullsen s values taken from [Tullsen96] table3. ~ ~ t IK 2K # of BTB entries (b): four way associative 4~K Branch Misprediction Rate Many papers today evaluate branch prediction by examining the branch misprediction rate, for this reason the branch miss prediction rate was separated from the BTB miss rate and evaluated separately. Figure 10 shows the results of a study of the branch misprediction rate when operating system sharing is used in a varying ~. size BTB The results show a large increase in the miss rates due to SMT with a direct mapped BTB. The increase is about 4% for two threads and 5% for four. Four-way associativity had little effect on the misprediction rate with less than a 2% change even at the highest BTB sizes. Figure 10b also contains results from ITullsen96] and shows that Tullsen achieved a significantly lower misprediction rate with his use of SPEC92 and its small working set size. The accuracy of the branch does depend on the number of BTB entries. This is because the first level lookup determines which counter in the second level table is used. Increasing the number of BTB entries would mean less aliasing on the history of the first step improving the accuracy of the second level. SMT s Real Effect on Cache and BP Performance 14 Results and Analysis

16 IK 2K # of B]]~ entries (a): direct mapped ~ tour threads 4K IK 2K 4K # of BTB entdes Figure 11: BTB Miss Rate (b): four way associative Significant BTB miss (or misses in the BTB given that the branch was predicted correctly and it was a taken branch) rate over all branches. For a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) - - _ ~ two threads Significant BTB Misses BTB misses are important because a miss would imply that the address that was predicted correct could not be fetched because it was not known. SMT can have a large effect on BTB misses because the BTB is just a cache and its additional threads might overload it. The same study that was preformed is section was extended to the BTB miss rate 1, the results appears in Figure 11. At higher BTB sizes, the BTB miss rate increased from 2.2% (one thread) to 4.8% (two threads) and (four threads) with a direct mapped BTB. However, moving to a four way associative BTB fully compensates for the increased miss rates do to SMT, lowering the miss rate to 1.3% (two threads) and 1.8% (four threads). 1. A BTB miss is only counted here if the branch was predicted correctly. SMT s Real Effect on Cache and BP Performance 15 Results and Analysis

17 40r ~ 25 ~ t5 one thread to.~r t~hhrreeaa~ss one thread IK 2K 4K ~ ot BTB entries (a): direct mapped IK 2K 4K ~ of BTB entries (b): four way associative Figure 12: Effective Overall Branch Misprediction Misprediction rates plus significant BTB misses over all branches. For a shared operating system with respect to BTB size. (Number Counters = 2048, History Length = 3) Combining both the branch misprediction values and the significant BTB misses yields the tree number of limes the correct instruction after the branch could not be executed the cycle after the branch (Figure 12). The Figure shows that at higher BTB sizes the overall branch prediction miss rates increased to 13.4% for two threads and 14.1% for four threads, over the single thread performance of 8.5%. However, moving from a direct mapped to a four-way associative BTB makes up for most of the additional miss rates do to SMT, with miss rates of 8.6% for two threads and 9.4% for four threads History Length l)ifferent values of history length were used in this study. The results support the findings of [Gloy96], which showed that including the operating system makes lower history length values more cost effective especially on smaller table sizes. Also, lower table sizes performed better with smaller histories. Furthermore, when table size increased, larger history lengths performed better. SMT s Real Effect on Cache and BP Perfounance 16 Results and Analysis

18 t 3 x o o... tverage(noxos) o x average(os) o o o x o o x x x x x x x o Figure 13: Thread variation (branch prediction) 14 different sets of traces were run through a two threaded machine. The lines are the average of the numbers. Significant BTB miss - misses in the BTB given that the branch was predicted correctly and it was a taken branch over all branches. x-are points with the operating system. (solid line) o-are points without operating system sharing. (dotted line) (Number Counters = 2048, History Length = 3, 2048 BTB entries, Associativity = 4) Preliminary results pointed to the possibility that SMT might even strengthen the above findings. Other studies should be performed to determine the effects of SMT on different prediction schemes Does the Choice of Threads Affect Branch Prediction? Branch prediction is very thread oriented. Some threads can be predicted very well while others cannot [Gloy96]. This studies tries to show the effect of my choice of threads on the trends of this paper. Figure 13 contains results from a branch prediction simulation similar to the cache simulation in Section 4.4. These figures show that the variance is rather high 45% variation on the misprediction rate 75% variation on the BTB miss rate. This points to the importance of thread execution choice in a SMT environment, and suggests that operating system sharing is importanto branch prediction. SMT s Real Effect on Cache and BP Performance 17 Results and Analysis

19 Fine Grained Multithreading: A new thread executes on the processor every cycle [Smith78, Alverson90] Course Grained Multithreading: When one thread stalls, a new thread starts executing [Weber89, Agarwal90] a super fast context switch occurs and Simultaneous Multithreading: Multiple threads run at the same time, sharing processor resources [Daddis91, Tullsen95] Figure 14: Types of Multithreading 5.0 Previous Work This section looks at previous work in multithreading and compares other studies in SMT to the results found in this study. 5.1 Work In ILP Many of the latency hiding techniques in this paper originally came from papers that focus on ILP. [Smith81] focuses on the effects of fetch, decoding, dependence-checking, and branch prediction. [Butler91] added scheduling window size, scheduling policy, and functional unit configuration. [Lam92] focus on the interaction of branches and ILR while [Wall91] examines scheduling window size, branch prediction, register renaming, and aliasing. [Gloy96] examines branch prediction and IBS s operating systems effect on 1LR 5.2 What is Multithreading? Multithreading is an architectural technique where more than one concurrent thread is supported on the same processor. This enables a processor to continue executing useful work from non-stalled threads when one thread is stalled by a long latency operation or hazard. The first machine containing multithreading was one of the CDC 6600 s peripheral processors [Thornton64], while the rationale for multithreading was SMT s Real Effect on Cache and BP Performance 18 Previous Work

20 first presented in [Flynn72]. Today there are typically three types of multithreading which are listed in Figure Fine Grained Multithreading In traditional fine grained multithreading each thread is allocated once per N cycles, where N is the number of threads. If N is large enough, the stall cycles from any hazard will be masked by the execution of other threads. The HEP system presented in [Smith78, Smith81 ] is a classic example of fine grained multithreading. A drawback inherent to the approach in HEP is that single thread latency will be on the order of N times greater than if the thread was executing alone on the processor. This is because each thread is only allocated a fraction of the cycles. The Monsoon architecture counters this problem by forming a pool of available instructions from all the threads to issue from [Papadopoulos90, Papadopoulos91]. The TERA system extends this by allowing out-of-order execution within a thread [Alversong0] Course Grained Multithreading Course grained multithreading can achieve better single thread latency than fine grained by swapping to a new thread only when the first hits a long latency hazard. This idea was first presented in [Weber89], and it was implemented in APRIL [Agarwal90, Kurihara91] Simultaneous Multithreading A fundamental problem with fine and course grained multithreading is the fact that only one thread can issue instructions during any one cycle, simultaneous multithreading however does not have this drawback. An SMT machine tries to dispatch one (or more) instruction per thread per cycle (through variations on this rule exist) The instructions then execute and complete much as they do in a superscalar processor. For a very good example and a definition of Simultaneous Multithreading see [Tullsen95, Tullsen96]. SMT s Real Effect on Cache and BP Performance 19 Previous Work

21 Tullsen points out that the other multithreading methods allowed the single cycle issue bandwidth to be under-utilized creating horizontal waste, and asserts that SMT avoids this problem. The main benefit to SMT as compared to the other techniques is that it can attain higher functional unit utilization by allowing concurrent threads to dynamically fill as many functional units as possible during every cycle. If one thread can only issue a single instruction due to a data or control hazard, other unstalled threads will fill in and issue instructions to the unused functional units. As long as enough threads are unstalled the processor s resources will be used to their full potential. Large amounts of memory bandwidth are necessary because multiple threads need to be able to access the caches during the same cycle, and repeatedly continue this every cycle. On-chip caches and branch prediction becomes more of a problem because of the increased working set size. [Daddis91] simulates the first SMT machine. This paper modelled an architecture with an Instruction Window. This window is basically an issue buffer filled with a mixture of instructions from multiple threads. Instruction cache requests are controlled such that if one thread is stalled, instructions from other threads are fetched unti~ the window is full. An interesting effect of this fetch policy is that it stalls a thread when it encounters a branch, and fetches instructions from other streams until the branch is resolved. This eliminates any penalty due to branches, as long as enough other threads are unstalled, and thus eliminates the need for branch prediction hardware. The results showed a 100% speedup when moving from one thread to two threads, but these results are only an upper bound as their model did not include cache effects. [Hirata92] extended their multithreading model to include SMT. Hirata obtained a speed-up of 5.79 over a regular superscalar when using 8 threads. This speedup did not account for any cache effects or branch prediction. However, it showed the potential of SMT. [Tullsen95] introduces Simultaneous Multithreading and examines a number of SMT configurations, and compares them to single threaded and finely multithreaded architectures. To do this they create a model of SMT s Real Effect on Cache and BP Performance 2(/ Previous Work

22 a multithreaded processor based on the Alpha (with extremely relaxed issue restrictions and multiple register files.) This model allows up to 8 threads to issue at once into 10 functional units. One of the driving forces behind my research was the fact that [Tullsen95] virtually ignored cache problems in his SMT research. He designed an SMT architecture model based on a Dec Alpha processor, and quoted numbers of 6.3 IPC using multiple copies of the SPEC92 benchmark suite as threads. Unfortunately, his numbers are overly optimistic due to the fact that the SPEC92 benchmark suite runs amazingly well in small caches [Gee93]. His later work [Tullsen96] does a good job closing the gap between the theoretical and the implementable. This research provides a simulator with much more detail than SMACS. Their simulator includes: functional units, instruction queues, register file analysis, TLBs, partial memory disambiguation, full out-oforder, speculative execution, branch penalties, wrong path execution, and extensive work on the fetch policies used in their machine, all of which SMACS does poorly or not at all. Their simulator has problems however, because it is based on the SPEC92 benchmark suite (even though one other real application trace was used). In addition, [Tullsen96] does not have operating system references in his traces and has a mixture of floating point and integer benchmarks running together. The major contributions of [Tullsen96] is the detail they put into compensating for the increased size of the register file, and the extensive work they put towards deciding which thread to fetch from. These factors are major contributions to the SMT machine. Table 3 shows a comparison of my work with [Tullsen96] s work with similar machine characteristics. The numbers for this chart were extrapolated from Figure 4 and 6 of [Tullsen96]. These comparisons indicate that Tullsen s IPC values are about twice what mine are. The differences probably came from his use of the SPEC92 benchmark suite. My benchmarks are much more memory intensive than SPEC92, adding greatly to the number of possible cache conflicts. In addition, his processor does not have operating system sharing however Spe92 s operating system is small and would not matter to performance. SMT s Real Effect on Cache and BP Performance 21 Previous Work

23 1 Thread 2 Threads 4 Threads Tullsen s results(rr) 2.1 IPC 2.8 IPC 3.6 IPC Tullsen s results(bigq,icount) 2.1 IPC 3.3 IPC 4.2 IPC My results (without OS) 1.2 IPC 1.0 IPC 2.2 IPC My results (with OS) 1.0 IPC 1.1 IPC 1.9 IPC Table 3: Comparison of Results Tullsen s examination of the location of the bottleneck in SMT is not complete because he used non memory-intensive traces. His analysis should be reexamined with real full system traces to verify that the bottleneck is not on the memory system. [Keckler92, Prasadh91] simulate a extension of a VILW that has some SMT features. [Gunther93, Li95, Govindarajan95} all simulate SMT machines but do not issue more than one instruction per cycle per thread. [Yamamoto95, Gulati96] also simulate SMT machines, however none of the above machines have looked at cache effects with large working sets. My work differs from these and the other authors above in that detail was made to the memory system of the machine. In addition, I used system traces containing operating system references. 6.0 Conclusions and Future Work This paper demonstrates that SMT can successfully increase thruput, even for memory intensive workloads. The large working sets of multiple applications do increase the cache and BTB miss rates but SMT s tolerance for latency can successfully overcome these performance hits. The result is an increase in IPC from 1.03 for one thread to 1.14 and 1.87 at two and four threads respectively. Much of SMT s performance improvement can be attributed to significant sharing between different threads s operating system references, although the exact impact is hard to measure. Without this sharing SMT s Real Effecl on Cache and BP Performance 22 Conclusions and Future Work

24 SMT with two threads performs worse than single thread performance. The instruction cache and BTB results show that cache miss rates increased dramatically from one to two threads, but only slightly from two to four threads, most likely do to the increased sharing of operating systems references. The effects of increased sharing plus a larger set of threads to execute gives a four-threaded machine a significant performance improvement over a two-threaded system. The choice of threads can have a significant impact on the performance of the SMT system. Two thread s instruction, data, BTB caches miss rates vary 40%, 30%, and 75% respectively from the average depending on which threads were used. This implies that choosing which threads are run together will greatly influence performance. This work did not consider the impact of full out-of-order, investigate the increased branch misprediction rate on IPC, or evaluate the effects of sharing of the operating system in the data cache. This work should be expanded to include these areas to better understand SMT in a large working set environment. Other work on SMT machines should encompass optimizing the BTB for multiple threads and investigating SMT s effect on different types of caches and branch prediction schemes. These are the areas hardest hit by SMT and it is unlikely that the single- and multiple-threads share the same design solution. Other problems should not be overlooked as well, such as costs of a greater memory bandwidth, more complex instruction dispatch, and larger register files. SMT s Real Effect on Cache and BP Performance 23 Conclusions and Future Work

25 7.0 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache Simulator (SMACS) This appendix fully explains the SMACS cache simulator. 7.1 Traces The IBS (Instruction Benchmark Suite) traces used in the simulations were created using Monster, a hardware logic analyzer described in [Nagle92]. These traces represent real programs that ran on a DECStation 3100 using the Mach operating system. They contain both application and operation system memory references as well as PID, opcode and operand information [Uhlig95]. Each trace originally consisted of about 2400 files (about 2 gigabytes worth of data.) The files were grouped in sets of three: an address trace file, an instruction opcode file and a file containing the PIDs of the processes running. One of these groups of three files takes up such a large amount of memory it was impossible to run four different traces at a time in SMACS. To combat this problem a tool was created that can split one set of trace files into 10 new sets, each about 1/10 of the size of the original. This keeps the amount of memory SMACS uses small enough to avoid excessive paging, thus increasing simulation speed. To differentiate between multiple threads cache accesses a unique THID (Thread-ID) was added to each address when it was first accessed. This models a multithreaded operating system assigning different memory spaces to each thread. If operating thread sharing is turned on then SMACS assigns a special OSID (Operating System ID) to every operating system instruction reference, modelling the fact that the operating system code is shared among threads and exists in only one place in physical memory. If operating system sharing is turned off then the operating system instructions are skipped in the traces. Note that all accesses are in terms of their physical addresses and that SMACS does not model a TLB. SMT s Real Effect on Cache and BP Perforlnance 24 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

26 7.2 Main Cycle Loop The main cycle by cycle loop in SMACS executes the following steps: 1. Loop through threads until the maximum number instructions have been issued (Round Robin) Break if all threads are stalled on cache misses or data dependency 2. For each unstalled thread check for the next instruction in the buffer If not in the buffer but is in the level one cache stall for one cycle ff not in the Level 1 I cache, queue the request to the cache hierarchy If not stalled on the Instruction cache, mark destination operands as in use Check that all operands are ready if not stall on data dependency If write, queue into write buffer stall if full If read check the Level 1 data cache and write buffer If not in the Level 1 D cache or buffer, queue the request to the cache hierarchy If the instruction is not stalled mark destination as ready and call the branch routine if the instruction is a branch 3. If a data port is not in use, write back the head of the write buffer 4. Check the I and D, Level 2, Level 3, and Main memory request queues and process any ready requests On any cache or main memory hit (when dequeing) all higher cache levels (and the instruction buffer it s an I cache access) are updated to hold the new information. There is a potential problem caused by the memory hierarchy update method. If two requests that map to the same Level one data cache line are issued by two different threads, they can both be filled during the same cycle, and the second one to be filled will write over the first one s data. This problem was circumvented with the addition of a time-out counter. A request is reissued when it has been waiting for longer than the maximum delay required to access main memory. This might not accurately model real hardware, but this happened during less than 0.05% of the cache misses when running the benchmarks, so this hack did not affect the results. Another problem occurs for branch prediction. If a branch occurs at the end of a trace file the branch must be skipped and not included in the branch prediction analysis. This happens only one out of 2500 branches so this does not disrupt the branch prediction numbers. SMT s Real Effect on Cache and BP Performance 25 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

27 7.3 Number of Issue Slots The number of total issue slots per cycle is limited to eight. After the cache and data hazards, this is the next culprit for limiting threads execution. 7.4 Functional Units This simulator does not simulate functional units. [Tullsen96] showed that he obtained only a 0.5% increase to his IPC with infinite functional units over his normal nine functional units. This fact justifies omitting functional units in this study. 7.5 Data Dependency and Instruction Latency SMACS has true data dependency (and register renaming to eliminate false data dependencies) built into the simulator. However, it treats all non read and write operations as one cycle. This can be justified because other processors do not have long latencies for most of this study s instructions, mainly because no floating point benchmarks were used. However, this could still be the cause of error in my study. 7.6 Non Out-of-Order Machine SMACS is not an out-of-order machine. Making it one would increase the accuracy of the results in this paper. This could even raise the low thread numbers IPC more than the high thread numbers. 7.7 Instruction Buffers and Prefetching The SMACS machine contains four prefetch buffers each one linesize in length, these buffers are divided among the processes, i.e. with two threads, each thread would get two buffers, etc. If fewer than four processes are in use then prefetching is preformed on the inactive buffer. If no other instruction fetch is going SMT s Real Effect on Cache and BP Performance 26 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

28 Size Associativity Miss delay Instruction level 1 Data level 1 level 2 vat. vat. 128K bytes cycles 6 cycles 15 cycles Table 2: Cache Characteristics level 3 2M bytes 1 50 cycles on then the machine loads one of these prefetch buffers with the next address of the active buffer. If only a single thread is in use there is still only one prefetch that occurs, i.e. only one buffer per thread is prefetched at any one time. Prefetching does not fill the cache on a miss. 7.8 Cache Characteristics Cache characteristic values are summarized on Table 3. Note that a level one cache hit takes one cycle, the caches are fully pipelined, and that only the ports limit their use Non-Banked Caches The caches in SMACS are not banked as they should be for multiple level one cache accesses [Sohi91]. However, this study was done with the believe that this will not alter the finding much and that if the cache was banked it would only decrease the performance slightly Write Back Policy SMACS simulates a write through cache, storing to memory after an instruction leaves the write queue. The rest of the written line is then loaded into the cache Split Instruction and Data Caches Most current processors split their level one caches into separate instruction and data caches and use unified level two and higher caches. In fact, [Smith82] shows that splitting a cache is more effective for SMT s Real Effect on Cache and BP Performance 27 Appendix A: The Simultaneous Multi-Threaded Adjustable Cache

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

EECS 452 Lecture 9 TLP Thread-Level Parallelism

EECS 452 Lecture 9 TLP Thread-Level Parallelism EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

floating point instruction queue integer instruction queue

floating point instruction queue integer instruction queue Submitted to the 23rd Annual International Symposium on Computer Architecture Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen 3,

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. Recent years have seen a great deal of interest in multiple-issue machines or superscalar

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information