Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Size: px
Start display at page:

Download "Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures"

Transcription

1 Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3 Abstract Neither simulation results nor real system results give an explanation to the behavior of advanced computer systems for the full design spectrum. In this paper, we present simple models that explain the behavior of simultaneous multithreded, multiprocessor and multiprocessor with simultaneous multithreaded architectures. The results of these models show that there are limitations and problems with these architectures when using fast processors, and that the current single processor architecture will continue to be the target of system design for many years to come. 1 Introduction The demand for more computing power is growing as a result of new applications in multi media, Internet and telecommunication. The microarchitecture features that have been used to improve processor performance like superpipelining, superscalar, branch prediction and out-of-order execution cannot offer continuous improvement to performance as the cost of hiding cache misses and branch miss predictions increases with the increase in processor speed and complexity [1]. The advancements in technology that have produced faster and larger chips cannot be fully utilized to produce more performance from single processor. In the mean time, the software used for server applications and desktop applications are becoming more parallel. These applications consist of multiple threads or processes that can run in parallel [1,3] using multiprocessor system. Multiprocessor systems have been used for many years to improve the performance of applications by executing multiple instructions in parallel [5,6]. Recently, a single processor has been developed to execute multiple threads simultaneously (SMT) to make full use of the single processor resources [1,7,8]. Intel Hyper-Threading makes a single physical processor appears as two logical processors [1]. Threads could be scheduled to the logical processors and run in parallel as in multiprocessor system to improve the system performance. In this paper, we analyze the behavior of the following different architectures: single processor; SMT; SMP; SMP with SMT. We will introduce simple models capable of explaining the behavior of these complicated architectures and predict their performance. The rest of this paper is organized as follows: Section 2 explains the motivation; Section 3 presents simple models for the different architectures; Section 4 discusses the results; Section 5 concludes this paper. 2 Motivation The computer industry has depended on CAD tools to design, test and build advanced computer systems. The accuracy and speed of these tools have made us more and more dependent on them not only for the design but also for evaluating the design. Major design decisions are made according to the results of simulation. Simulation results might have errors [4] and do not give an overall view for the behavior of the system. Simulation gives the results of only one point in the design space for the parameters used in the simulation results. If we need to make an optimal design decision we must view the results of the system for the full design spectrum. The optimization that is based on the value of the results at one point is misleading and could be counter productive. Furthermore, we cannot optimize the performance of a system by optimizing the performance of each individual parameter. We can only optimize the system performance if we understand the behavior of the system with respect to its parameters in the full design spectrum. New architectures must be evaluated not only to find their expected performance but also to understand their behavior specially in a fast changing technology and many new applications that have different characteristics. If we understand the behavior of these archi-

2 tectures, we can answer important questions regarding which architecture should be adopted and why. We cannot answer these questions from the results of simulation or even from the results of a real system. Simple laws of physics like Newton s laws are very useful because they can explain clearly the behavior of objects in complicated environment. Fundamental laws that explain the behavior of the systems are very valuable in evaluating design decisions and in developing systems for the future. These laws cannot be obtained using CAD tools. The only way to develop these laws is by understanding the system behavior without the need to focus on the details of its components. Therefore, our objectives in this paper is to find some simple rules that govern the behavior of complicated architectures without getting into the details of each system. These rules will be used to develop simple models that help to explain the behavior of the different architectures. 3 Simple Models For The Different Architectures To understand the behavior of important architectures like multirthreaded, multiprocessor and multiprocessor with multithreaded, we constructed simple performance models. These models evaluate the behavior of each architecture relative to the well known single processor architecture to hide complicated system details. In these models, we assume that there are enough threads to run in parallel for each architecture. This is to evaluate the full potential of each architecture without being limited by the application. For the single processor architecture, these threads are executed one after the other. The single processor can benefit from superscalar capabilities to run multiple instructions in parallel. SMT architecture assumes that some of the processor resources are duplicated so that threads can run in parallel, however there are some internal shared resources which cannot support parallelism and force threads to share them similar to Intel Hyper- Threading [1]. Each thread in SMT could still use superscalar to run multiple instructions in parallel. SMP architecture assumes multiple processors are used in parallel sharing the outside resources (main memory). Each processor in SMP is a single processor using superscalar to run instructions in parallel. SMP with SMT architecture assumes that multiple processors are used in parallel sharing the external resources, but each processor is capable of running multiple threads simultaneously with some inter- Single SMP with 4 s SMT with 4 Threads SMP with SMT (2 processors, 2 threads) Figure 1: Block Diagrams of the Different Architectures nal resources duplicated and other internal resources are shared. Figure 1 shows the block diagrams of the different architectures. These architectures could share internal or external resources. represents parallel superscalar resources using multiple function units to execute parallel instructions on each thread. resources represents shared resources as L2 cache inside the processor. SMT and SMP with SMT share these resources among the multiple threads and threads must wait for each other to access them. resource represents the main memory that is shared by the threads in all architectures. 3.1 The Rules of Behavior for Different Architectures We define the following rules for the behavior of SMT, SMP and SMP with SMT architectures relative to the single processor. 1-Each architecture including the single processor can execute multiple instructions in parallel without stalls for each thread in execution time that = T EX using superscalar. T EX depends on the number of execution units, the speed of processor, and ILP in each thread. 2-All architectures must wait to access the shared resources inside the processor for a time that = T WI. T WI is the waiting time to access the shared resources inside the processor as L1, L2 caches. Only a portion of the instructions are subjected to this waiting time. 3-All architectures must wait to access the external shared resources outside the processor for a time that = T WO T WO depends on the speed

3 of the main memory, the memory-processor bus and the portion of instructions that must use the outside resources. 4-Parallel threads can improve the total time of an application by overlapping either the no stall execution time or the waiting time among threads. The gain in performance for parallel threads (in SMT, SMP and SMT with SMT) comes not only from overlapping the no stall execution time, but also from overlapping some of the waiting time of each thread with one another. 5-The total waiting time for parallel threads when accessing the shared resources is equal to the number of threads multiplied by the waiting time for each thread to access these shared resources. If we assume that each thread takes the same time in accessing the shared resources, then the total waiting time is equal to the number of threads multiplied by the waiting time to access the shared resources (queuing theory). 3.2 The Model for Single The performance of a single processor could be given by the following model: T ST = T EX + T WI + T WO T ST = The total time for running a single thread in a single processor. T EX = The no stall execution time for a single thread in a single processor with superscalar. T WI = The time for a thread to use the shared resources inside the processor. T WI = P i T i. P i =Probability of an instruction in a thread to use the shared resources inside the processor. T i = time for an instruction to access the shared resources inside the processor. T WO = The time for a thread to access the shared resources outside the processor. T WO = P o T o, P o =The probability of an instruction in a thread to use the shared resources outside the processor. T o = time for an instruction to access the shared resources outside the processor. T ST = T EX + P i T i + P o T o The total time to execute N threads in a single processor = N T ST 3.3 The Model for SMT The total time for executing multiple threads in SMT architecture is obtained relative to the single processor model, by overlapping the no stall execution time T EX then adding the waiting time of N threads for using the shared resources inside and outside the processor as shown in the SMT model of Figure 1. T SMT = T EX + N (P i T i + P o T o ) Note that T EX is not multiplied by N thread due to the overlapping of no stall execution time among threads in SMT model. 3.4 The Model for SMP The total time for executing multiple threads in SMP architecture is obtained relative to the single processor model, by overlapping both the no stall execution time T EX and the waiting time for the shared resources inside the processors (P i T i ) then adding the waiting time of N threads for using the shared resources outside the processor as shown in SMP model of Figure 1. T SMP = T EX + P i T i + N (P o T o ) Note that both T EX and P i T i are not multiplied by N due to overlapping them between multiple processors in SMP model. 3 The Model for SMP with SMT The total time for executing multiple threads in SMP with SMT architecture is obtained by overlapping the no stall execution time inside each processor then adding the waiting time of M simultaneous threads for using the shared resources inside each processor and also adding the waiting time of all N threads for using the shared resources outside the processors as shown in SMP with SMT model of Figure 1. M is the number of simultaneous threads inside the processor and N is the number of threads that use N M SMT processors in parallel. T SMT.SMP = T EX + M (P i T i ) + N (P o T o ) 3.6 The Specifications of the Systems We selected Pentium 4 processor to evaluate the models [1,9]. The SMT model assumes using simultaneous multi-threading with N threads (Pentium 4 Hyper Threading uses only two threads). The SMP model assumes N processors in parallel sharing the main memory, each processor works in a single processor mode (no Hyper-Threading). The SMP with SMT model assumes N number of threads using N M processors and each processor can support M simultaneous threads inside it. The specifications of these systems are: The processor speed= 2 GHz. L1 Data cache= 8 Kbytes, 4-way, 64 byte block. L2 cache is unified= 512 Kbytes, 8-way, 128 byte block, write back, access time= 7 cycles. Memory Bus is 8 byte width, 3 MHz speed. DDR DRAM speed is 333 MHz with precharge= 2 cycles, row access=3 cycles, column access= 3 cycles.

4 3.7 The Parameters of The Models We assume the following values for the parameters of these models:- T EX : It depends on the speed of the processor ( 2 GHz) and the average ILP ( we assume ILP = 4 ). T EX =.4 4 =.1 ns. T i : We assume that most of the waiting time inside the processor happens when the instructions and data miss L1 cache and access L2 cache. The L2 cache has an access time = 7 cycles. T i = 7.4 = 2.8 ns. P i : It depends on the miss rates of L1 data and instruction caches. We assume that L1 cache have an average miss rate for data and instructions of approximately 3.75%. P i = T o : It depends on DRAM latency and data transfer time. T o : is assumed to be 66 ns (DRAM access time + bus transfer). P o : It depends on the miss rate of L1 and L2 caches. We assume that the global miss rate is (M 1 M 2) =.8%. P o =.8. All models use the same parameters making the results less sensitive to the values of these parameters. The models evaluate the relative performance compared to a single processor with the same parameters. The parameters were selected based on typical values of a real system (Pentium 4). 4 The Results of The Models The results are obtained using the models and the above parameters. 4.1 Comparing the Results of the Models to the Results of a Real System Figure 2, shows the performance improvements of the SMT model and the results of Intel Xeon processor running online transaction [1]. On line transaction applications have enough parallel threads so that the system performance will not be limited by the number of threads. The results show that when N=2 threads for SMT, both Intel system and the SMT model give the same performance improvements of about 24% compared to the performance of running 2 threads in a single processor. Figure 3, shows the performance gain of SMP model and the results of Intel SMP. Intel SMP with 2 processors (total=2 threads ) and the SMP model with 2 processors (2 threads) have the same performance gain of about 55% compared to a single processors running 2 threads one after the other. 1 Intel Xeon Hyper Threading processor(2 threads) Figure 2: Comparing Intel Xeon Results and The Results of the For 4 processors, Intel system with SMP improves the performance by 160%. The model for SMP shows performance improvement of 150%. Figure 4, shows the performance gain of SMP with SMT model and the results of Intel SMP with hyperthreading. Combining SMP with SMT using 2 processors (4 threads), improves the performance of Intel system by 1%. SMP with SMT model shows the same performance improvements of 1%. These results are obtained relative to single processor performance. For 4 processors (8 threads) the SMP with SMT model gives the same performance improvements ( 170%) as Intel SMP with hyper-threading. The different models (single, SMT, SMP, and SMT+SMP) accurately predicted real system performance for all the conditions although they use very simple models. These models are accurate because they are developed relative to the single processor. 4.2 Predicting Performance Improvements of the Different Architectures Figure 5, shows the performance improvements of the three different architectures using our models compared to a single processor model. The performance improvement from the SMT model is the lowest and reaches a maximum value of 1 for 10 threads. This is a modest gain for SMT and indicates that we cannot rely on SMT for improving the performance of large number of threads. The waiting time to access the shared resources inside and outside the processor, limits the performance gain for SMT. The performance improvement from SMP model increases with increasing the number of processors,

5 Single Model SMP Model SMP with Intel Xeon multiprocessor SMP Model 2 processors 4 processors Figure 3: Comparing Intel Xeon Results and The Results of the SMP Model 2 1 Intel Xeon multiprocessor with Hyper Threading SMP and processors(4 threads) 4 processors(8 threads) Figure 4: Comparing Intel Xeon Results and The Results of the SMP with N /Thread Figure 5: Performance Improvements of the Different Architectures then it reaches a diminishing returns for N = 10. SMP gives performance improvements of about 3 for 10 processors. This limitation is caused by the waiting time for accessing the external resource (main memory). The SMP with SMT architecture model gives the best performance improvement of 3.7 for N=10 processors using 20 threads, but it also reach a diminishing returns for N = 10. The behavior of this model is exactly the same as the behavior of SMP except it provides an extra fixed performance gain obtained by using SMT (two threads) inside each processor. We will not include this model in the following results. 4.3 The Performance Improvements of SMT and SMP with faster processor Figure 6 shows the performance improvements of SMT and SMP models using 2 GHz and using 5 GHz compared to a single processor models of 2 GHz and 5 GHz. The performance improvement for SMT and SMP decreases as the processor speed increases. The 5 GHz SMT and SMP architectures give lower performance improvements relative to 5 GHz single processor compared to the improvements for 2 GHz architectures. This means that both SMT and SMP architectures will not be effective in improving the performance of parallel threads for future high speed processors. It should be noted that although the performance improvement of faster SMT and SMP systems decreases, the average time to finish the applications improves with the use of faster processors. For example, the SMT takes 1.67 ns to finish 10 parallel threads in the 2 GHz processor compared to 1.62 ns for SMT to finish 10 parallel threads in the 5 GHz processor, clearly showing that increasing processor speed does

6 5.0 for faster 4 SMP Model SMP Model for faster Time= 1.62 ns Time=.73 ns Time=1.67 ns 3 Time= ns N /Thread 5.0 for faster L2 Cache 4 SMP Model SMP Model for faster L2 Cache Time=1.15 ns Time=.73 ns Time=1.67 ns Time= ns N /Thread Figure 6: Performance Improvements of SMT and SMP with Faster s not help the performance improvements of SMT. Also SMP takes.73 ns to finish 10 parallel threads in the 2 GHz processors and.68 ns to finish 10 parallel threads in 5 GHz, indicating also that SMP will not be effective in improving the performance of faster processors. 4.4 The Performance Improvements of SMT and SMP with faster L2 Cache Figure 7 shows the performance improvements of SMT and SMP models when using faster L2 cache ( L2 access time = 1.4 ns compared to 2.8 ns). The performance improvement of SMT when using faster L2 cache increases. For N=2 threads, the total time for SMT is.41 ns, compared to the total time of.31 ns for the faster L2. This is 32% improvements in performance. The performance improvements of SMP decreases as L2 speed increases. This means that neither faster processor core nor faster L2 help the SMP system. The total time for SMP when N=10 is.73 ns compared to.68 ns for SMP with faster L2. The improvements in total time is only 7% with 1% improvement in L2 speed. We must look for other ways to improve the performance of SMP. 4 The Performance Improvements of SMT and SMP with faster DRAM Figure 8 shows the performance improvements of SMT and SMP with 66 ns DRAM access time, and with 33 ns DRAM access time. SMT system gains modestly from improving the access time of DRAM. The SMP system gains significantly from reducing the access time of DRAM. The performance improvement of SMP increases by about Figure 7: Performance Improvements of SMT and SMP with faster L2 Cache 5.0 for faster DRAM 4 SMP Model SMP Model for faster DRAM N /Thread Figure 8: Performance Improvements of SMT and SMP with faster DRAM 40% for N=10. This means that only faster DRAM can benefit SMP performance improvement. Faster L2 cache and faster DRAM both benefit SMT performance improvement. For the single processor, any improvements in the speed of core, L2 or DRAM will help its performance. This means that the single processor architecture will gain the most from any improvements in technology, and more likely to be the target architecture for many years to come. 4.6 The Performance Improvements of SMT and SMP with Slower Figure 9, shows the performance improvements of SMT and SMP architectures using 2 GHz processor and using 1.25 GHz processor compared to the performance of single processors running at 2 GHz and 1.25

7 5.0 for slower processors SMP Model SMP Model for slower s N /Thread Figure 9: Performance Improvements of SMT and SMP with Slower GHz. The performance improvement of both SMT and SMP significantly increases with the slower processors. This is because slower processors have more execution time that could be overlapped among threads to benefit from parallelism in SMT or SMP. For example, SMP using 10 processors and running at 1.25 GHz has performance gain of 5 compared to the single processor and only 3 times the single processor performance when SMP uses 10 processors at 2 GHz. The time to finish 10 threads in SMP with 1.25 GHz processors is.83 ns compared to.73 ns to finish the 10 threads in SMP with 2 GHz processors (only 13% reduction in performance for 1% processor speed reduction). The time to finish 10 threads in SMT with 1.25 GHz processors is 1.77 ns compared to 1.67 ns to finish the 10 threads in SMT with 2 GHz processors. SMT and SMP scale better with slower processors. Slower processors cost less to implement and have lower power consumption. 5 Conclusions We have developed simple models that can explain the behavior of modern architectures like SMT, SMP and SMP with SMT without the need to include complicated system details. These models predict the performance gain compared to a single processor performance and could be used to make design decision regarding the future of these architectures. The results of these models show limitations in performance gain for using faster processors and that they are more suitable for slower processors. SMP gives the best performance of all architectures specially when it uses fast main memory. It also show that the performance of a single processor benefits the most from any improvements in technology and will likely to be the target architecture for many years to come. References [1] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, and Michael Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal Q1, 22. [2] D. Burger, J. R. Goodman, and A. Kagi, Memory Bandwidth of Future Microprocessors, Proc. 23rd Ann. Int l Symp. on Computer Architecture, ACM Press, New York, 1996, PP [3] J. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach Morgan Kaufmann Publishers, Inc, San Francisco, CA, 22. [4] R. Desikan, D. Burger, and S. W. Kecker, Measuring Experimental Error in Microprocessor Simulation, 28th Annual Symposium on Computer Architecture, pp , June 30- July 3, 20 Gottenberge, Sweden [5] Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL: A processor architecture for Multiprocessing, in Proceedings of the 17th Annual International Symposium on Computer Architectures, pages , May [6] L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip Multiprocessor, IEEE Computer, 30(9), pages 79-85, September [7] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, in 2nd Annual International Symposium on Computer Architecture, June [8] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo R.L. Stamm, and D.M. Tullsen, Simultaneous multithreading: A platform for next-generation processors, IEEE Micro, 17(5), pages 12-19, Oct [9] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Camean, A. Kyker, and P. Roussel, The microarchitecture of the Pentium 4 processor, Intel Technology Journal, 5(1), pages 1-133, Feb. 20.

MULTIPROCESSOR system has been used to improve

MULTIPROCESSOR system has been used to improve arallel Vector rocessing Using Multi Level Orbital DATA Nagi Mekhiel Abstract Many applications use vector operations by applying single instruction to multiple data that map to different locations in

More information

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE

IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

The Intel move from ILP into Multi-threading

The Intel move from ILP into Multi-threading The Intel move from ILP into Multi-threading Miguel Pires Departamento de Informática, Universidade do Minho Braga, Portugal migutass@hotmail.com Abstract. Multicore technology came into consumer market

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Introducing TAM: Time-Based Access Memory

Introducing TAM: Time-Based Access Memory SPECIAL SECTION ON SECURITY AND RELIABILITY AWARE SYSTEM DESIGN FOR MOBILE COMPUTING DEVICES Received December 9, 2015, accepted January 27, 2016, date of publication February 3, 2016, date of current

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Architectural Issues for the 1990s. David A. Patterson. Computer Science Division EECS Department University of California Berkeley, CA 94720

Architectural Issues for the 1990s. David A. Patterson. Computer Science Division EECS Department University of California Berkeley, CA 94720 Microprocessor Forum 10/90 1 Architectural Issues for the 1990s David A. Patterson Computer Science Division EECS Department University of California Berkeley, CA 94720 1990 (presented at Microprocessor

More information

Multithreaded Architectures and The Sort Benchmark

Multithreaded Architectures and The Sort Benchmark Multithreaded Architectures and The Sort Benchmark Philip Garcia Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA, USA philipgar@lehigh.edu Henry. F. Korth Dept. of Computer Science

More information

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE Roger Luis Uy College of Computer Studies, De La Salle University Abstract: Tick-Tock is a model introduced by Intel Corporation in 2006 to show the improvement

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Winter 2018 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2018 1 When and Where? When:

More information

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1> Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Pipelined Hash-Join on Multithreaded Architectures

Pipelined Hash-Join on Multithreaded Architectures Pipelined Hash-Join on Multithreaded Architectures Philip Garcia University of Wisconsin-Madison Madison, WI 53706 USA pcgarcia@wisc.edu Henry F. Korth Lehigh University Bethlehem, PA 805 USA hfk@lehigh.edu

More information

Integrated circuit processing technology offers

Integrated circuit processing technology offers Theme Feature A Single-Chip Multiprocessor What kind of architecture will best support a billion transistors? A comparison of three architectures indicates that a multiprocessor on a chip will be easiest

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Methods for Modeling Resource Contention on Simultaneous Multithreading Processors

Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Approximate Performance Evaluation of Multi Threaded Distributed Memory Architectures

Approximate Performance Evaluation of Multi Threaded Distributed Memory Architectures 5-th Annual Performance Engineering Workshop; Bristol, UK, July 22 23, 999. c 999 by W.M. Zuberek. All rights reserved. Approximate Performance Evaluation of Multi Threaded Distributed Architectures W.M.

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture

Computer Architecture Computer Architecture Mehran Rezaei m.rezaei@eng.ui.ac.ir Welcome Office Hours: TBA Office: Eng-Building, Last Floor, Room 344 Tel: 0313 793 4533 Course Web Site: eng.ui.ac.ir/~m.rezaei/architecture/index.html

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Main Memory Supporting Caches

Main Memory Supporting Caches Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Cache Issues 1 Example cache block read

More information

Fundamentals of Computer Systems

Fundamentals of Computer Systems Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Scalability of the RAMpage Memory Hierarchy

Scalability of the RAMpage Memory Hierarchy Scalability of the RAMpage Memory Hierarchy Philip Machanick Department of Computer Science, University of the Witwatersrand, philip@cs.wits.ac.za Abstract The RAMpage hierarchy is an alternative to the

More information

Fundamentals of Computer Systems

Fundamentals of Computer Systems Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Computer Architecture

Computer Architecture Informatics 3 Computer Architecture Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh (thanks to Prof. Nigel Topham) General Information Instructor

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Computer Architecture. Introduction. Lynn Choi Korea University

Computer Architecture. Introduction. Lynn Choi Korea University Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200 Parallelism via Multithreaded and Multicore CPUs Bradley Dutton March 29, 2010 ELEC 6200 Outline Multithreading Hardware vs. Software definition Hardware multithreading Simple multithreading Interleaved

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Micro-threading: A New Approach to Future RISC

Micro-threading: A New Approach to Future RISC Micro-threading: A New Approach to Future RISC Chris Jesshope C.R.Jesshope@massey.ac.nz Bing Luo R.Luo@massey.ac.nz Institute of Information Sciences and Technology, Massey University, Palmerston North,

More information

Threaded Multiple Path Execution

Threaded Multiple Path Execution Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

EECS 452 Lecture 9 TLP Thread-Level Parallelism

EECS 452 Lecture 9 TLP Thread-Level Parallelism EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

Emerging DRAM Technologies

Emerging DRAM Technologies 1 Emerging DRAM Technologies Michael Thiems amt051@email.mot.com DigitalDNA Systems Architecture Laboratory Motorola Labs 2 Motivation DRAM and the memory subsystem significantly impacts the performance

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading BORUT ROBIČ JURIJ ŠILC THEO UNGERER Faculty of Computer and Information Sc. Computer Systems Department Dept. of Computer

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Organizational issues (I)

Organizational issues (I) COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework

More information

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar, CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP

INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST HELPS IMPROVE PROCESSOR PERFORMANCE AND DECREASE POWER CONSUMPTION BY IDENTIFYING WHEN DESIGNERS CAN CHOOSE AMONG A SET OF OPTIMIZATIONS

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Speculation Control for Simultaneous Multithreading

Speculation Control for Simultaneous Multithreading Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California dkang@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology

Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology Amit Agarwal, Kaushik Roy, and T. N. Vijaykumar Electrical & Computer Engineering Purdue University, West Lafayette, IN 4797,

More information