Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
|
|
- Barbra Ferguson
- 5 years ago
- Views:
Transcription
1 Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3 Abstract Neither simulation results nor real system results give an explanation to the behavior of advanced computer systems for the full design spectrum. In this paper, we present simple models that explain the behavior of simultaneous multithreded, multiprocessor and multiprocessor with simultaneous multithreaded architectures. The results of these models show that there are limitations and problems with these architectures when using fast processors, and that the current single processor architecture will continue to be the target of system design for many years to come. 1 Introduction The demand for more computing power is growing as a result of new applications in multi media, Internet and telecommunication. The microarchitecture features that have been used to improve processor performance like superpipelining, superscalar, branch prediction and out-of-order execution cannot offer continuous improvement to performance as the cost of hiding cache misses and branch miss predictions increases with the increase in processor speed and complexity [1]. The advancements in technology that have produced faster and larger chips cannot be fully utilized to produce more performance from single processor. In the mean time, the software used for server applications and desktop applications are becoming more parallel. These applications consist of multiple threads or processes that can run in parallel [1,3] using multiprocessor system. Multiprocessor systems have been used for many years to improve the performance of applications by executing multiple instructions in parallel [5,6]. Recently, a single processor has been developed to execute multiple threads simultaneously (SMT) to make full use of the single processor resources [1,7,8]. Intel Hyper-Threading makes a single physical processor appears as two logical processors [1]. Threads could be scheduled to the logical processors and run in parallel as in multiprocessor system to improve the system performance. In this paper, we analyze the behavior of the following different architectures: single processor; SMT; SMP; SMP with SMT. We will introduce simple models capable of explaining the behavior of these complicated architectures and predict their performance. The rest of this paper is organized as follows: Section 2 explains the motivation; Section 3 presents simple models for the different architectures; Section 4 discusses the results; Section 5 concludes this paper. 2 Motivation The computer industry has depended on CAD tools to design, test and build advanced computer systems. The accuracy and speed of these tools have made us more and more dependent on them not only for the design but also for evaluating the design. Major design decisions are made according to the results of simulation. Simulation results might have errors [4] and do not give an overall view for the behavior of the system. Simulation gives the results of only one point in the design space for the parameters used in the simulation results. If we need to make an optimal design decision we must view the results of the system for the full design spectrum. The optimization that is based on the value of the results at one point is misleading and could be counter productive. Furthermore, we cannot optimize the performance of a system by optimizing the performance of each individual parameter. We can only optimize the system performance if we understand the behavior of the system with respect to its parameters in the full design spectrum. New architectures must be evaluated not only to find their expected performance but also to understand their behavior specially in a fast changing technology and many new applications that have different characteristics. If we understand the behavior of these archi-
2 tectures, we can answer important questions regarding which architecture should be adopted and why. We cannot answer these questions from the results of simulation or even from the results of a real system. Simple laws of physics like Newton s laws are very useful because they can explain clearly the behavior of objects in complicated environment. Fundamental laws that explain the behavior of the systems are very valuable in evaluating design decisions and in developing systems for the future. These laws cannot be obtained using CAD tools. The only way to develop these laws is by understanding the system behavior without the need to focus on the details of its components. Therefore, our objectives in this paper is to find some simple rules that govern the behavior of complicated architectures without getting into the details of each system. These rules will be used to develop simple models that help to explain the behavior of the different architectures. 3 Simple Models For The Different Architectures To understand the behavior of important architectures like multirthreaded, multiprocessor and multiprocessor with multithreaded, we constructed simple performance models. These models evaluate the behavior of each architecture relative to the well known single processor architecture to hide complicated system details. In these models, we assume that there are enough threads to run in parallel for each architecture. This is to evaluate the full potential of each architecture without being limited by the application. For the single processor architecture, these threads are executed one after the other. The single processor can benefit from superscalar capabilities to run multiple instructions in parallel. SMT architecture assumes that some of the processor resources are duplicated so that threads can run in parallel, however there are some internal shared resources which cannot support parallelism and force threads to share them similar to Intel Hyper- Threading [1]. Each thread in SMT could still use superscalar to run multiple instructions in parallel. SMP architecture assumes multiple processors are used in parallel sharing the outside resources (main memory). Each processor in SMP is a single processor using superscalar to run instructions in parallel. SMP with SMT architecture assumes that multiple processors are used in parallel sharing the external resources, but each processor is capable of running multiple threads simultaneously with some inter- Single SMP with 4 s SMT with 4 Threads SMP with SMT (2 processors, 2 threads) Figure 1: Block Diagrams of the Different Architectures nal resources duplicated and other internal resources are shared. Figure 1 shows the block diagrams of the different architectures. These architectures could share internal or external resources. represents parallel superscalar resources using multiple function units to execute parallel instructions on each thread. resources represents shared resources as L2 cache inside the processor. SMT and SMP with SMT share these resources among the multiple threads and threads must wait for each other to access them. resource represents the main memory that is shared by the threads in all architectures. 3.1 The Rules of Behavior for Different Architectures We define the following rules for the behavior of SMT, SMP and SMP with SMT architectures relative to the single processor. 1-Each architecture including the single processor can execute multiple instructions in parallel without stalls for each thread in execution time that = T EX using superscalar. T EX depends on the number of execution units, the speed of processor, and ILP in each thread. 2-All architectures must wait to access the shared resources inside the processor for a time that = T WI. T WI is the waiting time to access the shared resources inside the processor as L1, L2 caches. Only a portion of the instructions are subjected to this waiting time. 3-All architectures must wait to access the external shared resources outside the processor for a time that = T WO T WO depends on the speed
3 of the main memory, the memory-processor bus and the portion of instructions that must use the outside resources. 4-Parallel threads can improve the total time of an application by overlapping either the no stall execution time or the waiting time among threads. The gain in performance for parallel threads (in SMT, SMP and SMT with SMT) comes not only from overlapping the no stall execution time, but also from overlapping some of the waiting time of each thread with one another. 5-The total waiting time for parallel threads when accessing the shared resources is equal to the number of threads multiplied by the waiting time for each thread to access these shared resources. If we assume that each thread takes the same time in accessing the shared resources, then the total waiting time is equal to the number of threads multiplied by the waiting time to access the shared resources (queuing theory). 3.2 The Model for Single The performance of a single processor could be given by the following model: T ST = T EX + T WI + T WO T ST = The total time for running a single thread in a single processor. T EX = The no stall execution time for a single thread in a single processor with superscalar. T WI = The time for a thread to use the shared resources inside the processor. T WI = P i T i. P i =Probability of an instruction in a thread to use the shared resources inside the processor. T i = time for an instruction to access the shared resources inside the processor. T WO = The time for a thread to access the shared resources outside the processor. T WO = P o T o, P o =The probability of an instruction in a thread to use the shared resources outside the processor. T o = time for an instruction to access the shared resources outside the processor. T ST = T EX + P i T i + P o T o The total time to execute N threads in a single processor = N T ST 3.3 The Model for SMT The total time for executing multiple threads in SMT architecture is obtained relative to the single processor model, by overlapping the no stall execution time T EX then adding the waiting time of N threads for using the shared resources inside and outside the processor as shown in the SMT model of Figure 1. T SMT = T EX + N (P i T i + P o T o ) Note that T EX is not multiplied by N thread due to the overlapping of no stall execution time among threads in SMT model. 3.4 The Model for SMP The total time for executing multiple threads in SMP architecture is obtained relative to the single processor model, by overlapping both the no stall execution time T EX and the waiting time for the shared resources inside the processors (P i T i ) then adding the waiting time of N threads for using the shared resources outside the processor as shown in SMP model of Figure 1. T SMP = T EX + P i T i + N (P o T o ) Note that both T EX and P i T i are not multiplied by N due to overlapping them between multiple processors in SMP model. 3 The Model for SMP with SMT The total time for executing multiple threads in SMP with SMT architecture is obtained by overlapping the no stall execution time inside each processor then adding the waiting time of M simultaneous threads for using the shared resources inside each processor and also adding the waiting time of all N threads for using the shared resources outside the processors as shown in SMP with SMT model of Figure 1. M is the number of simultaneous threads inside the processor and N is the number of threads that use N M SMT processors in parallel. T SMT.SMP = T EX + M (P i T i ) + N (P o T o ) 3.6 The Specifications of the Systems We selected Pentium 4 processor to evaluate the models [1,9]. The SMT model assumes using simultaneous multi-threading with N threads (Pentium 4 Hyper Threading uses only two threads). The SMP model assumes N processors in parallel sharing the main memory, each processor works in a single processor mode (no Hyper-Threading). The SMP with SMT model assumes N number of threads using N M processors and each processor can support M simultaneous threads inside it. The specifications of these systems are: The processor speed= 2 GHz. L1 Data cache= 8 Kbytes, 4-way, 64 byte block. L2 cache is unified= 512 Kbytes, 8-way, 128 byte block, write back, access time= 7 cycles. Memory Bus is 8 byte width, 3 MHz speed. DDR DRAM speed is 333 MHz with precharge= 2 cycles, row access=3 cycles, column access= 3 cycles.
4 3.7 The Parameters of The Models We assume the following values for the parameters of these models:- T EX : It depends on the speed of the processor ( 2 GHz) and the average ILP ( we assume ILP = 4 ). T EX =.4 4 =.1 ns. T i : We assume that most of the waiting time inside the processor happens when the instructions and data miss L1 cache and access L2 cache. The L2 cache has an access time = 7 cycles. T i = 7.4 = 2.8 ns. P i : It depends on the miss rates of L1 data and instruction caches. We assume that L1 cache have an average miss rate for data and instructions of approximately 3.75%. P i = T o : It depends on DRAM latency and data transfer time. T o : is assumed to be 66 ns (DRAM access time + bus transfer). P o : It depends on the miss rate of L1 and L2 caches. We assume that the global miss rate is (M 1 M 2) =.8%. P o =.8. All models use the same parameters making the results less sensitive to the values of these parameters. The models evaluate the relative performance compared to a single processor with the same parameters. The parameters were selected based on typical values of a real system (Pentium 4). 4 The Results of The Models The results are obtained using the models and the above parameters. 4.1 Comparing the Results of the Models to the Results of a Real System Figure 2, shows the performance improvements of the SMT model and the results of Intel Xeon processor running online transaction [1]. On line transaction applications have enough parallel threads so that the system performance will not be limited by the number of threads. The results show that when N=2 threads for SMT, both Intel system and the SMT model give the same performance improvements of about 24% compared to the performance of running 2 threads in a single processor. Figure 3, shows the performance gain of SMP model and the results of Intel SMP. Intel SMP with 2 processors (total=2 threads ) and the SMP model with 2 processors (2 threads) have the same performance gain of about 55% compared to a single processors running 2 threads one after the other. 1 Intel Xeon Hyper Threading processor(2 threads) Figure 2: Comparing Intel Xeon Results and The Results of the For 4 processors, Intel system with SMP improves the performance by 160%. The model for SMP shows performance improvement of 150%. Figure 4, shows the performance gain of SMP with SMT model and the results of Intel SMP with hyperthreading. Combining SMP with SMT using 2 processors (4 threads), improves the performance of Intel system by 1%. SMP with SMT model shows the same performance improvements of 1%. These results are obtained relative to single processor performance. For 4 processors (8 threads) the SMP with SMT model gives the same performance improvements ( 170%) as Intel SMP with hyper-threading. The different models (single, SMT, SMP, and SMT+SMP) accurately predicted real system performance for all the conditions although they use very simple models. These models are accurate because they are developed relative to the single processor. 4.2 Predicting Performance Improvements of the Different Architectures Figure 5, shows the performance improvements of the three different architectures using our models compared to a single processor model. The performance improvement from the SMT model is the lowest and reaches a maximum value of 1 for 10 threads. This is a modest gain for SMT and indicates that we cannot rely on SMT for improving the performance of large number of threads. The waiting time to access the shared resources inside and outside the processor, limits the performance gain for SMT. The performance improvement from SMP model increases with increasing the number of processors,
5 Single Model SMP Model SMP with Intel Xeon multiprocessor SMP Model 2 processors 4 processors Figure 3: Comparing Intel Xeon Results and The Results of the SMP Model 2 1 Intel Xeon multiprocessor with Hyper Threading SMP and processors(4 threads) 4 processors(8 threads) Figure 4: Comparing Intel Xeon Results and The Results of the SMP with N /Thread Figure 5: Performance Improvements of the Different Architectures then it reaches a diminishing returns for N = 10. SMP gives performance improvements of about 3 for 10 processors. This limitation is caused by the waiting time for accessing the external resource (main memory). The SMP with SMT architecture model gives the best performance improvement of 3.7 for N=10 processors using 20 threads, but it also reach a diminishing returns for N = 10. The behavior of this model is exactly the same as the behavior of SMP except it provides an extra fixed performance gain obtained by using SMT (two threads) inside each processor. We will not include this model in the following results. 4.3 The Performance Improvements of SMT and SMP with faster processor Figure 6 shows the performance improvements of SMT and SMP models using 2 GHz and using 5 GHz compared to a single processor models of 2 GHz and 5 GHz. The performance improvement for SMT and SMP decreases as the processor speed increases. The 5 GHz SMT and SMP architectures give lower performance improvements relative to 5 GHz single processor compared to the improvements for 2 GHz architectures. This means that both SMT and SMP architectures will not be effective in improving the performance of parallel threads for future high speed processors. It should be noted that although the performance improvement of faster SMT and SMP systems decreases, the average time to finish the applications improves with the use of faster processors. For example, the SMT takes 1.67 ns to finish 10 parallel threads in the 2 GHz processor compared to 1.62 ns for SMT to finish 10 parallel threads in the 5 GHz processor, clearly showing that increasing processor speed does
6 5.0 for faster 4 SMP Model SMP Model for faster Time= 1.62 ns Time=.73 ns Time=1.67 ns 3 Time= ns N /Thread 5.0 for faster L2 Cache 4 SMP Model SMP Model for faster L2 Cache Time=1.15 ns Time=.73 ns Time=1.67 ns Time= ns N /Thread Figure 6: Performance Improvements of SMT and SMP with Faster s not help the performance improvements of SMT. Also SMP takes.73 ns to finish 10 parallel threads in the 2 GHz processors and.68 ns to finish 10 parallel threads in 5 GHz, indicating also that SMP will not be effective in improving the performance of faster processors. 4.4 The Performance Improvements of SMT and SMP with faster L2 Cache Figure 7 shows the performance improvements of SMT and SMP models when using faster L2 cache ( L2 access time = 1.4 ns compared to 2.8 ns). The performance improvement of SMT when using faster L2 cache increases. For N=2 threads, the total time for SMT is.41 ns, compared to the total time of.31 ns for the faster L2. This is 32% improvements in performance. The performance improvements of SMP decreases as L2 speed increases. This means that neither faster processor core nor faster L2 help the SMP system. The total time for SMP when N=10 is.73 ns compared to.68 ns for SMP with faster L2. The improvements in total time is only 7% with 1% improvement in L2 speed. We must look for other ways to improve the performance of SMP. 4 The Performance Improvements of SMT and SMP with faster DRAM Figure 8 shows the performance improvements of SMT and SMP with 66 ns DRAM access time, and with 33 ns DRAM access time. SMT system gains modestly from improving the access time of DRAM. The SMP system gains significantly from reducing the access time of DRAM. The performance improvement of SMP increases by about Figure 7: Performance Improvements of SMT and SMP with faster L2 Cache 5.0 for faster DRAM 4 SMP Model SMP Model for faster DRAM N /Thread Figure 8: Performance Improvements of SMT and SMP with faster DRAM 40% for N=10. This means that only faster DRAM can benefit SMP performance improvement. Faster L2 cache and faster DRAM both benefit SMT performance improvement. For the single processor, any improvements in the speed of core, L2 or DRAM will help its performance. This means that the single processor architecture will gain the most from any improvements in technology, and more likely to be the target architecture for many years to come. 4.6 The Performance Improvements of SMT and SMP with Slower Figure 9, shows the performance improvements of SMT and SMP architectures using 2 GHz processor and using 1.25 GHz processor compared to the performance of single processors running at 2 GHz and 1.25
7 5.0 for slower processors SMP Model SMP Model for slower s N /Thread Figure 9: Performance Improvements of SMT and SMP with Slower GHz. The performance improvement of both SMT and SMP significantly increases with the slower processors. This is because slower processors have more execution time that could be overlapped among threads to benefit from parallelism in SMT or SMP. For example, SMP using 10 processors and running at 1.25 GHz has performance gain of 5 compared to the single processor and only 3 times the single processor performance when SMP uses 10 processors at 2 GHz. The time to finish 10 threads in SMP with 1.25 GHz processors is.83 ns compared to.73 ns to finish the 10 threads in SMP with 2 GHz processors (only 13% reduction in performance for 1% processor speed reduction). The time to finish 10 threads in SMT with 1.25 GHz processors is 1.77 ns compared to 1.67 ns to finish the 10 threads in SMT with 2 GHz processors. SMT and SMP scale better with slower processors. Slower processors cost less to implement and have lower power consumption. 5 Conclusions We have developed simple models that can explain the behavior of modern architectures like SMT, SMP and SMP with SMT without the need to include complicated system details. These models predict the performance gain compared to a single processor performance and could be used to make design decision regarding the future of these architectures. The results of these models show limitations in performance gain for using faster processors and that they are more suitable for slower processors. SMP gives the best performance of all architectures specially when it uses fast main memory. It also show that the performance of a single processor benefits the most from any improvements in technology and will likely to be the target architecture for many years to come. References [1] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, and Michael Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal Q1, 22. [2] D. Burger, J. R. Goodman, and A. Kagi, Memory Bandwidth of Future Microprocessors, Proc. 23rd Ann. Int l Symp. on Computer Architecture, ACM Press, New York, 1996, PP [3] J. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach Morgan Kaufmann Publishers, Inc, San Francisco, CA, 22. [4] R. Desikan, D. Burger, and S. W. Kecker, Measuring Experimental Error in Microprocessor Simulation, 28th Annual Symposium on Computer Architecture, pp , June 30- July 3, 20 Gottenberge, Sweden [5] Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL: A processor architecture for Multiprocessing, in Proceedings of the 17th Annual International Symposium on Computer Architectures, pages , May [6] L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip Multiprocessor, IEEE Computer, 30(9), pages 79-85, September [7] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, in 2nd Annual International Symposium on Computer Architecture, June [8] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo R.L. Stamm, and D.M. Tullsen, Simultaneous multithreading: A platform for next-generation processors, IEEE Micro, 17(5), pages 12-19, Oct [9] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Camean, A. Kyker, and P. Roussel, The microarchitecture of the Pentium 4 processor, Intel Technology Journal, 5(1), pages 1-133, Feb. 20.
MULTIPROCESSOR system has been used to improve
arallel Vector rocessing Using Multi Level Orbital DATA Nagi Mekhiel Abstract Many applications use vector operations by applying single instruction to multiple data that map to different locations in
More informationIMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE
IMPLEMENTING HARDWARE MULTITHREADING IN A VLIW ARCHITECTURE Stephan Suijkerbuijk and Ben H.H. Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationThe Intel move from ILP into Multi-threading
The Intel move from ILP into Multi-threading Miguel Pires Departamento de Informática, Universidade do Minho Braga, Portugal migutass@hotmail.com Abstract. Multicore technology came into consumer market
More informationOne-Level Cache Memory Design for Scalable SMT Architectures
One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract
More informationAn In-order SMT Architecture with Static Resource Partitioning for Consumer Applications
An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul
More informationIntroducing TAM: Time-Based Access Memory
SPECIAL SECTION ON SECURITY AND RELIABILITY AWARE SYSTEM DESIGN FOR MOBILE COMPUTING DEVICES Received December 9, 2015, accepted January 27, 2016, date of publication February 3, 2016, date of current
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationArchitectural Issues for the 1990s. David A. Patterson. Computer Science Division EECS Department University of California Berkeley, CA 94720
Microprocessor Forum 10/90 1 Architectural Issues for the 1990s David A. Patterson Computer Science Division EECS Department University of California Berkeley, CA 94720 1990 (presented at Microprocessor
More informationMultithreaded Architectures and The Sort Benchmark
Multithreaded Architectures and The Sort Benchmark Philip Garcia Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA, USA philipgar@lehigh.edu Henry. F. Korth Dept. of Computer Science
More informationDEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE
DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE Roger Luis Uy College of Computer Studies, De La Salle University Abstract: Tick-Tock is a model introduced by Intel Corporation in 2006 to show the improvement
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationEvaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000
Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Winter 2018 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2018 1 When and Where? When:
More informationChapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationPipelined Hash-Join on Multithreaded Architectures
Pipelined Hash-Join on Multithreaded Architectures Philip Garcia University of Wisconsin-Madison Madison, WI 53706 USA pcgarcia@wisc.edu Henry F. Korth Lehigh University Bethlehem, PA 805 USA hfk@lehigh.edu
More informationIntegrated circuit processing technology offers
Theme Feature A Single-Chip Multiprocessor What kind of architecture will best support a billion transistors? A comparison of three architectures indicates that a multiprocessor on a chip will be easiest
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationThe Impact of Write Back on Cache Performance
The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,
More informationMethods for Modeling Resource Contention on Simultaneous Multithreading Processors
Methods for Modeling Resource Contention on Simultaneous Multithreading Processors Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald Department of Computer Science Department of Electrical
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationExploring Efficient SMT Branch Predictor Design
Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationApproximate Performance Evaluation of Multi Threaded Distributed Memory Architectures
5-th Annual Performance Engineering Workshop; Bristol, UK, July 22 23, 999. c 999 by W.M. Zuberek. All rights reserved. Approximate Performance Evaluation of Multi Threaded Distributed Architectures W.M.
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationComputer Architecture
Computer Architecture Mehran Rezaei m.rezaei@eng.ui.ac.ir Welcome Office Hours: TBA Office: Eng-Building, Last Floor, Room 344 Tel: 0313 793 4533 Course Web Site: eng.ui.ac.ir/~m.rezaei/architecture/index.html
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationPull based Migration of Real-Time Tasks in Multi-Core Processors
Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMain Memory Supporting Caches
Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Cache Issues 1 Example cache block read
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationWritten Exam / Tentamen
Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationScalability of the RAMpage Memory Hierarchy
Scalability of the RAMpage Memory Hierarchy Philip Machanick Department of Computer Science, University of the Witwatersrand, philip@cs.wits.ac.za Abstract The RAMpage hierarchy is an alternative to the
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or
More informationExploring the Effects of Hyperthreading on Scientific Applications
Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER
More informationMeasurement-based Analysis of TCP/IP Processing Requirements
Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationComputer Architecture
Informatics 3 Computer Architecture Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh (thanks to Prof. Nigel Topham) General Information Instructor
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationComputer Architecture. Introduction. Lynn Choi Korea University
Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationApplications of Thread Prioritization in SMT Processors
Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,
More informationParallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200
Parallelism via Multithreaded and Multicore CPUs Bradley Dutton March 29, 2010 ELEC 6200 Outline Multithreading Hardware vs. Software definition Hardware multithreading Simple multithreading Interleaved
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationMicro-threading: A New Approach to Future RISC
Micro-threading: A New Approach to Future RISC Chris Jesshope C.R.Jesshope@massey.ac.nz Bing Luo R.Luo@massey.ac.nz Institute of Information Sciences and Technology, Massey University, Palmerston North,
More informationThreaded Multiple Path Execution
Threaded Multiple Path Execution Steven Wallace Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San Diego fswallace,calder,tullseng@cs.ucsd.edu Abstract
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationEECS 452 Lecture 9 TLP Thread-Level Parallelism
EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationMultithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University
Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure
More informationEmerging DRAM Technologies
1 Emerging DRAM Technologies Michael Thiems amt051@email.mot.com DigitalDNA Systems Architecture Laboratory Motorola Labs 2 Motivation DRAM and the memory subsystem significantly impacts the performance
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationChip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading
Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading BORUT ROBIČ JURIJ ŠILC THEO UNGERER Faculty of Computer and Information Sc. Computer Systems Department Dept. of Computer
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationOrganizational issues (I)
COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework
More informationCO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,
CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationChapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in
More informationINTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP
INTERACTION COST: FOR WHEN EVENT COUNTS JUST DON T ADD UP INTERACTION COST HELPS IMPROVE PROCESSOR PERFORMANCE AND DECREASE POWER CONSUMPTION BY IDENTIFYING WHEN DESIGNERS CAN CHOOSE AMONG A SET OF OPTIMIZATIONS
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationSpeculation Control for Simultaneous Multithreading
Speculation Control for Simultaneous Multithreading Dongsoo Kang Dept. of Electrical Engineering University of Southern California dkang@usc.edu Jean-Luc Gaudiot Dept. of Electrical Engineering and Computer
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationExploring High Bandwidth Pipelined Cache Architecture for Scaled Technology
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology Amit Agarwal, Kaushik Roy, and T. N. Vijaykumar Electrical & Computer Engineering Purdue University, West Lafayette, IN 4797,
More information