Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3 Email: nmekhiel@ee.ryerson.ca Abstract Neither simulation results nor real system results give an explanation to the behavior of advanced computer systems for the full design spectrum. In this paper, we present simple models that explain the behavior of simultaneous multithreded, multiprocessor and multiprocessor with simultaneous multithreaded architectures. The results of these models show that there are limitations and problems with these architectures when using fast processors, and that the current single processor architecture will continue to be the target of system design for many years to come. 1 Introduction The demand for more computing power is growing as a result of new applications in multi media, Internet and telecommunication. The microarchitecture features that have been used to improve processor performance like superpipelining, superscalar, branch prediction and out-of-order execution cannot offer continuous improvement to performance as the cost of hiding cache misses and branch miss predictions increases with the increase in processor speed and complexity [1]. The advancements in technology that have produced faster and larger chips cannot be fully utilized to produce more performance from single processor. In the mean time, the software used for server applications and desktop applications are becoming more parallel. These applications consist of multiple threads or processes that can run in parallel [1,3] using multiprocessor system. Multiprocessor systems have been used for many years to improve the performance of applications by executing multiple instructions in parallel [5,6]. Recently, a single processor has been developed to execute multiple threads simultaneously (SMT) to make full use of the single processor resources [1,7,8]. Intel Hyper-Threading makes a single physical processor appears as two logical processors [1]. Threads could be scheduled to the logical processors and run in parallel as in multiprocessor system to improve the system performance. In this paper, we analyze the behavior of the following different architectures: single processor; SMT; SMP; SMP with SMT. We will introduce simple models capable of explaining the behavior of these complicated architectures and predict their performance. The rest of this paper is organized as follows: Section 2 explains the motivation; Section 3 presents simple models for the different architectures; Section 4 discusses the results; Section 5 concludes this paper. 2 Motivation The computer industry has depended on CAD tools to design, test and build advanced computer systems. The accuracy and speed of these tools have made us more and more dependent on them not only for the design but also for evaluating the design. Major design decisions are made according to the results of simulation. Simulation results might have errors [4] and do not give an overall view for the behavior of the system. Simulation gives the results of only one point in the design space for the parameters used in the simulation results. If we need to make an optimal design decision we must view the results of the system for the full design spectrum. The optimization that is based on the value of the results at one point is misleading and could be counter productive. Furthermore, we cannot optimize the performance of a system by optimizing the performance of each individual parameter. We can only optimize the system performance if we understand the behavior of the system with respect to its parameters in the full design spectrum. New architectures must be evaluated not only to find their expected performance but also to understand their behavior specially in a fast changing technology and many new applications that have different characteristics. If we understand the behavior of these archi-

tectures, we can answer important questions regarding which architecture should be adopted and why. We cannot answer these questions from the results of simulation or even from the results of a real system. Simple laws of physics like Newton s laws are very useful because they can explain clearly the behavior of objects in complicated environment. Fundamental laws that explain the behavior of the systems are very valuable in evaluating design decisions and in developing systems for the future. These laws cannot be obtained using CAD tools. The only way to develop these laws is by understanding the system behavior without the need to focus on the details of its components. Therefore, our objectives in this paper is to find some simple rules that govern the behavior of complicated architectures without getting into the details of each system. These rules will be used to develop simple models that help to explain the behavior of the different architectures. 3 Simple Models For The Different Architectures To understand the behavior of important architectures like multirthreaded, multiprocessor and multiprocessor with multithreaded, we constructed simple performance models. These models evaluate the behavior of each architecture relative to the well known single processor architecture to hide complicated system details. In these models, we assume that there are enough threads to run in parallel for each architecture. This is to evaluate the full potential of each architecture without being limited by the application. For the single processor architecture, these threads are executed one after the other. The single processor can benefit from superscalar capabilities to run multiple instructions in parallel. SMT architecture assumes that some of the processor resources are duplicated so that threads can run in parallel, however there are some internal shared resources which cannot support parallelism and force threads to share them similar to Intel Hyper- Threading [1]. Each thread in SMT could still use superscalar to run multiple instructions in parallel. SMP architecture assumes multiple processors are used in parallel sharing the outside resources (main memory). Each processor in SMP is a single processor using superscalar to run instructions in parallel. SMP with SMT architecture assumes that multiple processors are used in parallel sharing the external resources, but each processor is capable of running multiple threads simultaneously with some inter- Single SMP with 4 s SMT with 4 Threads SMP with SMT (2 processors, 2 threads) Figure 1: Block Diagrams of the Different Architectures nal resources duplicated and other internal resources are shared. Figure 1 shows the block diagrams of the different architectures. These architectures could share internal or external resources. represents parallel superscalar resources using multiple function units to execute parallel instructions on each thread. resources represents shared resources as L2 cache inside the processor. SMT and SMP with SMT share these resources among the multiple threads and threads must wait for each other to access them. resource represents the main memory that is shared by the threads in all architectures. 3.1 The Rules of Behavior for Different Architectures We define the following rules for the behavior of SMT, SMP and SMP with SMT architectures relative to the single processor. 1-Each architecture including the single processor can execute multiple instructions in parallel without stalls for each thread in execution time that = T EX using superscalar. T EX depends on the number of execution units, the speed of processor, and ILP in each thread. 2-All architectures must wait to access the shared resources inside the processor for a time that = T WI. T WI is the waiting time to access the shared resources inside the processor as L1, L2 caches. Only a portion of the instructions are subjected to this waiting time. 3-All architectures must wait to access the external shared resources outside the processor for a time that = T WO T WO depends on the speed

of the main memory, the memory-processor bus and the portion of instructions that must use the outside resources. 4-Parallel threads can improve the total time of an application by overlapping either the no stall execution time or the waiting time among threads. The gain in performance for parallel threads (in SMT, SMP and SMT with SMT) comes not only from overlapping the no stall execution time, but also from overlapping some of the waiting time of each thread with one another. 5-The total waiting time for parallel threads when accessing the shared resources is equal to the number of threads multiplied by the waiting time for each thread to access these shared resources. If we assume that each thread takes the same time in accessing the shared resources, then the total waiting time is equal to the number of threads multiplied by the waiting time to access the shared resources (queuing theory). 3.2 The Model for Single The performance of a single processor could be given by the following model: T ST = T EX + T WI + T WO T ST = The total time for running a single thread in a single processor. T EX = The no stall execution time for a single thread in a single processor with superscalar. T WI = The time for a thread to use the shared resources inside the processor. T WI = P i T i. P i =Probability of an instruction in a thread to use the shared resources inside the processor. T i = time for an instruction to access the shared resources inside the processor. T WO = The time for a thread to access the shared resources outside the processor. T WO = P o T o, P o =The probability of an instruction in a thread to use the shared resources outside the processor. T o = time for an instruction to access the shared resources outside the processor. T ST = T EX + P i T i + P o T o The total time to execute N threads in a single processor = N T ST 3.3 The Model for SMT The total time for executing multiple threads in SMT architecture is obtained relative to the single processor model, by overlapping the no stall execution time T EX then adding the waiting time of N threads for using the shared resources inside and outside the processor as shown in the SMT model of Figure 1. T SMT = T EX + N (P i T i + P o T o ) Note that T EX is not multiplied by N thread due to the overlapping of no stall execution time among threads in SMT model. 3.4 The Model for SMP The total time for executing multiple threads in SMP architecture is obtained relative to the single processor model, by overlapping both the no stall execution time T EX and the waiting time for the shared resources inside the processors (P i T i ) then adding the waiting time of N threads for using the shared resources outside the processor as shown in SMP model of Figure 1. T SMP = T EX + P i T i + N (P o T o ) Note that both T EX and P i T i are not multiplied by N due to overlapping them between multiple processors in SMP model. 3 The Model for SMP with SMT The total time for executing multiple threads in SMP with SMT architecture is obtained by overlapping the no stall execution time inside each processor then adding the waiting time of M simultaneous threads for using the shared resources inside each processor and also adding the waiting time of all N threads for using the shared resources outside the processors as shown in SMP with SMT model of Figure 1. M is the number of simultaneous threads inside the processor and N is the number of threads that use N M SMT processors in parallel. T SMT.SMP = T EX + M (P i T i ) + N (P o T o ) 3.6 The Specifications of the Systems We selected Pentium 4 processor to evaluate the models [1,9]. The SMT model assumes using simultaneous multi-threading with N threads (Pentium 4 Hyper Threading uses only two threads). The SMP model assumes N processors in parallel sharing the main memory, each processor works in a single processor mode (no Hyper-Threading). The SMP with SMT model assumes N number of threads using N M processors and each processor can support M simultaneous threads inside it. The specifications of these systems are: The processor speed= 2 GHz. L1 Data cache= 8 Kbytes, 4-way, 64 byte block. L2 cache is unified= 512 Kbytes, 8-way, 128 byte block, write back, access time= 7 cycles. Memory Bus is 8 byte width, 3 MHz speed. DDR DRAM speed is 333 MHz with precharge= 2 cycles, row access=3 cycles, column access= 3 cycles.

3.7 The Parameters of The Models We assume the following values for the parameters of these models:- T EX : It depends on the speed of the processor ( 2 GHz) and the average ILP ( we assume ILP = 4 ). T EX =.4 4 =.1 ns. T i : We assume that most of the waiting time inside the processor happens when the instructions and data miss L1 cache and access L2 cache. The L2 cache has an access time = 7 cycles. T i = 7.4 = 2.8 ns. P i : It depends on the miss rates of L1 data and instruction caches. We assume that L1 cache have an average miss rate for data and instructions of approximately 3.75%. P i =.0375. T o : It depends on DRAM latency and data transfer time. T o : is assumed to be 66 ns (DRAM access time + bus transfer). P o : It depends on the miss rate of L1 and L2 caches. We assume that the global miss rate is (M 1 M 2) =.8%. P o =.8. All models use the same parameters making the results less sensitive to the values of these parameters. The models evaluate the relative performance compared to a single processor with the same parameters. The parameters were selected based on typical values of a real system (Pentium 4). 4 The Results of The Models The results are obtained using the models and the above parameters. 4.1 Comparing the Results of the Models to the Results of a Real System Figure 2, shows the performance improvements of the SMT model and the results of Intel Xeon processor running online transaction [1]. On line transaction applications have enough parallel threads so that the system performance will not be limited by the number of threads. The results show that when N=2 threads for SMT, both Intel system and the SMT model give the same performance improvements of about 24% compared to the performance of running 2 threads in a single processor. Figure 3, shows the performance gain of SMP model and the results of Intel SMP. Intel SMP with 2 processors (total=2 threads ) and the SMP model with 2 processors (2 threads) have the same performance gain of about 55% compared to a single processors running 2 threads one after the other. 1 Intel Xeon Hyper Threading 01 01 01 01 01 01 01 01 01 01 1 processor(2 threads) Figure 2: Comparing Intel Xeon Results and The Results of the For 4 processors, Intel system with SMP improves the performance by 160%. The model for SMP shows performance improvement of 150%. Figure 4, shows the performance gain of SMP with SMT model and the results of Intel SMP with hyperthreading. Combining SMP with SMT using 2 processors (4 threads), improves the performance of Intel system by 1%. SMP with SMT model shows the same performance improvements of 1%. These results are obtained relative to single processor performance. For 4 processors (8 threads) the SMP with SMT model gives the same performance improvements ( 170%) as Intel SMP with hyper-threading. The different models (single, SMT, SMP, and SMT+SMP) accurately predicted real system performance for all the conditions although they use very simple models. These models are accurate because they are developed relative to the single processor. 4.2 Predicting Performance Improvements of the Different Architectures Figure 5, shows the performance improvements of the three different architectures using our models compared to a single processor model. The performance improvement from the SMT model is the lowest and reaches a maximum value of 1 for 10 threads. This is a modest gain for SMT and indicates that we cannot rely on SMT for improving the performance of large number of threads. The waiting time to access the shared resources inside and outside the processor, limits the performance gain for SMT. The performance improvement from SMP model increases with increasing the number of processors,

Single Model SMP Model SMP with 2 2 1 Intel Xeon multiprocessor SMP Model 2 processors 4 processors Figure 3: Comparing Intel Xeon Results and The Results of the SMP Model 2 1 Intel Xeon multiprocessor with Hyper Threading SMP and 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 2 processors(4 threads) 4 processors(8 threads) Figure 4: Comparing Intel Xeon Results and The Results of the SMP with 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 2 4 6 8 10 N /Thread Figure 5: Performance Improvements of the Different Architectures then it reaches a diminishing returns for N = 10. SMP gives performance improvements of about 3 for 10 processors. This limitation is caused by the waiting time for accessing the external resource (main memory). The SMP with SMT architecture model gives the best performance improvement of 3.7 for N=10 processors using 20 threads, but it also reach a diminishing returns for N = 10. The behavior of this model is exactly the same as the behavior of SMP except it provides an extra fixed performance gain obtained by using SMT (two threads) inside each processor. We will not include this model in the following results. 4.3 The Performance Improvements of SMT and SMP with faster processor Figure 6 shows the performance improvements of SMT and SMP models using 2 GHz and using 5 GHz compared to a single processor models of 2 GHz and 5 GHz. The performance improvement for SMT and SMP decreases as the processor speed increases. The 5 GHz SMT and SMP architectures give lower performance improvements relative to 5 GHz single processor compared to the improvements for 2 GHz architectures. This means that both SMT and SMP architectures will not be effective in improving the performance of parallel threads for future high speed processors. It should be noted that although the performance improvement of faster SMT and SMP systems decreases, the average time to finish the applications improves with the use of faster processors. For example, the SMT takes 1.67 ns to finish 10 parallel threads in the 2 GHz processor compared to 1.62 ns for SMT to finish 10 parallel threads in the 5 GHz processor, clearly showing that increasing processor speed does

5.0 for faster 4 SMP Model SMP Model for faster Time= 1.62 ns Time=.73 ns Time=1.67 ns 3 Time=.68 0 11 1 ns 2 0 1 1 11 1 2 4 6 8 10 N /Thread 5.0 for faster L2 Cache 4 SMP Model SMP Model for faster L2 Cache Time=1.15 ns Time=.73 ns Time=1.67 ns Time=.68 11 ns 3 11 1 0 1 2 0 0 1 0 0 0 0 2 4 6 8 10 N /Thread Figure 6: Performance Improvements of SMT and SMP with Faster s not help the performance improvements of SMT. Also SMP takes.73 ns to finish 10 parallel threads in the 2 GHz processors and.68 ns to finish 10 parallel threads in 5 GHz, indicating also that SMP will not be effective in improving the performance of faster processors. 4.4 The Performance Improvements of SMT and SMP with faster L2 Cache Figure 7 shows the performance improvements of SMT and SMP models when using faster L2 cache ( L2 access time = 1.4 ns compared to 2.8 ns). The performance improvement of SMT when using faster L2 cache increases. For N=2 threads, the total time for SMT is.41 ns, compared to the total time of.31 ns for the faster L2. This is 32% improvements in performance. The performance improvements of SMP decreases as L2 speed increases. This means that neither faster processor core nor faster L2 help the SMP system. The total time for SMP when N=10 is.73 ns compared to.68 ns for SMP with faster L2. The improvements in total time is only 7% with 1% improvement in L2 speed. We must look for other ways to improve the performance of SMP. 4 The Performance Improvements of SMT and SMP with faster DRAM Figure 8 shows the performance improvements of SMT and SMP with 66 ns DRAM access time, and with 33 ns DRAM access time. SMT system gains modestly from improving the access time of DRAM. The SMP system gains significantly from reducing the access time of DRAM. The performance improvement of SMP increases by about Figure 7: Performance Improvements of SMT and SMP with faster L2 Cache 5.0 for faster DRAM 4 SMP Model SMP Model for faster DRAM 3 2 0 0 0 1 0 0 0 2 4 6 8 10 N /Thread Figure 8: Performance Improvements of SMT and SMP with faster DRAM 40% for N=10. This means that only faster DRAM can benefit SMP performance improvement. Faster L2 cache and faster DRAM both benefit SMT performance improvement. For the single processor, any improvements in the speed of core, L2 or DRAM will help its performance. This means that the single processor architecture will gain the most from any improvements in technology, and more likely to be the target architecture for many years to come. 4.6 The Performance Improvements of SMT and SMP with Slower Figure 9, shows the performance improvements of SMT and SMP architectures using 2 GHz processor and using 1.25 GHz processor compared to the performance of single processors running at 2 GHz and 1.25

5.0 for slower processors SMP Model SMP Model for slower s 3 2 1 11 0 0 11 0 0 11 0 0 2 4 6 8 10 N /Thread Figure 9: Performance Improvements of SMT and SMP with Slower GHz. The performance improvement of both SMT and SMP significantly increases with the slower processors. This is because slower processors have more execution time that could be overlapped among threads to benefit from parallelism in SMT or SMP. For example, SMP using 10 processors and running at 1.25 GHz has performance gain of 5 compared to the single processor and only 3 times the single processor performance when SMP uses 10 processors at 2 GHz. The time to finish 10 threads in SMP with 1.25 GHz processors is.83 ns compared to.73 ns to finish the 10 threads in SMP with 2 GHz processors (only 13% reduction in performance for 1% processor speed reduction). The time to finish 10 threads in SMT with 1.25 GHz processors is 1.77 ns compared to 1.67 ns to finish the 10 threads in SMT with 2 GHz processors. SMT and SMP scale better with slower processors. Slower processors cost less to implement and have lower power consumption. 5 Conclusions We have developed simple models that can explain the behavior of modern architectures like SMT, SMP and SMP with SMT without the need to include complicated system details. These models predict the performance gain compared to a single processor performance and could be used to make design decision regarding the future of these architectures. The results of these models show limitations in performance gain for using faster processors and that they are more suitable for slower processors. SMP gives the best performance of all architectures specially when it uses fast main memory. It also show that the performance of a single processor benefits the most from any improvements in technology and will likely to be the target architecture for many years to come. References [1] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, and Michael Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal Q1, 22. [2] D. Burger, J. R. Goodman, and A. Kagi, Memory Bandwidth of Future Microprocessors, Proc. 23rd Ann. Int l Symp. on Computer Architecture, ACM Press, New York, 1996, PP. 78-89. [3] J. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach Morgan Kaufmann Publishers, Inc, San Francisco, CA, 22. [4] R. Desikan, D. Burger, and S. W. Kecker, Measuring Experimental Error in Microprocessor Simulation, 28th Annual Symposium on Computer Architecture, pp. 266-277, June 30- July 3, 20 Gottenberge, Sweden [5] Agarwal, B.H. Lim, D. Kranz and J. Kubiatowicz, APRIL: A processor architecture for Multiprocessing, in Proceedings of the 17th Annual International Symposium on Computer Architectures, pages 104-114, May 1990. [6] L. Hammond, B. Nayfeh, and K. Olukotun, A Single-Chip Multiprocessor, IEEE Computer, 30(9), pages 79-85, September 1997. [7] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, in 2nd Annual International Symposium on Computer Architecture, June 1995. [8] S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo R.L. Stamm, and D.M. Tullsen, Simultaneous multithreading: A platform for next-generation processors, IEEE Micro, 17(5), pages 12-19, Oct. 1997. [9] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Camean, A. Kyker, and P. Roussel, The microarchitecture of the Pentium 4 processor, Intel Technology Journal, 5(1), pages 1-133, Feb. 20.