DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

Size: px
Start display at page:

Download "DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor"

Transcription

1 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box 20537, 1678 Nicosia, Cyprus {tsik,skevos,pedro}@cs.ucy.ac.cy Abstract. High-end microprocessors achieve their performance as a result of adding more features and therefore increasing their complexity. In this paper we present DDM-CMP, a Chip-Multiprocessor using the Data-Driven Multithreading execution model. As a proof-of-concept we present a DDM-CMP configuration with the same hardware budget as a high-end processor. In that budget we implement four simpler CPUs, the TSUs, and the interconnection network. An estimation of DDM- CMP performance for the execution of SPLASH-2 kernels shows that, for the same clock frequency, DDM-CMP achieves a speedup of 2.6 to 7.6 compared to the high-end processor. A lower frequency configuration, which is more powerefficient, still achieves high speedup (1.1 to 3.3). These encouraging results lead us to believe that the proposed architecture has a significant benefit over traditional designs. 1 Introduction Current state-of-the-art microprocessor designs aim at achieving higher performance by exploiting more ILP through using complex hardware structures. Nevertheless, such increase in complexity results in several problems and consequently marginal performance increases. Palacharla et al. [1] explain that window wakeup, selection and operand bypass logic are likely to be the most limiting factors for improving performance in future designs. The analysis presented by Olukotun et al. [2] proves that the complexity a large number of structures increases in a quadratic way with different processor parameters such as the issue width and the number of pipeline stages. Increasing design complexity does not only limit the performance improvement but also makes the validation and testing a difficult task [3]. Agarwal et al. [4] derive that the doubling of microprocessor performance every 18 months has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Their results show that, due to both diminishing improvements in clock rate and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow down substantially. An alternative design that achieves parallelism but avoids the complexity is the Chip Multiprocessor (CMP) [2]. Several research projects have proposed CMP architec- T.D. Hämäläinen et al. (Eds.): SAMOS 2005, LNCS 3553, pp , c Springer-Verlag Berlin Heidelberg 2005

2 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 365 tures [2, 5, 6, 7]. In addition, commercial products have also been proposed (e.g. IBM s Power5 [8] and SUN s Niagara [9]). Parallel architectures often suffer from large synchronization and communication latencies. Data-Driven Multithreading (DDM) [10, 11] is an execution model that aims at tolerating the latencies by allowing the computation processor to produce useful work while a long latency event is in progress. In this model, the synchronization part of the program is separated from the communication part allowing it to hide the synchronization and communication delays [10]. While such computation models usually require the design of dedicated microprocessors, Kyriacou et al. [10] showed that the DDM benefits may be achieved using commodity microprocessors. The only additional requirement is a small hardware structure, the Thread Synchronization Unit (TSU). The contribution of this paper is to explore the DDM concept with the new CMP type of architectures. The proposed architecture, DDM-CMP, is a chip multiprocessor architecture where the cores are simple embedded processors operating under the DDM execution model. Along with the cores, the chip also includes the TSUs and an interconnection network. The use of embedded processors is justified by Olukotun et al. [2] who showed that the simpler the cores of the multiprocessor, the higher their frequency can be. In addition, embedded processors are smaller and therefore we are able to include more cores in the same chip. A prototype will be build using Xilinx Virtex II Pro [12] chip. Our experiments use kernels from SPLASH-2 benchmark suite and compare the estimated performance of a DDM-CMP system composed of four simple cores to that of an equal hardware budget high-end processor. For this analysis we use Pentium III and Pentium 4 as representatives of simple and high end processors, respectively. The results show that a DDM-CMP configuration clocked at the same frequency with the high-end processor achieves a speedup of 2.6 to 7.6. DDM-CMP s benefits may be explored for low-power configurations as the results show that even when clocked at a frequency less than half of the high-end processor s, it achieves a speedup of 1.1 to 3.3. The rest of this paper is organized as follows. Section 2 describes DDM execution model, the proposed DDM-CMP architecture and its prototype implementation. Section 3 describes the case study used as the proof of our concept. Finally we present our conclusions in Section 4. 2 DDM-CMP Architecture The proposed DDM-CMP architecture is the evolution of the DDM architecture presented in [10, 11]. In this section, we present the DDM model of execution and describe the DDM-CMP architecture, its prototype implementation and its target applications. 2.1 DDM Model Data-Driven Multithreading (DDM) provides effective latency tolerance by allowing the computation processor produce useful work, while a long latency event is in progress. This model of execution has been evolved from the dataflow model of computation. In particular, it originates from the dynamic dataflow Decoupled Data Driven (D 3 )

3 366 K. Stavrou, P. Evripidou, and P. Trancoso graphs [13, 14], where the synchronization part of a program is separated from the computation part. The computation part represents the actual instructions of the program executed by the computation processor whereas the synchronization part contains information about data dependencies among threads and is used for thread scheduling. A program in DDM is a collection of re-entrant code blocks. A code block is equivalent to a function or a loop body in the high-level program text. Each code block comprises of several threads. A thread is a sequence of instructions equivalent to a basic block. A producer-consumer relationship exists among threads. In a typical program, a set of threads called the producers create data used by other threads called the consumers. Scheduling of code blocks, as well as scheduling of threads within a code block, is done dynamically at run time according to data availability. While the instructions within a thread are fetched by the CPU sequentially in control-flow order, the CPU may apply any optimization to increase ILP (e.g. out-of-order execution). As we are still in the process of developing a DDM-CMP compiler, the procedure of partitioning the program into a data-driven synchronization graph and code threads (as presented in [15]) is currently done by hand. Fig. 1. Thread Scheduling Unit (TSU) internal structure TSU - Hardware Support for DDM. The purpose of the Thread Synchronization Unit (TSU) is to provide hardware support for data-driven thread synchronization on conventional microprocessors. The TSU is made out of three units: The Thread Issue Unit (TIU), the Post Processing Unit (PPU) and the Network Interface Unit (NIU). When a thread completes its execution, the PPU updates the Ready Count (Ready Count is set by the compiler and corresponds to the number of input values or producers to the thread) of its consumer threads, determines whether any of those threads became ready for execution and if so, it forwards them to the TIU. The function of the TIU is to

4 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 367 schedule and prefetch threads deemed executable by the PPU. The NIU is responsible for the communication between the TSU and the interconnection network. The internal structure of the TSU is depicted in Figure 1. A detailed description of the operation of the TSU can be found in [15]. CacheFlow. Although DDM can tolerate communication and synchronization latency, scheduling based on data availability may have a negative effect on locality. To overcome this problem, the scheduling information together with software-triggered data prefetching, are used to implement efficient cache management policies. These policies are named CacheFlow [16]. The most effective CacheFlow policy contains two optimizations False Conflict Avoidance and Thread Reordering. False conflict avoidance prevents the prefetcher from replacing cache blocks required by the threads deemed executable, and so reduces cache misses. Thread reordering attempts to exploit both temporal and spatial locality by reordering the threads still waiting for their input data. 2.2 CMP Architecture The proposed chip multiprocessor can be implemented using three hardware structures: the microprocessor cores, thetsus and the interconnection network. Fig. 2. Several alternatives of the DDM-CMP architecture: (a) Each microprocessor has its own TSU, (b) One TSU is shared among two microprocessors and the number of cores increases, (c) One TSU serves all the microprocessors of the chip, and (d) Saved space is used to implement on-chip shared cache Our first proposed CMP architecture (Figure 2-(a)) is one that simply performs the integration of the previously proposed D 2 NOW [11] into a single chip. While having one TSU per processor was required in a NOW system, when all processors are on the

5 368 K. Stavrou, P. Evripidou, and P. Trancoso same chip it is possible to optimize the use of the TSU structure and share it among two or more processors (Figure 2-(b)). Ultimately we may consider the extreme case where one TSU is shared among all CPUs on-chip (Figure 2-(c)). Notice that by saving hardware with the sharing of the TSUs it may be possible to increase the number of on-chip CPUs or alternatively add internal shared cache (Figure 2-(d)). Although the impact of the interconnection network to the performance of an architecture that uses DDM execution model is small [15], there is still potential for studying several alternatives. This is specially interesting as the number of on-chip CPUs increases. The tradeoff between the size and the performance of the interconnection network will be studied as a larger, more complex, interconnection network may result in a decrease of the number of CPUs that can be embedded in the chip. 2.3 Prototype Implementation To prove that the proposed DDM-CMP architecture offers the expected benefits, a hardware prototype will be implemented. This prototype will use the Xilinx Virtex II Pro chip [12]. This chip contains, among others, two embedded Power PC 405 [17] processors and a programmable FPGA with more than logic cells. We aim at implementing the TSU and the interconnection network on the FPGA portion and execute the application threads on the two processors. 2.4 Target Applications The DDM-CMP architecture can be used to speedup the execution of parallelizable loop-based or pipeline-like applications. On the one hand, the proposed architecture is explicitly beneficial for parallelizable applications as it provides multiple parallel execution processors. On the other hand, protocol stack applications, that are representative examples of pipeline-like applications, can benefit from DDM-CMP, by mapping the code corresponding to each layer to a different DDM thread. Each layer, or DDM-thread, will be running in parallel providing a pipelined execution model with significant performance enhancement. Overall, we envision the DDM-CMP chip to be used in a single chip system as a substitute of a high-end microprocessor or as a building block for larger multiprocessor systems like BlueGene/L [18]. 3 DDM-CMP Performance Potential Analysis 3.1 Design The objective of the proposed DDM-CMP architecture is to achieve better performance than a current high-end microprocessor, given the same hardware budget, i.e. the same die area. For our analysis we consider the Intel Pentium 4 as the baseline for the highend microprocessor. As mentioned before, DDM-CMP is build out of simpler cores. For the purposes of our analysis we consider Intel Pentium III as a representative of such a core. From the information reported in [19], the number of transistors used in implementing Intel Pentium 4 3.2GHz 1MB L2 cache 90nm technology is approximately

6 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor million while the number of transistor used in implementing the Intel Pentium III 800MHz 256KB L2 cache 180nm technology is 22 million. Therefore, the Pentium 4 requires approximately 5.7 times more transistors than what is needed to build Pentium III. In addition to the processors, other hardware structures are needed to implement the DDM-CMP architecture: the TSUs and the interconnection network. As explained earlier, these structures can be implemented using a relatively small number of transistors. If we use the Pentium 4 chip in order to implement four Pentium III processors, about 37 million transistors will be left unused. This number of transistors is more that enough to implement the four TSUs and the appropriate interconnection network. Therefore, a DDM-CMP architecture with four Pentium III processors can be implemented with the same number of transistors needed to build a Pentium 4. This is the DDM-CMP configuration that will be used for our proof-of-concept experiments. 3.2 Experimental Setup As we do not have yet a DDM-CMP simulator, its performance results are derived from the results obtained by Kyriacou et al. [10] for the D 2 NOW implementation. In this case, we will use the results for the D 2 NOW architecture configured with four Pentium III 800MHz processors including all architecture optimizations. Notice that the D 2 NOW results are conservative for the DDM-CMP architecture as the on-chip interconnection network has both larger bandwidth and smaller latency than the D 2 NOW interconnect. As for the baseline high-end processor we have selected the Pentium 4 3.2GHz. To obtain the results for this setup we measure the execution time of the application s native execution on that system. The execution time is determined by measuring the number of processor cycles consumed in the execution of the main function of the program, i.e. we ignore the initialization phase. The processor cycles are measured by reading the contents of the hardware program counter of the processor [20]. Notice that as this is native execution, in order for the results to be statistically significant we execute the same experiment ten times and exclude the largest and smaller measurement. For this proof-of-concept analysis the workload considered to test the proposed architecture is composed of three kernels from the SPLASH-2 benchmark suite [21]: LU, FFT, and Radix. 3.3 Experimental Results The results collected from [10] and from the native execution on the Pentium 4 system are summarized in Table 1. It is possible to observe that the 4 x Pentium III DDM-CMP achieves better performance than the native Pentium 4 for only the Radix application. For both FFT and LU the DDM-CMP performance is worse than the one obtained for the Pentium 4. It is interesting to note that these results may be correlated with the fact that for both FFT and LU the execution time is much smaller than the one for the Radix application. From a brief analysis of the execution of the applications we were able to determine that the main function that performs the calculations for the algorithm accounts for more than 80% of the total execution for Radix, while it accounts for approximately only 50% for both FFT and LU. This is an indication that in order to

7 370 K. Stavrou, P. Evripidou, and P. Trancoso Table 1. Performance results with different implementation technology DDM-CMP CPU 4 x Pentium III 800MHz Pentium 4 3.2GHz Speedup Cycles [x1000] Time [ms] Cycles [x1000] Time [ms] FFT LU Radix obtain more reliable results for both FFT and LU we will need to use larger data set sizes. This issue will be covered in the near future as we complete the DDM-CMP simulator. Nevertheless, at this point we do not consider this result to be a problem as we expect that there will always be applications that will not show better performance when executing on the DDM-CMP. The results presented in Table 1 are significantly affected by technology scaling. It is important to notice that the Pentium III is implemented using 0.18µm technology and its clock speed is 800MHz whereas the Pentium 4 is implemented using 0.09µm technology and a clock of 3.2GHz. If pipeline stalls due to off-chip operations are not taken into account, the number of clock cycles needed to execute a series of instructions is independent of the implementation technology. If we consider that the off-chip operations are not dominant in the applications studied, the execution time for the specific application on the specific architecture will decrease with the same rate that the frequency increases. In this analysis we will consider two frequency scaling scenarios. The first one is a realistic scaling where instead of considering the original Pentium III 800MHz we consider the highest clock frequency which it was produced. From [22], the Pentium III code named Tualatin had a clock frequency of 1.4GHz. The second scaling is the upper most limit scenario where we consider that we would use a Pentium III equivalent processor which would be able to scale up to the Pentium 4 frequency (3.2GHz). An additional optimization that will be considered in the future is the fact that the TSU may be modified to be shared among more than one processor. This will minimize the hardware overhead that is introduced due to the DDM architecture. The extra space that is saved from sharing the TSU may be used to increase the number of cores on the chip. One more factor that will have an impact on the performance of DDM-CMP is the type of processor used as the core. In this analysis we are using the Pentium III given the restrictions that originate from the use of the results obtained with the D 2 NOW study. In a real implementation, as we discussed previously, we will use embedded processors as the core for the DDM-CMP. As these embedded processors are simpler, they require fewer transistors for their implementation and consequently we will be able to fit more cores into the same chip. Given the above arguments, in addition to the frequency scaling, we also consider the case where we would be able to fit eight processors on the same chip. Notice that as we use the results from a D 2 NOW configured with eight Pentium III processors the results are not accurate, but they can be used as an indication of the upper limit that may be achieved. The results from these scaling scenarios together with the original results are depicted in Figure 3.

8 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 371 speedup compared to P4 3.2GHz FFT LU Radix FFT LU Radix 4 Core CMP 8 Core CMP PIII 800MHz PIII 1.4GHz PIII 3.2GHz Fig. 3. Speedup when frequency and core scaling is taken into account In Figure 3 we have depicted three bars for each application. The one on the left represents the original results, the one in the middle represents the Pentium III scaling for 1.4GHz, and the one on the right represents the upper limit scaling with the 3.2GHz clock frequency. The group of results on the left represent the original chip design with four cores while the group on the right represents the scenario where the optimizations used allowed for the scaling of the number of cores to eight. As it was already observed with the original results, both FFT and LU have a speedup smaller than 1 and therefore their performance is better on the Pentium 4 system. In contrast, Radix achieves almost a 2x speedup when executing on the original DDM-CMP. The speedup values increase as the scaling is applied to the Pentium III processor. It is relevant to notice that even with the first scaling all three applications already show a better performance on the DDM-CMP compared to the execution on the Pentium 4. This configuration has also the advantage of being more power-efficient than the original Pentium 4 as it is clocked at less than half of its frequency. When scaling the frequency to the same as the baseline we observe larger speedup values ranging from 2.6 to 7.6. It is also interesting to observe that Radix presents a superlinear speedup as with the upper limit scaling it achieves a speedup of 7.6 with only four processors. This may be justified by the effectiveness of the CacheFlow policies. The results for the eight core DDM-CMP present, for all applications at any scaling scenario, a speedup larger than one. Overall, the results show very good speedup for both the high-performance and lowpower DDM-CMP configurations. 4 Conclusions In this paper we have presented DDM-CMP, a Chip-Multiprocessor implementation using the Data-Driven Multithreading execution model. The DDM-CMP architecture

9 372 K. Stavrou, P. Evripidou, and P. Trancoso turns away from the complexity path taken by recent high-end microprocessors. Its performance is achieved by combining several simple commodity microprocessors together with a small overhead, an extra hardware structure, the Thread Scheduling Unit (TSU). As a proof-of-concept we present a DDM-CMP implementation that utilizes the same hardware budget as a current high-end processor, the Pentium 4, to implement four Pentium III processors together with the necessary TSUs and interconnection network. The results obtained are very encouraging as the DDM-CMP configuration clocked at the same frequency as the Pentium 4 achieves a speedup of 2.6 to 7.6. DDM-CMP can alternatively be configured for power-efficiency and still achieve high speedup. A configuration clocked at less than half of the Pentium 4 frequency achieves speedup values ranging from 1.1 to 3.3. We are currently evaluating the different architecture alternatives for DDM-CMP, a larger set of applications, and are starting to implement a prototype of this architecture on a Virtex II Pro chip. Acknowledgments We would like to thank Costas Kyriacou for his contribution in the discussions and preparation of the results. Also, we would like to thank the anonymous reviewers for their valuable comments. References 1. Palacharla, S., Jouppi, N., Smith, J.: Complexity Effective Superscalar Processors. In: Proc. of the 24th ISCA. (1997) Olukotun, K., et al.: The Case for a Single Chip Multiprocessor. In: Proc. of the 7th ASP- LOS. (1996) Silas, I., et al.: System-Level Validation of the Intel(r) Pentium(r) M Processor. Intel Technology Journal 7 (2003) 4. Agarwal, V., et al.: Clock rate versus IPC: The end of the Road for Conventional Microarchitectures. In: Proc. of the 27th ISCA. (2000) Hammond, L., et al.: The Stanford Hydra CMP. IEEE Micro 20 (2000) Barroso, L., et al.: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In: Proc. of the 27th ISCA. (2000) Taylor, M., et al.: Evaluation of the Raw Microprocessor: An Exposed Wire Delay Architecture for ILP and Streams. In: Proc. of the 31st ISCA. (2004) Kalla, R., Sinharoy, B., Tendler, M.: IBM POWER5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro 24 (2004) Kongetira, P.: A 32-way Multithreaded SPARC Processor. In: Proc. of Hot Chips (2004) 10. Kyriacou, C., Evripidou, P., Trancoso, P.: Data Driven Multithreading Using Conventional Microprocessors. Technical Report TR-05-4, University of Cyprus (2005) 11. Evripidou, P., Kyriacou, C.: Data driven network of workstations (D2NOW). J. UCS 6 (2000) XILINX: Virtex-II Pro and Virtex-II Pro X FPGA User Guide. Version 3.0 (2004)

10 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Evripidou, P.: D3-machine: A Decoupled Data Driven Multithreaded architecture with variable resolution support. Parallel Computing 27 (2001) Evripidou, P., Gaudiot, J.: A decoupled graph/computation data-driven architecture with variable resolution actors. In: Proc. of ICPP (1990) Kyriacou, C.: Data Driven Multithreading using Conventional Control Flow Microprocessors. PhD dissertation, University of Cyprus (2005) 16. Kyriacou, C., Evripidou, P., Trancoso, P.: CacheFlow: A Short-Term Optimal Cache Management Policy for Data Driven Multithreading. In: Proc. of the 10th Euro-Par, Pisa, Italy. (2004) 17. IBM Microelectronics Division: The PowerPC 405(tm) Core (1998) 18. The BlueGene/L Team: An Overview of the BlueGene/L Supercomputer. In: Proc. of the 2002 ACM/IEEE supercomputing. (2002) Intel: Intel Microprocessor Quick Reference Guide. pressroom/kits/ quickreffam.htm (2004) 20. PCL: The Performance Counter Library Version 2.2 (2003) 21. Woo, S., at.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: Proc. of 22nd ISCA. (1995) Topelt, B., Schuhmann, D., Volkel, F.: The mother of all CPU charts Part 2. (2004)

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

Multi-Core Microprocessor Chips: Motivation & Challenges

Multi-Core Microprocessor Chips: Motivation & Challenges Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information

Simultaneous Multithreading and the Case for Chip Multiprocessing

Simultaneous Multithreading and the Case for Chip Multiprocessing Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

March 17, :11 WSPC/INSTRUCTION FILE ddm cmp ppl mar08. Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D 2 -CMP) using FPGAs

March 17, :11 WSPC/INSTRUCTION FILE ddm cmp ppl mar08. Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D 2 -CMP) using FPGAs Parallel Processing Letters c World Scientific Publishing Company Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D -CMP) using FPGAs Konstantinos Tatas, Costas Kyriacou Computer Engineering

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis 1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

DDMCPP: The Data-Driven Multithreading C Pre-Processor

DDMCPP: The Data-Driven Multithreading C Pre-Processor DDMCPP: The Data-Driven Multithreading C Pre-Processor Pedro Trancoso, Kyriakos Stavrou, Paraskevas Evripidou Department of Computer Science University of Cyprus 75 Kallipoleos Ave., P.O.Box 20537, 1678

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Performance Impact of Resource Conflicts on Chip Multi-processor Servers

Performance Impact of Resource Conflicts on Chip Multi-processor Servers Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Computer Architecture Today (I)

Computer Architecture Today (I) Fundamental Concepts and ISA Computer Architecture Today (I) Today is a very exciting time to study computer architecture Industry is in a large paradigm shift (to multi-core and beyond) many different

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. Recent years have seen a great deal of interest in multiple-issue machines or superscalar

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Chip Multithreading: Opportunities and Challenges

Chip Multithreading: Opportunities and Challenges Chip Multithreading: Opportunities and Challenges Lawrence Spracklen & Santosh G. Abraham Scalable Systems Group Sun Microsystems Inc., Sunnyvale, CA {lawrence.spracklen,santosh.abraham}@sun.com Abstract

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Architectural Support for Data-Driven Execution

Architectural Support for Data-Driven Execution Architectural Support for Data-Driven Execution GEORGE MATHEOU and PARASKEVAS EVRIPIDOU, University of Cyprus The exponential growth of sequential processors has come to an end, and thus, parallel processing

More information

Verilog-based simulation of hardware support for Data-flow concurrency on Multicore systems

Verilog-based simulation of hardware support for Data-flow concurrency on Multicore systems Verilog-based simulation of hardware support for Data-flow concurrency on Multicore systems George Matheou Department of Computer Science University of Cyprus Nicosia, Cyprus Email: geomat@cs.ucy.ac.cy

More information