DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor
|
|
- Harry Nelson
- 5 years ago
- Views:
Transcription
1 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box 20537, 1678 Nicosia, Cyprus {tsik,skevos,pedro}@cs.ucy.ac.cy Abstract. High-end microprocessors achieve their performance as a result of adding more features and therefore increasing their complexity. In this paper we present DDM-CMP, a Chip-Multiprocessor using the Data-Driven Multithreading execution model. As a proof-of-concept we present a DDM-CMP configuration with the same hardware budget as a high-end processor. In that budget we implement four simpler CPUs, the TSUs, and the interconnection network. An estimation of DDM- CMP performance for the execution of SPLASH-2 kernels shows that, for the same clock frequency, DDM-CMP achieves a speedup of 2.6 to 7.6 compared to the high-end processor. A lower frequency configuration, which is more powerefficient, still achieves high speedup (1.1 to 3.3). These encouraging results lead us to believe that the proposed architecture has a significant benefit over traditional designs. 1 Introduction Current state-of-the-art microprocessor designs aim at achieving higher performance by exploiting more ILP through using complex hardware structures. Nevertheless, such increase in complexity results in several problems and consequently marginal performance increases. Palacharla et al. [1] explain that window wakeup, selection and operand bypass logic are likely to be the most limiting factors for improving performance in future designs. The analysis presented by Olukotun et al. [2] proves that the complexity a large number of structures increases in a quadratic way with different processor parameters such as the issue width and the number of pipeline stages. Increasing design complexity does not only limit the performance improvement but also makes the validation and testing a difficult task [3]. Agarwal et al. [4] derive that the doubling of microprocessor performance every 18 months has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Their results show that, due to both diminishing improvements in clock rate and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow down substantially. An alternative design that achieves parallelism but avoids the complexity is the Chip Multiprocessor (CMP) [2]. Several research projects have proposed CMP architec- T.D. Hämäläinen et al. (Eds.): SAMOS 2005, LNCS 3553, pp , c Springer-Verlag Berlin Heidelberg 2005
2 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 365 tures [2, 5, 6, 7]. In addition, commercial products have also been proposed (e.g. IBM s Power5 [8] and SUN s Niagara [9]). Parallel architectures often suffer from large synchronization and communication latencies. Data-Driven Multithreading (DDM) [10, 11] is an execution model that aims at tolerating the latencies by allowing the computation processor to produce useful work while a long latency event is in progress. In this model, the synchronization part of the program is separated from the communication part allowing it to hide the synchronization and communication delays [10]. While such computation models usually require the design of dedicated microprocessors, Kyriacou et al. [10] showed that the DDM benefits may be achieved using commodity microprocessors. The only additional requirement is a small hardware structure, the Thread Synchronization Unit (TSU). The contribution of this paper is to explore the DDM concept with the new CMP type of architectures. The proposed architecture, DDM-CMP, is a chip multiprocessor architecture where the cores are simple embedded processors operating under the DDM execution model. Along with the cores, the chip also includes the TSUs and an interconnection network. The use of embedded processors is justified by Olukotun et al. [2] who showed that the simpler the cores of the multiprocessor, the higher their frequency can be. In addition, embedded processors are smaller and therefore we are able to include more cores in the same chip. A prototype will be build using Xilinx Virtex II Pro [12] chip. Our experiments use kernels from SPLASH-2 benchmark suite and compare the estimated performance of a DDM-CMP system composed of four simple cores to that of an equal hardware budget high-end processor. For this analysis we use Pentium III and Pentium 4 as representatives of simple and high end processors, respectively. The results show that a DDM-CMP configuration clocked at the same frequency with the high-end processor achieves a speedup of 2.6 to 7.6. DDM-CMP s benefits may be explored for low-power configurations as the results show that even when clocked at a frequency less than half of the high-end processor s, it achieves a speedup of 1.1 to 3.3. The rest of this paper is organized as follows. Section 2 describes DDM execution model, the proposed DDM-CMP architecture and its prototype implementation. Section 3 describes the case study used as the proof of our concept. Finally we present our conclusions in Section 4. 2 DDM-CMP Architecture The proposed DDM-CMP architecture is the evolution of the DDM architecture presented in [10, 11]. In this section, we present the DDM model of execution and describe the DDM-CMP architecture, its prototype implementation and its target applications. 2.1 DDM Model Data-Driven Multithreading (DDM) provides effective latency tolerance by allowing the computation processor produce useful work, while a long latency event is in progress. This model of execution has been evolved from the dataflow model of computation. In particular, it originates from the dynamic dataflow Decoupled Data Driven (D 3 )
3 366 K. Stavrou, P. Evripidou, and P. Trancoso graphs [13, 14], where the synchronization part of a program is separated from the computation part. The computation part represents the actual instructions of the program executed by the computation processor whereas the synchronization part contains information about data dependencies among threads and is used for thread scheduling. A program in DDM is a collection of re-entrant code blocks. A code block is equivalent to a function or a loop body in the high-level program text. Each code block comprises of several threads. A thread is a sequence of instructions equivalent to a basic block. A producer-consumer relationship exists among threads. In a typical program, a set of threads called the producers create data used by other threads called the consumers. Scheduling of code blocks, as well as scheduling of threads within a code block, is done dynamically at run time according to data availability. While the instructions within a thread are fetched by the CPU sequentially in control-flow order, the CPU may apply any optimization to increase ILP (e.g. out-of-order execution). As we are still in the process of developing a DDM-CMP compiler, the procedure of partitioning the program into a data-driven synchronization graph and code threads (as presented in [15]) is currently done by hand. Fig. 1. Thread Scheduling Unit (TSU) internal structure TSU - Hardware Support for DDM. The purpose of the Thread Synchronization Unit (TSU) is to provide hardware support for data-driven thread synchronization on conventional microprocessors. The TSU is made out of three units: The Thread Issue Unit (TIU), the Post Processing Unit (PPU) and the Network Interface Unit (NIU). When a thread completes its execution, the PPU updates the Ready Count (Ready Count is set by the compiler and corresponds to the number of input values or producers to the thread) of its consumer threads, determines whether any of those threads became ready for execution and if so, it forwards them to the TIU. The function of the TIU is to
4 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 367 schedule and prefetch threads deemed executable by the PPU. The NIU is responsible for the communication between the TSU and the interconnection network. The internal structure of the TSU is depicted in Figure 1. A detailed description of the operation of the TSU can be found in [15]. CacheFlow. Although DDM can tolerate communication and synchronization latency, scheduling based on data availability may have a negative effect on locality. To overcome this problem, the scheduling information together with software-triggered data prefetching, are used to implement efficient cache management policies. These policies are named CacheFlow [16]. The most effective CacheFlow policy contains two optimizations False Conflict Avoidance and Thread Reordering. False conflict avoidance prevents the prefetcher from replacing cache blocks required by the threads deemed executable, and so reduces cache misses. Thread reordering attempts to exploit both temporal and spatial locality by reordering the threads still waiting for their input data. 2.2 CMP Architecture The proposed chip multiprocessor can be implemented using three hardware structures: the microprocessor cores, thetsus and the interconnection network. Fig. 2. Several alternatives of the DDM-CMP architecture: (a) Each microprocessor has its own TSU, (b) One TSU is shared among two microprocessors and the number of cores increases, (c) One TSU serves all the microprocessors of the chip, and (d) Saved space is used to implement on-chip shared cache Our first proposed CMP architecture (Figure 2-(a)) is one that simply performs the integration of the previously proposed D 2 NOW [11] into a single chip. While having one TSU per processor was required in a NOW system, when all processors are on the
5 368 K. Stavrou, P. Evripidou, and P. Trancoso same chip it is possible to optimize the use of the TSU structure and share it among two or more processors (Figure 2-(b)). Ultimately we may consider the extreme case where one TSU is shared among all CPUs on-chip (Figure 2-(c)). Notice that by saving hardware with the sharing of the TSUs it may be possible to increase the number of on-chip CPUs or alternatively add internal shared cache (Figure 2-(d)). Although the impact of the interconnection network to the performance of an architecture that uses DDM execution model is small [15], there is still potential for studying several alternatives. This is specially interesting as the number of on-chip CPUs increases. The tradeoff between the size and the performance of the interconnection network will be studied as a larger, more complex, interconnection network may result in a decrease of the number of CPUs that can be embedded in the chip. 2.3 Prototype Implementation To prove that the proposed DDM-CMP architecture offers the expected benefits, a hardware prototype will be implemented. This prototype will use the Xilinx Virtex II Pro chip [12]. This chip contains, among others, two embedded Power PC 405 [17] processors and a programmable FPGA with more than logic cells. We aim at implementing the TSU and the interconnection network on the FPGA portion and execute the application threads on the two processors. 2.4 Target Applications The DDM-CMP architecture can be used to speedup the execution of parallelizable loop-based or pipeline-like applications. On the one hand, the proposed architecture is explicitly beneficial for parallelizable applications as it provides multiple parallel execution processors. On the other hand, protocol stack applications, that are representative examples of pipeline-like applications, can benefit from DDM-CMP, by mapping the code corresponding to each layer to a different DDM thread. Each layer, or DDM-thread, will be running in parallel providing a pipelined execution model with significant performance enhancement. Overall, we envision the DDM-CMP chip to be used in a single chip system as a substitute of a high-end microprocessor or as a building block for larger multiprocessor systems like BlueGene/L [18]. 3 DDM-CMP Performance Potential Analysis 3.1 Design The objective of the proposed DDM-CMP architecture is to achieve better performance than a current high-end microprocessor, given the same hardware budget, i.e. the same die area. For our analysis we consider the Intel Pentium 4 as the baseline for the highend microprocessor. As mentioned before, DDM-CMP is build out of simpler cores. For the purposes of our analysis we consider Intel Pentium III as a representative of such a core. From the information reported in [19], the number of transistors used in implementing Intel Pentium 4 3.2GHz 1MB L2 cache 90nm technology is approximately
6 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor million while the number of transistor used in implementing the Intel Pentium III 800MHz 256KB L2 cache 180nm technology is 22 million. Therefore, the Pentium 4 requires approximately 5.7 times more transistors than what is needed to build Pentium III. In addition to the processors, other hardware structures are needed to implement the DDM-CMP architecture: the TSUs and the interconnection network. As explained earlier, these structures can be implemented using a relatively small number of transistors. If we use the Pentium 4 chip in order to implement four Pentium III processors, about 37 million transistors will be left unused. This number of transistors is more that enough to implement the four TSUs and the appropriate interconnection network. Therefore, a DDM-CMP architecture with four Pentium III processors can be implemented with the same number of transistors needed to build a Pentium 4. This is the DDM-CMP configuration that will be used for our proof-of-concept experiments. 3.2 Experimental Setup As we do not have yet a DDM-CMP simulator, its performance results are derived from the results obtained by Kyriacou et al. [10] for the D 2 NOW implementation. In this case, we will use the results for the D 2 NOW architecture configured with four Pentium III 800MHz processors including all architecture optimizations. Notice that the D 2 NOW results are conservative for the DDM-CMP architecture as the on-chip interconnection network has both larger bandwidth and smaller latency than the D 2 NOW interconnect. As for the baseline high-end processor we have selected the Pentium 4 3.2GHz. To obtain the results for this setup we measure the execution time of the application s native execution on that system. The execution time is determined by measuring the number of processor cycles consumed in the execution of the main function of the program, i.e. we ignore the initialization phase. The processor cycles are measured by reading the contents of the hardware program counter of the processor [20]. Notice that as this is native execution, in order for the results to be statistically significant we execute the same experiment ten times and exclude the largest and smaller measurement. For this proof-of-concept analysis the workload considered to test the proposed architecture is composed of three kernels from the SPLASH-2 benchmark suite [21]: LU, FFT, and Radix. 3.3 Experimental Results The results collected from [10] and from the native execution on the Pentium 4 system are summarized in Table 1. It is possible to observe that the 4 x Pentium III DDM-CMP achieves better performance than the native Pentium 4 for only the Radix application. For both FFT and LU the DDM-CMP performance is worse than the one obtained for the Pentium 4. It is interesting to note that these results may be correlated with the fact that for both FFT and LU the execution time is much smaller than the one for the Radix application. From a brief analysis of the execution of the applications we were able to determine that the main function that performs the calculations for the algorithm accounts for more than 80% of the total execution for Radix, while it accounts for approximately only 50% for both FFT and LU. This is an indication that in order to
7 370 K. Stavrou, P. Evripidou, and P. Trancoso Table 1. Performance results with different implementation technology DDM-CMP CPU 4 x Pentium III 800MHz Pentium 4 3.2GHz Speedup Cycles [x1000] Time [ms] Cycles [x1000] Time [ms] FFT LU Radix obtain more reliable results for both FFT and LU we will need to use larger data set sizes. This issue will be covered in the near future as we complete the DDM-CMP simulator. Nevertheless, at this point we do not consider this result to be a problem as we expect that there will always be applications that will not show better performance when executing on the DDM-CMP. The results presented in Table 1 are significantly affected by technology scaling. It is important to notice that the Pentium III is implemented using 0.18µm technology and its clock speed is 800MHz whereas the Pentium 4 is implemented using 0.09µm technology and a clock of 3.2GHz. If pipeline stalls due to off-chip operations are not taken into account, the number of clock cycles needed to execute a series of instructions is independent of the implementation technology. If we consider that the off-chip operations are not dominant in the applications studied, the execution time for the specific application on the specific architecture will decrease with the same rate that the frequency increases. In this analysis we will consider two frequency scaling scenarios. The first one is a realistic scaling where instead of considering the original Pentium III 800MHz we consider the highest clock frequency which it was produced. From [22], the Pentium III code named Tualatin had a clock frequency of 1.4GHz. The second scaling is the upper most limit scenario where we consider that we would use a Pentium III equivalent processor which would be able to scale up to the Pentium 4 frequency (3.2GHz). An additional optimization that will be considered in the future is the fact that the TSU may be modified to be shared among more than one processor. This will minimize the hardware overhead that is introduced due to the DDM architecture. The extra space that is saved from sharing the TSU may be used to increase the number of cores on the chip. One more factor that will have an impact on the performance of DDM-CMP is the type of processor used as the core. In this analysis we are using the Pentium III given the restrictions that originate from the use of the results obtained with the D 2 NOW study. In a real implementation, as we discussed previously, we will use embedded processors as the core for the DDM-CMP. As these embedded processors are simpler, they require fewer transistors for their implementation and consequently we will be able to fit more cores into the same chip. Given the above arguments, in addition to the frequency scaling, we also consider the case where we would be able to fit eight processors on the same chip. Notice that as we use the results from a D 2 NOW configured with eight Pentium III processors the results are not accurate, but they can be used as an indication of the upper limit that may be achieved. The results from these scaling scenarios together with the original results are depicted in Figure 3.
8 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor 371 speedup compared to P4 3.2GHz FFT LU Radix FFT LU Radix 4 Core CMP 8 Core CMP PIII 800MHz PIII 1.4GHz PIII 3.2GHz Fig. 3. Speedup when frequency and core scaling is taken into account In Figure 3 we have depicted three bars for each application. The one on the left represents the original results, the one in the middle represents the Pentium III scaling for 1.4GHz, and the one on the right represents the upper limit scaling with the 3.2GHz clock frequency. The group of results on the left represent the original chip design with four cores while the group on the right represents the scenario where the optimizations used allowed for the scaling of the number of cores to eight. As it was already observed with the original results, both FFT and LU have a speedup smaller than 1 and therefore their performance is better on the Pentium 4 system. In contrast, Radix achieves almost a 2x speedup when executing on the original DDM-CMP. The speedup values increase as the scaling is applied to the Pentium III processor. It is relevant to notice that even with the first scaling all three applications already show a better performance on the DDM-CMP compared to the execution on the Pentium 4. This configuration has also the advantage of being more power-efficient than the original Pentium 4 as it is clocked at less than half of its frequency. When scaling the frequency to the same as the baseline we observe larger speedup values ranging from 2.6 to 7.6. It is also interesting to observe that Radix presents a superlinear speedup as with the upper limit scaling it achieves a speedup of 7.6 with only four processors. This may be justified by the effectiveness of the CacheFlow policies. The results for the eight core DDM-CMP present, for all applications at any scaling scenario, a speedup larger than one. Overall, the results show very good speedup for both the high-performance and lowpower DDM-CMP configurations. 4 Conclusions In this paper we have presented DDM-CMP, a Chip-Multiprocessor implementation using the Data-Driven Multithreading execution model. The DDM-CMP architecture
9 372 K. Stavrou, P. Evripidou, and P. Trancoso turns away from the complexity path taken by recent high-end microprocessors. Its performance is achieved by combining several simple commodity microprocessors together with a small overhead, an extra hardware structure, the Thread Scheduling Unit (TSU). As a proof-of-concept we present a DDM-CMP implementation that utilizes the same hardware budget as a current high-end processor, the Pentium 4, to implement four Pentium III processors together with the necessary TSUs and interconnection network. The results obtained are very encouraging as the DDM-CMP configuration clocked at the same frequency as the Pentium 4 achieves a speedup of 2.6 to 7.6. DDM-CMP can alternatively be configured for power-efficiency and still achieve high speedup. A configuration clocked at less than half of the Pentium 4 frequency achieves speedup values ranging from 1.1 to 3.3. We are currently evaluating the different architecture alternatives for DDM-CMP, a larger set of applications, and are starting to implement a prototype of this architecture on a Virtex II Pro chip. Acknowledgments We would like to thank Costas Kyriacou for his contribution in the discussions and preparation of the results. Also, we would like to thank the anonymous reviewers for their valuable comments. References 1. Palacharla, S., Jouppi, N., Smith, J.: Complexity Effective Superscalar Processors. In: Proc. of the 24th ISCA. (1997) Olukotun, K., et al.: The Case for a Single Chip Multiprocessor. In: Proc. of the 7th ASP- LOS. (1996) Silas, I., et al.: System-Level Validation of the Intel(r) Pentium(r) M Processor. Intel Technology Journal 7 (2003) 4. Agarwal, V., et al.: Clock rate versus IPC: The end of the Road for Conventional Microarchitectures. In: Proc. of the 27th ISCA. (2000) Hammond, L., et al.: The Stanford Hydra CMP. IEEE Micro 20 (2000) Barroso, L., et al.: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In: Proc. of the 27th ISCA. (2000) Taylor, M., et al.: Evaluation of the Raw Microprocessor: An Exposed Wire Delay Architecture for ILP and Streams. In: Proc. of the 31st ISCA. (2004) Kalla, R., Sinharoy, B., Tendler, M.: IBM POWER5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro 24 (2004) Kongetira, P.: A 32-way Multithreaded SPARC Processor. In: Proc. of Hot Chips (2004) 10. Kyriacou, C., Evripidou, P., Trancoso, P.: Data Driven Multithreading Using Conventional Microprocessors. Technical Report TR-05-4, University of Cyprus (2005) 11. Evripidou, P., Kyriacou, C.: Data driven network of workstations (D2NOW). J. UCS 6 (2000) XILINX: Virtex-II Pro and Virtex-II Pro X FPGA User Guide. Version 3.0 (2004)
10 DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Evripidou, P.: D3-machine: A Decoupled Data Driven Multithreaded architecture with variable resolution support. Parallel Computing 27 (2001) Evripidou, P., Gaudiot, J.: A decoupled graph/computation data-driven architecture with variable resolution actors. In: Proc. of ICPP (1990) Kyriacou, C.: Data Driven Multithreading using Conventional Control Flow Microprocessors. PhD dissertation, University of Cyprus (2005) 16. Kyriacou, C., Evripidou, P., Trancoso, P.: CacheFlow: A Short-Term Optimal Cache Management Policy for Data Driven Multithreading. In: Proc. of the 10th Euro-Par, Pisa, Italy. (2004) 17. IBM Microelectronics Division: The PowerPC 405(tm) Core (1998) 18. The BlueGene/L Team: An Overview of the BlueGene/L Supercomputer. In: Proc. of the 2002 ACM/IEEE supercomputing. (2002) Intel: Intel Microprocessor Quick Reference Guide. pressroom/kits/ quickreffam.htm (2004) 20. PCL: The Performance Counter Library Version 2.2 (2003) 21. Woo, S., at.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: Proc. of 22nd ISCA. (1995) Topelt, B., Schuhmann, D., Volkel, F.: The mother of all CPU charts Part 2. (2004)
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationIntroduction to Microprocessor
Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationChecker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India
Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More information2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]
EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationMulti-threading technology and the challenges of meeting performance and power consumption demands for mobile applications
Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits
More informationMulti-Core Microprocessor Chips: Motivation & Challenges
Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationPerformance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture
Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationNew Advances in Micro-Processors and computer architectures
New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationMarch 17, :11 WSPC/INSTRUCTION FILE ddm cmp ppl mar08. Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D 2 -CMP) using FPGAs
Parallel Processing Letters c World Scientific Publishing Company Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D -CMP) using FPGAs Konstantinos Tatas, Costas Kyriacou Computer Engineering
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More information18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationUnderstanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures
Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationDDMCPP: The Data-Driven Multithreading C Pre-Processor
DDMCPP: The Data-Driven Multithreading C Pre-Processor Pedro Trancoso, Kyriakos Stavrou, Paraskevas Evripidou Department of Computer Science University of Cyprus 75 Kallipoleos Ave., P.O.Box 20537, 1678
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationPerformance Impact of Resource Conflicts on Chip Multi-processor Servers
Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationComputer Architecture Today (I)
Fundamental Concepts and ISA Computer Architecture Today (I) Today is a very exciting time to study computer architecture Industry is in a large paradigm shift (to multi-core and beyond) many different
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationMultithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University
Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationProf. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationMultiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.
Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationA comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.
A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor. Recent years have seen a great deal of interest in multiple-issue machines or superscalar
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationinstruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals
Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationChip Multithreading: Opportunities and Challenges
Chip Multithreading: Opportunities and Challenges Lawrence Spracklen & Santosh G. Abraham Scalable Systems Group Sun Microsystems Inc., Sunnyvale, CA {lawrence.spracklen,santosh.abraham}@sun.com Abstract
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationArchitectural Support for Data-Driven Execution
Architectural Support for Data-Driven Execution GEORGE MATHEOU and PARASKEVAS EVRIPIDOU, University of Cyprus The exponential growth of sequential processors has come to an end, and thus, parallel processing
More informationVerilog-based simulation of hardware support for Data-flow concurrency on Multicore systems
Verilog-based simulation of hardware support for Data-flow concurrency on Multicore systems George Matheou Department of Computer Science University of Cyprus Nicosia, Cyprus Email: geomat@cs.ucy.ac.cy
More information