Performance Impact of Resource Conflicts on Chip Multi-processor Servers

Size: px
Start display at page:

Download "Performance Impact of Resource Conflicts on Chip Multi-processor Servers"

Transcription

1 Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi Do, Korea {myunghol, ysryu, swhong, Abstract. Chip Multi-Processors (CMPs) are becoming mainstream microprocessors for High Performance Computing and commercial business applications as well. Multiple CPU cores on CMPs allow multiple software threads executing on the same chip at the same time. Thus they promise to deliver higher capacity of computations performed per chip in a given time interval. However, resource sharing among the threads executing on the same chip can cause conflicts and lead to performance degradation. Thus, in order to obtain high performance and scalability on CMP servers, it is crucial to first understand the performance impact that the resource conflicts have on the target applications. In this paper, we evaluate the performance impact of the resource conflicts on an example high-end CMP server, Sun Fire E25K, using a standard OpenMP benchmark suite, SPEC OMPL. 1 Introduction Recently, microprocessor designers have been considering many design choices to efficiently utilize the ever increasing effective chip area with the increase of transistor density. Instead of employing a complicated processor pipeline on a chip with an emphasis on improving single thread s performance, incorporating multiple processor cores on a single chip (or Chip Multi-Processor) has become a main stream microprocessor design trend. As a Chip Multi-Processor (CMP), it can execute multiple software threads on a single chip at the same time. Thus a CMP provides a larger capacity of computations performed per chip for a given time interval (or throughput). Examples are Dual-Core Intel Xeon [3], AMD Opteron [1], UltraSPARC IV, IV+, T1 microprocessors from Sun Microsystems [12], [14], IBM Power 5 [5], among others. Shared-Memory Multiprocessor (SMP) servers based on CMPs are already introduced in the market, e.g., Sun Fire E25K [12] from Sun Microsystems based on dual-core UltraSPARC IV processors. They are rapidly adopted in High Performance Computing (HPC) applications as well as in commercial business applications. Although CMP servers promise to deliver higher chip-level throughput performance than the servers based on the traditional single core processors, resources on the CMPs such as cache(s), cache/memory bus, functional units, etc., are B. Kågström et al. (Eds.): PARA 2006, LNCS 4699, pp , c Springer-Verlag Berlin Heidelberg 2007

2 Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1169 shared among the cores on the same processor chip. Software threads running on the cores of the same processor chip compete for the shared resources, which can cause conflicts and hurt performance. Thus exploiting the full performance potential of CMP servers is a challenging task. In this paper, we evaluate the performance impact of the resource conflicts among the processor cores of CMPs on a high-end SMP server, Sun Fire E25K. For our performance evaluation, we use HPC applications parallelized using OpenMP standard [9] for SMP: SPEC OMPL benchmark suite [11]. Using the Sun Studio 10 compiler suite [13], we generate fairly high optimized executables for SPEC OMPL programs and run them on E25K server. In order to evaluate the performance impact of the resource conflicts on the shared resources, level-2 cache bus and main memory bus, 64-thread (and 32-thread) runs were conducted using both cores of 32 CMPs (16 CMPs for 32-thread run) and using only one core of 64 CMPs (32 CMPs for 32-thread run). The experimental results show 17 18% average (geometric mean for the 9 benchmark programs) slowdowns for the runs with resource conflicts than without the conflicts. Benchmarks which intensively utilize the memory bandwidth or allocate large amounts of memory suffer more due to the resource conflicts. The rest of the paper is organized as follows: Section 2 describes the architecture of an example CMP server, Sun Fire E25K. Section 3 describes the OpenMP programming model and our test benchmark suite, SPEC OMPL. It also describes how to generate optimized executables for SPEC OMPL. Section 4 first shows the settings for utilizing Solaris 10 Operating System features useful for achieving high performance for SPEC OMPL. Then it shows the experimental results on E25K. Section 5 wraps up the paper with conclusions. 2 Chip Multi-processor Server In this section, we describe the architecture of an example high-end CMP server which we used for our performance experiments in this paper. The Sun Fire E25K server is the first generation throughput computing server from Sun Microsystems which aims to dramatically increase the application throughput by employing dual-core CMPs. The server is based on the dual-core UltraSPARC IV processor and can scale up to 72 processors executing 144 threads (two threads per each UltraSPARC IV processor) simultaneously. The system offers up to twice the compute power of the UltraSPARC III Cu (predecessor to UltraSPARC IV processor) based high-end systems. The UltraSPARC IV contains two enhanced UltraSPARC III Cu cores (or Thread Execution Engines: TEEs), a memory controller, and the necessary cache tag for 8 MB of external L2 cache per core (see Fig. 1). The off-chip L2 cache is 16 MB in size (8 MB per core). The two cores share the Fireplane System Interconnect, as well as the L2 cache bus. Thus they become the potential source of performance bottlenecks. The basic computational component of the Sun Fire E25K server is the Uni- Board [12]. Each UniBoard consists of up to four UltraSPARC IV processors,

3 1170 M. Lee et al. Fig. 1. UltraSPARC IV processor their L2 caches, and associated main memory. Sun Fire E25K can contain up to 18 UniBoards, thus at maximum 72 UltraSPARC IV processors. In order to maintain cache coherency system wide, the snoopy cache coherency protocol is used within the UniBoard and directory-based cache coherency protocol is used among different UniBoards. The memory latency, measured using lat mem rd( ) routine of lmbench, to the memory within the same UniBoard is 240nsec and 455nsec to the memory in different Uniboard (or remote memory). 3 SPEC OMPL Benchmarks The SPEC OMPL is a standard benchmark suite for evaluating the performance of OpenMP applications. It consists of application programs written in C and Fortran, and parallelized using the OpenMP API [11]. The underlying execution model for OpenMP programs is fork-join (see Fig. 2) [9]. A master thread executes sequentially until a parallel region of code is encountered. At that point, the master thread forks a team of worker threads. All threads participate in executing the parallel region concurrently. At the end of the parallel region (the join point), the team of worker threads and the master synchronize. After then the master thread alone continues sequential execution. OpenMP parallelization incurs an overhead cost that does not exist in sequential programs: cost of creating threads, synchronizing threads, accessing shared data, allocating copies of private data, bookkeeping of information related to threads, and so on. The SPEC OMPL benchmark suite consists of nine application programs representative of HPC applications from the areas of chemistry, mechanical engineering, climate modeling, and physics. Each benchmark requires a memory size up to 6.4 GB when running on a single processor. Thus

4 Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1171 Fig. 2. OpenMP execution model the benchmarks target large-scale systems with 64-bit address space. Table 1 lists the benchmarks and their application areas. Table 1. SPEC OMPL Benchmarks Using Sun Studio 10 compiler suite [13], we ve generated executables for the benchmarks in SPEC OMPL suite. By using combinations of compiler options provided by the Sun Studio 10, fairly high level of compiler optimizations is applied to the benchmarks. Commonly used compiler flags are -fast -openmp -xipo=2 -autopar -xprofile -xarch=v9a. Other further optimization flags are applied to individual benchmark also. These options provide many common and also advanced optimizations such as scalar optimizations, loop transformations, data prefetching, memory hierarchy optimizations, interproce-

5 1172 M. Lee et al. dural optimizations, profile feedback optimizations, among others. (Please see [13] for more details on the compiler options.) The -openmp option processes openmp directives and generate parallel code for execution on multiprocessors. The -autopar option provides automatic parallelization by the compiler beyond user-specified parallelization. This can further improve the performance. 4 Performance Results Using the compiler options described in section 3, weve generated highly optimized executables for SPEC OMPL. In this section, we first describe the system environments on which the optimized executables are executed. We then show the performance results, impact of resource conflicts, on Sun Fire E25K. We also show one example compiler technique which can reduce the impact of resource conflicts along with the experimental results. 4.1 System Environments The Solaris 10 Operating System provides features which help improve performance of OpenMP applications. They are Memory Placement Optimization (MPO) and Multiple Page Size Support (MPSS). MPO feature can be useful in improving performance of programs with intensive data accesses to localized regions of memory. With the default MPO policy called first-touch, memory accesses can be kept on the local board most of the time, whereas, without MPO, those accesses would be distributed all over the boards (both local and remote) which can become very expensive. MPSS can improve performance of programs which use a large amount of memory. Using large size pages (supported by MPSS), the number of TLB entries needed for the program and the number of TLB misses can be significantly reduced. Thus performance can be significantly improved [10]. We are enabling both MPO and MPSS for our runs of SPEC OMPL executables. OpenMP threads can be bound to processors using the environment variable SUNW MP PROCBIND which is supported by thread library in Solaris 10. Processor binding, when used along with the static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will either be in the local cache from a previous invocation of a parallel region, or in local memory due to the OS s first-touch memory allocation policy. 4.2 Impact of Resource Conflicts on CMP AsmentionedinSection2,twocoresononeUltraSPARCIVCMPsharetheL2 cache bus and the memory bus, which are potential sources of performance bottlenecks. In order to measure the performance impact of these resource conflicts on SPEC OMPL, we ve measured the performance of 64-thread (and 32-thread) runs in two ways:

6 Performance Impact of Resource Conflicts on Chip Multi-processor Servers Using 64 (32) UltraSPARC IV processors, thus using only one core per processor. 2. Using 32 (16) UltraSPARC IV processors, thus using both cores of the processor. In this case, there are possible resource conflicts between the two cores. Table thread cases-64x1 vs. 32x2 Table 2 (and Table 3) shows the run times for both 1 and 2 using 64 threads (32 threads) measured on Sun Fire E25K with 1050Mhz UltraSPARC IV processors. They also show the speed-ups of 1 over 2. Overall, 1 performs 1.18x (1.17x) better than 2 in 64-thread run (32-thread run). Benchmarks with greater performance gains from 1 core show the following characteristics: 313.swim l: This is a memory bandwidth-intensive benchmark. For example, there are 14 common arrays accessed all over the program. All the arrays are Table thread cases-32x1 vs. 16x2

7 1174 M. Lee et al. of the same size (7702 x 7702) and each array element is 8 bytes long. Thus the total array size is 6,342 Mbytes. The arrays are seldom reused in the same loop iteration and the accesses stride through the arrays continuously. When only one core is used per processor, it can fully utilize the L2 cache and the main memory bandwidth available on the processor chip, whereas when two cores are used the bandwidth is effectively halved between the two cores. This led to 1.39x gain of 1 over 2. Unless some aggressive compiler optimizations are performed to increase the data reuses, the benchmark will suffer from the continuous feeding of data to the processor cores which burns all the available memory bandwidths. 315.mgrid l: This benchmark, like 313.swim l, requires high memory bandwidth. Although this benchmark shows some data reuses (group reuse) of the three dimensional arrays which are intensively accessed, the same data is reused at most three times. Therefore, the accesses stride through the arrays. Using only one core can have much higher memory bandwidth, as in 313.swim l s case, which leads to 1.20x gain. 325.apsi l and 331.art l: These benchmarks allocate large amount of memory per thread at run-time. For example, 325.apsi l allocates 6,771 Mbytes of an array at run-time besides many other smaller arrays. The dynamic memory allocation can be parallelized, however it still requires a large memory space per processor core. Thus, instead of allowing 8 threads to allocate large memory on the same UniBoards memory, allowing only 4 threads, by using only one core per each UltraSPARC IV, can have significant performance benefit. 331.art l also shows similar characteristics. 327.gafort l: In this benchmark, the two hottest subroutines have critical sections inside the main loops. Also they both suffer from intensive memory loads and stores generated from the critical section loops. These take up large portions of the total run time. Placing 2 threads on one UltraSPARC IV (by using both cores) can reduce the overhead involved in the locks and unlocks. However, allocating 8 threads on two different UniBoards (by using only one core in each UltraSPARC IV) reduces the pressure on the memory bandwidth significantly compared with allocating 8 threads on the same UniBoard. The benefit from the latter dominates that of the former. Benchmarks other than the above (311.wupwise l, 317.applu l, 321.equake l, 329.fma3d l) relatively give less pressure on the memory bandwidth and/or consume smaller amount of memory. Thus the performance gap between 1 and 2 is smaller. These benchmarks are not heavily affected by the resource conflicts and are more suitable for execution on CMP servers. In order to show the performance impact due to resource conflicts from a different perspective, we ve calculated the speed-ups from 32-thread runs to 64- thread runs in two ways: Calculating scalabilities from 32 x 1 run to 64 x 1 run, i.e. when only core is used. Calculating scalabilities from 32 x 1 run to 32 x 2 run. Thus 64-thread run is performed with resource conflicts.

8 Performance Impact of Resource Conflicts on Chip Multi-processor Servers 1175 Fig. 3. Scalabilities from 32-thread runs to 64-thread runs Fig. 3 shows the scalabilities in both cases. For the benchmarks which are affected more due to the resource conflicts, the two scalability bars show bigger gaps. 4.3 Algorithmic/Compiler Techniques to Reduce Resource Conflicts on CMP For benchmarks which suffer a lot due to resource conflicts, algorithmic and/or compiler techniques are needed to reduce the penalties. For example, aggressive procedure inlining and skewed tiling [7] technique can be used for 313.swim l. The skewed tiling, when applied to 313.swim l, can convert a major portion of the memory accesses to cache accesses by increasing data reuses. Thus can significantly cut down the traffic to main memory and make a large performance gain. Using the compiler flags -Qoption iropt -Atile:skewp provided in Sun Studio10 Fortran compiler, we ve generated a new executable for 313.swim l. We ve run both the original and the new executables on a smaller Sun SMP server (SunFire E2900 employing 12 UltraSPARV IV processors) using both cores of each UltraSPARC IV. For these runs we ve reduced the array sizes to 1/4th of the original sizes. (There are fourteen two-dimensional arrays with sizes 7702 x 7702 in 313.swim l. Were reduced them into 3802 x 3802.) We ve also reduced the number of loop iterations from 2400 to Then we ve conducted the following two runs: Using 8 threads, the original executable runs in 1431 sec and the new one runs in 624 sec, resulting in 2.29x speed-up.

9 1176 M. Lee et al. Using 16 threads, the original executable runs in 1067 sec and the new one runs in 428 sec, resulting in 2.49x speed-up. Above results show the effectiveness of skewed tiling for 313.swim l. Other algorithmic/compiler techniques are being sought for benchmarks which are affected more by the resource conflicts. 5 Conclusion In this paper, we first described the architecture of an example CMP server, Sun Fire E25K, in detail. Then we introduced the OpenMP execution model along with the SPEC OMPL benchmark suite used for our performance study. We also showed how to generate highly optimized executables for SPEC OMPL using the Sun Studio 10 compiler. We then described the system settings on which we run the optimized executables of SPEC OMPL. They include features in Solaris 10 OS (MPO, MPSS) which help improve HPC application performance and binding of threads to processors. Using these features, we ve measured the performance impact of the resource conflicts on CMPs for SPEC OMPL using either one core or both cores of UltraSPARC IV CMPs in the system. It turned out that the benchmarks which have high memory bandwidths requirements and/or use large amounts of memory suffer in the presence of the resource conflicts. Algorithmic and compiler techniques are needed to reduce the conflicts on the limited resources shared among different cores. Acknowledgments. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF D00785). The authors would like to extend their thanks to the Center for Computing and Communications of the RWTH Aachen University for allowing the accesses of Sun Fire E25K and E2900 servers. References 1. AMD Multi-Core: Introducing x86 Multi-Core Technology & Dual-Core Processors (2005), 2. Chaudhry, S., Caprioli, P., Yip, S., Tremblay, M.: High-Performance Throughput Computing, IEEE Micro (May-June 2005) 3. Intel Dual-Core Server Processor, 4. Intel Hyperthreading Technology, 5. Kalla, R., Sinharoy, B., Tendler, J.: IBM POWER5 chip: a dual core multithreaded processor, IEEE Micro (March-April 2004) 6. Li, Y., Brooks, D., Hu, Z., Shadron, K.: Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. In: 11th International Symposium on High-Performance Computer Architecture (2005) 7. Li, Z.: Optimal Skewed Tiling for Cache Locality Enhancement. In: International Parallel and Distributed Processing Symposium (IPDPS 03) (2003)

10 Performance Impact of Resource Conflicts on Chip Multi-processor Servers Olukotun, K., et al.: The Case for a single Chip-Multiprocessor. In: International Conference on Architectural Support for Programming Languages and Operating Systems (1996) 9. OpenMP Architecture Review Board, Solaris 10 Operating System, The SPEC OMP benchmark suite, Sun Fire E25K server, e25k/index.xml 13. Sun Studio 10 Software, Sun UltraSPARC T1 microprocessor, Tullsen, D., Eggers, S., Levy, H.: Simultaneous MultiThreading: Maximizing On- Chip Parallelism. In: International Symposium on Computer Architecture (1995)

Compiler Support and Performance Tuning of OpenMP Programs on SunFire Servers

Compiler Support and Performance Tuning of OpenMP Programs on SunFire Servers Compiler Support and Performance Tuning of OpenMP Programs on SunFire Servers Myungho Lee, Larry Meadows, Darryl Gove, Dominic Paulraj, Sanjay Goil Compiler Performance Engineering Group Brian Whitney

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

SunFire range of servers

SunFire range of servers TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200 Parallelism via Multithreaded and Multicore CPUs Bradley Dutton March 29, 2010 ELEC 6200 Outline Multithreading Hardware vs. Software definition Hardware multithreading Simple multithreading Interleaved

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Amdahl's Law in the Multicore Era

Amdahl's Law in the Multicore Era Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

PCS - Part Two: Multiprocessor Architectures

PCS - Part Two: Multiprocessor Architectures PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Understanding Cache Interference

Understanding Cache Interference Understanding Cache Interference by M.W.A. Settle B.A., University of Colorado, 1996 M.S., University of Colorado, 2001 A thesis submitted to the Faculty of the Graduate School of the University of Colorado

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Memory Subsystem Profiling with the Sun Studio Performance Analyzer

Memory Subsystem Profiling with the Sun Studio Performance Analyzer Memory Subsystem Profiling with the Sun Studio Performance Analyzer CScADS, July 20, 2009 Marty Itzkowitz, Analyzer Project Lead Sun Microsystems Inc. marty.itzkowitz@sun.com Outline Memory performance

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

JBus Architecture Overview

JBus Architecture Overview JBus Architecture Overview Technical Whitepaper Version 1.0, April 2003 This paper provides an overview of the JBus architecture. The Jbus interconnect features a 128-bit packet-switched, split-transaction

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information