Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures

Size: px
Start display at page:

Download "Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures"

Transcription

1 Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Ahmed S. Mohamed Department of Electrical and Computer Engineering The George Washington University Washington, DC Abstract: We experiment with various techniques of monitoring and tuning UPC programs while porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. In fact, the SGI NUMA environment has provided new opportunities for UPC. For example, the spectrum of performance analysis and profiler tools within the SGI NUMA environment made the development of new monitoring and tuning strategies that aim at improving the efficiency of parallel UPC applications possible. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to further optimize UPC programs with a better data and threads layouts potentially resulting in significant performance improvements. Furthermore, the SGI CC-NUMA environment provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by allowing, for example, two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization operations. This characteristic alleviates the effect of false sharing. Key Words: UPC, NAS, Latency, Privatization 1- Introduction Unified Parallel C (UPC) is an explicit parallel extension of ANSI C programming language designed for high performance computing on large-scale parallel machines [1]. It s two primary advantages are that programmers can get very close to hardware so that they can easily attain optimized performance, and that there is lots of C code that can be parallelized. The language provides a uniform programming model for both shared and distributed memory hardware since it is based on a distributed shared memory programming model. The communication model in it is based on the idea of a shared, partitioned address space, where variables may be directly read and written by multiple processors, but each variable is physically associated with a single processor. UPC uses a Single Program Multiple Data (SPMD) model of computation in which the amount of parallelism is fixed at program startup time, typically with a single thread of execution per processor. The philosophy behind UPC, is that you start with C; keep all its powerful concepts and features. Then add parallelism capitalizing on the experiences gain from previous and current parallel C, such as Split-C[2], Cilk [3], and AC [4]. The performance of a UPC parallel program depends on five main factors: (i) whether the parallelism within a program is explored/exposed efficiently to its potential. It is always important that a parallel developer explores/exposes parallelism in a program to the extreme so that a multiprocessor system can be highly utilized. Parallelism could be indicated explicitly by the UPC programmer, recognized by a parallel UPC compile, or discovered automatically by the UPC runtime system. (ii) whether the tasks to be performed are evenly allocated among processors. Load balancing is one of the key factors that affect system performance. The motivation of load balancing is to reduce the average completion time of processes and improve the utilization of processors. (iii) whether the UPC shared data used by all parallel processes/threads is rationally distributed among memories of processors. If a process/thread accesses a remote memory very frequently the proportion of communication overhead to the execution time will be relatively high so that the parallel performance is worsened. (vi) whether the running tasks are not cooperating as a team and thus creating much synchronization overhead. If processes/threads are waiting for each other (barriers) or waiting on data locks too much, we need to restructure our computation/data

2 layout. (v) whether UPC I/O and file system present a bottleneck in performance. Many parallel applications require high-performance I/O to avoid negating some or efficient use of an underlying parallel file system by allowing parallel applications to describe complex I/O requests at a high level. UPC programs therefore need extensive tuning with respect to the above five factors. This requires, however, that monitoring information about a UPC application behavior be collected. While the performance of message passing codes can easily be assessed and optimized using standard instrumentation tools (e.g. MPI-trace), the same task for the distributed shared memory UPC environment is much more difficult. This stems from the fact that any distributed shared memory communication is performed at runtime through transparently issued store and load operations to remote data locations and handled completely by the NUMA hardware. In addition, distributed shared memory communication is very fine grain, making a code instrumentation recording each global memory operation practically infeasible as it would slow down the execution significantly and thereby distort the final monitoring to a point where it is unusable for an accurate performance analysis. In this paper we propose a monitoring and tuning strategy that aims at improving the efficiency of parallel UPC applications. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program s data and code layouts. Using this visualized information, programmers are able to detect the communication bottlenecks and further optimize programs with a better data and code layouts potentially resulting in significant performance improvements. The problem is that all the five factors affecting the performance of parallel programs interact in complex ways so as to make the job of tuning at the high level a difficulty. Also analyzing the low-level hardware counters for diagnosing high-level parallel code is not so easy to do. Efficient distributed shared memory programming is one of the main motivations of the UPC project. In order to achieve this goal, both the low-level hardware and the high-level mechanisms have to be taken into consideration. Figure 1 shows the various layers involved in both UPC program development and execution. Our objective is to collect low level monitoring information from lower layers, and after analyzing these information be able to tune UPC programs at higher layers. In this work, we make use of SGI NUMA monitoring facilities (e.g. programs profilers, IRIX system calls or services) to get performance counters such as the all of the benefit derived from parallelizing its computation. The UPC I/O interface should enable number of cache misses, process and thread management statistics, locks and synchronization information such as mean and peak lock time, peak barrier wait time, lowlevel data transfer statistics, the detailed access numbers to pages and memory regions, communication hot spots statistics, the access behavior histograms on virtual pages. Inappropriate data allocation can, for example, be easily detected via the different height of paging histogram s columns. For example, Page 300 (virtual number) is located on processor 1 s memory but accessed only by processor 2. It is therefore incorrectly allocated and should be placed on processor 2 s memory. Monitoring is a direct and efficient way to exhibit where the program spends most of its execution time: time spent resolving cache misses, time spent waiting at barriers, time spent waiting on remote accesses, time spent waiting for I/O, and time spent waiting for locks, etc. In UPC, we assume that the programmer can specify the correct code/data layout for his application. A UPC programmer is always given a list of possible optimizations hints such as: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. In this work we would like to verify such optimization hints and come up with new more justifiable ones. We do this while reporting on our experience in porting NAS NPB benchmark on the lately developed GCC-SGI UPC compiler. Also the performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. 2- Experimental work The test-bed machine that we are running the experimental work on is an O3800 NUMA 32 MIPS processors, 500MHZ R14000, with 4MB 2-way set associative 32KB instruction/ 32KB data cache each, with 8MB secondary caches, 8 GB SDRAM, 3.2 GB/s total memory bandwidth, running IRIX 6.5. For the Compaq comparison we used a Compaq AlphaServer SC of 32 Alpha EV68 processors, 833 MHZ, each with 64 KB instruction cache, 64 KB data cache, and 8MB per processor of secondary cache, 24 GB SDRAM, 5.2 GB/s total memory bandwidth, running Tru64 Unix.

3 2.1 Porting NAS NPB-2.3 on the GCC-SGI UPC V1.10 Compiler The GCC-SGI UPC V1.10 compiler implementation effort has been undertaken lately [3]. UPC had two previous compilers: Compaq UPC compiler, and Cray UPC compiler. The Origin UPC compiler is based on a modified version of the gcc C compiler that calls specific functions for each basic UPC operation. The gcc UPC toolset provides a compilation and execution environment for UPC programs on the Origin NUMA machine. The current version extends the capabilities of GNU gcc (version ) compiler. The gcc UPC compiler is implemented as a C Language dialect translator, in a fashion similar to the implementation of the GNU Objective C compiler. This release was developed exclusively for use on SGI workstations and servers running the MIPS instruction set, the IRIX (release 6.5 or higher) operating system, and the MIPS2 32-bit ABI. Supported systems include Origin 2000 super-servers, and Octane workstations. By default, this release of the gcc UPC compiler would support systems with as many as 256 independent processing units. We made use of this compiler in the experimental section. Figure 2 in the appendix shows execution time of NAS NPB-2.3 benchmark on the Origin 3800 testbed using the GCC-SGI UPC compiler. The benchmark was tested on 1,4,9,16, 25 and 32 processors. The first number in each experiment represents the execution time of the given program, and the second number represent the time spent on the communication routines. For the kernels (CG,EP,FT,IS, and MG) only collective functions take place in this communication layer. For the applications (BT, LU, and SP), the second number is the time spent on the upc_memget(), and upc_memput() routines. Figure 3 shows the same benchmark on the Compaq machine. The same two numbers computed in each Origin experiment were also computed here. The execution times on one processor using plain C code on both the Origin and Compaq machines are used to measure the speedups drawn in figures 2 and 3. Speedups in figure 2 are scaled with respect to the single processors execution times. For NP processors, speedup is computed as T(ref)/T(NP), where T(ref) is taken from first column values. The speedups in figure 3 are computed as T(ref)/T(NP) for an NP processor run, with T(ref). The SGI UPC shows better performance than the Compaq UPC. 2.2 Low-level Monitoring and High-level Tuning of UPC programs Figure 4 shows the execution time of three Micro-Kernel NAS NPB benchmarks: Matrix Multiplication, Sobal Edge Detection, and N-Queens under 1,2,4,8, and 16 processors on the NUMA O3800 testbed with and without the following performance tuning and adjustments: a- Space privatization: use private pointers instead of shared pointers when dealing with local shared data (through casting and assignments). Here we run two experiments; one before the optimization and another after. b- Block moves: use block copy instead of coping elements one by one with a loop, through string operations or structures. Two experiments are demonstrated one before and another after the optimization. c- Latency hiding: Overlap remote accesses with local processing using split-phase barriers. Two experiments are demonstrated one before and another after the optimization. d- Block pre-fetching: use block gets and puts. Two experiments are demonstrated one before and another after the optimization. Figure 4 draws the total runtime as our measure of performance enhancement due to various optimizations in the above benchmarking. The figure shows that both full optimization and pointer optimization hints outperform no optimization versions. Table 3 breaks these execution times into contributions from user time (which includes all local consistency time), time spent waiting at barriers, non-overlapped time spent waiting on faults, and non-overlapped time spent waiting for locks. Table 3 display hardware counters that were collected from a number of system profiling tools found on the Origin machine. These profiling tools are: Prof: profiling tool of the time spent on both user and runtime system routines. Speedshop (earlier SGI Pixie): to collect data from hardware counters with very little overhead for every processor, it can be used to detect possible load imbalance or extreme losses due to parallel overhead. Perfex: get the number of events that occur in the execution of a program such as TLB misses, cache misses, etc. Co-Pilot is a graphical interface that displays the activity in the various parts of the system during the execution of a parallel program. The NAS parallel benchmarks (NPB) is developed by the Numerical Aerodynamic simulation (NAS) program at NASA Ames Research Center for the performance evaluation of parallel supercomputers. The NPB mimics the computation and data movement characteristics of large scale computation fluid dynamics (CFD) applications. The NPB comes in two flavors NPB 1 and NPB 2. The NPB 1 are the original "pencil and paper" benchmarks. Vendors and others implement the detailed specifications in the NPB 1 report, using algorithms and programming models appropriate to their different machines. On the other hand NPB 2 are MPI-based source-code implementations written and distributed by NAS. They are intended to be run with little or no tuning. Another implementation of NPB 2 is the NPB 2- serial; these are single processor (serial) source-code implementations derived from the NPB 2 by removing all parallelism [NPB]. We have therefore used NPB 2 in our MPI execution time measurements. NPB 2-serial was used to provide the uniprocessor performance when

4 reporting on the scalability of MPI. The NPB suite consists of five kernels (EP, MG, FT, CG, IS) and three pseudo-applications (LU, SP, BT) programs. The bulk of the computations is integer arithmetic in IS. The other benchmarks are floating-point computation intensive. A brief description of each workload is presented in this section. BT (Block Tri-diagonal) is a simulated CFD application that uses an implicit algorithm to solve 3- dimensional (3-D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 5x5 blocks and are solved sequentially along each dimension. BT uses coarse-grain communications. SP (Scalar Penta-diagonal) is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam- Warmingapproximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. SP uses coarse-grain communications. LU (Block Lower Triangular) : is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve a seven-blockdiagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. LU performs a large number of small communications (five words) each. FT (Fast Fourier Transform): This benchmark solves a 3D partial differential equation using an FFT-based spectral method, also requiring long range communication. FT performs three one-dimensional (1- D) FFT s, one for each dimension. MG (MultiGrid): The MG benchmark uses a V-cycle multigrid method to compute the solution of the 3-D scalar Poisson equation. It performs both short and long range communications that are highly structured. CG (Conjugate Gradient): This benchmark computes an approximation to the smallest eigenvalue of symmetric positive definite matrix. This kernel features unstructured grid computations requiring irregular longrange communications. EP (Embarrassingly Parallel): This benchmark can run on any number of processors with little communication. It estimates the upper achievable limits for floating point performance of a parallel computer. This benchmark generates pairs of Gaussian random deviates according to a specific scheme and tabulates the number of pairs in successive annuli. IS (Integer sorting): This benchmark is a parallel sorting program based on bucket sort. It requires a lot of total exchange communication. There are different versions/classes of the NPB like Sample, Class A, Class B and Class C. These classes differ mainly in the size of the problem. Tables 3 and 4 give the problem sizes and performance rates (measured in Mflop/s) for each of the eight benchmarks, for Class A and Class B problem sets on a single processor Cray YMP. The following table shows the NPB problem sizes: (a) Table 1: Class A workloads (smaller version): Benchmark Size Operations(x10 3) MFLOPS EP MG CG 14, FT x IS 2 23 x LU SP BT (b) Table 2: Class B workloads (Bigger version) Benchmark Size Operations(x10 3 )MFLOPS EP MG CG 75, FT 512 x IS 2 25 x LU SP BT We have a number of results in table 3. First the hardware counters are compared between the no optimization version versus the counters for the full optimization version. We show the case for the matrix multiplication for NP=1 and for NP=4. The statistics for NP=4 represent the counters of one of the 4 processors. Careful analysis of these counters reveals a number of important observations. For example for NP=1 matrix multiplication, the number of loads and stores are less by two order of magnitude in the optimized version. That means the optimization hints have led to less memory references. This has also been reflected on the cache. Although the number of cache misses are same order of magnitude, the quadwords written back from cache is one order of magnitude less in the optimized version. Also the prefetch primary data cache misses is four order of magnitude less in the optimized version. The total number of decoded instructions is one order of magnitude less in the optimized version. That means the processor had less number of instructions to execute in the optimized version. The number of executed prefetch instructions is four order of magnitude larger in the optimized version. That means those instructions had less execution time in the optimized version. The number of mispredicted branches is one order of magnitude less in the optimized version. That means the pipelines were less often exposed to pipeline hazards in the optimized version. For the NP=4 matrix multiplication the number of loads and stores are less by

5 an order of magnitude in the four processor case. This has also been reflected on the cache misses on both the level 1 and level caches. 3- Conclusion In this paper we have reported on our experience in porting NAS NPB benchmark using the lately developed GCC-SGI UPC compiler on an Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. We are currently working on an UPC-I/O reference implementation on top of ROMIO (MPI-I/O implementation on the SGI NUMA) and a UPC parallel I/O Test Suite for testing the reference implementation. The Test Suite is a set of tests of I/O performance written using UPC-I/O routines. In this suite, there are low-level class of tests and the matrix tests in the kernel class. 4- References [1] T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC Language Specifications V1.0 ( February, [2] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. Von Eicken, and Y. Yelick, Introduction to Split-C, University of California, Berkeley, [3] R.C. Miller, A Type-Checking Preprocessor for Cilk 2, A Multi-threaded C Language, Master s Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, [4] W. W.Carlson and J. M.Draper, Distributed Data Access in AC, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara, CA, July 19-21, 1995, pp Appendix (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 2 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) 4 3 SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-SP-A COMPAQ-SP-A SEQ-LINEAR 1 SGI-FT-A COMPAQ-FT-A SEQ-LINEAR

6 (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-EP-A COMPAQ-EP-A SEQ-LINEAR SGI-CG-B COMPAQ-CG-B SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-FT-A COMPAQ-FT-A SEQ-LINEAR SGI-IS-A COMPAQ-IS-A SEQ-LINEAR SGI-FT-A COMPAQ-FT-A SEQ-LINEAR (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) (SGI GCC-UPC v1.10 vs COMPAQ UPC v1.7) SGI-BT-A COMPAQ-BT-A SEQ-LINEAR SGI-SP-A COMPAQ-SP-A SEQ-LINEAR EXECUTION TIME - MATRIX MULTIPLICATION EXECUTION TIME - SOBEL EDGE ALGORITHM Execution Time (sec) 3 2 Execution Time (sec) NO OPT. PTR OPT. FULL OPT. NO OPT. PTR OPT. FULL OPT.

7 Table 3:Low level monitoring for Matrix Multiplication Microkernel: No Optimization/All-Optimization NP=1 & No Optimization/All-Optimization NP=4 Based on 600 MHz IP0 MIPS R12000/R14000 CPU Event Counter Name No Opt. NP=1 All-Opt NP=1 No Opt. NP=4 All Opt. NP=4 =================================================================================================================== Cycles Executed prefetch instructions Decoded instructions Decoded loads Decoded stores Miss handling table occupancy Failed store conditionals Resolved conditional branches Quadwords written back from scache Correctable scache data array ECC errors Primary instruction cache misses Secondary instruction cache misses Instruction misprediction from scache way prediction table External interventions External invalidations ALU/FPU progress cycles Graduated instructions Prefetch primary data cache misses Graduated loads Graduated stores Graduated store conditionals Graduated floating point instructions Quadwords written back from primary data cache TLB misses Mispredicted branches Primary data cache misses Secondary data cache misses Data misprediction from scache way prediction table State of intervention hits in scache State of invalidation hits in scache Store/prefetch exclusive to clean block in scache Store/prefetch exclusive to shared block in scache Statistics ~opt NP=1 All NP=1 ~opt NP=4 All NP=4 =================================================================================== Graduated instructions/cycle Graduated floating point instructions/cycle Graduated loads & stores/cycle Graduated loads & stores/floating point instruction Mispredicted branches/resolved conditional branches Graduated loads /Decoded loads ( and prefetches ) Graduated stores/decoded stores Data mispredict/data scache hits Instruction mispredict/instruction scache hits L1 Cache Line Reuse L2 Cache Line Reuse L1 Data Cache Hit Rate L2 Data Cache Hit Rate L1--L2 bandwidth used (MB/s, average per process) Memory bandwidth used (MB/s, average per process) MFLOPS (average per process) Cache misses in flight per cycle (average) Prefetch cache miss rate inf

Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture

Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture Performance Monitoring and Evaluation of a Implementation on a NUMA Architecture François Cantonnet, Yiyi Yao, Smita Annareddy, Ahmed S. Mohamed*, Tarek A. El-Ghazawi Department of Electrical and Computer

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to

More information

An Intelligent and Cost-effective Solution to Implement High Performance Computing

An Intelligent and Cost-effective Solution to Implement High Performance Computing www.ijape.org International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 doi: 10.14355/ijape.2016.05.006 An Intelligent and Cost-effective Solution to Implement High Performance Computing

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES Tarek El-Ghazawi, François Cantonnet, Yiyi Yao Department of Electrical and Computer Engineering The George Washington University tarek@gwu.edu

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Architectural Requirements and Scalability of the NAS Parallel Benchmarks

Architectural Requirements and Scalability of the NAS Parallel Benchmarks Abstract Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler Computer Science Division Department

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Early Evaluation of the Cray X1 at Oak Ridge National Laboratory Patrick H. Worley Thomas H. Dunigan, Jr. Oak Ridge National Laboratory 45th Cray User Group Conference May 13, 2003 Hyatt on Capital Square

More information

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Benchmarking CPU Performance

Benchmarking CPU Performance Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Building MPI for Multi-Programming Systems using Implicit Information

Building MPI for Multi-Programming Systems using Implicit Information Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley

More information

A Local-View Array Library for Partitioned Global Address Space C++ Programs

A Local-View Array Library for Partitioned Global Address Space C++ Programs Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Evaluating Titanium SPMD Programs on the Tera MTA

Evaluating Titanium SPMD Programs on the Tera MTA Evaluating Titanium SPMD Programs on the Tera MTA Carleton Miyamoto, Chang Lin {miyamoto,cjlin}@cs.berkeley.edu EECS Computer Science Division University of California, Berkeley Abstract Coarse grained

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Cost-Performance Evaluation of SMP Clusters

Cost-Performance Evaluation of SMP Clusters Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Givy, a software DSM runtime using raw pointers

Givy, a software DSM runtime using raw pointers Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Multi-core processors are here, but how do you resolve data bottlenecks in native code? Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Parallel Algorithm Design. CS595, Fall 2010

Parallel Algorithm Design. CS595, Fall 2010 Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Communication Characteristics in the NAS Parallel Benchmarks

Communication Characteristics in the NAS Parallel Benchmarks Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Chapter 18. Parallel Processing. Yonsei University

Chapter 18. Parallel Processing. Yonsei University Chapter 18 Parallel Processing Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2 Types

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information