Parallel Code Optimisation

Size: px
Start display at page:

Download "Parallel Code Optimisation"

Transcription

1 April 8, 2008

2 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code

3 Theoretical perfomance The theoretical floating point performance of a processor is the clock speed of the processor multiplied by the maximum number of floating point opertations per cycle As an example consider a single core of a modern Intel 3GHz Clock frequency is 3x10 9, with a maximum of 4 floating point operations per clock Theoretical performance of 1.2x10 10 FLOPS (floating point operations per second) This will never be reached in practice, and the efficiency of a code is measured as the fraction of this theoretical rate that is obtained Even this fraction will probably vary wildly on different architectures

4 Latency and Bandwidth The bottlenecks which cause this slowdown below the theoretical maximum can be broken down into two classes of problem in various parts of the system Latency problems - The processor is idle because the data that it requires isn t available yet Bandwidth problems - The processor is idle because it can operate on the data faster than the data can be provided to it. Almost all parts of a computer system except the actual processor cores (ALUs and FPUs) are concerned with moving data around and so can be described in terms of latency and bandwidth Latency can be hidden, masked and optimised around either in hardware or to an extent in software Bandwidth is to a great extent a feature of the hardware used and must be accepted as a fundamental limit of the system

5 Latency and Bandwidth Latency is the time after a request has been made before the data begins to become available Bandwidth is the rate at which the data becomes available after the latency time has expired

6 Optimisation strategy Before you can optimise code, you need to know which parts are the cause of the problem. Heavily optimising parts of the code which take little time to execute is a poor use of time Use profilers, parallel profilers, or equivalents and hardware performance counters to itentify the locations of the bottlenecks.

7 Parallel profilers for MPI Most MPI implementations can be compiled with a standard, free MPI profiler called MPE / Jumpshot There are also commerical alternatives Will consider MPE as an example, since it is fairly typical of MPI profilers

8 MPE MPE provides wrapper scripts for compilers just as MPI does, so codes to be tested just have to be compiled using mpecc and mpefc You can either just compile the code using the compilers, or you can custom instrument the code with additional commands to produce more information in the log files If you instrument the code with additional data, it allows you to indentify locations within your code explicitly, making it easier to indentify where problems occur in your code Although it depends on exactly how you use the code, the output will normally be a log file which is then viewed with a the supplied (also free) Jumpshot tool.

9 MPE The results of the output are given using a Ganntt chart, with time on the x axis and processor number on the y axis. Colours represent the MPI commands being executed, white lines represent the path between the start of a send operation and the completion of the matching receive.

10 MPE Long blocks for MPI Send and MPI Recv may well mean that you have a problem in that part of the code The problem can sometimes be relieved by the use of non blocking sends and receives, but sometimes, it genuinely is impossible to proceed any further with the compute work until the communication is finished If that is the case, you may have a load balancing problem where some processors have to work harder than others

11 Profiling without a profiler Much of what is done by MPE is just to put calls to MPI Wtime in before and after MPI commands and to record that to an output file The same technique can be copied manually, by calling MPI Wtime manually and printing the results to output files on a per processor basis Since this is a LOT of work, it requires some level of intuition about where to instrument Also takes more work to read back the results Not to be recommended, but is useful if you want to profile on a machine which doesn t have MPE or a similar program

12 Code profilers If your code isn t limited by communication then you will have to optimise the core algorithm You want to indentify which parts of the core algorithm are causing the slowdown Want to know how long the code spends in each subroutine Once again, there is a free example gprof which works with the gcc compilers Many compiler vendors also have a profiler which works with their compiler There are other types of profiler such as Valgrind which offer many more options

13 gprof To use gprof the code must be compiled using a gcc compiler with the -pg compile time option The code is then run as normal. Note that the code must exit normally or the profiling output will not be written The output is in a file called gmon.out unless otherwise specified The file contains two subsections, the flat profile and the call graph

14 % cumulative self self total time seconds seconds calls ms/call ms/call name MAIN shared_data boundary_c shared_data set_dt gprof The flat profile shows you how long it spent in different subroutines, and how many times the subroutine was called. In this trivial Fortran 90 code, almost all of the time was spent in the main program (referred to as MAIN after compiling) Note that the two other subroutines are called many times and yet still take almost no time Due to the way the gfortran compiles Fortran 90 codes, the module name is prepended to subroutine names

15 index % time self children called name /1 main [2] [1] MAIN [1] /4417 shared_data boundary_condit /2209 shared_data set_dt [10] <spontaneous> [2] main [2] /1 MAIN [1] /4417 MAIN [1] [9] shared_data boundary_conditions /2209 MAIN [1] [10] shared_data set_dt [10] gprof The call graph shows you the calling stacks for all the calls in the program. From this, we can see that boundary conditions was called from MAIN, as was set dt Normally you can see all the information that you need to identify the bottleneck from the flat profile. Just put the most work into Chris the Brady subroutines Parallel that Codetake Optimisation the longest fraction of the

16 Optimising in general Remember, getting the right answer is the key point Always remember that in parallel you may get better results by using a different algorithm which scales better than trying to optimise your first choice Man hours are expensive, compute hours are relatively cheap. Make sure that the optimisation is worth the effort If you re in a hole, stop digging. Some algorithms will never be fast and highly scalable. If you can t change the algorithm and can t get it to scale then just learn to live with it. The harder you optimise your code, the quicker the optimisations are outdated by changes in compilers and hardware. A code optimised for a Cray 1 is not even close to optimal for a modern computer

17 Optimising communications Problems with communications tend to manifest themselves as poor scaling performance The ultimate limit to scaling performance is described by Amdahl s law 1 S = (1 P) + P N where S is the maximum speedup possible on N processors if a fraction of the work P is not parallel If only 10% of the work done by a code is parallelisable then the maximum speedup even on an infinite number of processors is 10 times!

18 Optimising communications Non parallel work includes any time when the code is doing exactly the same thing on multiple processors. Even if that means all processors doing nothing (waiting) This means that any time taken waiting for communication is non-parallel work and limits the scaling according to Amdahl s law Therefore, you want spend as small a fraction of the runtime waiting for communication as possible Note that for many types of code, this actually isn t a real concern because as the problem size increases communication becomes an ever smaller fraction of the total runtime automatically. In these cases, scaling performance is recovered by looking at larger problems, and the limit becomes one of maximum speedup for a given problem size

19 Optimising communications For large domain decomposed codes non-parallel work fractions can be much smaller than 1% allowing scaling to thousands of processors It is generally easier to optimise for SMP machines, so this section will describe how to optimise MPI codes on clusters Doing the same thing on an SMP machine will improve performance there, but usually by a smaller amount

20 Optimising for communication latency On modern cluster hardware, latency is usually only the limiting factor when sending many small messages ( 100s of bytes) If that is the case, try and coalesce the many small messages into a single larger one and then send that If that is not possible then the only option is to try an perform computation while the communication is underway by using non-blocking MPI commands Normally start all communications at the start of the timestep, and then put in MPI Wait commands at the point when a particular piece of information is needed

21 Optimising for communication latency BE CAREFUL! There is a both a latency associated with non-blocking sends and receives and also a compute overhead associated with the monitoring threads for the in flight communications. This can mean that attempting to mask latency using this approach can actually make a code slower. MPI Isend and MPI Irecv are particularly bad in this sense. Much better to set up persistant communication handles using MPI Send init and MPI Recv init if possible If you have really large numbers of in-flight messages, then even this may be a poor choice due to the overhead of managing many open communications

22 MPI Send init int MPI Send init (void * buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm, MPI Request *request ) CALL MPI SEND INIT (buf, count, datatype, dest, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a send operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Isend If you wish to send part of an array, this is a good reason to use MPI custom types

23 MPI Recv init int MPI Recv init (void * buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Request *request ) CALL MPI RECV INIT (buf, count, datatype, source, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a receive operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Irecv If you wish to send part of an array, this is a good reason to use MPI custom types

24 MPI Start int MPI Start ( MPI Request *request ) CALL MPI START ( request, ierr ) Description Starts an instance of the persistent communication referenced by request In most MPI implementations, this is a low latency command The communication is started in a non-blocking manner the code must explicitly wait for the command to complete using MPI Wait

25 Optimising for communication bandwidth It is unusual for communication bandwidth to be the limiting factor in point to point communications in parallel codes because usually a code which sends large quantities of data then performs many operations on the data that it has received It s more common for collective communications to become bandwidth limited (covered in the next section) If a code is bandwidth limited, there are only really two things that can be done Change the algorithm to require less data to be communicated Use the non-blocking commands to send and receive data while still computing Since normally only a small number of send and receive pairs, can use MPI Isend and MPI Irecv without too much of a penalty

26 Optimising collective communication Optimising collective communication is much harder, since the user has less control over the operation of the command Try to perform as much reduction as possible on local nodes before calling MPI Reduce or MPI Allreduce If you have really specific requirements, you can try to write your own algorithm for the collective operation using point-to-point commands, but this will only work if you have very specific requirements Try to avoid collective communications if at all possible

27 Load balancing If you have a code where the workload on different processors isn t guaranteed to be equal, you could have problems with load balancing Load balancing causes problems because the scaling of the whole system is no better than the scaling of the worst element of the system. So if one processor has a workload which doesn t scale at all with number of processors then the parallel code will be no faster than the serial code More usually, load balancing appears as the code scaling suboptimally because some processors are less used than others Addressing this problem is specific to the exact algorithm being implemented, but in many off-grid problems can be the dominant limiting factor to performance Have to design some kind of dynamic load balancing algorithm

28 Optimising IO Any code is only as useful as the data that it outputs, however you want to minimize time taken to write data to disk Both latency and bandwidth to disk storage is very poor compared with either compute or communication Best thing you can do with IO is to try and minimise it

29 Optimising IO Using MPIIO is about the best thing that you can do to improve performance MPIIO can be further improved by passing hints to the MPI layer during calls to open and write statements using the MPI Info object The exact form of the hints is not portable, but they are fairly easy to work with using MPI Info create and MPI Info set, and should be documented on a machine by machine basis Turning off MPIIO atomic mode using MPI File set atomicity increases speed massively if you do not write to the same area of the file from several nodes

30 Optimising IO Another possibility is to look at doing data reduction during the compute time. Depending on what you want to do with the data, it may be possible to reduce the amount of data to be dumped from a large 3D grid of several variables down to a single line, or possibly even a single number Moving to more modern machines with parallel storage systems will massively improve performance

31 Optimising core code performance In many classes of parallel code, the limiting factor in execution speed is the speed of computation on each node In this case, the code must be optimised in the same manner as a serial code Identify the bottleneck subroutines as already suggested and then optimise them There are a few standard tricks which should always be used to optimise speed, although note that all can sometimes impair performance and should be tested These will be introduced as a checklist of things to do and check, and then explained

32 Optimising core code performance READ YOUR COMPILER MANUAL Many of the classical tricks to optimise code are now done automatically by compilers Check different compiler optimisations, some will help, some will hinder Don t expect that always using the most aggressive compiler options will lead to the fastest code Always try compiler options before hand optimising code, it s much quicker Often highly hand optimised code is much harder to read, always ask whether readability is more important than speed Note that in Fortran, because of the language structures, compilers can safely be more aggressive, and so improvements for hand tuning are usually smaller for Fortran codes than C codes

33 Loop optimisation The most common error when using loops is to loop over the elements of a multidimensional array in the wrong order In Fortran, loops should be ordered so that the left-most index of an array is changing fastest In C, loops should normally be ordered so that the right-most index of an array is changing fastest For scientific codes, this tends to be a very robust optimisation and improves speed in most cases Compilers should perform this operation (loop interchange) automatically, however, in complex codes they are often unable to confirm that this interchange is safe and so don t perform the optimisation Depending on hardware architecture and code structure, this optimisation can reduce runtimes by up to 50%!

34 Other loop optimisations Most other loop optimisations are now efficiently dealt with by compilers, but some that you may want too look at are Loop fusion - There is an overhead associated with starting loops. Two operations in one loop are usually faster than one operation in each of two loops Loop peeling - If the first element of a loop must be dealt with in a special way, break the code out of the loop rather than using an IF statement in the loop The key point in all loop optimisation is locality of reference, which will be explained in the next section

35 Branch optimisations IF statements within loops are always bad If you can remove the IF statements completely by hand by handling special cases differently, or moving the IF statements outside, do so Otherwise, you may find that you are better off doing additional compute work rather than using an IF Try computing both and then multiplying the two results by a 0/1 Boolean flag rather

36 Optimising in theory That fairly short list contains most of the optimisations which are worth trying to perform with modern compilers There are other things which can be done, but they normally lead to small performance improvements However, for less common types of code, or for really heavy optimisation, have to understand why the optimisation works Once again, back to latency and bandwidth concerns, but now on the level of a single computer system Normally, codes which can be optimised are limited by memory latency, bandwidth limitation is also possible, but much harder to work around

37 Cache Modern computers use cache memory to improve memory latency Cache provides faster access to recently used memory by copying it into a smaller, faster, more expensive area of RAM which is usually situated on the CPU die In this form cache only speeds access to data that has been used recently. Scientific codes generally have working sets which are many megabytes in size, so simple caching provides only limited improvements Further performance improvement is given by the use of cache prefetchers

38 Cache prefetchers Historically, cache prefetching was a programming technique where data wanted on the next iteration was requested on the current iteration, so bringing the data into cache In modern systems, this duty is normally performed either by the compiler or by enabling hardware prefetch units on the CPU In either case, the prefetcher normally just assumes the data that will be wanted next is local to the memory just accessed. This means that improving locality of reference in the code structure improves performance

39 Locality of reference The easiest way to prefetch is to assume that the program is simply going to ask for the next piece of data in memory directly after the piece that it just asked for This means that the best way to optimise the performance of a code is to ensure that your code accesses memory in as close to a linear fashion as possible This explains why the order in which multidimensional arrays are accessed in loops is important When the prefetcher is working well, the effect of the cache is to significantly reduce the latency of the main memory Unless the entire dataset fits in the cache, or the code perform large amounts of calculation on a small block of data before moving onto the next block, then the higher bandwidth of the cache memory has little effect on the effective bandwidth of the entire memory subsystem

40 Bandwidth limitation The classical example of bandwidth limitation is video processing Relatively little work is done to each element of data, which is then never used again The net effect of this is to almost remove the effect of the cache memory again, since the limiting factor is the rate at which the data can be transfered from the main memory Although the number is highly workload dependant, a good rule of thumb is that you need 1 bit/s of memory bandwidth for every FLOP, although scientific codes often need more Prefetchers can make the situation worse if they are not performing perfectly, because incorrectly prefetched data is simply wasted memory bandwidth

41 Bandwidth limitation If your code is bandwidth limited then you have few options Buy a better computer with more memory bandwidth Make your code do more work with the existing data before moving onto the next part of the data

42 SIMD optimisation Most modern processors have a superscalar architecture, meaning that they can operate in a Single Instruction Multiple Data mode. This is often called vector operation SIMD allows processors to have a higher average IPC Also, the older scalar FPUs are often legacy features of the architecture and are much slower than their SIMD counterparts (normal x86 processors are a very good example of this) Compilers will use the SIMD paths automatically unless they are stopped from doing so This is usually because the compiler is unable to determine if there is dependancy in the program flow

43 SIMD optimisation The most common cause of a compiler being unable to determine a dependancy is that data in a given iteration really does depend on data in a previous iteration If that is the case, remember that compilers only try to vectorise the inner loop of a set of nested loops, so reordering your loops may allow vectorisation The second most common cause is that there is a branch in the code which the compiler is unable to determine will not cause dependancy Some compilers allow you to put in hints which force it to ignore safety checks A better approach is to try and remove the IF statements as already mentioned

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr

Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

CS 403: Lab 6: Profiling and Tuning

CS 403: Lab 6: Profiling and Tuning CS 403: Lab 6: Profiling and Tuning Getting Started 1. Boot into Linux. 2. Get a copy of RAD1D from your CVS repository (cvs co RAD1D) or download a fresh copy of the tar file from the course website.

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Technical Documentation Version 7.4. Performance

Technical Documentation Version 7.4. Performance Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted

More information

Computer Organization: A Programmer's Perspective

Computer Organization: A Programmer's Perspective Profiling Oren Kapah orenkapah.ac@gmail.com Profiling: Performance Analysis Performance Analysis ( Profiling ) Understanding the run-time behavior of programs What parts are executed, when, for how long

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

CIS 403: Lab 6: Profiling and Tuning

CIS 403: Lab 6: Profiling and Tuning CIS 403: Lab 6: Profiling and Tuning Getting Started 1. Boot into Linux. 2. Get a copy of RAD1D from your CVS repository (cvs co RAD1D) or download a fresh copy of the tar file from the course website.

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

CSE 303: Concepts and Tools for Software Development

CSE 303: Concepts and Tools for Software Development CSE 303: Concepts and Tools for Software Development Dan Grossman Spring 2007 Lecture 19 Profiling (gprof); Linking and Libraries Dan Grossman CSE303 Spring 2007, Lecture 19 1 Where are we Already started

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Debugging and Profiling

Debugging and Profiling Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

Threaded Programming. Lecture 9: Introduction to performance optimisation

Threaded Programming. Lecture 9: Introduction to performance optimisation Threaded Programming Lecture 9: Introduction to performance optimisation Why? Large computer simulations are becoming common in many scientific disciplines. These often take a significant amount of time

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser Parallel Constraint Programming (and why it is hard... ) This Week s Lectures Search and Discrepancies Parallel Constraint Programming Why? Some failed attempts A little bit of theory and some very simple

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Orna Agmon Ben-Yehuda. OpenMP Usage. March 15, 2009 OpenMP Usage Slide 1

Orna Agmon Ben-Yehuda. OpenMP Usage. March 15, 2009 OpenMP Usage Slide 1 OpenMP Usage Orna Agmon Ben-Yehuda March 15, 2009 OpenMP Usage Slide 1 What is this talk about? Dilemmas I encountered when transforming legacy code using openmp Tricks I found to make my life easier The

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren nmm1@cam.ac.uk March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous set of practical points Over--simplifies

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

Profiling and Workflow

Profiling and Workflow Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance

More information

Optimized Scientific Computing:

Optimized Scientific Computing: Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Debugging, Performance and Tuning Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2008 Programming with MPI p. 2/?? Available Implementations

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

NAMD Serial and Parallel Performance

NAMD Serial and Parallel Performance NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance

More information

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient The Need for Speed: Understanding design factors that make multicore parallel simulations efficient Shobana Sudhakar Design & Verification Technology Mentor Graphics Wilsonville, OR shobana_sudhakar@mentor.com

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP p. 1/?? Introduction to OpenMP Simple SPMD etc. Nick Maclaren nmm1@cam.ac.uk September 2017 Introduction to OpenMP p. 2/?? Terminology I am badly abusing the term SPMD tough The

More information

Optimisation p.1/22. Optimisation

Optimisation p.1/22. Optimisation Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp

How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp www.mcs.anl.gov/~gropp Why Measure Performance? Publish papers or sell product Engineer a solution

More information

Problem 1. (15 points):

Problem 1. (15 points): CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

MPI Performance Tools

MPI Performance Tools Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI

More information

Linux multi-core scalability

Linux multi-core scalability Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Spectre and Meltdown. Clifford Wolf q/talk

Spectre and Meltdown. Clifford Wolf q/talk Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative

More information

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents White Paper How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet Programs that do a lot of I/O are likely to be the worst hit by the patches designed to fix the Meltdown

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Why C? Because we can t in good conscience espouse Fortran.

Why C? Because we can t in good conscience espouse Fortran. C Tutorial Why C? Because we can t in good conscience espouse Fortran. C Hello World Code: Output: C For Loop Code: Output: C Functions Code: Output: Unlike Fortran, there is no distinction in C between

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

Inside the PostgreSQL Shared Buffer Cache

Inside the PostgreSQL Shared Buffer Cache Truviso 07/07/2008 About this presentation The master source for these slides is http://www.westnet.com/ gsmith/content/postgresql You can also find a machine-usable version of the source code to the later

More information

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead. Announcements Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead. Thread level parallelism: Multi-Core Processors Two (or more) complete processors, fabricated on

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

Quiz for Chapter 1 Computer Abstractions and Technology

Quiz for Chapter 1 Computer Abstractions and Technology Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Ext3/4 file systems. Don Porter CSE 506

Ext3/4 file systems. Don Porter CSE 506 Ext3/4 file systems Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Today s Lecture Kernel RCU File System Networking Sync Memory Management Device Drivers

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Chapter 1: Solving Integration Problems Using Patterns 2 Introduction The Need for Integration Integration Challenges

More information

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

ECE Spring 2017 Exam 2

ECE Spring 2017 Exam 2 ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

Network performance. slide 1 gaius. Network performance

Network performance. slide 1 gaius. Network performance slide 1 historically much network performance research was based on the assumption that network traffic was random apoisson distribution of traffic Paxson and Floyd 1994, Willinger 1995 found this assumption

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Introduction to OpenMP

Introduction to OpenMP 1.1 Minimal SPMD Introduction to OpenMP Simple SPMD etc. N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 August 2011 SPMD proper is a superset of SIMD, and we are now going to cover some of the

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Software Analysis. Asymptotic Performance Analysis

Software Analysis. Asymptotic Performance Analysis Software Analysis Performance Analysis Presenter: Jonathan Aldrich Performance Analysis Software Analysis 1 Asymptotic Performance Analysis How do we compare algorithm performance? Abstract away low-level

More information

Lectures Parallelism

Lectures Parallelism Lectures 24-25 Parallelism 1 Pipelining vs. Parallel processing In both cases, multiple things processed by multiple functional units Pipelining: each thing is broken into a sequence of pieces, where each

More information

CS533 Modeling and Performance Evaluation of Network and Computer Systems

CS533 Modeling and Performance Evaluation of Network and Computer Systems CS533 Modeling and Performance Evaluation of Network and Computer Systems Monitors (Chapter 7) 1 Monitors That which is monitored improves. Source unknown A monitor is a tool used to observe system Observe

More information

A Review on Cache Memory with Multiprocessor System

A Review on Cache Memory with Multiprocessor System A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract

More information