Parallel Code Optimisation
|
|
- Lillian Austin
- 5 years ago
- Views:
Transcription
1 April 8, 2008
2 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code
3 Theoretical perfomance The theoretical floating point performance of a processor is the clock speed of the processor multiplied by the maximum number of floating point opertations per cycle As an example consider a single core of a modern Intel 3GHz Clock frequency is 3x10 9, with a maximum of 4 floating point operations per clock Theoretical performance of 1.2x10 10 FLOPS (floating point operations per second) This will never be reached in practice, and the efficiency of a code is measured as the fraction of this theoretical rate that is obtained Even this fraction will probably vary wildly on different architectures
4 Latency and Bandwidth The bottlenecks which cause this slowdown below the theoretical maximum can be broken down into two classes of problem in various parts of the system Latency problems - The processor is idle because the data that it requires isn t available yet Bandwidth problems - The processor is idle because it can operate on the data faster than the data can be provided to it. Almost all parts of a computer system except the actual processor cores (ALUs and FPUs) are concerned with moving data around and so can be described in terms of latency and bandwidth Latency can be hidden, masked and optimised around either in hardware or to an extent in software Bandwidth is to a great extent a feature of the hardware used and must be accepted as a fundamental limit of the system
5 Latency and Bandwidth Latency is the time after a request has been made before the data begins to become available Bandwidth is the rate at which the data becomes available after the latency time has expired
6 Optimisation strategy Before you can optimise code, you need to know which parts are the cause of the problem. Heavily optimising parts of the code which take little time to execute is a poor use of time Use profilers, parallel profilers, or equivalents and hardware performance counters to itentify the locations of the bottlenecks.
7 Parallel profilers for MPI Most MPI implementations can be compiled with a standard, free MPI profiler called MPE / Jumpshot There are also commerical alternatives Will consider MPE as an example, since it is fairly typical of MPI profilers
8 MPE MPE provides wrapper scripts for compilers just as MPI does, so codes to be tested just have to be compiled using mpecc and mpefc You can either just compile the code using the compilers, or you can custom instrument the code with additional commands to produce more information in the log files If you instrument the code with additional data, it allows you to indentify locations within your code explicitly, making it easier to indentify where problems occur in your code Although it depends on exactly how you use the code, the output will normally be a log file which is then viewed with a the supplied (also free) Jumpshot tool.
9 MPE The results of the output are given using a Ganntt chart, with time on the x axis and processor number on the y axis. Colours represent the MPI commands being executed, white lines represent the path between the start of a send operation and the completion of the matching receive.
10 MPE Long blocks for MPI Send and MPI Recv may well mean that you have a problem in that part of the code The problem can sometimes be relieved by the use of non blocking sends and receives, but sometimes, it genuinely is impossible to proceed any further with the compute work until the communication is finished If that is the case, you may have a load balancing problem where some processors have to work harder than others
11 Profiling without a profiler Much of what is done by MPE is just to put calls to MPI Wtime in before and after MPI commands and to record that to an output file The same technique can be copied manually, by calling MPI Wtime manually and printing the results to output files on a per processor basis Since this is a LOT of work, it requires some level of intuition about where to instrument Also takes more work to read back the results Not to be recommended, but is useful if you want to profile on a machine which doesn t have MPE or a similar program
12 Code profilers If your code isn t limited by communication then you will have to optimise the core algorithm You want to indentify which parts of the core algorithm are causing the slowdown Want to know how long the code spends in each subroutine Once again, there is a free example gprof which works with the gcc compilers Many compiler vendors also have a profiler which works with their compiler There are other types of profiler such as Valgrind which offer many more options
13 gprof To use gprof the code must be compiled using a gcc compiler with the -pg compile time option The code is then run as normal. Note that the code must exit normally or the profiling output will not be written The output is in a file called gmon.out unless otherwise specified The file contains two subsections, the flat profile and the call graph
14 % cumulative self self total time seconds seconds calls ms/call ms/call name MAIN shared_data boundary_c shared_data set_dt gprof The flat profile shows you how long it spent in different subroutines, and how many times the subroutine was called. In this trivial Fortran 90 code, almost all of the time was spent in the main program (referred to as MAIN after compiling) Note that the two other subroutines are called many times and yet still take almost no time Due to the way the gfortran compiles Fortran 90 codes, the module name is prepended to subroutine names
15 index % time self children called name /1 main [2] [1] MAIN [1] /4417 shared_data boundary_condit /2209 shared_data set_dt [10] <spontaneous> [2] main [2] /1 MAIN [1] /4417 MAIN [1] [9] shared_data boundary_conditions /2209 MAIN [1] [10] shared_data set_dt [10] gprof The call graph shows you the calling stacks for all the calls in the program. From this, we can see that boundary conditions was called from MAIN, as was set dt Normally you can see all the information that you need to identify the bottleneck from the flat profile. Just put the most work into Chris the Brady subroutines Parallel that Codetake Optimisation the longest fraction of the
16 Optimising in general Remember, getting the right answer is the key point Always remember that in parallel you may get better results by using a different algorithm which scales better than trying to optimise your first choice Man hours are expensive, compute hours are relatively cheap. Make sure that the optimisation is worth the effort If you re in a hole, stop digging. Some algorithms will never be fast and highly scalable. If you can t change the algorithm and can t get it to scale then just learn to live with it. The harder you optimise your code, the quicker the optimisations are outdated by changes in compilers and hardware. A code optimised for a Cray 1 is not even close to optimal for a modern computer
17 Optimising communications Problems with communications tend to manifest themselves as poor scaling performance The ultimate limit to scaling performance is described by Amdahl s law 1 S = (1 P) + P N where S is the maximum speedup possible on N processors if a fraction of the work P is not parallel If only 10% of the work done by a code is parallelisable then the maximum speedup even on an infinite number of processors is 10 times!
18 Optimising communications Non parallel work includes any time when the code is doing exactly the same thing on multiple processors. Even if that means all processors doing nothing (waiting) This means that any time taken waiting for communication is non-parallel work and limits the scaling according to Amdahl s law Therefore, you want spend as small a fraction of the runtime waiting for communication as possible Note that for many types of code, this actually isn t a real concern because as the problem size increases communication becomes an ever smaller fraction of the total runtime automatically. In these cases, scaling performance is recovered by looking at larger problems, and the limit becomes one of maximum speedup for a given problem size
19 Optimising communications For large domain decomposed codes non-parallel work fractions can be much smaller than 1% allowing scaling to thousands of processors It is generally easier to optimise for SMP machines, so this section will describe how to optimise MPI codes on clusters Doing the same thing on an SMP machine will improve performance there, but usually by a smaller amount
20 Optimising for communication latency On modern cluster hardware, latency is usually only the limiting factor when sending many small messages ( 100s of bytes) If that is the case, try and coalesce the many small messages into a single larger one and then send that If that is not possible then the only option is to try an perform computation while the communication is underway by using non-blocking MPI commands Normally start all communications at the start of the timestep, and then put in MPI Wait commands at the point when a particular piece of information is needed
21 Optimising for communication latency BE CAREFUL! There is a both a latency associated with non-blocking sends and receives and also a compute overhead associated with the monitoring threads for the in flight communications. This can mean that attempting to mask latency using this approach can actually make a code slower. MPI Isend and MPI Irecv are particularly bad in this sense. Much better to set up persistant communication handles using MPI Send init and MPI Recv init if possible If you have really large numbers of in-flight messages, then even this may be a poor choice due to the overhead of managing many open communications
22 MPI Send init int MPI Send init (void * buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm, MPI Request *request ) CALL MPI SEND INIT (buf, count, datatype, dest, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a send operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Isend If you wish to send part of an array, this is a good reason to use MPI custom types
23 MPI Recv init int MPI Recv init (void * buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Request *request ) CALL MPI RECV INIT (buf, count, datatype, source, tag, comm, request, ierr ) Description Creates a persistent communication handle request for a receive operation Routine only needs to be called once at the start of the program, and the handle is then used everytime the communication goes ahead Saves a lot of overhead associated with MPI Irecv If you wish to send part of an array, this is a good reason to use MPI custom types
24 MPI Start int MPI Start ( MPI Request *request ) CALL MPI START ( request, ierr ) Description Starts an instance of the persistent communication referenced by request In most MPI implementations, this is a low latency command The communication is started in a non-blocking manner the code must explicitly wait for the command to complete using MPI Wait
25 Optimising for communication bandwidth It is unusual for communication bandwidth to be the limiting factor in point to point communications in parallel codes because usually a code which sends large quantities of data then performs many operations on the data that it has received It s more common for collective communications to become bandwidth limited (covered in the next section) If a code is bandwidth limited, there are only really two things that can be done Change the algorithm to require less data to be communicated Use the non-blocking commands to send and receive data while still computing Since normally only a small number of send and receive pairs, can use MPI Isend and MPI Irecv without too much of a penalty
26 Optimising collective communication Optimising collective communication is much harder, since the user has less control over the operation of the command Try to perform as much reduction as possible on local nodes before calling MPI Reduce or MPI Allreduce If you have really specific requirements, you can try to write your own algorithm for the collective operation using point-to-point commands, but this will only work if you have very specific requirements Try to avoid collective communications if at all possible
27 Load balancing If you have a code where the workload on different processors isn t guaranteed to be equal, you could have problems with load balancing Load balancing causes problems because the scaling of the whole system is no better than the scaling of the worst element of the system. So if one processor has a workload which doesn t scale at all with number of processors then the parallel code will be no faster than the serial code More usually, load balancing appears as the code scaling suboptimally because some processors are less used than others Addressing this problem is specific to the exact algorithm being implemented, but in many off-grid problems can be the dominant limiting factor to performance Have to design some kind of dynamic load balancing algorithm
28 Optimising IO Any code is only as useful as the data that it outputs, however you want to minimize time taken to write data to disk Both latency and bandwidth to disk storage is very poor compared with either compute or communication Best thing you can do with IO is to try and minimise it
29 Optimising IO Using MPIIO is about the best thing that you can do to improve performance MPIIO can be further improved by passing hints to the MPI layer during calls to open and write statements using the MPI Info object The exact form of the hints is not portable, but they are fairly easy to work with using MPI Info create and MPI Info set, and should be documented on a machine by machine basis Turning off MPIIO atomic mode using MPI File set atomicity increases speed massively if you do not write to the same area of the file from several nodes
30 Optimising IO Another possibility is to look at doing data reduction during the compute time. Depending on what you want to do with the data, it may be possible to reduce the amount of data to be dumped from a large 3D grid of several variables down to a single line, or possibly even a single number Moving to more modern machines with parallel storage systems will massively improve performance
31 Optimising core code performance In many classes of parallel code, the limiting factor in execution speed is the speed of computation on each node In this case, the code must be optimised in the same manner as a serial code Identify the bottleneck subroutines as already suggested and then optimise them There are a few standard tricks which should always be used to optimise speed, although note that all can sometimes impair performance and should be tested These will be introduced as a checklist of things to do and check, and then explained
32 Optimising core code performance READ YOUR COMPILER MANUAL Many of the classical tricks to optimise code are now done automatically by compilers Check different compiler optimisations, some will help, some will hinder Don t expect that always using the most aggressive compiler options will lead to the fastest code Always try compiler options before hand optimising code, it s much quicker Often highly hand optimised code is much harder to read, always ask whether readability is more important than speed Note that in Fortran, because of the language structures, compilers can safely be more aggressive, and so improvements for hand tuning are usually smaller for Fortran codes than C codes
33 Loop optimisation The most common error when using loops is to loop over the elements of a multidimensional array in the wrong order In Fortran, loops should be ordered so that the left-most index of an array is changing fastest In C, loops should normally be ordered so that the right-most index of an array is changing fastest For scientific codes, this tends to be a very robust optimisation and improves speed in most cases Compilers should perform this operation (loop interchange) automatically, however, in complex codes they are often unable to confirm that this interchange is safe and so don t perform the optimisation Depending on hardware architecture and code structure, this optimisation can reduce runtimes by up to 50%!
34 Other loop optimisations Most other loop optimisations are now efficiently dealt with by compilers, but some that you may want too look at are Loop fusion - There is an overhead associated with starting loops. Two operations in one loop are usually faster than one operation in each of two loops Loop peeling - If the first element of a loop must be dealt with in a special way, break the code out of the loop rather than using an IF statement in the loop The key point in all loop optimisation is locality of reference, which will be explained in the next section
35 Branch optimisations IF statements within loops are always bad If you can remove the IF statements completely by hand by handling special cases differently, or moving the IF statements outside, do so Otherwise, you may find that you are better off doing additional compute work rather than using an IF Try computing both and then multiplying the two results by a 0/1 Boolean flag rather
36 Optimising in theory That fairly short list contains most of the optimisations which are worth trying to perform with modern compilers There are other things which can be done, but they normally lead to small performance improvements However, for less common types of code, or for really heavy optimisation, have to understand why the optimisation works Once again, back to latency and bandwidth concerns, but now on the level of a single computer system Normally, codes which can be optimised are limited by memory latency, bandwidth limitation is also possible, but much harder to work around
37 Cache Modern computers use cache memory to improve memory latency Cache provides faster access to recently used memory by copying it into a smaller, faster, more expensive area of RAM which is usually situated on the CPU die In this form cache only speeds access to data that has been used recently. Scientific codes generally have working sets which are many megabytes in size, so simple caching provides only limited improvements Further performance improvement is given by the use of cache prefetchers
38 Cache prefetchers Historically, cache prefetching was a programming technique where data wanted on the next iteration was requested on the current iteration, so bringing the data into cache In modern systems, this duty is normally performed either by the compiler or by enabling hardware prefetch units on the CPU In either case, the prefetcher normally just assumes the data that will be wanted next is local to the memory just accessed. This means that improving locality of reference in the code structure improves performance
39 Locality of reference The easiest way to prefetch is to assume that the program is simply going to ask for the next piece of data in memory directly after the piece that it just asked for This means that the best way to optimise the performance of a code is to ensure that your code accesses memory in as close to a linear fashion as possible This explains why the order in which multidimensional arrays are accessed in loops is important When the prefetcher is working well, the effect of the cache is to significantly reduce the latency of the main memory Unless the entire dataset fits in the cache, or the code perform large amounts of calculation on a small block of data before moving onto the next block, then the higher bandwidth of the cache memory has little effect on the effective bandwidth of the entire memory subsystem
40 Bandwidth limitation The classical example of bandwidth limitation is video processing Relatively little work is done to each element of data, which is then never used again The net effect of this is to almost remove the effect of the cache memory again, since the limiting factor is the rate at which the data can be transfered from the main memory Although the number is highly workload dependant, a good rule of thumb is that you need 1 bit/s of memory bandwidth for every FLOP, although scientific codes often need more Prefetchers can make the situation worse if they are not performing perfectly, because incorrectly prefetched data is simply wasted memory bandwidth
41 Bandwidth limitation If your code is bandwidth limited then you have few options Buy a better computer with more memory bandwidth Make your code do more work with the existing data before moving onto the next part of the data
42 SIMD optimisation Most modern processors have a superscalar architecture, meaning that they can operate in a Single Instruction Multiple Data mode. This is often called vector operation SIMD allows processors to have a higher average IPC Also, the older scalar FPUs are often legacy features of the architecture and are much slower than their SIMD counterparts (normal x86 processors are a very good example of this) Compilers will use the SIMD paths automatically unless they are stopped from doing so This is usually because the compiler is unable to determine if there is dependancy in the program flow
43 SIMD optimisation The most common cause of a compiler being unable to determine a dependancy is that data in a given iteration really does depend on data in a previous iteration If that is the case, remember that compilers only try to vectorise the inner loop of a set of nested loops, so reordering your loops may allow vectorisation The second most common cause is that there is a branch in the code which the compiler is unable to determine will not cause dependancy Some compilers allow you to put in hints which force it to ignore safety checks A better approach is to try and remove the IF statements as already mentioned
MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationPerformance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr
Performance Analysis of MPI Programs with Vampir and Vampirtrace Bernd Mohr Research Centre Juelich (FZJ) John von Neumann Institute of Computing (NIC) Central Institute for Applied Mathematics (ZAM) 52425
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationCS 403: Lab 6: Profiling and Tuning
CS 403: Lab 6: Profiling and Tuning Getting Started 1. Boot into Linux. 2. Get a copy of RAD1D from your CVS repository (cvs co RAD1D) or download a fresh copy of the tar file from the course website.
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationTechnical Documentation Version 7.4. Performance
Technical Documentation Version 7.4 These documents are copyrighted by the Regents of the University of Colorado. No part of this document may be reproduced, stored in a retrieval system, or transmitted
More informationComputer Organization: A Programmer's Perspective
Profiling Oren Kapah orenkapah.ac@gmail.com Profiling: Performance Analysis Performance Analysis ( Profiling ) Understanding the run-time behavior of programs What parts are executed, when, for how long
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationCIS 403: Lab 6: Profiling and Tuning
CIS 403: Lab 6: Profiling and Tuning Getting Started 1. Boot into Linux. 2. Get a copy of RAD1D from your CVS repository (cvs co RAD1D) or download a fresh copy of the tar file from the course website.
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationCSE 303: Concepts and Tools for Software Development
CSE 303: Concepts and Tools for Software Development Dan Grossman Spring 2007 Lecture 19 Profiling (gprof); Linking and Libraries Dan Grossman CSE303 Spring 2007, Lecture 19 1 Where are we Already started
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationDebugging and Profiling
Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/
More informationThreaded Programming. Lecture 9: Introduction to performance optimisation
Threaded Programming Lecture 9: Introduction to performance optimisation Why? Large computer simulations are becoming common in many scientific disciplines. These often take a significant amount of time
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationParallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser
Parallel Constraint Programming (and why it is hard... ) This Week s Lectures Search and Discrepancies Parallel Constraint Programming Why? Some failed attempts A little bit of theory and some very simple
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationOrna Agmon Ben-Yehuda. OpenMP Usage. March 15, 2009 OpenMP Usage Slide 1
OpenMP Usage Orna Agmon Ben-Yehuda March 15, 2009 OpenMP Usage Slide 1 What is this talk about? Dilemmas I encountered when transforming legacy code using openmp Tricks I found to make my life easier The
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren nmm1@cam.ac.uk March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous set of practical points Over--simplifies
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationCache introduction. April 16, Howard Huang 1
Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently
More informationProfiling and Workflow
Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Debugging, Performance and Tuning Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2008 Programming with MPI p. 2/?? Available Implementations
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationAdvanced Message-Passing Interface (MPI)
Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point
More informationNAMD Serial and Parallel Performance
NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance
More informationThe Need for Speed: Understanding design factors that make multicore parallel simulations efficient
The Need for Speed: Understanding design factors that make multicore parallel simulations efficient Shobana Sudhakar Design & Verification Technology Mentor Graphics Wilsonville, OR shobana_sudhakar@mentor.com
More informationIntroduction to OpenMP
Introduction to OpenMP p. 1/?? Introduction to OpenMP Simple SPMD etc. Nick Maclaren nmm1@cam.ac.uk September 2017 Introduction to OpenMP p. 2/?? Terminology I am badly abusing the term SPMD tough The
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationLecture 16. Today: Start looking into memory hierarchy Cache$! Yay!
Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationOpenMP: Open Multiprocessing
OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationHow Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp
How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp www.mcs.anl.gov/~gropp Why Measure Performance? Publish papers or sell product Engineer a solution
More informationProblem 1. (15 points):
CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationMPI Performance Tools
Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationSpectre and Meltdown. Clifford Wolf q/talk
Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative
More informationWhite Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents
White Paper How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet Programs that do a lot of I/O are likely to be the worst hit by the patches designed to fix the Meltdown
More informationMPI Casestudy: Parallel Image Processing
MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationWhy C? Because we can t in good conscience espouse Fortran.
C Tutorial Why C? Because we can t in good conscience espouse Fortran. C Hello World Code: Output: C For Loop Code: Output: C Functions Code: Output: Unlike Fortran, there is no distinction in C between
More informationGrand Central Dispatch
A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model
More informationInside the PostgreSQL Shared Buffer Cache
Truviso 07/07/2008 About this presentation The master source for these slides is http://www.westnet.com/ gsmith/content/postgresql You can also find a machine-usable version of the source code to the later
More informationAnnouncements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.
Announcements Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead. Thread level parallelism: Multi-Core Processors Two (or more) complete processors, fabricated on
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationInput and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state
What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms
More informationDomain Decomposition: Computational Fluid Dynamics
Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will
More informationCPU Architecture. HPCE / dt10 / 2013 / 10.1
Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)
More informationQuiz for Chapter 1 Computer Abstractions and Technology
Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationExt3/4 file systems. Don Porter CSE 506
Ext3/4 file systems Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Today s Lecture Kernel RCU File System Networking Sync Memory Management Device Drivers
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationPerformance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]
Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?
More informationEnterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions
Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Chapter 1: Solving Integration Problems Using Patterns 2 Introduction The Need for Integration Integration Challenges
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCS161 Design and Architecture of Computer Systems. Cache $$$$$
CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks
More informationECE Spring 2017 Exam 2
ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationSupercomputing in Plain English Part IV: Henry Neeman, Director
Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationNetwork performance. slide 1 gaius. Network performance
slide 1 historically much network performance research was based on the assumption that network traffic was random apoisson distribution of traffic Paxson and Floyd 1994, Willinger 1995 found this assumption
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationIntroduction to OpenMP
1.1 Minimal SPMD Introduction to OpenMP Simple SPMD etc. N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 August 2011 SPMD proper is a superset of SIMD, and we are now going to cover some of the
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationSoftware Analysis. Asymptotic Performance Analysis
Software Analysis Performance Analysis Presenter: Jonathan Aldrich Performance Analysis Software Analysis 1 Asymptotic Performance Analysis How do we compare algorithm performance? Abstract away low-level
More informationLectures Parallelism
Lectures 24-25 Parallelism 1 Pipelining vs. Parallel processing In both cases, multiple things processed by multiple functional units Pipelining: each thing is broken into a sequence of pieces, where each
More informationCS533 Modeling and Performance Evaluation of Network and Computer Systems
CS533 Modeling and Performance Evaluation of Network and Computer Systems Monitors (Chapter 7) 1 Monitors That which is monitored improves. Source unknown A monitor is a tool used to observe system Observe
More informationA Review on Cache Memory with Multiprocessor System
A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract
More information