General Remarks. Overview. General Remarks. High Performance Computing Programming Paradigms and Scalability Part 1: Introduction

Size: px

Start display at page:

Download "General Remarks. Overview. General Remarks. High Performance Computing Programming Paradigms and Scalability Part 1: Introduction"

Nelson Holt
5 years ago
Views:

1 High erformance Computing rogramming aradigms and Scalability art 1: Introduction D Dr. rer. nat. habil. Ralf-eter Mundani Computation in Engineering (CiE) Scientific Computing (SCCS) Summer Term 2015 General Remarks Ralf-eter Mundani mundani@tum.de, phone: , room: 3181 consultation-hour: by appointment lecture: Tuesday, 12:00 13:30, room Christoph Riesinger riesinge@in.tum.de exercise: Wednesday, 10:15 11:45, room (fortnightly) examination written, 90 minutes all printed/written materials allowed (no electronic devices) materials: 12 General Remarks Overview content part 1: introduction part 2: high-performance networks part 3: foundations part 4: shared-memory programming part 5: distributed-memory programming part 6: examples of parallel algorithms motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. Grace Murray Hopper 13 14

Motivation numerical simulation: from phenomena to

modelling determination of parameters, expression of

because experiments are sometimes impossible life cycle

discipline mathematics computer science application 2.

visualisation illustration of abstract simulation

validation comparison of results with reality 6.

WTC (1993) because experiments are sometimes not welcome

15 16 Motivation why numerical simulation?

experiments are sometimes more expensive aerodynamics,

Motivation why parallel programming and HC?

challenges ) demand for more computing power climate or

2 Motivation numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations Motivation why numerical simulation? because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e.g. discipline mathematics computer science application 2. numerical treatment model discretisation, algorithm development 3. implementation software development, parallelisation 4. visualisation illustration of abstract simulation results 5. validation comparison of results with reality 6. embedding insertion into working process bomb attack on WTC (1993) because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e.g Motivation why numerical simulation? (cont d) because experiments are sometimes very costly & time consuming protein folding, material sciences, e.g. Mississippi basin model (Jackson, MS) because experiments are sometimes more expensive aerodynamics, crash test, e.g. Motivation why parallel programming and HC? complex problems (especially the so called grand challenges ) demand for more computing power climate or geophysics simulation (tsunami, e.g.) structure or flow simulation (crash test, e.g.) development systems (CAD, e.g.) large data analysis (Large Hadron Collider at CERN, e.g.) military applications (crypto analysis, e.g.) performance increase due to faster hardware, more memory ( work harder ) more efficient algorithms, optimisation ( work smarter ) parallel computing ( get some help ) 17 18

3 Motivation objectives (in case all resources would be available N-times) throughput: compute N problems simultaneously running N instances of a sequential program with different data sets ( embarrassing parallelism ); SETI@home, e.g. drawback: limited resources of single nodes response time: compute one problem at a fraction (1N) of time running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e.g. drawback: writing a parallel program; communication Motivation levels of parallelism qualitative meaning: level(s) on which work is done in parallel granularity sub-instruction level instruction level block level process level program level Motivation levels of parallelism (cont d) program level parallel processing of different programs independent units without any shared data organised by the OS process level a program is subdivided into processes to be executed in parallel each process consists of a larger amount of sequential instructions and some private data communication in most cases necessary (data exchange, e.g.) term of process often referred to as heavy-weight process Motivation levels of parallelism (cont d) block level blocks of instructions are executed in parallel each block consists of few instructions and shares data with others communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread) instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order of commands sub-instruction level instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e.g.)

4 Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation definition of parallel computers A collection of processing elements that communicate and cooperate to solve large problems (ALMASE and GOTTLIEB, 1989) possible appearances of such processing elements specialised units (steps of a vector pipeline, e.g.) parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, ) several uniform arithmetical units (processing elements of array computers, GGUs, e.g.) complete stand-alone computers connected via LAN (work station or C clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called metacomputers) instruction pipelining instruction execution involves several operations 1.instruction fetch (IF) 2.decode (DE) 3.fetch operands (O) 4.execute (EX) 5.write back (WB) which are executed successively hence, only one part of CU works at a given moment IF DE O EX WB IF DE O EX WB instruction N instruction N1 instruction pipelining (cont d) observation: while processing particular stage of instruction, other stages are idle hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines) advantage: no additional hardware necessary instruction N IF DE O EX WB instruction N1 instruction N2 instruction N3 instruction N4 IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB time

superscalar faster CU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, ) dispatcher decides (during runtime) which

4 ALU FU but, performance improvement is limited (intrinsic parallelism) instr. A instr.

5 superscalar faster CU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, ) dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional units for instance, owerc 970 (4 ALU, 2 FU) instr. 1 ALU instr. 2 ALU instr. 3 ALU instr. 4 ALU FU but, performance improvement is limited (intrinsic parallelism) instr. A instr. B FU superscalar (cont d) pipelining for superscalar architectures also possible instruction N instruction N1 instruction N9 IF DE O EX WB IF DE O EX WB time IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB very long instruction word (VLIW) in contrast to superscalar architectures, the compiler groups parallel executable instructions during compilation (pipelining still possible) advantage: no additional hardware logic necessary drawback: not always fully useable ( dummy filling (NO)) VLIW instruction vector units simultaneously execution of one instruction on a one-dimensional array of data ( vector) VU first appeared in 1970s and were the basis of most supercomputers in the 1980s and 1990s ( A 1 B 1 A 2 B 2 A 3 B 3 A N1 B N1 A N B N )T instr. 1 instr. 2 instr. 3 instr. 4 instruction N1 N ( C 1 C 2 C 3 C N1 C N ) T registers specialised hardware very expensive limited application areas (mostly CFD, CSD, )

dual core, quad core, many core, and multicore observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation fv 2 dual core, quad core, many core,

6 dual core, quad core, many core, and multicore observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation fv 2 dual core, quad core, many core, and multicore (cont d) 25% reduction in performance (i.e. core voltage) leads to approx. 50% reduction in dissipation dissipation performance normal CU reduced CU dual core, quad core, many core, and multicore (cont d) idea: installation of two cores per die with same dissipation as single core system dual core, quad core, many core, and multicore (cont d) single vs. dual quad core 0 core 0 core 1 core 0 core 1 core 2 core 3 dissipation L1 L2 L1 L1 shared L2 L1 L1 shared L2 L1 L1 shared L2 performance FSB FSB FSB single core dual core FSB: front side bus (i.e. connection to memory (via north bridge))

INTEL Nehalem Core i7 core 0 core 1 core 2 core 3 L1L2 L1L2 L1L2 L1L2 shared L3 Intel E5-2600 Sandy-Bridge Series 2 CUs connected by 2 QIs (Intel Quick ath Interconnect) Quick ath Interconnect (1

7 INTEL Nehalem Core i7 core 0 core 1 core 2 core 3 L1L2 L1L2 L1L2 L1L2 shared L3 Intel E Sandy-Bridge Series 2 CUs connected by 2 QIs (Intel Quick ath Interconnect) Quick ath Interconnect (1 sending and 1 receiving port) 8 GT/s 16 Bit/T payload 2 directions / 8 Bit/Byte = 32 GB/s max bandwidth / QI 2 QI links: 2 32 GB/s 64 GB/s max bandwidth QI source: QI: Quickath Interconnect replaces FSB (QI is a point-to-point interconnection with a memory controller now on-die in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB) source: G. Wellein, RRZE Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation arrival of clusters in the late eighties, Cs became a commodity market with rapidly increasing performance, mass production, and decreasing prices growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MI library 1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops

supercomputers supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated

S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world s first TFlops computer) then: ASCI Blue acific (1998, LLNL), ASCI Blue Mountain, ASCI White, meanwhile new high-end

8 supercomputers supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world s first TFlops computer) then: ASCI Blue acific (1998, LLNL), ASCI Blue Mountain, ASCI White, meanwhile new high-end computing memorandum (2004) supercomputers (cont d) federal Bundeshöchstleistungsrechner initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich) one new installation every second year (i.e. a six year upgrade cycle for each centre) the newest one to be among the top 10 of the world overview and state of the art: Top500 list (updated every six month), see finally (a somewhat different definition) Supercomputer: Turns CU-bound problems into I/O-bound problems. Ken Batcher MOORE s law observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965) some numbers: Top500 number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months

9 some numbers: Top500 (cont d) some numbers: Top500 (cont d) Citius, altius, fortius! the 10 fastest supercomputers in the world (by November 2014) The Earth Simulator world s #1 from installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching) 8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; TFlops sustained performance (Linpack) nodes connected by single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) further 700 TB disc space and 1.60 B mass storage

BlueGeneL world s #1 from 2004 08 installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12

60 TFlops sustained performance (Linpack) nodes configured as 3D torus (32 32 64); global reduction tree for fast operations (global max sum) in a few microseconds 1024 Gbps link to global parallel

supercomputer dual-core Opteron Cell Broadband Engine 129,600 cores (1456.70 TFlops peak performance) and 98 TB memory; 1144.

35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation s nuclear weapons stockpile, real-time applications (cause & effect in capital markets, bone

10 BlueGeneL world s #1 from installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes) 2 owerc 440d processors (2.8 GFlops each) 512MB memory 131,072 processors ( TFlops peak performance) and TB memory; TFlops sustained performance (Linpack) nodes configured as 3D torus ( ); global reduction tree for fast operations (global max sum) in a few microseconds 1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9 Roadrunner world s #1 from installed in 2008 at LANL, NM, USA installation costs about $120 million first hybrid supercomputer dual-core Opteron Cell Broadband Engine 129,600 cores ( TFlops peak performance) and 98 TB memory; TFlops sustained performance (Linpack) standard processing (file system IO, e. g.) handled by Opteron, while mathematically and CU-intensive tasks are handled by Cell 2.35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation s nuclear weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e.g.) HLRB II (world s #6 for ) installed in 2006 at LRZ, Garching installation costs 38M monthly costs approx. 400,000 upgrade in 2007 (finished) one of Germany s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus) 256 blades (ccnuma link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80 GFlops) 4GB memory per core 9728 cores (62.30 TFlops peak performance) and 39 TB memory; TFlops sustained performance (Linpack) footprint 24m 12m; total weight 103 metric tons SuperMUC (world s #4 for ) installed in 2012 at LRZ, Garching IBM System x idatalex (still) one of Germany s 3 supercomputers consists of 19 islands (Infiniband FDR10 pruned tree with 4:1 intra-island / inter-island ratio) 18 thin islands with 512 nodes each (total 288 TB memory) Sandy Bridge-E Xeon E5 (2 CUs (8 cores each) / node) 1 fat island with 205 nodes (total 52 TB memory) Westmere-EX Xeon E7 (4 CUs (10 cores each) / node) 147,456 cores (3.185 Flops peak performance thin islands only); Flops sustained performance (Linpack) footprint 21m 26m; warm water cooling

11 Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation Classification of arallel Computers standard classification according to FLYNN global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data drawback: very different computers may belong to the same class Classification of arallel Computers Classification of arallel Computers standard classification according to FLYNN (cont d) SISD one processing unit that has access to one data memory and to one program memory classical monoprocessor following VON NEUMANN s principle data memory processor program memory standard classification according to FLYNN (cont d) SIMD several processing units, each with separate access to a (shared or distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes outdated due to recent developments at commodity market data memory processor program memory data memory processor

12 Classification of arallel Computers standard classification according to FLYNN (cont d) MISD several processing units that have access to one data memory; several program memories not very popular class (mainly for special applications such as Digital Signal rocessing) operating on a single stream of data, forwarding results from one processing unit to the next example: systolic array (network of primitive processing elements that pump data) processor program memory Classification of arallel Computers standard classification according to FLYNN (cont d) MIMD several processing units, each with separate access to a (shared or distributed) data memory; several program memories classification according to (physical) memory organisation shared memory shared (global) address space distributed memory distributed (local) address space example: multiprocessor systems, networks of computers data memory processor program memory data memory processor program memory data memory processor program memory Classification of arallel Computers processor coupling cooperation of processors computers as well as their shared use of various resources require communication and synchronisation the following types of processor coupling can be distinguished memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS) shared address space distributed address space global memory MemMS, SM distributed memory Mem-MesMS (hybrid) MesMS Classification of arallel Computers processor coupling (cont d) uniform memory access (UMA) each processor has direct access via the network to each memory module M with same access times to all data standard programming model can be used (i.e. no explicit send receive of messages necessary) communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in general by the programmer) M M M network

13 Classification of arallel Computers processor coupling (cont d) symmetric multiprocessor (SM) only a small amount of processors, in most cases a central bus, one address space (UMA), but bad scalability cache-coherence implemented in hardware (i.e. a read always provides a variable s value from its last write) example: double or quad boards, SGI Challenge M Classification of arallel Computers processor coupling (cont d) non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of data (i.e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM VSM, Cray T3E network C: cache C C C M M Classification of arallel Computers processor coupling (cont d) cache-coherent non-uniform memory access (ccnuma) caches for local and remote addresses; cache-coherence implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000 network Classification of arallel Computers processor coupling (cont d) cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1 network C M C M C C C

14 Classification of arallel Computers processor coupling (cont d) no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange (due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and IO due to parallel data transfer (Direct Memory Access, e.g.) possible example: IBM S2, ASCI Red Blue White M M network M Classification of arallel Computers difference between processes and threads program (*.exe, *.out, e.g.) thread model (UMA, NUMA) process model (NORMA) program (*.exe, *.out, e.g.) messages messages Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation Quantitative erformance Evaluation execution time time T of a parallel program between start of the execution on one processor and end of all computations on the last processor during execution all processors are in one of the following states compute T COM : time spent for computations communicate T COMM : time spent for send and receive operations idle T IDLE : time spent for waiting (sending receiving messages) hence T T COM T COMM T IDLE

15 Quantitative erformance Evaluation comparison multiprocessor monoprocessor correlation of multi- and monoprocessor systems performance important: program that can be executed on both systems definitions (1): amount of unit operations of a program on the monoprocessor system (p): amount of unit operations of a program on the multiprocessor systems with p processors T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) simplifying preconditions T(1) (1) one operation to be executed in one step on the monoprocessor system T(p) (p) more than one operation to be executed in one step (for p 2) on the multiprocessor system with p processors Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) speed-up S(p) indicates the improvement in processing speed T(1) S( p) with 1 S(p) p T( p) efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p S( p) E( p) with 1p E(p) 1 p Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) speed-up and efficiency can be seen in two different ways algorithm-independent best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency algorithm-dependent parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; unfair due to communication and synchronisation overhead relative speed-up relative efficiency

16 Quantitative erformance Evaluation scalability objective: adding further processing elements to the system shall reduce the execution time without any program modifications i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size one porter may carry one suitcase in a minute 60 porters won t do it in a second but 60 porters may carry 60 suitcases in a minute in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems Quantitative erformance Evaluation AMDAHL s law the probably most important and most famous estimate for the speed-up (even if quite pessimistic) underlying model each program has a sequential part s, 0 s 1, that can only be executed in a sequential way: synchronisation, data IO, furthermore, each program consists of a parallelisable part 1s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e.g. hence, the execution time for the parallel program executed on p processors can be written as 1 s T( p) st(1) T(1) p Quantitative erformance Evaluation Quantitative erformance Evaluation AMDAHL s law (cont d) the speed-up can thus be computed as T(1) S (p ) T( p) T(1) 1 s s T(1) T(1) p 1 1 s s p AMDAHL s law (cont d) example: s 0.1 independent from p the speed-up is bounded by this limit where s the error? 10 when increasing p we finally get AMDAHL s law 9 8 lim S( p) p 1 lim p 1 s s p 1 s speed-up speed-up is bounded: S(p) 1s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s 0.1) S(p) # processes

17 Quantitative erformance Evaluation GUSTAFSON s law addresses the shortcomings of AMDAHL s law as it states that any sufficient large problem can be efficiently parallelised instead of a fixed problem size it supposes a fixed time concept underlying model execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part, 0 1 hence, the execution time for the sequential program on the monoprocessor can be written as T(1) p(1) the speed-up can thus be computed as Quantitative erformance Evaluation GUSTAFSON s law (cont d) difference to AMDAHL sequential part s(p) is not constant, but gets smaller with increasing p s( p), s(p) 0, 1 p(1 ) often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) speed-up is not bounded for increasing p S(p) p(1) p (1p) Quantitative erformance Evaluation GUSTAFSON s law (cont d) some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system Quantitative erformance Evaluation communication computation-ratio (CCR) important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size example NN matrix distributed among p processors (Np rows each) iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8NNp communication time: 2N CCR: p4n what does this mean?

18 Twelve ways to fool the masses when giving performance results on parallel computers. David H. Bailey, NASA Ames Research Centre, Quote only 32-bit performance results, not 64-bit results. 2. resent performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimised codes on Crays. Twelve ways 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8. If MFLOS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilisation, parallel speed-ups or MFLOS per dollar. 10. Mutilate the algorithm used in the parallel implementation to match the architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12. If all else fails, show pretty pictures and animated videos, and don t talk about performance

High Performance Computing Programming Paradigms and Scalability Part 1: Introduction

High Performance Computing Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing (SCCS) Summer Term