Leistungsanalyse von Rechnersystemen

Size: px

Start display at page:

Download "Leistungsanalyse von Rechnersystemen"

Tracey Dixon
5 years ago
Views:

1 Center for Information Services and High Performance Computing (ZIH) Leistungsanalyse von Rechnersystemen 10. November 2010 Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de)

2 Center for Information Services and High Performance Computing (ZIH) Summary of Previous Lecture Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de)

3 Summary of Previous Lecture Different workloads: Test workload Real workload Synthetic workload Historical examples for test workloads: Addition instruction Instruction mixes Kernels Synthetic programs Application benchmarks Holger Brunst, Matthias Müller: Leistungsanalyse

4 Excursion on Speedup and Efficiency Metrics Comparison of sequential and parallel algorithms Speedup: n is the number of processors T 1 is the execution time of the sequential algorithm T n is the execution time of the parallel algorithm with n processors Efficiency: Its value estimates how well-utilized p processors solve a given problem Usually between zero and one. Exception: Super linear speedup (later) Holger Brunst, Matthias Müller: Leistungsanalyse

5 Amdahl s Law Find the maximum expected improvement to an overall system when only part of the system is improved Serial execution time = s+p Parallel execution time = s+p/n Normalizing with respect to serial time (s+p) = 1 results in: S n = 1/(s+p/n) Drops off rapidly as serial fraction increases Maximum speedup possible = 1/s, independent of n the number of processors! Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors? What is wrong with this argument? Holger Brunst, Matthias Müller: Leistungsanalyse

6 Popilar and historic benchmarks Popular benchmarks: Eratosthenes sieve algorithm Ackermann s Function Whetstone LINPACK Dhrystone Lawrence Livermore Loops TPC-C SPEC Holger Brunst, Matthias Müller: Leistungsanalyse

7 Workload description Level of Detail of the workload description - Examples: Most frequent request (e.g. Addition) Frequency of request type (instruction mix) Time-stamped sequence of requests Average resource demand (e.g. 20 I/O requests per second) Distribution of resource demands (not only the average, but also probability distribution) Holger Brunst, Matthias Müller: Leistungsanalyse

8 Characterization of Benchmarks There are many metrics, each one has its purpose Computer Hardware Raw machine performance: Tflops Microbenchmarks: Stream Algorithmic benchmarks: Linpack Compact Apps/Kernels: NAS benchmarks Application Suites: SPEC User-specific applications: Custom benchmarks Applications Holger Brunst, Matthias Müller: Leistungsanalyse

9 Comparison of different benchmark classes coverage relevance Identify problems Time evolution Micro Algorithmic Kernels SPEC Apps Holger Brunst, Matthias Müller: Leistungsanalyse

10 SPEC Benchmarks: CPU 2006 Application Benchmarks Different metrics: Integer, floatingpoint Standard and rate Base, peak Run rules Holger Brunst, Matthias Müller: Leistungsanalyse

11 Center for Information Services and High Performance Computing (ZIH) Stream Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de)

12 Stream Benchmark Author: John McCalpin ( Mr Bandwidth ) John McCalpin Memory Bandwidth and Machine Balance in High Performance Computers, IEEE TCCA Newsletter, December STREAM: measure memory bandwidth with the operations: Copy: a(i) = b(i) Scale: a(i)=s*b(i) Add: a(i)=b(i)+c(i) Triad: a(i)=b(i)+s*c(i) STREAM2: measures memory hierarchy bandwidth with the operations: Fill: a(i)=0 Copy: a(i)=b(i) Daxpy: a(i) = a(i) +q*b(i) Sum: sum += a(i) Holger Brunst, Matthias Müller: Leistungsanalyse

13 Stream 2 properties Holger Brunst, Matthias Müller: Leistungsanalyse

14 Stream Results: TOP 10 STREAM Memory Bandwidth --- John D. McCalpin, Revised to Tue Jul 25 10:10:14 CST 2006 All results are in MB/s MB=10^6 B, *not* 2^20 B Machine ID ncpus COPY SCALE ADD TRIAD SGI_Altix_ SGI_Altix_ NEC_SX NEC_SX-5-16A NEC_SX HP_AlphaServer_GS Cray_T932_ E NEC_SX IBM_System_p5_ HP_Integrity_SuperDome Holger Brunst, Matthias Müller: Leistungsanalyse

15 Stream 2 Results Holger Brunst, Matthias Müller: Leistungsanalyse

16 Center for Information Services and High Performance Computing (ZIH) Linpack and TOP500 Slides courtesy Jack Dongarra Nöthnitzer Straße 46 Raum 1026 Tel Holger Brunst (holger.brunst@tu-dresden.de) Matthias S. Mueller (matthias.mueller@tu-dresden.de)

17 The Linpack Benchmark is a measure of a computer s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report. LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n 3 + O(n 2 ) Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100.

18 When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results: Please send the results of this run to: Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee Fax: Internet: dongarra@cs.utk.edu norm. resid resid machep x(1) x(n) E E E E E+00 times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio times for array with leading dimension of E E E E E E E E E E E E E E E E E E E E E E E E-02 times for array with leading dimension of E E E E E E E E E E E E E E E E E E E E E E E E-02 Time Factor Time Solve Total Time Mflop/s rate

19 In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options)

20 Year Computer Number of Processors Cycle time Mflop/s 2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz NEC SX-8/1 (1 proc) 1 2 GHz Intel Pentium Nocona (1 proc 3.6 GHz) GHz HP Integrity Server rx2600 (1 proc 1.5GHz) GHz Intel Pentium 4 (3.06 GHz) GHz Fujitsu VPP5000/ nsec Fujitsu VPP5000/ nsec CRAY T nsec CRAY T nsec CRAY C nsec CRAY C nsec CRAY C nsec CRAY C nsec CRAY Y-MP nsec CRAY Y-MP nsec CRAY Y-MP nsec ETA 10-E nsec NEC SX nsec NEC SX nsec CRAY X-MP nsec CRAY nsec CRAY nsec 3.4

21 In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Linpack 1000 (1986) n=1000 (8MB); wanted to see higher performance levels Any language; 64 bit floating point arithmetic Hand optimization OK Linpack TPP (1991) (Top500; 1993) Any size (n as large as you can; n=10 6 ; 8TB; ~6 hours); Any language; 64 bit floating point arithmetic Hand optimization OK Strassen s method not allowed (confuses the op count and rate) Reference implementation available In all cases results are verified by looking at: Operations count for factorization ; solve

22 R max Rate LINPACK NxN benchmark Size Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem LINPACK NxN report N max the size of the chosen problem run on a machine R max the performance in Gflop/s for the chosen size problem run on the machine N 1/2 the size where half the R max execution rate is achieved R peak the theoretical peak performance Gflop/s for the machine LINPACK NxN is used to rank TOP500 fastest computers in the world N 1/2 N max

23 TPP performance Rate Size

24 (Entries for this table began in 1991.) Year Computer # of Procs Measured Gflop/s Size of Problem Size of 1/2 Perf Theoretical Peak Gflop/s IBM Blue Gene/L Earth Simulator Computer, NEC 2001 ASCI White-Pacific, IBM SP Power ASCI White-Pacific, IBM SP Power ASCI Red Intel Pentium II Xeon core 1998 ASCI Blue-Pacific SST, IBM SP 604E 1997 Intel ASCI Option Red (200 MHz Pentium Pro) Hitachi CP-PACS Intel Paragon XP/S MP Intel Paragon XP/S MP Fujitsu NWT NEC SX-3/ Fujitsu VP2600/

Manufacturer Computer Rmax [TF/ s] Installation Site Country Year #Proc 1 IBM BlueGene/L eserver Blue Gene 280.6 DOE Lawrence Livermore Nat Lab USA 2005 custom 131072 2 IBM BGW eserver Blue Gene 91.

87 NASA Ames USA 2004 hybrid 10160 5 Dell Thunderbird Pentium/Infiniband 38.27 DOE Sandia Nat Lab USA 2005 commod 8000 6 10 Cray Red Storm Cray XT3 AMD 36.

25 Manufacturer Computer Rmax [TF/ s] Installation Site Country Year #Proc 1 IBM BlueGene/L eserver Blue Gene DOE Lawrence Livermore Nat Lab USA 2005 custom IBM BGW eserver Blue Gene IBM Thomas Watson Research USA 2005 custom IBM ASC Purple Power5 p DOE Lawrence Livermore Nat Lab USA 2005 custom SGI Columbia Altix, Itanium/Infiniband NASA Ames USA 2004 hybrid Dell Thunderbird Pentium/Infiniband DOE Sandia Nat Lab USA 2005 commod Cray Red Storm Cray XT3 AMD DOE Sandia Nat Lab USA 2005 hybrid NEC Earth-Simulator SX Earth Simulator Center Japan 2002 custom IBM MareNostrum PPC 970/Myrinet Barcelona Supercomputer Center Spain 2005 commod IBM eserver Blue Gene ASTRON University Groningen Netherlands 2005 custom Cray Jaguar Cray XT3 AMD DOE Oak Ridge Nat Lab USA 2005 hybrid 5200

26 My Laptop

27 Cluster: Commodity processors & Commodity interconnect Constellation: # of procs/node nodes in the system

31 Center for Information Services and High Performance Computing (ZIH) HPCC Benchmark Slides courtesy Jack Dongara Matthias Müller

32 Linpack Benchmark Good One number Simple to define & easy to rank Allows problem size to change with machine and over time Bad Emphasizes only peak CPU speed and number of CPUs Does not stress local bandwidth Does not stress the network Does not test gather/scatter Ignores Amdahl s Law (Only does weak scaling) Ugly Benchmarketeering hype From Linpack Benchmark and Top500: no single number can reflect overall performance Clearly need something more than Linpack HPC Challenge Benchmark Test suite stresses not only the processors, but the memory system and the interconnect. The real utility of the HPCC benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from Linpack.

33 Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest. 1. HPL (LINPACK) MPI Global (Ax = b) STREAM Local; single CPU *STREAM Embarrassingly parallel PTRANS (A A + B T ) MPI Global RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global BW and Latency MPI FFT - Global, single CPU, and EP 9. Matrix Multiply single CPU and EP

Messages Pages Performance Targets Tape HPCC was developed by HPCS to assist in testing new HEC systems Each benchmark focuses on a different

34 HPL: linear system solve Ax = b STREAM: vector operations A = B + s * C FFT: 1D Fast Fourier Transform Z = fft(x) RandomAccess: integer update T[i] = XOR( T[i], rand) HPC Challenge Memory Hierarchy Operands Registers Cache(s) Local Memory Remote Memory Disk Instructions Lines Blocks Messages Pages Performance Targets Tape HPCC was developed by HPCS to assist in testing new HEC systems Each benchmark focuses on a different part of the memory hierarchy HPCS performance targets attempt to Flatten the memory hierarchy Improve real application performance Make programming easier 34

computations but they do no communicate with each other explicitly.

35 Local - only a single processor is performing computations. Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly. Global - all processors in the system are performing computations and they explicitly communicate with each other.

36 CPU computational speed Computational resources Memory bandwidth Node Interconnect bandwidth

37 HPL Matrix Multiply CPU computational speed Computational resources STREAM Memory bandwidth Node Interconnect bandwidth Random & Natural Ring Bandwidth & Latency

38 Memory Access Patterns

39 Memory Access Patterns

40 TPP Linpack Benchmark Used for the Top500 ratings Solve Ax=b, dense problem, matrix is random Uses LU decomposition with partial pivoting Based on the ScaLAPACK routines but optimized The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage In double precision (64-bit) arithmetic Run on all processors Problem size set by user These settings used for the other tests Requires An implementation of the MPI An implementation of the Basic Linear Algebra Subprograms (BLAS) Reports total TFlop/s achieved for set of processors Takes the most time Considering stopping the process after say 25% Rate TPP performance Size

41 The STREAM Benchmark is a standard benchmark for the measurement of computer memory bandwidth Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors name kernel bytes/iter FLOPS/iter COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) Four operations COPY, SCALE ADD, TRIAD Measures: Machine Balance - relative cost of memory accesses vs arithmetic Vector lengths chosen to fill local memory Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion Reports total GB/s achieved per processor

42 Implements parallel matrix transpose A = A + B T The matrices A and B are distributed across the processors Two-dimensional block-cyclic storage Same storage as for HPL Exercises the communications pattern where pairs of processors communicate with each other simultaneously. Large (out-of-cache) data transfers across the network Stresses the global bisection bandwidth Reports total GB/s achieved for set of processors

43 Integer Read-modify-write to random address No spatial or temporal locality Measures memory latency or the ability to hide memory latency Architecture stresses Latency to cache and main memory Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark Three forms Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion Tested with an MPI version across the set of processors Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors Reports Gup/s (Giga updates per second) per processor

44 Ping-Pong test between pairs of processors Send a message from proc i to proc k then return message from proc k to proc i proc i MPI_Send() - proc k MPI_Recv() proc i MPI_Recv() - proc k MPI_Send() Other processors doing MPI_Waitall() time += MPI_Wtime() time /= 2 The test is performed between as many possible distinct pairs of processors. There is an upper bound on the time for the test Tries to find the weakest link amongst all pairs Minimum bandwidth Maximum latency Not necessarily the same link will be the worst for bandwidth and latency Message 8B used for latency test; take max time Message 2MB used for bandwidth test; take min GB/s

45 Two types of rings: Naturally ordered (use MPI_COMM_WORLD): 0,1,2,... P-1. Randomly ordered (30 rings tested) eg.: 7, 2, 5, 0, 3, 1, 4, 6 Each node posts two sends (to its left and right neighbor) and two receives (from its left and right neighbor). Two types of communication routines are used: combined send/receive and non-blocking send/receive. MPI_Sendrecv( TO: right_neighbor,from: left_neighbor) MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and MPI_Isend( right_neighbor )MPI_Isend( left_neighbor ) The smaller (better) time for each is taken (which one is smaller depends on the MPI implementation). Message 8B used for latency test; Message 2MB used for bandwidth test;

46 Using FFTE software Daisuke Takahashi code from University of Tsukuba 64 bit complex 1-D FFT Uses 64 bit addressing Global transpose with MPI_Alltoall() Three transposes (data is never scrambled)

47 Single program to download and run Simple input file similar to HPL input Base Run and Optimization Run Base run must be made User supplies MPI and the BLAS Optimized run allowed to replace certain routines User specifies what was done Results upload via website html table and Excel spreadsheet generated with performance results Intentionally we are not providing a single figure of merit (no over all ranking) Goal: no more than 2 X the time to execute HPL.

1. Download 2. Install 3. Run 4. Upload results 5. Confirm via @email@ 6. Tune 7. Run 8. Upload results 9.

environment Only some routines can be replaced Data layout needs to be preserved Multiple languages can be

48 1. Download 2. Install 3. Run 4. Upload results 5. Confirm via 6. Tune 7. Run 8. Upload results 9. Confirm via Optional Prequesites: C compiler BLAS MPI Provide detailed installation and execution environment Only some routines can be replaced Data layout needs to be preserved Multiple languages can be used Results are immediately available on the web site: Interactive HTML XML MS Excel Kiviat charts (radar plots) 48

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents