TORSTEN HOEFLER

Size: px

Start display at page:

Download "TORSTEN HOEFLER"

Bathsheba Short
5 years ago
Views:

1 TORSTEN HOEFLER Performance Modeling for Future Computing Technologies Presentation at Tsinghua University, Beijing, China Part of the 60 th anniversary celebration of the Computer Science Department (#9) spcl.inf.ethz.ch

2 Changing hardware constraints and the physics of computing 130nm 0.9 V [1] 3-bit FP ADD: 0.9 pj 3-bit FP MUL: 3. pj How to address locality challenges on standard architectures and programming? D. Unat et al.: Trends in Data Locality Abstractions for HPC Systems x3 bit from L1 (8 kib): 10 pj x3 bit from L (1 MiB): 100 pj x3 bit from DRAM: 1.3 nj IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 8, Nr. 10, IEEE, Oct nm 65nm Three Ls of modern computing: 45nm 3nm [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary []: Moore: Landauer Limit Demonstrated, IEEE Spectrum 01 nm 14nm 10nm

Control Locality? Load-store vs. Dataflow architectures spcl.inf.ethz.

changes in the store than pushing vast numbers of words back and forth through the von

" Load-store ( von Neumann ) x=a+b Static Dataflow ( non von Neumann ) y=(a+b)*(c+d) ALU

3 Control Locality? Load-store vs. Dataflow architectures spcl.inf.ethz.ch Turing Award 1977 (Backus): "Surely there must be a less primitive way of making big changes in the store than pushing vast numbers of words back and forth through the von Neumann bottleneck." Load-store ( von Neumann ) x=a+b Static Dataflow ( non von Neumann ) y=(a+b)*(c+d) ALU Energy per instruction: 70pJ x Energy per operation: 1-3pJ y x Control Registers Cache a b Source: Mark Horowitz, ISSC 14 a+b + c+d + ld st add r1, a, b, r1, r x r Memory a b a b Memory c d y 3

4 Single Instruction Multiple Data/Threads (SIMD - Vector CPU, SIMT - GPU) ALU ALU ALU ALU ALU Control ALU ALU ALU ALU (High Performance) Computing really ALU 1 MiB: 100 pj x Registers 45nm, 0.9 V [1] Random Access SRAM: 8 kib: 10 pj 3 kib: 0 pj 45nm, 0.9 V [1] Single R/W registers: 3 bit: 0.1 pj became a data management challenge Cache + + Memory a b Memory c d y [1]: Marc Horowitz, Computing s Energy Problem (and what we can do about it), ISSC 014, plenary 4

Crystal Ball into the Post-Moore Future (maybe already today?

isolation) Many different ideas how to build Efficient CMOS Future architectures will force us to manage memory/communication Rest of this (synapses) talk: how do we understand Accelerators which

datapaths Adapt architecture to problem Phrase your problem as Optical Dataflow + Control Flow inference! Spin-based Main challenge Even learning is hard Superconducting Programmability!

5 Crystal Ball into the Post-Moore Future (maybe already today?) Neuromorphic Asynchronous CMOS circuits 1000x energy benefit Integrates compute (neurons) and Quantum Completely different paradigm Concept of qubits Bases on quantum mechanics (which only works in isolation) Many different ideas how to build Efficient CMOS Future architectures will force us to manage memory/communication Rest of this (synapses) talk: how do we understand Accelerators which Custom architectures Very specialized parts Network and of storage programs to Ion accelerate trap (ions trapped in fields) on which Reconfigurable device? datapaths Adapt architecture to problem Phrase your problem as Optical Dataflow + Control Flow inference! Spin-based Main challenge Even learning is hard Superconducting Programmability! Comparatively little work Majorana qubits See our SC18 tutorial Suddenly much lower energy (none proven to scale) benefits accelerated heterogeneity FPGAs or CGRAs or GPUs Have been around for a while Use transistors more efficiently Obvious answer: the slow ones! Needs new algorithms to be useful So simply observe Algorithms their performance? are limited Not so fast. Productive Parallel Programming for FPGA image source: 1stcentury.com image source: intel.com image source: intel.com 5

6 What can we learn from High Performance Computing dgemm("n", "N", 50, 50, 50, 1.0, A, 50, B, 50, 1.0, C, 50); >x ~4x 1 ~10 3 ~10 4 ~10 6 ~10 8 ~10 10 ~

7 HPC is used to solve complex problems! Image credit: Serena Donnin, Sarah Rauscher, Ivo Kabashow 7

8 Scientific Performance Engineering 1) Observe ) Model 3) Understand 4) Build 8

9 Part I: Observe Measure systems Experimental design Examine documentation Collect data Gather statistics Document process Factorial design 9

10 usec spcl.inf.ethz.ch Trivial Example: Simple ping-pong latency benchmark The latency of Piz Daint is 1.77us! ~1.ms How did you get this number? I averaged 10 6 experiments, it must be right! Why do you think so? Can I see the data? ~1.77us sample TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 10

11 Dealing with variation The 99.9% confidence interval is 1.765us to 1.775us Did you assume normality? Ugs, What? the Isn t data that is not always normal at all. The the nonparametric case with many 99.9% CI is much measurements? wider: 1.6us to 1.9us! Can we test for normality? TH, Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 11

12 Looking at the data in detail This CI makes me nervous. Let s zoom! Clearly, the mean/median are not sufficient! Try quantile regression! S D Image credit: nersc.gov 1

Scientific benchmarking of parallel computing systems ACM/IEEE Supercomputing 015 (SC15) + talk online on youtube!

reporting of the subsets base case. of standard benchmarks Rule 3: Use or the applications arithmetic mean or not only using for all summarizing system resources. costs.

Only if these are not available use the Rule 5: Report if the measurement values are deterministic. For geometric mean for summarizing ratios.

, based on Rule 7: Carefully investigate if measures of central tendency the number of samples) without diagnostic checking.

13 Scientific benchmarking of parallel computing systems ACM/IEEE Supercomputing 015 (SC15) + talk online on youtube! Rule 1: When publishing parallel speedup, report if the base case is a single parallel process or best serial execution, as well as Rule the : absolute Specify the execution reason performance only reporting of the subsets base case. of standard benchmarks Rule 3: Use or the applications arithmetic mean or not only using for all summarizing system resources. costs. Use the Rule 4: Avoid summarizing ratios; summarize the costs or rates that harmonic mean for summarizing rates. the ratios base on instead. Only if these are not available use the Rule 5: Report if the measurement values are deterministic. For geometric mean for summarizing ratios. nondeterministic data, report confidence intervals of the Rule 6: Do not assume measurement. normality of collected data (e.g., based on Rule 7: Carefully investigate if measures of central tendency the number of samples) without diagnostic checking. such as Rule mean 8: Carefully or median investigate are useful if to measures report. Some of central problems, tendency such such as worst-case mean latency, median are may useful require to other report. percentiles. Rule 9: Document all varying factors and their Some levels problems, as well as the complete such as worst-case experimental latency, setup may (e.g., require software, other hardware, percentiles. techniques) Rule 10: For parallel time measurements, report all measurement, to facilitate reproducibility and provide interpretability. (optional) Rule 11: If synchronization, possible, show upper and summarization performance bounds techniques. to facilitate Rule 1: Plot as much information as needed to interpret the interpretability of the measured results. experimental results. Only connect measurements by lines if they indicate trends and the interpolation is valid. 13

Simplifying Measuring and Reporting: LibSciBench Simple MPI-like C/C+ interface High-resolution timers Flexible data collection Controlled by environment variables

14 Simplifying Measuring and Reporting: LibSciBench Simple MPI-like C/C+ interface High-resolution timers Flexible data collection Controlled by environment variables Tested up to 51k ranks Parallel timer synchronization R scripts for data analysis and visualization S. Di Girolamo, TH: 14

15 We have the (statistically sound) data, now what? t(n=100)? t(n=1510)? Matrix Multiply t(n) = a*n 3 The 99% confidence interval is within 1% of the reported median. TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 15

16 We have the (statistically sound) data, now what? t(n=100)=0.667s t(n=1510)=0.48s The 99% confidence interval is within 1% of the reported median. The adjusted R of the model fit is 0.99 TH, W. Gropp, M. Snir, W. Kramer: Performance Modeling for Systematic Performance Tuning, IEEE/ACM SC11 16

While a model can never be truth, a model might be ranked from very useful, to useful, to somewhat useful to,

17 Part II: Model Model Burnham, Anderson: A model is a simplification or approximation of reality and hence will not reflect all of reality.... Box noted that all models are wrong, but some are useful. While a model can never be truth, a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. This is generally true for all kinds of modeling. We focus on performance modeling in the following! 17

18 Performance Modeling Capability Model Performance Model Requirements Model TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC

Requirements modeling I: Six-step performance

parameters Describe application kernels Fit

computation overlap Communication pattern

19 Requirements modeling I: Six-step performance modeling Input parameters Communication parameters Describe application kernels Fit sequential baseline Communication / computation overlap Communication pattern 10-0% speedup [] [1] TH, W. Gropp, M. Snir and W. Kramer: Performance Modeling for Systematic Performance Tuning, SC11 [] TH and S. Gottlieb: Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes, EuroMPI 10 19

Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (boring and tricky) Idea: Automatically select best (scalability) model from

20 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (boring and tricky) Idea: Automatically select best (scalability) model from predefined search space number of terms n = i j e.g., number of f ( p) c p k log k ( p) processes k k = 1 (model) constant n Î i k Î I j k Î J I, J Ì n =1 I = { 0,1, } J = {0,1} p p log(p) p log(p) p log(p) [1]: A. Calotoiu, TH, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 0

21 Requirements modeling II: Automated best-fit modeling Manual kernel selection and hypothesis generation is time consuming (and boring) Idea: Automatically select best model from predefined space n f (p) = åc p i k log j k (p) k c1 log( p) + c k=1 c log( p) + c n = I = { 0,1, } J = {0,1} + c p + c p + c log(p) + c p log(p) + c p log(p) c c c c c c c c log( p) + c log( p) + c p + c p p + c p + c + c p p p log( p) p p p p log( p) + c p log( p) + c p log( p) [1]: A. Calotoiu, T. Hoefler, M. Poke, F. Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, IEEE/ACM SC13 p log( p) log( p) p p log( p) log( p) n Î i k Î I j k Î J I, J Ì 1

Requirements modeling III: Source-code analysis [1]

What if the data is not sufficient or too noisy?

loop iterations symbolically for (j = 1; j <= n; j

OperationInBody(j,k); Parallel program Loop

Kwasniewski: Automatic Complexity Analysis of

22 Requirements modeling III: Source-code analysis [1] Extra-P selects model based on best fit to the data What if the data is not sufficient or too noisy? Back to first principles The source code describes all possible executions Describing all possibilities is too expensive, focus on counting loop iterations symbolically for (j = 1; j <= n; j = j*) for (k = j; k <= n; k = k++) OperationInBody(j,k); Parallel program Loop extraction N = ( n + 1)log n n + [1]: TH, G. Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs, ACM SPAA 14 Requirements Models W = N p= 1 D = N p Number of iterations

23 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

24 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

25 Capability models for cache-to-cache communication write read X = Invalid read R I = 78 ns = Local read: R L = 8.6 ns Remote read R R = 35 ns S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 6

Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application

26 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

Design optimal algorithms or systems in the model Can lead to non-intuitive designs

bottleneck t(n) = 76ps * n 3 Flop rate R = flop * n 3 /(76ps * n 3 ) = 7.

27 Part III: Understand Use models to 1. Proof optimality of real implementations Stop optimizing, step back to algorithm level. Design optimal algorithms or systems in the model Can lead to non-intuitive designs Understand Proof optimality of matrix multiplication Intuition: flop rate is the bottleneck t(n) = 76ps * n 3 Flop rate R = flop * n 3 /(76ps * n 3 ) = 7.78 Gflop/s Flop peak: GHz * 8 flops = Gflop/s Achieved ~90% of peak (IBM Power 7 Gets more complex quickly Imagine sparse matrix-vector 8

28 Design algorithms bcast in cache-to-cache model Multi-ary tree example 0 depth d = Tree depth Level size 1 k 1 = k = 3 Tree cost Reached threads S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 30

29 Measured results small broadcast and reduction P=10 4.7x P=58 3.3x Intel Xeon Phi 5110P (60 cores at 105 MHz), Intel MPI v each operation timed separately, reporting maximum across processes S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 31

kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for

30 Performance Modeling Capability Model Performance Model p p log(p) p log(p) p log(p) Requirements Input Model Commun ication paramet ers paramet ers Describe application kernels TH: Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC Fit sequenti al baseline Communicati on / computation overlap Commun ication pattern

31 How to continue from here? DAPP Transformation System DAPP Parallel Language Data-centric, explicit requirements models User-supported, compile- and run-time memlets + operators = DCIR Performance-transparent Platforms HTM [1] RMA fompi-na [] NISA [3] [1]: M. Besta, TH: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, ACM HPDC 15 []: R. Belli, TH: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS 15 [3]: S. Di Girolamo, P. Jolivet, K. D. Underwood, TH: Exploiting Offload Enabled Network Interfaces, IEEE Micro 16 33

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU