Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems

Size: px

Start display at page:

Download "Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems"

Howard Short
6 years ago
Views:

1 Mitglied der Helmholtz-Gemeinschaft Performance Analysis and Optimization of Scientific Applications on Extreme-Scale Computer Systems Bernd Mohr 1 st Intl. Workshop on Strategic Development of High Performance Computers Tsukuba, March 18-19, 2013

Parallel Architectures: State of the Art Router Router Router Router Router Router Router or Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router

2 Parallel Architectures: State of the Art Router Router Router Router Router Router Router or Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Network or Switch SMP Memory Interconnect A 0... Interconnect A m Memory A 0... A m... P 0 Interconnect... P n A 0... A m P 0... P n... Memory N 0 P 0 P n N 1 N k NUMA L3 0 L2 0 L2 r/2 L1 0 L L1 Core 0 Core 1... Core r P i A j IWSDHPC, Tsukuba, 2013 JSC 2

3 Exascale Performance Challenges Exascale systems will consist of Complex configurations With a huge number of components Very likely heterogeneous With not enough memory Dynamically changing configuration due to fault recovery and power saving Deep software hierarchies of large, complex software components will be required to make use of such systems Sophisticated integrated performance measurement, analysis, and optimization capabilities will be required to efficiently operate an Exascale system IWSDHPC, Tsukuba, 2013 JSC 3

4 Cross-Cutting Considerations Performance-aware design, development and deployment of hardware and software necessary Integration with OS, compilers, middleware and runtime systems required Support for performance observability in HW and SW (runtime) needed Enable performance measurement and optimization in case of HW and SW changes due to faults or power adaptation IWSDHPC, Tsukuba, 2013 JSC 4

5 Technical Challenges Heterogeneity Extreme Concurrency Perturbation and data volume Drawing insight from measurements Quality information sources This requires tools to be Portable Insightful Scalable Integrated IWSDHPC, Tsukuba, 2013 JSC 5

Not many Tools match these Requirements TAU

org Extrae/Paraver Barcelona Supercomputing

es/paraver Vampir toolset Technical University of

6 Not many Tools match these Requirements TAU University of Oregon, US HPCToolkit Rice University, US Extrae/Paraver Barcelona Supercomputing Centre, Spain Vampir toolset Technical University of Dresden, Germany Scalasca Jülich Supercomputing Centre, Germany IWSDHPC, Tsukuba, 2013 JSC 6

7 Run everywhere PORTABILITY IWSDHPC, Tsukuba, 2013 JSC 7

8 Scalasca: Supported Platforms Instrumentation and measurement only (visual analysis on front-end or workstation) Cray XT3/XT4/XT5, XE6, XK6 IBM BlueGene/L, BlueGene/P, BlueGene/Q NEC SX8 and SX9 K Machine, Fujitsu FX10 Intel MIC Full support (instrumentation, measurement, and automatic analysis) Linux IA32, IA64, x86_64, and PPC based clusters IBM AIX Power3/4/5/6/7 based clusters SGI Linux IA64 and x86_64 based clusters SUN/Oracle Solaris Sparc and x86/x86_64 based clusters IWSDHPC, Tsukuba, 2013 JSC 8

9 Now also on the K Computer Thanks to Tomotake Nakamura + colleagues at RIKEN!! IWSDHPC, Tsukuba, 2013 JSC 9

10 Known Installations of Scalasca Companies ( France ) Bull Dassault Aviation ( France ) EDF (France) ( Germany ) GNS ( Germany ) MAGMA ( Germany ) RECOM ( Netherlands ) Shell ( USA ) Sun Microsystems ( UK ) Qontix Research / HPC Centres ( USA ) ANL ( Spain ) BSC ( France ) CEA ( France ) CERFACS ( Italy ) CINECA ( Finland ) CSC ( Switzerland ) CSCS Research / HPC Centres (cont.) ( Germany ) DLR ( Germany ) DKRZ ( UK ) EPCC ( Germany ) HLRN ( Germany ) HLRS ( Ireland ) ICHEC ( France ) IDRIS JSCC (Russia) ( USA ) LLNL ( Germany ) LRZ MSU (Russia) ( USA ) NCAR ( USA ) NCSA NSCC (China) ( USA ) ORNL ( USA ) PSC ( Germany ) RZG RIKEN (Japan) Research / HPC Centres (cont.) ( Netherlands ) SARA ( Bulgaria ) SAITC ( USA ) TACC Universities ( USA ) RPI ( Germany ) RWTH ( Germany ) TUD ( USA ) UOregon ( USA ) UTK DoD Computing Centers (USA) AFRL DSRC ARL DSRC ARSC DSRC ERDC DSRC Navy DSRC MHPCC DSRC SSC-Pacific IWSDHPC, Tsukuba, 2013 JSC 10

11 More than numbers and diagrams INSIGHTFULNESS IWSDHPC, Tsukuba, 2013 JSC 11

12 A picture is worth 1000 words MPI ring program Real world example IWSDHPC, Tsukuba, 2013 JSC 12

13 What about 1000 s of pictures? (with 100 s of menu options) IWSDHPC, Tsukuba, 2013 JSC 13

14 Example Automatic Analysis: Late Sender IWSDHPC, Tsukuba, 2013 JSC 14

15 process process process process Scalasca: Example MPI Patterns (a) Late Sender time (b) Late Receiver time (c) Late Sender / Wrong Order time (d) Wait at N x N time ENTER EXIT SEND RECV COLLEXIT IWSDHPC, Tsukuba, 2013 JSC 15

16 Scalasca Example: CESM Sea Ice Module Late Sender Analysis IWSDHPC, Tsukuba, 2013 JSC 16

17 Scalasca Example: CESM Sea Ice Module Late Sender Analysis + Application Topology IWSDHPC, Tsukuba, 2013 JSC 17

18 Scalasca Root Cause Analysis Root-cause analysis Wait states typically caused by load or communication imbalances earlier in the program Waiting time can also propagate (e.g., indirect waiting time) Goal: Enhance performance analysis to find the root cause of wait states cause Approach Distinguish between direct and indirect waiting time Identify call path/process combinations delaying other processes and causing first order waiting time Identify original delay A foo DELAY bar Send B foo bar Recv Send C foo bar Recv Indirect wait Direct wait time IWSDHPC, Tsukuba, 2013 JSC 18

19 Scalasca Example: CESM Sea Ice Module Direct Wait Time Analysis IWSDHPC, Tsukuba, 2013 JSC 19

20 Scalasca Example: CESM Sea Ice Module Indirect Wait Time Analysis IWSDHPC, Tsukuba, 2013 JSC 20

21 Scalasca Example: CESM Sea Ice Module Delay Costs Analysis IWSDHPC, Tsukuba, 2013 JSC 21

22 To infinity and beyond EXTREME CONCURRENCY IWSDHPC, Tsukuba, 2013 JSC 22

Scaling already important TODAY! Number of Cores share for TOP 500 November 2012 NCore Count NCore Share Rmax Share 1025-2048 1 0.2% 122 TF 0.1% 1,280 2049-4096 2 0.4% 155 TF 0.

23 Scaling already important TODAY! Number of Cores share for TOP 500 November 2012 NCore Count NCore Share Rmax Share % 122 TF 0.1% 1, % 155 TF 0.1% 7, % 8,579 TF 5.3% 551, % 24,543 TF 15.1% 2,617,986 > % 128,574 TF 79.4% 11,707,806 Total % 161,973 TF 100% 14,885,800 Average system size: 29,772 cores Median system size: 15,360 cores IWSDHPC, Tsukuba, 2013 JSC 23

24 Personal Motivation (I) 07 / 2012 Jugene 72 rack IBM BlueGene/P 294,912 cores Most parallel system in the world 06/2009 to 06/2011!!! IWSDHPC, Tsukuba, 2013 JSC 24

25 Personal Motivation (II) Juqueen 28 rack IBM BlueGene/Q 458,752 cores 1,835,008 HW threads IWSDHPC, Tsukuba, 2013 JSC 25

26 Roads to Scalability Scalable data collection and reduction Automatic detection of most important execution phases (Paraver) Parallel collection and reduction based on MPI and parallel I/O (All tools) Scalable parallel data analysis Parallel client/server processing and visualization (Vampir) Parallel pattern search, delay and critical-path analysis (Scalasca) Parallel analyzer and visualizer (Paraver) Scalable visualizations 3D charts and topology displays (TAU, Scalasca) Hierarchical browsers (Scalasca) IWSDHPC, Tsukuba, 2013 JSC 26

27 TAU ParaProf: 3D Profile, Miranda, 16K PEs IWSDHPC, Tsukuba, 2013 JSC 27

28 TAU 3D Topology view / distribution histogram IWSDHPC, Tsukuba, 2013 JSC 28

29 VampirServer BETA: Trace Visualization OTF trace of 4.5 TB VampirServer running with 20,000 analysis processes IWSDHPC, Tsukuba, 2013 JSC 29

30 Paraver Data Reduction Features Accumulation of values using software counters Powerful filtering expression over time, processors, states, communications, events Automatic structure / phase detection Based on signal processing Using wavelets (Casas: ParCo 2007) Using autocorrelation functions (Casas: Euro-Par 2007) Also used for cleanup: Preemptions Clogged systems / instrumentation overhead Flushing IWSDHPC, Tsukuba, 2013 JSC 30

Scalasca trace analysis sweep3d@294,912 BGP 10 min sweep3d runtime 11 sec replay 4 min trace data write/read (576 files) 7.6 TB buffered trace data 510 billion events B. J. N. Wylie, M.

31 Scalasca trace analysis BGP 10 min sweep3d runtime 11 sec replay 4 min trace data write/read (576 files) 7.6 TB buffered trace data 510 billion events B. J. N. Wylie, M. Geimer, B. Mohr, D. Böhme, Z.Szebenyi, F. Wolf: Largescale performance analysis of Sweep3D with the Scalasca toolset. Parallel Processing Letters, 20(4): , IWSDHPC, Tsukuba, 2013 JSC 31

32 Scalasca trace analysis BGQ IWSDHPC, Tsukuba, 2013 JSC 32

33 Scalasca trace analysis BGQ IWSDHPC, Tsukuba, 2013 JSC 33

34 Together we are strong INTEGRATION IWSDHPC, Tsukuba, 2013 JSC 34

35 Integration Need integrated tool (environment) for all levels of parallelization Inter-node (MPI) Intra-node (OpenMP, task-based programming) Accelerators (CUDA, OpenCL) Integration with performance modeling and prediction No tool fits all requirements Interoperability of tools Integration via open interfaces IWSDHPC, Tsukuba, 2013 JSC 35

36 Scalasca TAU VAMPIR Paraver Extrae Status End 2011 PRV trace Paraver TAU VT Vampir Trace TAU TRACE TAU trace X X X X X OTF / VTF3 trace VAMPIR R R Scalasca EPILOG trace Trace Analyzer CUBE3 profile CUBE3 Presenter TAU EPILOG TAU PROFILE TAU profile X X gprof/mpip/ profile PerfDMF PARAPROF IWSDHPC, Tsukuba, 2013 JSC 36

37 Scalasca TAU VAMPIR Paraver Extrae Status Begin 2013 PRV trace Paraver X VAMPIR R R Score-P OTF2 trace Scalasca Trc Analyzer CUBE4 profile CUBE4 Presenter TAU SCOREP gprof/mpip/ profile PerfDMF PARAPROF IWSDHPC, Tsukuba, 2013 JSC 37

38 Tool Integration: Score-P Objectives Mainly funded by SILC, LMAC (BMBF) + PRIMA (DOE) projects Make common part of Periscope, Scalasca, TAU, and Vampir a community effort Score-P measurement system Functional requirements Performance data: profiles (CUBE4), traces (OTF2) Initially direct instrumentation, later also sampling Offline and online access Metrics: time, communication metrics and hardware counters Initially MPI 2 and OpenMP 3, later also CUDA and OpenCL Current release: V1.1.1 of Feb IWSDHPC, Tsukuba, 2013 JSC 38

39 Score-P Architecture Vampir Scalasca TAU Periscope Event traces (OTF2) Call-path profiles (CUBE4) Online interface Hardware counter (PAPI) Memory management Score-P measurement infrastructure etc Compiler TAU instrumentor COBI (binary) MAQAO instrumentor OPARI 2 (OpenMP) MPI wrappers Instrumentation Application (MPI, OpenMP, hybrid) IWSDHPC, Tsukuba, 2013 JSC 39

40 Score-P Partners Forschungszentrum Jülich, Germany German Research School for Simulation Sciences, Aachen, Germany Gesellschaft für numerische Simulation mbh Braunschweig, Germany RWTH Aachen, Germany Technische Universität Dresden, Germany Technische Universität München, Germany University of Oregon, Eugene, USA IWSDHPC, Tsukuba, 2013 JSC 40

/2013) Integration of TAU and Scalasca LMAC (08/2011 to 07/2013) Evolution of Score-P Analysis of performance

41 Funded Integration Projects SILC (01/2009 to 12/2011) Unified measurement system (Score-P) for Vampir, Scalasca, Periscope PRIMA (08/2009 to 0?/2013) Integration of TAU and Scalasca LMAC (08/2011 to 07/2013) Evolution of Score-P Analysis of performance dynamics H4H (10/2010 to 09/2013) Hybrid programming for heterogeneous platforms HOPSA (02/2011 to 01/2013) Integration of system and application monitoring IWSDHPC, Tsukuba, 2013 JSC 41

42 Integration of Score-P based Tools Threadspotter Application measured with ThreadSpotter Memory profile Link Explore memory behavior Cube HOPSA Workflow Application linked to Score-P Profile CUBE-4 Scalasca waitstate analysis Visual exploration profile generation Worst-instance visualization What-if scenarios Dimemas OTF2 to PRV conversion Trace PRV Trace OTF-2 Visual exploration Paraver done to do LAPTA System metrics Vampir Application measured with Extrae IWSDHPC, Tsukuba, 2013 JSC 42

43 Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 43

44 Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 44

45 Scalasca Vampir/Paraver integration IWSDHPC, Tsukuba, 2013 JSC 45

46 Future Work OPEN ISSUES IWSDHPC, Tsukuba, 2013 JSC 46

47 Biggest Open Issues How to handle asynchronous non-deterministic executions? Currently favored programming model at node-level Breaks measure-analyze-optimize cycle Potential solution Use traditional tools only at inter-node level Use auto-tuning smart runtime systems inside node Further factors to non-determinism Failing components and recovery actions Components operating on varying speeds to save energy IWSDHPC, Tsukuba, 2013 JSC 47

48 Acknowledgements Scalasca team (JSC) (GRS) Markus Geimer Jie Jiang Michael Knobloch Daniel Lorenz Bernd Mohr Peter Philippen Christian Rössel David Böhme Marc-André Hermanns Alexandru Calotoiu Marc Schlütter Pavel Saviankou Alexandre Strube Brian Wylie Anke Visser Ilja Zhukov Monika Lücke Aamer Shah Felix Wolf Sponsors IWSDHPC, Tsukuba, 2013 JSC 48

Scalasca: A Scalable Portable Integrated Performance Measurement and Analysis Toolset. CEA Tools 2012 Bernd Mohr

Scalasca: A Scalable Portable Integrated Performance Measurement and Analysis Toolset CEA Tools 2012 Bernd Mohr Exascale Performance Challenges Exascale systems will consist of Complex configurations With