Performance-oriented development Performance often regarded as a prost-process that is applied after an initial version has been created Instead, performance must be of concern right from the beginning Elements of performance-oriented development Application divided into compute-intensive kernels Miniapplications that resemble their behavior (analog: lab mice) Models that describe their behavior analytically Performance studies that describe their behavior experimentally Exascale: software/hardware co-design Performance tools will play key role Need to embrace idea of systematic performance-oriented development 12
Scalable performance-analysis toolset for parallel codes Focus on communication & synchronization Integrated performance analysis process Performance overview via call-path profiling In-depth study of application behavior via event tracing Programming models MPI, OpenMP Future: support for PGAS and accelerators www.scalasca.org 13
Scalasca team David Böhme, Alexandru Calotoiu, Dominic Eschweiler, Wolfgang Frings, Markus Geimer, Max Görtz, Youssef Hatem, Marc-André Hermanns, Monika Lücke, Michael Knobloch, Daniel Lorenz, Bernd Mohr, Peter Philippen, Christian Rössel, Pavel Saviankou, Christopher Schleiden, Marc Schlütter, Aamer Shah, Christian Siebert, Alexandre Strube, Zoltán Szebenyi, Felix Wolf, Ilya Zhukov 14
Installations and users Companies Bull (France) Dassault Aviation (France) Efield Solutions (Sweden) GNS (Germany) INTES (Germany) MAGMA (Germany) RECOM (Germany) SciLab (France) Shell (Netherlands) SiCortex (USA) Sun Microsystems (USA, Singapore, India) Qontix (UK) Research/supercomputing centers Argonne National Laboratory (USA) Barcelona Supercomputing Center (Spain) Bulgarian Supercomputing Centre (Bulgaria) CERFACS (France) CINECA (Italy) Centre Informatique National de l Enseignement Supérieur (France) Commissariat à l'énergie atomique (France) CaSToRC (Cyprus) CASPUR (Italy) Deutsches Klimarechenzentrum (DKRZ) Deutsches Zentrum für Luft- und Raumfahrt (Germany) Edinburgh Parallel Computing Centre (UK) Federal Office of Meteorology and Climatology (Switzerland) Forschungszentrum Jülich (Germany) IT Center for Science (Finland) High Performance Computing Center Stuttgart (Germany) Irish Centre for High-End Computing (Ireland) IDRIS (France) Research/supercomputing centers (cont.) Karlsruher Institut für Technologie (Germany) Lawrence Livermore National Laboratory (LLNL) Leibniz-Rechenzentrum (Germany) National Authority For Remote Sensing & Space Science (Egypt) National Center for Atmospheric Research (USA) National Center for Supercomputing Applications (USA) HLRN (Germany) Oak Ridge National Laboratory (USA) PDC Center for High Performance Computing (Sweden) Pittsburgh Supercomputing Center (USA) Potsdam-Institut für Klimafolgenforschung (Germany) Rechenzentrum Garching (Germany) SARA Computing and Networking Services (Netherlands) Shanghai Supercomputing Center (China) Swiss National Supercomputing Center (Switzerland) Texas Advanced Computing Center (USA) Very Large Computing Centre (France) Universities King Abdullah University of Science and Technology (Saudi Arabia) Lund University (Sweden) Lomonosov Moscow State University (Russia) Rensselaer Polytechnic Institute (USA) Rheinisch-Westfälische Technische Hochschule Aachen (Germany) Technische Universität Dresden (Germany) Universität Basel (Switzerland) University of Oregon (USA) University of Tennessee (USA) University of Tsukuba (Japan) + 9 defense computing centers 15
Scalasca architecture Measurement library Instr. target application HWC Optimized measurement configuration Local event traces Parallel waitstate search Summary report Wait-state report Report manipulation Instrumented executable Which problem? Where in the program? Which process? Instrumenter compiler / linker Source modules 16
Performance optimizations XNS fluid-dynamics code (RWTH Aachen) Redundant messages detected 4-5x faster MAGMAfill fluid-dynamics code (MAGMASOFT GmbH) Communication bottleneck identified 25% faster INDEED FEM code (GNS mbh) Serialization bottleneck identified 30-40% faster Illumination particle simulation (Queen s University) Communication bottleneck uncovered 2x faster 17
Scalability Application study of ASCI Sweep3D benchmark Identified MPI waiting time correlating with computational imbalance Time [s] 1000 100 10 Jaguar, MK = 10 (default) Measured execution - Computation - MPI processing - MPI waiting Measurements & analyses demonstrated on Jaguar with up to 192k cores Jugene with up to 288k cores 1 1,024 2,048 4,096 8,192 16,384 32,768 65,636 131,072 262,144 Processes Computation
Performance-oriented development Application decomposition Kernel identification (static & dynamic analysis) Kernel extraction (miniapps) Kernel optimization Modeling support Model parameter identification Code validation against model Management of performance design documents Model representation Cross-experiment analysis 19
Article on Scalasca (to be published in October) 20 Parallel Programming, August 12, 2011
Virtual Institute High Productivity Supercomputing The virtual institute in a Partnership to develop advanced programming tools for complex simulation codes Goals Improve code quality Speed up development Activities Tool development and integration Training & support PROPER workshop series (in conjunction with Euro-Par) www.vi-hps.org
Conclusion Performance must become first-class citizen in development process Combination of experimental performance analysis and modeling Requires managing performance design documents Paying staff to do performance optimization is worth the money Performance tools will further improve their productivity Qualified staff hard to find We need more software engineering in computational science curricula 22
23 Thank you!