Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013
|
|
- Samuel Hall
- 5 years ago
- Views:
Transcription
1 Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013
2 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage depends on software performance Jugene (2008): cores Juqueen (2013): cores Slide 2 September 26, 2013 David öhme
3 Imbalance limits parallel efficiency Imbalance leads to wait states at synchronization points 1 2 Recv Send Wait state Late-sender wait state Goal: Locate inefficient parallelism and quantify its impact Slide 3 September 26, 2013 David öhme
4 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Determine delays that cause wait states. Slide 4 September 26, 2013 David öhme
5 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Identify activities that determine program run. Slide 4 September 26, 2013 David öhme
6 Outline Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 5 September 26, 2013 David öhme
7 Performance analysis tools Tools help understand performance, but Profiling provides limited insights Tracing produces too much data to analyze manually Time profile display in TU Event-trace line visualization in Vampir Slide 6 September 26, 2013 David öhme
8 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance Slide 7 September 26, 2013 David öhme
9 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Slide 7 September 26, 2013 David öhme
10 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Use automatic trace analysis Slide 7 September 26, 2013 David öhme
11 nalysis workflow Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme
12 nalysis workflow pplication run Local event traces Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme
13 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme
14 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Which problem? Where in the code? Which process? Slide 8 September 26, 2013 David öhme
15 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 9 September 26, 2013 David öhme
16 Delay as root cause of wait states delay is some additional activity on one process that causes a wait state at a synchronization point 1 R 1 Work Delay S 2 2 S 1 Work R 2 Late-sender wait state Delay on process 1 causes a late-sender wait state on process 2 Slide 10 September 26, 2013 David öhme
17 Wait-state propagation Wait states can propagate 1 2 Delay 3 4 Slide 11 September 26, 2013 David öhme
18 Wait-state propagation Wait states can propagate Delay Propagation Propagation Slide 11 September 26, 2013 David öhme
19 Wait-state propagation Wait states can propagate ccount for propagation in analysis Extended wait-state classification Incorporate long-distance effects in calculation of delay costs Delay Propagation Propagation Slide 11 September 26, 2013 David öhme
20 Wait-state classification Distinguish propagating and terminal wait states 1 2 Delay 3 4 Slide 12 September 26, 2013 David öhme
21 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state Slide 12 September 26, 2013 David öhme
22 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state terminal wait state Slide 12 September 26, 2013 David öhme
23 ssigning delay costs Delay costs represent amount of wait caused by a delay Short-term costs represent wait states caused directly Long-term costs represent wait states caused via propagation short-term costs long-term costs long-term costs Slide 13 September 26, 2013 David öhme
24 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme
25 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation Distribution of late-sender waiting CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme
26 CESM sea ice model: wait-state formation Propagating wait states Distribution of delay costs 25% Short-term 75% Long-term Terminal wait states Slide 15 September 26, 2013 David öhme
27 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 16 September 26, 2013 David öhme
28 Critical-path analysis Use automatic trace analysis to extract the critical path 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme
29 Critical-path analysis Use automatic trace analysis to extract the critical path Performance indicators show bottlenecks at single glance 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme
30 Critical-path profile and imbalance Timeline Slide 18 September 26, 2013 David öhme
31 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Slide 18 September 26, 2013 David öhme
32 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Slide 18 September 26, 2013 David öhme
33 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Imbalance wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Critical imbalance indicator finds inefficient parallelism Imbalance = T critical T avg Slide 18 September 26, 2013 David öhme
34 Example: PEPC 30 verage Maximum nalysis of plasma-physics code PEPC using 512 on lue Gene/P Wall-clock tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme
35 Example: PEPC nalysis of plasma-physics code PEPC using 512 on lue Gene/P Profile metrics underestimate performance impact of tree walk kernel due to dynamic load imbalance Wall-clock verage Maximum Critical-path Critical-path imbalance 0 tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme
36 nalysis of MPMD programs Processes execute different activities E.g. master-worker Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i
37 nalysis of MPMD programs Processes execute different activities E.g. master-worker Complex imbalance analysis issues Not supported by existing tools Imbalance quantification needs to incorporate partition sizes More knobs to tune Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i
38 Performance impact indicators Denote allocation- costs of imbalance Map wait onto critical-path activites with excess Distinguish intra-partition and inter-partition imbalance costs High intra-partition costs, low inter-partition costs Very high inter-partition costs, low intra-partition costs Slide 21 September 26, 2013 David öhme
39 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod
40 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod
41 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Tune mesh size to adjust load balance Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod
42 150 ddcmd: mesh size tuning Resource consumption (CPU-h Run (seconds) Resource consumption (CPU-hrs) Run (seconds) 50 0 Small mesh size increases workload of particle tasks Increasing mesh size shifts critical path to mesh tasks mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs 50 mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs Slide 23 September 26, 2013 David öhme
43 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 24 September 26, 2013 David öhme
44 Implementation Integrated in Scalasca trace analysis toolset Highly scalable parallel trace analysis So far only for wait-state detection Slide 25 September 26, 2013 David öhme
45 Parallel trace replay pplication run Local event traces pplication records stamped communication events One trace file per process trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme
46 Parallel trace replay pplication run Local event traces Parallel analysis nalysis report pplication records stamped communication events One trace file per process nalysis traverse traces in parallel Exchange information at original synchronization points trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme
47 Trace analysis extensions Use multiple replay passes ackward replay lets data travel from effect to cause trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT backward replay Slide 27 September 26, 2013 David öhme
48 Delay detection via backward replay 1 R 1 S 2 2 S 1 R 2 Wait state Slide 28 September 26, 2013 David öhme
49 Delay detection via backward replay 1. Identify synchronization interval 1 R 1 S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme
50 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme
51 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 3. Transfer vector t r 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme
52 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme
53 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 5. Calculate costs 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme
54 Critical-path extraction in backward replay 1 c=false 2 R 1 3 R 2 Slide 29 September 26, 2013 David öhme
55 Critical-path extraction in backward replay 1 2 R 1 c=false c=true 3 R 2 c=false Slide 29 September 26, 2013 David öhme
56 Critical-path extraction in backward replay c=false 1 c=true c=true 2 R 1 3 R 2 c=false Slide 29 September 26, 2013 David öhme
57 Critical-path extraction in backward replay c=true c=false 1 c=false c=true c=true 2 R 1 c=false c=false 3 R 2 Slide 29 September 26, 2013 David öhme
58 Scalability 1000 Wall (s) 100 Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme
59 Scalability 1000 Wall (s) 100 Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme
60 Scalability 1000 Wall (s) 100 Combined wait-state, delay, and critical-path analysis (trace replay) Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme
61 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Delay Propagation Propagation Identifies delays and explains formation of wait states Slide 31 September 26, 2013 David öhme
62 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Slide 31 September 26, 2013 David öhme
63 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Highly scalable implementation for O(100k) Slide 31 September 26, 2013 David öhme
64 Thank you Further reading: D. öhme,. R. de Supinski, M. Geimer, M. Schulz, and F. Wolf: Scalable Critical-Path ased Performance nalysis. IPDPS D. öhme, M. Geimer, and F. Wolf: Characterizing Load and Communication Imbalance in Large-Scale Parallel pplications. IPDPS PhD Forum D. öhme, M. Geimer, F. Wolf, and L. rnold: Identifying the root causes of wait states in large-scale parallel applications. ICPP est paper award. D. öhme, M.-. Hermanns, M. Geimer, and F. Wolf: Performance simulation of non-blocking communication in message-passing applications. PROPER Slide 32 September 26, 2013 David öhme
65 SPECMPI case studies Uninstrumented execution Instrumented execution Trace replay 50 Imbalance percentage (profile metrics) Imbalance percentage (critical-path metrics) Run (sec) Imbalance percent RxML 122.tachyon 121.pop2 104.milc 137.lu 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 147.l2wrf2 145.lGemsFDTD 143.dleslie RxML 122.tachyon 121.pop2 104.milc 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 143.dleslie 137.lu 147.l2wrf2 145.lGemsFDTD Tracing run- overhead and trace replay s Critical-path vs. profile load imbalance metrics Slide 33 September 26, 2013 David öhme
Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationSCALASCA parallel performance analyses of SPEC MPI2007 applications
Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute
More informationAnna Morajko.
Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance
More informationIdentifying the root causes of wait states in large-scale parallel applications
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/01-1 12/January/2010 Identifying the root causes of wait states in large-scale parallel applications D. Böhme,
More informationAutomatic trace analysis with the Scalasca Trace Tools
Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationGermán Llort
Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationPurity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,
More informationWorkload Characterization using the TAU Performance System
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of
More informationInteractive Analysis of Large Distributed Systems with Scalable Topology-based Visualization
Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique
More informationA scalable tool architecture for diagnosing wait states in massively parallel applications
A scalable tool architecture for diagnosing wait states in massively parallel applications Markus Geimer a, Felix Wolf a,b,, Brian J. N. Wylie a, Bernd Mohr a a Jülich Supercomputing Centre, Forschungszentrum
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationLARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET
Parallel Processing Letters Vol. 20, No. 4 (2010) 397 414 c World Scientific Publishing Company LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET BRIAN J. N. WYLIE, MARKUS GEIMER, BERND
More informationVAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW
VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationPerformance analysis of Sweep3D on Blue Gene/P with Scalasca
Mitglied der Helmholtz-Gemeinschaft Performance analysis of Sweep3D on Blue Gene/P with Scalasca 2010-04-23 Brian J. N. Wylie, David Böhme, Bernd Mohr, Zoltán Szebenyi & Felix Wolf Jülich Supercomputing
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationHow to overcome common performance problems in legacy climate models
How to overcome common performance problems in legacy climate models Jörg Behrens 1 Contributing: Moritz Handke 1, Ha Ho 2, Thomas Jahns 1, Mathias Pütz 3 1 Deutsches Klimarechenzentrum (DKRZ) 2 Helmholtz-Zentrum
More informationCritically Missing Pieces on Accelerators: A Performance Tools Perspective
Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationAccelerating HPL on Heterogeneous GPU Clusters
Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationNetwork-on-Chip Architecture
Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)
More informationGeneric Topology Mapping Strategies for Large-scale Parallel Architectures
Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationSynchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar
Synchronizing the Record Replay engines for MPI Trace compression By Apoorva Kulkarni Vivek Thakkar PROBLEM DESCRIPTION An analysis tool has been developed which characterizes the communication behavior
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationUsage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications
Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications Felix Wolf 1,2, Erika Ábrahám 1, Daniel Becker 1,2, Wolfgang Frings 1, Karl Fürlinger 3, Markus Geimer
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationParallel I/O Libraries and Techniques
Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial
More informationPerformance Analysis with Periscope
Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationCS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationCommunication Characteristics in the NAS Parallel Benchmarks
Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this
More informationPerformance Tools (Paraver/Dimemas)
www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!
More informationArchitecture of Parallel Computer Systems - Performance Benchmarking -
Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark
More informationIS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationHAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms
HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms SIMUTOOLS 2015 Athens, Greece Mario Bielert, Florina M. Ciorba, Kim Feldhoff, Thomas Ilsche, Wolfgang E. Nagel
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationPerformance engineering of GemsFDTD computational electromagnetics solver
Performance engineering of GemsFDTD computational electromagnetics solver Ulf Andersson 1 and Brian J. N. Wylie 2 1 PDC Centre for High Performance Computing, KTH Royal Institute of Technology, Stockholm,
More informationAdvanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied
More informationThe Potential of Diffusive Load Balancing at Large Scale
Center for Information Services and High Performance Computing The Potential of Diffusive Load Balancing at Large Scale EuroMPI 2016, Edinburgh, 27 September 2016 Matthias Lieber, Kerstin Gößner, Wolfgang
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationThe Flooding Time Synchronization Protocol
The Flooding Time Synchronization Protocol Miklos Maroti, Branislav Kusy, Gyula Simon and Akos Ledeczi Vanderbilt University Contributions Better understanding of the uncertainties of radio message delivery
More informationEliminate Threading Errors to Improve Program Stability
Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationSlack: A New Performance Metric for Parallel Programs
Slack: A New Performance Metric for Parallel Programs Jeffrey K. Hollingsworth Barton P. Miller hollings@cs.umd.edu bart@cs.wisc.edu Computer Science Department University of Maryland College Park, MD
More informationbtrecord and btreplay User Guide
btrecord and btreplay User Guide Alan D. Brunelle (Alan.Brunelle@hp.com) December 10, 2011 Abstract The btrecord and btreplay tools provide the ability to record and replay IOs captured by the blktrace
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationFast Dynamic Load Balancing for Extreme Scale Systems
Fast Dynamic Load Balancing for Extreme Scale Systems Cameron W. Smith, Gerrett Diamond, M.S. Shephard Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Outline: n Some comments on
More informationObject Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ
45 Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ Department of Computer Science The Australian National University Canberra, ACT 2611 Email: fzhen.he, Jeffrey.X.Yu,
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationSome aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)
Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationA CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign
More informationPerformance without Pain = Productivity Data Layout and Collective Communication in UPC
Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research
More information"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008
Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,
More informationTrade- Offs in Cloud Storage Architecture. Stefan Tai
Trade- Offs in Cloud Storage Architecture Stefan Tai Cloud computing is about providing and consuming resources as services There are five essential characteristics of cloud services [NIST] [NIST]: http://csrc.nist.gov/groups/sns/cloud-
More informationI/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings
Mitglied der Helmholtz-Gemeinschaft I/O at JSC I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O Wolfgang Frings W.Frings@fz-juelich.de Jülich Supercomputing
More informationMessage Passing Interface (MPI) on Intel Xeon Phi coprocessor
Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com
More informationDESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID
I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationOpenACC Support in Score-P and Vampir
Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)
More informationAuto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems
Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Minesh I. Patel Ph.D. 1, Karl Jordan 1, Mattew Clark Ph.D. 1, and Devesh Bhatt
More informationIntroduction to Performance Engineering
Introduction to Performance Engineering Markus Geimer Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance: an old problem
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationPerformance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationMark Sandstrom ThroughPuter, Inc.
Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom
More informationMulti-Domain Pattern. I. Problem. II. Driving Forces. III. Solution
Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time
More informationADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING
ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,
More informationThe Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla
The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Venugopalan Ramasubramanian Emin Gün Sirer Presented By: Kamalakar Kambhatla * Slides adapted from the paper -
More informationIntroducing Scalileo A Java Based Scaling Framework
Introducing Scalileo A Java Based Scaling Framework Tilmann Rabl, Christian Dellwo, Harald Kosch Chair of Distributed Database Systems First International Conference on Energy-Efficient Computing and Networking
More information