Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

Size: px
Start display at page:

Download "Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013"

Transcription

1 Characterizing Imbalance in Large-Scale Parallel Programs David o hme September 26, 2013

2 Need for Performance nalysis Tools mount of parallelism in Supercomputers keeps growing Efficient resource usage depends on software performance Jugene (2008): cores Juqueen (2013): cores Slide 2 September 26, 2013 David öhme

3 Imbalance limits parallel efficiency Imbalance leads to wait states at synchronization points 1 2 Recv Send Wait state Late-sender wait state Goal: Locate inefficient parallelism and quantify its impact Slide 3 September 26, 2013 David öhme

4 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Determine delays that cause wait states. Slide 4 September 26, 2013 David öhme

5 Two novel analysis methods Root-cause analysis Critical-path analysis 1 Send 2 Recv Send 3 C Recv Identify activities that determine program run. Slide 4 September 26, 2013 David öhme

6 Outline Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 5 September 26, 2013 David öhme

7 Performance analysis tools Tools help understand performance, but Profiling provides limited insights Tracing produces too much data to analyze manually Time profile display in TU Event-trace line visualization in Vampir Slide 6 September 26, 2013 David öhme

8 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance Slide 7 September 26, 2013 David öhme

9 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Slide 7 September 26, 2013 David öhme

10 Imbalance analysis pitfalls Profilers underestimate impact of imbalance Data aggregation hides dynamic imbalance effects Static imbalance Dynamic imbalance nalysis solution needs to retain performance dynamics Use automatic trace analysis Slide 7 September 26, 2013 David öhme

11 nalysis workflow Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

12 nalysis workflow pplication run Local event traces Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

13 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Slide 8 September 26, 2013 David öhme

14 nalysis workflow pplication run Local event traces Parallel analysis Global analysis report Instrumented executable Instrumenter compiler / linker Source modules Which problem? Where in the code? Which process? Slide 8 September 26, 2013 David öhme

15 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 9 September 26, 2013 David öhme

16 Delay as root cause of wait states delay is some additional activity on one process that causes a wait state at a synchronization point 1 R 1 Work Delay S 2 2 S 1 Work R 2 Late-sender wait state Delay on process 1 causes a late-sender wait state on process 2 Slide 10 September 26, 2013 David öhme

17 Wait-state propagation Wait states can propagate 1 2 Delay 3 4 Slide 11 September 26, 2013 David öhme

18 Wait-state propagation Wait states can propagate Delay Propagation Propagation Slide 11 September 26, 2013 David öhme

19 Wait-state propagation Wait states can propagate ccount for propagation in analysis Extended wait-state classification Incorporate long-distance effects in calculation of delay costs Delay Propagation Propagation Slide 11 September 26, 2013 David öhme

20 Wait-state classification Distinguish propagating and terminal wait states 1 2 Delay 3 4 Slide 12 September 26, 2013 David öhme

21 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state Slide 12 September 26, 2013 David öhme

22 Wait-state classification Distinguish propagating and terminal wait states Delay propagating wait state propagating wait state terminal wait state Slide 12 September 26, 2013 David öhme

23 ssigning delay costs Delay costs represent amount of wait caused by a delay Short-term costs represent wait states caused directly Long-term costs represent wait states caused via propagation short-term costs long-term costs long-term costs Slide 13 September 26, 2013 David öhme

24 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme

25 Case study: CESM sea ice model nalysis of imbalance in CESM sea ice model Performance data mapped onto application topology Distribution of computation Distribution of late-sender waiting CICE setup: 2048 on G/P, 1 dipole grid, cartesian grid decomposition Slide 14 September 26, 2013 David öhme

26 CESM sea ice model: wait-state formation Propagating wait states Distribution of delay costs 25% Short-term 75% Long-term Terminal wait states Slide 15 September 26, 2013 David öhme

27 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 16 September 26, 2013 David öhme

28 Critical-path analysis Use automatic trace analysis to extract the critical path 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme

29 Critical-path analysis Use automatic trace analysis to extract the critical path Performance indicators show bottlenecks at single glance 1 C 2 C 3 C Critical path in a parallel program (shown in red) Slide 17 September 26, 2013 David öhme

30 Critical-path profile and imbalance Timeline Slide 18 September 26, 2013 David öhme

31 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Slide 18 September 26, 2013 David öhme

32 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Slide 18 September 26, 2013 David öhme

33 Critical-path profile and imbalance Sync 1 2 allocation Summary profile 3 Timeline Imbalance wall-clock Critical-path profile Critical-path profile shows wall-clock consumption Critical imbalance indicator finds inefficient parallelism Imbalance = T critical T avg Slide 18 September 26, 2013 David öhme

34 Example: PEPC 30 verage Maximum nalysis of plasma-physics code PEPC using 512 on lue Gene/P Wall-clock tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme

35 Example: PEPC nalysis of plasma-physics code PEPC using 512 on lue Gene/P Profile metrics underestimate performance impact of tree walk kernel due to dynamic load imbalance Wall-clock verage Maximum Critical-path Critical-path imbalance 0 tree_walk sum_force pplication kernel Slide 19 September 26, 2013 David öhme

36 nalysis of MPMD programs Processes execute different activities E.g. master-worker Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i

37 nalysis of MPMD programs Processes execute different activities E.g. master-worker Complex imbalance analysis issues Not supported by existing tools Imbalance quantification needs to incorporate partition sizes More knobs to tune Slide 20 September 26, 2013 David öhme Heterogeneous decomposition in Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication from tasks to collector tasks, blue arrows indicate communication ddcmdd. from collector tasks to FFT tasks. Image from Richards et al.: eyond Homogeneous Decomposition, SC 10 is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. To perform the reduction we nominate a subset of the particle tasks as collector tasks (see Fig. 2). Each mesh point is uniquely assigned to a collector task that is responsible for gathering all contributions to ρ for that mesh point and performing the sum. The number and arrangement of the collector tasks is a tunable parameter and all communication is local. In Stage 2 of the communication, each collector task sends mesh information to the appropriate mesh task using MPI_Isend. We T ory an i the and relie tatio on T tage send task task wid mes our that stag com proa num wid is th com low task the 4.3 F pute para miz sub focu and task sing as i

38 Performance impact indicators Denote allocation- costs of imbalance Map wait onto critical-path activites with excess Distinguish intra-partition and inter-partition imbalance costs High intra-partition costs, low inter-partition costs Very high inter-partition costs, low intra-partition costs Slide 21 September 26, 2013 David öhme

39 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

40 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

41 Example: ddcmd ddcmd molecular dynamics simulation on lue Gene/P Particle and mesh forces calculated in different partitions Fixed partition sizes: Tune mesh size to adjust load balance Slide 22 September 26, 2013 David öhme D. Richards et al.: eyond Homogeneous Figure 2: Task layout on the luegene/p torus. Cyan arrows indicate communication Decomposition, from MD tasks SC 10 to collector tasks, blue arrows indicate communication from collector tasks to FFT tasks. is the reduction of the contributions to ρ at each point from these multiple tasks to a single sum. This s ory acce an impro the L3 ( and rece relieved tations. on G/L The t tages ov sends al tasks an tasks. T width fr mesh po our typic that lie stage so complet proach t number width av is that th commun lows trad task in S the band 4.3 L For a pute nod

42 150 ddcmd: mesh size tuning Resource consumption (CPU-h Run (seconds) Resource consumption (CPU-hrs) Run (seconds) 50 0 Small mesh size increases workload of particle tasks Increasing mesh size shifts critical path to mesh tasks mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs 50 mesh size Critical path (on particle task) Critical path (on mesh task) Total resource consumption Inter-partition imbalance costs Slide 23 September 26, 2013 David öhme

43 Parallel Performance nalysis Root-Cause nalysis Concepts Case study Critical-Path nalysis The critical path Critical-path imbalance indicator nalysis of MPMD programs Implementation Parallel trace replay Scalability evaluation Slide 24 September 26, 2013 David öhme

44 Implementation Integrated in Scalasca trace analysis toolset Highly scalable parallel trace analysis So far only for wait-state detection Slide 25 September 26, 2013 David öhme

45 Parallel trace replay pplication run Local event traces pplication records stamped communication events One trace file per process trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme

46 Parallel trace replay pplication run Local event traces Parallel analysis nalysis report pplication records stamped communication events One trace file per process nalysis traverse traces in parallel Exchange information at original synchronization points trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT Slide 26 September 26, 2013 David öhme

47 Trace analysis extensions Use multiple replay passes ackward replay lets data travel from effect to cause trace rank i ENTER SEND EXIT trace rank j ENTER RECV EXIT backward replay Slide 27 September 26, 2013 David öhme

48 Delay detection via backward replay 1 R 1 S 2 2 S 1 R 2 Wait state Slide 28 September 26, 2013 David öhme

49 Delay detection via backward replay 1. Identify synchronization interval 1 R 1 S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

50 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

51 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r S 2 3. Transfer vector t r 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

52 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

53 Delay detection via backward replay 1. Identify synchronization interval 2. Determine vectors t s and t r 1 R 1 t s t r Delay t s t r S 2 3. Transfer vector t r 4. Locate delay 5. Calculate costs 2 S 1 R 2 Wait state Synchronization interval Slide 28 September 26, 2013 David öhme

54 Critical-path extraction in backward replay 1 c=false 2 R 1 3 R 2 Slide 29 September 26, 2013 David öhme

55 Critical-path extraction in backward replay 1 2 R 1 c=false c=true 3 R 2 c=false Slide 29 September 26, 2013 David öhme

56 Critical-path extraction in backward replay c=false 1 c=true c=true 2 R 1 3 R 2 c=false Slide 29 September 26, 2013 David öhme

57 Critical-path extraction in backward replay c=true c=false 1 c=false c=true c=true 2 R 1 c=false c=false 3 R 2 Slide 29 September 26, 2013 David öhme

58 Scalability 1000 Wall (s) 100 Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

59 Scalability 1000 Wall (s) 100 Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

60 Scalability 1000 Wall (s) 100 Combined wait-state, delay, and critical-path analysis (trace replay) Wait-state and critical-path analysis (trace replay) Uninstrumented program execution Wait-state analysis only (trace replay) Processes Scalability of root-cause and critical-path analysis for the Sweep3D benchmark on lue Gene/P Slide 30 September 26, 2013 David öhme

61 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Delay Propagation Propagation Identifies delays and explains formation of wait states Slide 31 September 26, 2013 David öhme

62 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Slide 31 September 26, 2013 David öhme

63 Summary Two novel methods to locate and quantify imbalance Root-cause analysis Critical-path analysis 1 2 Delay Propagation Propagation 3 4 Identifies delays and explains formation of wait states Determines impact of inefficiency on run Highly scalable implementation for O(100k) Slide 31 September 26, 2013 David öhme

64 Thank you Further reading: D. öhme,. R. de Supinski, M. Geimer, M. Schulz, and F. Wolf: Scalable Critical-Path ased Performance nalysis. IPDPS D. öhme, M. Geimer, and F. Wolf: Characterizing Load and Communication Imbalance in Large-Scale Parallel pplications. IPDPS PhD Forum D. öhme, M. Geimer, F. Wolf, and L. rnold: Identifying the root causes of wait states in large-scale parallel applications. ICPP est paper award. D. öhme, M.-. Hermanns, M. Geimer, and F. Wolf: Performance simulation of non-blocking communication in message-passing applications. PROPER Slide 32 September 26, 2013 David öhme

65 SPECMPI case studies Uninstrumented execution Instrumented execution Trace replay 50 Imbalance percentage (profile metrics) Imbalance percentage (critical-path metrics) Run (sec) Imbalance percent RxML 122.tachyon 121.pop2 104.milc 137.lu 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 147.l2wrf2 145.lGemsFDTD 143.dleslie RxML 122.tachyon 121.pop2 104.milc 132.zeusmp2 129.tera_tf 128.GPgeofem 126.lammps 143.dleslie 137.lu 147.l2wrf2 145.lGemsFDTD Tracing run- overhead and trace replay s Critical-path vs. profile load imbalance metrics Slide 33 September 26, 2013 David öhme

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale

More information

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide

More information

SCALASCA parallel performance analyses of SPEC MPI2007 applications

SCALASCA parallel performance analyses of SPEC MPI2007 applications Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute

More information

Anna Morajko.

Anna Morajko. Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance

More information

Identifying the root causes of wait states in large-scale parallel applications

Identifying the root causes of wait states in large-scale parallel applications Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/01-1 12/January/2010 Identifying the root causes of wait states in large-scale parallel applications D. Böhme,

More information

Automatic trace analysis with the Scalasca Trace Tools

Automatic trace analysis with the Scalasca Trace Tools Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique

More information

A scalable tool architecture for diagnosing wait states in massively parallel applications

A scalable tool architecture for diagnosing wait states in massively parallel applications A scalable tool architecture for diagnosing wait states in massively parallel applications Markus Geimer a, Felix Wolf a,b,, Brian J. N. Wylie a, Bernd Mohr a a Jülich Supercomputing Centre, Forschungszentrum

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET

LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET Parallel Processing Letters Vol. 20, No. 4 (2010) 397 414 c World Scientific Publishing Company LARGE-SCALE PERFORMANCE ANALYSIS OF SWEEP3D WITH THE SCALASCA TOOLSET BRIAN J. N. WYLIE, MARKUS GEIMER, BERND

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

Charm++ for Productivity and Performance

Charm++ for Productivity and Performance Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

ScalaIOTrace: Scalable I/O Tracing and Analysis

ScalaIOTrace: Scalable I/O Tracing and Analysis ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Performance analysis of Sweep3D on Blue Gene/P with Scalasca

Performance analysis of Sweep3D on Blue Gene/P with Scalasca Mitglied der Helmholtz-Gemeinschaft Performance analysis of Sweep3D on Blue Gene/P with Scalasca 2010-04-23 Brian J. N. Wylie, David Böhme, Bernd Mohr, Zoltán Szebenyi & Felix Wolf Jülich Supercomputing

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

How to overcome common performance problems in legacy climate models

How to overcome common performance problems in legacy climate models How to overcome common performance problems in legacy climate models Jörg Behrens 1 Contributing: Moritz Handke 1, Ha Ho 2, Thomas Jahns 1, Mathias Pütz 3 1 Deutsches Klimarechenzentrum (DKRZ) 2 Helmholtz-Zentrum

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Synchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar

Synchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar Synchronizing the Record Replay engines for MPI Trace compression By Apoorva Kulkarni Vivek Thakkar PROBLEM DESCRIPTION An analysis tool has been developed which characterizes the communication behavior

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications

Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications Usage of the SCALASCA toolset for scalable performance analysis of large-scale parallel applications Felix Wolf 1,2, Erika Ábrahám 1, Daniel Becker 1,2, Wolfgang Frings 1, Karl Fürlinger 3, Markus Geimer

More information

Profiling of Data-Parallel Processors

Profiling of Data-Parallel Processors Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion

More information

Parallel I/O Libraries and Techniques

Parallel I/O Libraries and Techniques Parallel I/O Libraries and Techniques Mark Howison User Services & Support I/O for scientifc data I/O is commonly used by scientific applications to: Store numerical output from simulations Load initial

More information

Performance Analysis with Periscope

Performance Analysis with Periscope Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009. CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Communication Characteristics in the NAS Parallel Benchmarks

Communication Characteristics in the NAS Parallel Benchmarks Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this

More information

Performance Tools (Paraver/Dimemas)

Performance Tools (Paraver/Dimemas) www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!

More information

Architecture of Parallel Computer Systems - Performance Benchmarking -

Architecture of Parallel Computer Systems - Performance Benchmarking - Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark

More information

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms

HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms SIMUTOOLS 2015 Athens, Greece Mario Bielert, Florina M. Ciorba, Kim Feldhoff, Thomas Ilsche, Wolfgang E. Nagel

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

Performance engineering of GemsFDTD computational electromagnetics solver

Performance engineering of GemsFDTD computational electromagnetics solver Performance engineering of GemsFDTD computational electromagnetics solver Ulf Andersson 1 and Brian J. N. Wylie 2 1 PDC Centre for High Performance Computing, KTH Royal Institute of Technology, Stockholm,

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

The Potential of Diffusive Load Balancing at Large Scale

The Potential of Diffusive Load Balancing at Large Scale Center for Information Services and High Performance Computing The Potential of Diffusive Load Balancing at Large Scale EuroMPI 2016, Edinburgh, 27 September 2016 Matthias Lieber, Kerstin Gößner, Wolfgang

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

The Flooding Time Synchronization Protocol

The Flooding Time Synchronization Protocol The Flooding Time Synchronization Protocol Miklos Maroti, Branislav Kusy, Gyula Simon and Akos Ledeczi Vanderbilt University Contributions Better understanding of the uncertainties of radio message delivery

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Slack: A New Performance Metric for Parallel Programs

Slack: A New Performance Metric for Parallel Programs Slack: A New Performance Metric for Parallel Programs Jeffrey K. Hollingsworth Barton P. Miller hollings@cs.umd.edu bart@cs.wisc.edu Computer Science Department University of Maryland College Park, MD

More information

btrecord and btreplay User Guide

btrecord and btreplay User Guide btrecord and btreplay User Guide Alan D. Brunelle (Alan.Brunelle@hp.com) December 10, 2011 Abstract The btrecord and btreplay tools provide the ability to record and replay IOs captured by the blktrace

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

Fast Dynamic Load Balancing for Extreme Scale Systems

Fast Dynamic Load Balancing for Extreme Scale Systems Fast Dynamic Load Balancing for Extreme Scale Systems Cameron W. Smith, Gerrett Diamond, M.S. Shephard Computation Research Center (SCOREC) Rensselaer Polytechnic Institute Outline: n Some comments on

More information

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ 45 Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ Department of Computer Science The Australian National University Canberra, ACT 2611 Email: fzhen.he, Jeffrey.X.Yu,

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign

More information

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Performance without Pain = Productivity Data Layout and Collective Communication in UPC Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research

More information

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Charting the Course to Your Success! MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008 Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,

More information

Trade- Offs in Cloud Storage Architecture. Stefan Tai

Trade- Offs in Cloud Storage Architecture. Stefan Tai Trade- Offs in Cloud Storage Architecture Stefan Tai Cloud computing is about providing and consuming resources as services There are five essential characteristics of cloud services [NIST] [NIST]: http://csrc.nist.gov/groups/sns/cloud-

More information

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings Mitglied der Helmholtz-Gemeinschaft I/O at JSC I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O Wolfgang Frings W.Frings@fz-juelich.de Jülich Supercomputing

More information

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

OpenACC Support in Score-P and Vampir

OpenACC Support in Score-P and Vampir Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)

More information

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems Minesh I. Patel Ph.D. 1, Karl Jordan 1, Mattew Clark Ph.D. 1, and Devesh Bhatt

More information

Introduction to Performance Engineering

Introduction to Performance Engineering Introduction to Performance Engineering Markus Geimer Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance: an old problem

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,

More information

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Venugopalan Ramasubramanian Emin Gün Sirer Presented By: Kamalakar Kambhatla * Slides adapted from the paper -

More information

Introducing Scalileo A Java Based Scaling Framework

Introducing Scalileo A Java Based Scaling Framework Introducing Scalileo A Java Based Scaling Framework Tilmann Rabl, Christian Dellwo, Harald Kosch Chair of Distributed Database Systems First International Conference on Energy-Efficient Computing and Networking

More information