Tutorial: Application MPI Task Placement

Size: px
Start display at page:

Download "Tutorial: Application MPI Task Placement"

Transcription

1 Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign

2 Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How to do mapping? Example: MILC (near-neighbor comm pattern, >2x speedup) Example: Qbox (subgroup Alltoalls, seen 1.57x speedup) Focus on automatic tool

3 Topology-Aware Task Placement What is topo-aware task placement? Place tasks so that communicating tasks are closer in the machine topology: minimize off-node traffic and link contention Not the same as simple rank reordering It takes into account the actual nodes in the job and distance in the network between MPI tasks Why topo-aware task placement? Execution time can improve significantly (> 2x) in large jobs in some applications compared to default BW mapping

4 Mapping example: 2D application grid to 2D torus D application grid, ranks ordered by row Near-neighbor communication Each cell MPI task

5 Mapping example: 2D application grid to 2D torus If same shape, preserve structure

6 Mapping example: Random Placement Bad placement Longer paths Higher link utilization Link contention -> Longer communication time

7 2D 10x10 virtual grid 2D 10x10 Torus Near-neighbor communication, each task sends unit load to each neighbor Random shortest path for each pair of communicating ranks Avg hops Max hops Avg link load Max link load Row order Random placement

8 2D 10x10 virtual grid 2D 5x20 Torus Avg hops Max hops Avg link load Row order Random placement (near)optimal Max link load

9 Default BW mapping N-1 Tasks in rank order Node list N0 N1 N2 No real guarantee on the quality of this map, communicating tasks may not be close to each other Can work well on some shapes, and badly on others Better mapping recommended to obtain consistent performance regardless of shape OR always use a shape that gives good performance

10 How important is task placement for my application? Application comm patterns (near-neighbor, collectives) Volume of communication Computation to communication ratio Allocation shape Job size: larger jobs have more communication, and potentially larger distances between tasks Some applications (e.g. MILC) can get more than 2x-3x speedup in execution time on large jobs

11 When to do mapping? Perhaps easiest solution is to use automatic tool and test to see if there are actual improvements. Worst case: no improvement Remember that it might only improve in certain job sizes and job shapes Analyze the communication behavior of the application Knowledge of application: computation to communication ratio, volume of point to point traffic, collectives pattern If user doesn't have this knowledge Profiling tools

12 Profiling Tools CrayPat Large suite of performance analysis tools, including tracing mpip Link-time library: no need to recompile application Presents statistical information on MPI calls TopoProfiler Link-time library Collects information on MPI calls Includes topological information (hop-bytes, rank placement in 3D torus) from the run Extract the communication graph (messages and bytes communicated between MPI tasks)

13 Using TopoProfiler 1. Build topoprofiler libraries 2. Link with your application 3. After running the application, files generated: Summary file (optional) Comm graph Collectives stats file

14 TopoProfiler Output Total Execution Time: Total Messages: Total Bytes: Average Messages (rank average): Min Messages: on Rank: 0 Max Messages: on Rank: 0 Average Bytes (rank average): Min Bytes: on Rank: 0 Max Bytes: on Rank: 0 Average Hopbytes (rank average): Min Hopbytes: on Rank: 1672 Max Hopbytes: on Rank: 3040 Average Idle Time (rank average): Min Idle Time: on Rank: 4850 Max Idle Time: on Rank: 891 Total run time Messages Bytes hopbytes ) = Communication time 3 hops(m, n) bytes(m, n) totalbytes )

15 TopoProfiler Output Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 4177 Max Point to Point Operation Time: on Rank: 6783 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 2642 Max Collective Operation Time: on Rank: 3593 Average number of Send Calls (rank average): Min number of Send Calls: on Rank: 0 Max number of Send Calls: on Rank: 0 Average number of Recv Calls (rank average): Min number of Recv Calls: on Rank: 0 Max number of Recv Calls: on Rank: 0 Average number of Collective Operations (rank average): Min number of Collective Operations: on Rank: 0 Max number of Collective Operations: on Rank: 0 Time in point to point comm Time in collectives

16 TopoProfiler Output MPI Rank X Y Z T Messages Bytes Hop-bytes Idle Time Send Calls Recv Calls Collectives: timing, number of calls, bytes Option to contribute some collectives to per-rank hop-bytes metric, e.g. to know how close ranks in a subgroup collective are close to each other

17 How to do mapping? In the application: hard Ways to access topological info from within application (e.g. TopoManager) CrayPat grid_order tool Topaware genm Works with any MPI application No source code modification necessary Works with any communication pattern, including irregular

18 CrayPat grid_order: Generates MPI rank order information based on grid pattern detected by CrayPat profiling tools Calculated off-line, can place communicating tasks in the same node, doesn t explicitly place communicating nodes nearby Doesn t deal with irregular patterns

19 Topaware Run on login node, user gives application virtual topology grid. Topaware returns the set of geometries on which to run the job Inside batch job, call topaware to calculate map for the given allocation Limitation: Need to know shape of application grid (e.g. 4D application grid partitioned in 8x8x8x16 tasks) Tool assumes near-neighbor comm pattern in that grid Need to run in specific geometries Very good performance results in MILC

20 genm Communication graph as Input Goal: Minimize avg (per-rank) hopbytes hopbytes > = 3 bytes a, n distance(a, n) Keep communicating tasks close, preference based on amount of communication Doesn t necessarily correlate perfectly with application execution time, but good results Additionally favor solutions with smaller max hopbytes Hopbytes in y direction given 2x more weight

21 genm algorithm Algorithm: place tasks in processors to minimize hopbytes NP-hard problem Heuristic algorithm: random component, and greedy component Multiple trials of random greedy search Other algorithms not used either because too slow or find worse solutions We run separate instance of algorithm on every processor

22 Running genm Inside batch job: aprun -n NUM_PROCESSORS./mapper comm_graph.txt export MPICH_RANK_REORDER_METHOD=3 aprun -n NUM_PROCESSORS./program args Runs on any geometry and arbitrary topologies Works with BW service nodes ( holes ) Mapper runs on every processor. Two reasons: Allows increasing the search space (subsets of processors search a different solution space) With large problems, compensate for the fact that a single processor can do less trials

23 Mapping Examples: Methodology Profile: Analyze/confirm communication pattern Detect communication issues, e.g. high hopbytes, high point-to-point time, high collectives time Collectives used and timings of each Apply genm automatic task mapping

24 Example: MILC MILC - Lattice QCD 4D lattice (3 spatial dimensions + time) Extremely sensitive to placement Near-neighbor communication pattern

25 MILC Profiling Results tasks, 8x8x8 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 6624 Max Hopbytes: on Rank: Average Idle Time (rank average): ~40% time on communication Min Idle Time: on Rank: 4596 Max Idle Time: on Rank: Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 3989 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: Max Collective Operation Time: on Rank: 6041

26 MILC Profiling Results tasks, 11x2x24 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 2528 Max Hopbytes: on Rank: 7536 Average Idle Time (rank average): ~63% time on communication Min Idle Time: on Rank: 9940 Max Idle Time: on Rank: 7177 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 2665 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 9173 Max Collective Operation Time: on Rank: 2665

27 MILC mapping results on BW MILC 8192 ranks Program execution time (s) topaware genm default 8x4x8 11x2x12 6x6x8 Allocation shape (X * Y * Z)

28 MILC mapping results on BW MILC ranks Program execution time (s) topaware genm default 8x8x8 11x6x8 11x2x24 Allocation shape (X * Y * Z)

29 Example: Qbox First-principles MD code developed at LLNL Compute the properties of materials directly from underlying physics equations Density Functional Theory The computational workload split between: Parallel dense linear algebra (ScaLAPACK library) Custom 3D Fast Fourier Transform Gordon Bell Prize 06 for scaling on BG/L

30 Qbox communication MPI_alltoall 2D Process grid MPI_Allreduce

31 Qbox Profiling Results Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 5746 Max Hopbytes: on Rank: 3007 Average Idle Time (rank average): ~35% of execution time Min Idle Time: on Rank: 6076 Max Idle Time: on Rank: 1535 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 79 Max Point to Point Operation Time: on Rank: 7167 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 6076 Max Collective Operation Time: on Rank: 1279

32 Example: Qbox Profiling results Most of the communication time is spent in Alltoall. These involve subgroups (not global Alltoall). Example: 84 MPI groups of size 256 spend > 20 secs on collectives (avg=28, max=38) Time spent by these groups in MPI_alltoallv (avg=24, max=26) Mapping strategy: cluster these subgroups together OR: Let automatic mapping do it

33 Qbox Mapping Results on BW Qbox 8192 tasks Qbox tasks Program execution time (s) genm default Program execution time (s) genm default 11x3x8 Allocation shape (X * Y * Z) 11x2x12 11x2x24 Allocation shape (X * Y * Z)

34 Thank you! Contact me:

Update on Topology Aware Scheduling (aka TAS)

Update on Topology Aware Scheduling (aka TAS) Update on Topology Aware Scheduling (aka TAS) Work done in collaboration with J. Enos, G. Bauer, R. Brunner, S. Islam R. Fiedler Adaptive Computing 2 When things look good Image credit: Dave Semeraro 3

More information

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major

More information

Topology-Aware Job Scheduling Strategies for Torus Networks

Topology-Aware Job Scheduling Strategies for Torus Networks Topology-Aware Job Scheduling Strategies for Torus Networks J. Enos, G. Bauer, R. Brunner, S. Islam NCSA, Blue Waters Project University of Illinois Urbana, IL USA e-mail: jenos@illinois.edu R. Fiedler

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

The effects of common communication patterns in large scale networks with switch based static routing

The effects of common communication patterns in large scale networks with switch based static routing The effects of common communication patterns in large scale networks with switch based static routing Torsten Hoefler Indiana University talk at: Cisco Systems San Jose, CA Nerd Lunch 21st August 2008

More information

Using BigSim to Estimate Application Performance

Using BigSim to Estimate Application Performance October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator

More information

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign

More information

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Abhinav Bhatele and Laxmikant V. Kale University of Illinois at Urbana-Champaign LLNL-PRES-495871, P. O. Box 808,

More information

MPI for Cray XE/XK Systems & Recent Enhancements

MPI for Cray XE/XK Systems & Recent Enhancements MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.

More information

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Automated Mapping of Regular Communication Graphs on Mesh Interconnects Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Topology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus

Topology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus Topology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus Carl Albing Cray Inc., University of Reading; Norm Troullier, Stephen Whalen, Ryan Olson,

More information

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Applications. Oversampled 3D scan data. ~150k triangles ~80k triangles

Applications. Oversampled 3D scan data. ~150k triangles ~80k triangles Mesh Simplification Applications Oversampled 3D scan data ~150k triangles ~80k triangles 2 Applications Overtessellation: E.g. iso-surface extraction 3 Applications Multi-resolution hierarchies for efficient

More information

Topology Awareness in the Tofu Interconnect Series

Topology Awareness in the Tofu Interconnect Series Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,

More information

Toward Understanding the Impact of I/O Patterns on Congestion Protection Events on Blue Waters

Toward Understanding the Impact of I/O Patterns on Congestion Protection Events on Blue Waters May 8, 2014 Toward Understanding the Impact of I/O Patterns on Congestion Protection Events on Blue Waters Rob Sisneros, Kalyana Chadalavada Why? Size!= Super! HSN == Super! Fast food philosophy What the

More information

An Efficient Parallel Load-balancing Framework for Orthogonal Decomposition of Geometrical Data

An Efficient Parallel Load-balancing Framework for Orthogonal Decomposition of Geometrical Data ISC06 OVERVIEW SORT-BALANCE-SPLIT RESULTS An Efficient Parallel Load-balancing Framework for Orthogonal Decomposition of Geometrical Data Bruno R. C. Magalhães Farhan Tauheed Thomas Heinis Anastasia Ailamaki

More information

MPI for Scalable Computing (continued from yesterday)

MPI for Scalable Computing (continued from yesterday) MPI for Scalable Computing (continued from yesterday) Bill Gropp, University of Illinois at Urbana-Champaign Rusty Lusk, Argonne National Laboratory Rajeev Thakur, Argonne National Laboratory Topology

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

PREDICTING COMMUNICATION PERFORMANCE

PREDICTING COMMUNICATION PERFORMANCE PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

MPI Optimization. HPC Saudi, March 15 th 2016

MPI Optimization. HPC Saudi, March 15 th 2016 MPI Optimization HPC Saudi, March 15 th 2016 Useful Variables MPI Variables to get more insights MPICH_VERSION_DISPLAY=1 MPICH_ENV_DISPLAY=1 MPICH_CPUMASK_DISPLAY=1 MPICH_RANK_REORDER_DISPLAY=1 When using

More information

The Eclipse Parallel Tools Platform

The Eclipse Parallel Tools Platform May 1, 2012 Toward an Integrated Development Environment for Improved Software Engineering on Crays Agenda 1. What is the Eclipse Parallel Tools Platform (PTP) 2. Tour of features available in Eclipse/PTP

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Challenges in Geographic Routing: Sparse Networks, Obstacles, and Traffic Provisioning

Challenges in Geographic Routing: Sparse Networks, Obstacles, and Traffic Provisioning Challenges in Geographic Routing: Sparse Networks, Obstacles, and Traffic Provisioning Brad Karp Berkeley, CA bkarp@icsi.berkeley.edu DIMACS Pervasive Networking Workshop 2 May, 2 Motivating Examples Vast

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Geometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data

Geometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data Applications Oversampled 3D scan data ~150k triangles ~80k triangles 2 Copyright 2010 Gotsman, Pauly Page 1 Applications Overtessellation: E.g. iso-surface extraction 3 Applications Multi-resolution hierarchies

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

8/19/13. Blue Waters User Monthly Teleconference

8/19/13. Blue Waters User Monthly Teleconference 8/19/13 Blue Waters User Monthly Teleconference Extreme Scaling Workshop 2013 Successful workshop in Boulder. Presentations from 4 groups with allocations on Blue Waters. Industry representatives were

More information

Runtime Optimization of Application Level Communication Patterns

Runtime Optimization of Application Level Communication Patterns Runtime Optimization of Application Level Communication Patterns and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu Motivation Finite Difference code on a PC cluster

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011 Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011 Lecture 1 In which we describe what this course is about. 1 Overview This class is about the following

More information

Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?

Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next? Big Data Coflow The volume of data businesses want to make sense of is increasing Increasing variety of sources Recent Advances and What s Next? Web, mobile, wearables, vehicles, scientific, Cheaper disks,

More information

*University of Illinois at Urbana Champaign/NCSA Bell Labs

*University of Illinois at Urbana Champaign/NCSA Bell Labs Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations Saurabh Jha*, Valerio Formicola*, Catello Di Martino, William Kramer*, Zbigniew Kalbarczyk*, Ravishankar K. Iyer* *University

More information

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

L9: Hierarchical Clustering

L9: Hierarchical Clustering L9: Hierarchical Clustering This marks the beginning of the clustering section. The basic idea is to take a set X of items and somehow partition X into subsets, so each subset has similar items. Obviously,

More information

Etiquette protocol for Ultra Low Power Operation in Sensor Networks

Etiquette protocol for Ultra Low Power Operation in Sensor Networks Etiquette protocol for Ultra Low Power Operation in Sensor Networks Samir Goel and Tomasz Imielinski {gsamir, imielins}@cs.rutgers.edu DataMan Lab, Department of Computer Science Acknowledgement: Prof.

More information

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez A Comparative Study of High Performance Computing on the Cloud Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez What is The Cloud? The cloud is just a bunch of computers connected over

More information

Rectangular Partitioning

Rectangular Partitioning Rectangular Partitioning Joe Forsmann and Rock Hymas Introduction/Abstract We will look at a problem that I (Rock) had to solve in the course of my work. Given a set of non-overlapping rectangles each

More information

CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015

CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015 CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015 Contents 1 Overview 3 2 Formulations 4 2.1 Scatter-Reduce....................................... 4 2.2 Scatter-Gather.......................................

More information

MPI Programming Techniques

MPI Programming Techniques MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Enabling Contiguous Node Scheduling on the Cray XT3

Enabling Contiguous Node Scheduling on the Cray XT3 Enabling Contiguous Node Scheduling on the Cray XT3 Chad Vizino Pittsburgh Supercomputing Center Cray User Group Helsinki, Finland May 2008 With acknowledgements to Deborah Weisser, Jared Yanovich, Derek

More information

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores T/NT INTERFACE y/ x/ z/ 99 99 Juan A. Sillero, Guillem Borrell, Javier Jiménez (Universidad Politécnica de Madrid) and Robert D. Moser (U.

More information

Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement

Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement Robert A. Fiedler, Nathan Wichmann, Stephen Whalen Cray, Inc. St. Paul,

More information

Paging algorithms. CS 241 February 10, Copyright : University of Illinois CS 241 Staff 1

Paging algorithms. CS 241 February 10, Copyright : University of Illinois CS 241 Staff 1 Paging algorithms CS 241 February 10, 2012 Copyright : University of Illinois CS 241 Staff 1 Announcements MP2 due Tuesday Fabulous Prizes Wednesday! 2 Paging On heavily-loaded systems, memory can fill

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University Email: {wuxf, taylor}@cs.tamu.edu

More information

CS 229 Final Report: Location Based Adaptive Routing Protocol(LBAR) using Reinforcement Learning

CS 229 Final Report: Location Based Adaptive Routing Protocol(LBAR) using Reinforcement Learning CS 229 Final Report: Location Based Adaptive Routing Protocol(LBAR) using Reinforcement Learning By: Eunjoon Cho and Kevin Wong Abstract In this paper we present an algorithm for a location based adaptive

More information

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems Tony Maciejewski, Kyle Tarplee, Ryan Friese, and Howard Jay Siegel Department of Electrical and Computer Engineering Colorado

More information

Dimemas internals and details. BSC Performance Tools

Dimemas internals and details. BSC Performance Tools Dimemas ernals and details BSC Performance Tools CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS

More information

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Ad hoc and Sensor Networks Topology control

Ad hoc and Sensor Networks Topology control Ad hoc and Sensor Networks Topology control Goals of this chapter Networks can be too dense too many nodes in close (radio) vicinity This chapter looks at methods to deal with such networks by Reducing/controlling

More information

Enabling Contiguous Node Scheduling on the Cray XT3

Enabling Contiguous Node Scheduling on the Cray XT3 Enabling Contiguous Node Scheduling on the Cray XT3 Chad Vizino vizino@psc.edu Pittsburgh Supercomputing Center ABSTRACT: The Pittsburgh Supercomputing Center has enabled contiguous node scheduling to

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

Performance Engineering: Lab Session

Performance Engineering: Lab Session PDC Summer School 2018 Performance Engineering: Lab Session We are going to work with a sample toy application that solves the 2D heat equation (see the Appendix in this handout). You can download that

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Routing Outline. EECS 122, Lecture 15

Routing Outline. EECS 122, Lecture 15 Fall & Walrand Lecture 5 Outline EECS, Lecture 5 Kevin Fall kfall@cs.berkeley.edu Jean Walrand wlr@eecs.berkeley.edu Definition/Key Questions Distance Vector Link State Comparison Variations EECS - Fall

More information

context: massive systems

context: massive systems cutting the electric bill for internetscale systems Asfandyar Qureshi (MIT) Rick Weber (Akamai) Hari Balakrishnan (MIT) John Guttag (MIT) Bruce Maggs (Duke/Akamai) Éole @ flickr context: massive systems

More information

Toward the joint design of electronic and optical layer protection

Toward the joint design of electronic and optical layer protection Toward the joint design of electronic and optical layer protection Massachusetts Institute of Technology Slide 1 Slide 2 CHALLENGES: - SEAMLESS CONNECTIVITY - MULTI-MEDIA (FIBER,SATCOM,WIRELESS) - HETEROGENEOUS

More information

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level

More information

Scalable Topology Aware Object Mapping for Large Supercomputers

Scalable Topology Aware Object Mapping for Large Supercomputers Scalable Topology Aware Object Mapping for Large Supercomputers Abhinav Bhatele (bhatele@illinois.edu) Department of Computer Science, University of Illinois at Urbana-Champaign Advisor: Laxmikant V. Kale

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

Link Estimation and Tree Routing

Link Estimation and Tree Routing Network Embedded Systems Sensor Networks Link Estimation and Tree Routing 1 Marcus Chang, mchang@cs.jhu.edu Slides: Andreas Terzis Outline Link quality estimation Examples of link metrics Four-Bit Wireless

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Geometric Modeling and Processing

Geometric Modeling and Processing Geometric Modeling and Processing Tutorial of 3DIM&PVT 2011 (Hangzhou, China) May 16, 2011 6. Mesh Simplification Problems High resolution meshes becoming increasingly available 3D active scanners Computer

More information

Data Communication. Guaranteed Delivery Based on Memorization

Data Communication. Guaranteed Delivery Based on Memorization Data Communication Guaranteed Delivery Based on Memorization Motivation Many greedy routing schemes perform well in dense networks Greedy routing has a small communication overhead Desirable to run Greedy

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

LLNL Tool Components: LaunchMON, P N MPI, GraphLib

LLNL Tool Components: LaunchMON, P N MPI, GraphLib LLNL-PRES-405584 Lawrence Livermore National Laboratory LLNL Tool Components: LaunchMON, P N MPI, GraphLib CScADS Workshop, July 2008 Martin Schulz Larger Team: Bronis de Supinski, Dong Ahn, Greg Lee Lawrence

More information

CRYSTAL in parallel: replicated and distributed. Ian Bush Numerical Algorithms Group Ltd, HECToR CSE

CRYSTAL in parallel: replicated and distributed. Ian Bush Numerical Algorithms Group Ltd, HECToR CSE CRYSTAL in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE Introduction Why parallel? What is in a parallel computer When parallel? Pcrystal MPPcrystal

More information

Model-based Measurements Of Network Loss

Model-based Measurements Of Network Loss Model-based Measurements Of Network Loss June 28, 2013 Mirja Kühlewind mirja.kuehlewind@ikr.uni-stuttgart.de Universität Stuttgart Institute of Communication Networks and Computer Engineering (IKR) Prof.

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Lecture 9: Load Balancing & Resource Allocation

Lecture 9: Load Balancing & Resource Allocation Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently

More information

Maximizing Throughput on a Dragonfly Network

Maximizing Throughput on a Dragonfly Network Maximizing Throughput on a Dragonfly Network Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, Laxmikant V. Kale Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Memory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC]

Memory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC] Memory Management CSE 2431: Introduction to Operating Systems Reading: 8.1 8.3, [OSC] 1 Outline Basic Memory Management Swapping Variable Partitions Memory Management Problems 2 Basic Memory Management

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Flat Routing on Curved Spaces

Flat Routing on Curved Spaces Flat Routing on Curved Spaces Dmitri Krioukov (CAIDA/UCSD) dima@caida.org Berkeley April 19 th, 2006 Clean slate: reassess fundamental assumptions Information transmission between nodes in networks that

More information

Small-World Datacenters

Small-World Datacenters 2 nd ACM Symposium on Cloud Computing Oct 27, 2011 Small-World Datacenters Ji-Yong Shin * Bernard Wong +, and Emin Gün Sirer * * Cornell University + University of Waterloo Motivation Conventional networks

More information

Image Segmentation. Shengnan Wang

Image Segmentation. Shengnan Wang Image Segmentation Shengnan Wang shengnan@cs.wisc.edu Contents I. Introduction to Segmentation II. Mean Shift Theory 1. What is Mean Shift? 2. Density Estimation Methods 3. Deriving the Mean Shift 4. Mean

More information

Parallel Code Optimisation

Parallel Code Optimisation April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor

More information

CS Operating Systems

CS Operating Systems CS 4500 - Operating Systems Module 9: Memory Management - Part 1 Stanley Wileman Department of Computer Science University of Nebraska at Omaha Omaha, NE 68182-0500, USA June 9, 2017 In This Module...

More information