Tutorial: Application MPI Task Placement
|
|
- Frederica Wilkinson
- 5 years ago
- Views:
Transcription
1 Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign
2 Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How to do mapping? Example: MILC (near-neighbor comm pattern, >2x speedup) Example: Qbox (subgroup Alltoalls, seen 1.57x speedup) Focus on automatic tool
3 Topology-Aware Task Placement What is topo-aware task placement? Place tasks so that communicating tasks are closer in the machine topology: minimize off-node traffic and link contention Not the same as simple rank reordering It takes into account the actual nodes in the job and distance in the network between MPI tasks Why topo-aware task placement? Execution time can improve significantly (> 2x) in large jobs in some applications compared to default BW mapping
4 Mapping example: 2D application grid to 2D torus D application grid, ranks ordered by row Near-neighbor communication Each cell MPI task
5 Mapping example: 2D application grid to 2D torus If same shape, preserve structure
6 Mapping example: Random Placement Bad placement Longer paths Higher link utilization Link contention -> Longer communication time
7 2D 10x10 virtual grid 2D 10x10 Torus Near-neighbor communication, each task sends unit load to each neighbor Random shortest path for each pair of communicating ranks Avg hops Max hops Avg link load Max link load Row order Random placement
8 2D 10x10 virtual grid 2D 5x20 Torus Avg hops Max hops Avg link load Row order Random placement (near)optimal Max link load
9 Default BW mapping N-1 Tasks in rank order Node list N0 N1 N2 No real guarantee on the quality of this map, communicating tasks may not be close to each other Can work well on some shapes, and badly on others Better mapping recommended to obtain consistent performance regardless of shape OR always use a shape that gives good performance
10 How important is task placement for my application? Application comm patterns (near-neighbor, collectives) Volume of communication Computation to communication ratio Allocation shape Job size: larger jobs have more communication, and potentially larger distances between tasks Some applications (e.g. MILC) can get more than 2x-3x speedup in execution time on large jobs
11 When to do mapping? Perhaps easiest solution is to use automatic tool and test to see if there are actual improvements. Worst case: no improvement Remember that it might only improve in certain job sizes and job shapes Analyze the communication behavior of the application Knowledge of application: computation to communication ratio, volume of point to point traffic, collectives pattern If user doesn't have this knowledge Profiling tools
12 Profiling Tools CrayPat Large suite of performance analysis tools, including tracing mpip Link-time library: no need to recompile application Presents statistical information on MPI calls TopoProfiler Link-time library Collects information on MPI calls Includes topological information (hop-bytes, rank placement in 3D torus) from the run Extract the communication graph (messages and bytes communicated between MPI tasks)
13 Using TopoProfiler 1. Build topoprofiler libraries 2. Link with your application 3. After running the application, files generated: Summary file (optional) Comm graph Collectives stats file
14 TopoProfiler Output Total Execution Time: Total Messages: Total Bytes: Average Messages (rank average): Min Messages: on Rank: 0 Max Messages: on Rank: 0 Average Bytes (rank average): Min Bytes: on Rank: 0 Max Bytes: on Rank: 0 Average Hopbytes (rank average): Min Hopbytes: on Rank: 1672 Max Hopbytes: on Rank: 3040 Average Idle Time (rank average): Min Idle Time: on Rank: 4850 Max Idle Time: on Rank: 891 Total run time Messages Bytes hopbytes ) = Communication time 3 hops(m, n) bytes(m, n) totalbytes )
15 TopoProfiler Output Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 4177 Max Point to Point Operation Time: on Rank: 6783 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 2642 Max Collective Operation Time: on Rank: 3593 Average number of Send Calls (rank average): Min number of Send Calls: on Rank: 0 Max number of Send Calls: on Rank: 0 Average number of Recv Calls (rank average): Min number of Recv Calls: on Rank: 0 Max number of Recv Calls: on Rank: 0 Average number of Collective Operations (rank average): Min number of Collective Operations: on Rank: 0 Max number of Collective Operations: on Rank: 0 Time in point to point comm Time in collectives
16 TopoProfiler Output MPI Rank X Y Z T Messages Bytes Hop-bytes Idle Time Send Calls Recv Calls Collectives: timing, number of calls, bytes Option to contribute some collectives to per-rank hop-bytes metric, e.g. to know how close ranks in a subgroup collective are close to each other
17 How to do mapping? In the application: hard Ways to access topological info from within application (e.g. TopoManager) CrayPat grid_order tool Topaware genm Works with any MPI application No source code modification necessary Works with any communication pattern, including irregular
18 CrayPat grid_order: Generates MPI rank order information based on grid pattern detected by CrayPat profiling tools Calculated off-line, can place communicating tasks in the same node, doesn t explicitly place communicating nodes nearby Doesn t deal with irregular patterns
19 Topaware Run on login node, user gives application virtual topology grid. Topaware returns the set of geometries on which to run the job Inside batch job, call topaware to calculate map for the given allocation Limitation: Need to know shape of application grid (e.g. 4D application grid partitioned in 8x8x8x16 tasks) Tool assumes near-neighbor comm pattern in that grid Need to run in specific geometries Very good performance results in MILC
20 genm Communication graph as Input Goal: Minimize avg (per-rank) hopbytes hopbytes > = 3 bytes a, n distance(a, n) Keep communicating tasks close, preference based on amount of communication Doesn t necessarily correlate perfectly with application execution time, but good results Additionally favor solutions with smaller max hopbytes Hopbytes in y direction given 2x more weight
21 genm algorithm Algorithm: place tasks in processors to minimize hopbytes NP-hard problem Heuristic algorithm: random component, and greedy component Multiple trials of random greedy search Other algorithms not used either because too slow or find worse solutions We run separate instance of algorithm on every processor
22 Running genm Inside batch job: aprun -n NUM_PROCESSORS./mapper comm_graph.txt export MPICH_RANK_REORDER_METHOD=3 aprun -n NUM_PROCESSORS./program args Runs on any geometry and arbitrary topologies Works with BW service nodes ( holes ) Mapper runs on every processor. Two reasons: Allows increasing the search space (subsets of processors search a different solution space) With large problems, compensate for the fact that a single processor can do less trials
23 Mapping Examples: Methodology Profile: Analyze/confirm communication pattern Detect communication issues, e.g. high hopbytes, high point-to-point time, high collectives time Collectives used and timings of each Apply genm automatic task mapping
24 Example: MILC MILC - Lattice QCD 4D lattice (3 spatial dimensions + time) Extremely sensitive to placement Near-neighbor communication pattern
25 MILC Profiling Results tasks, 8x8x8 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 6624 Max Hopbytes: on Rank: Average Idle Time (rank average): ~40% time on communication Min Idle Time: on Rank: 4596 Max Idle Time: on Rank: Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 3989 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: Max Collective Operation Time: on Rank: 6041
26 MILC Profiling Results tasks, 11x2x24 shape, Default BW mapping Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 2528 Max Hopbytes: on Rank: 7536 Average Idle Time (rank average): ~63% time on communication Min Idle Time: on Rank: 9940 Max Idle Time: on Rank: 7177 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 2665 Max Point to Point Operation Time: on Rank: Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 9173 Max Collective Operation Time: on Rank: 2665
27 MILC mapping results on BW MILC 8192 ranks Program execution time (s) topaware genm default 8x4x8 11x2x12 6x6x8 Allocation shape (X * Y * Z)
28 MILC mapping results on BW MILC ranks Program execution time (s) topaware genm default 8x8x8 11x6x8 11x2x24 Allocation shape (X * Y * Z)
29 Example: Qbox First-principles MD code developed at LLNL Compute the properties of materials directly from underlying physics equations Density Functional Theory The computational workload split between: Parallel dense linear algebra (ScaLAPACK library) Custom 3D Fast Fourier Transform Gordon Bell Prize 06 for scaling on BG/L
30 Qbox communication MPI_alltoall 2D Process grid MPI_Allreduce
31 Qbox Profiling Results Total Execution Time: Average Hopbytes (rank average): Min Hopbytes: on Rank: 5746 Max Hopbytes: on Rank: 3007 Average Idle Time (rank average): ~35% of execution time Min Idle Time: on Rank: 6076 Max Idle Time: on Rank: 1535 Average Point to Point Operation Time (rank average): Min Point to Point Operation Time: on Rank: 79 Max Point to Point Operation Time: on Rank: 7167 Average Collective Operation Time (rank average): Min Collective Operation Time: on Rank: 6076 Max Collective Operation Time: on Rank: 1279
32 Example: Qbox Profiling results Most of the communication time is spent in Alltoall. These involve subgroups (not global Alltoall). Example: 84 MPI groups of size 256 spend > 20 secs on collectives (avg=28, max=38) Time spent by these groups in MPI_alltoallv (avg=24, max=26) Mapping strategy: cluster these subgroups together OR: Let automatic mapping do it
33 Qbox Mapping Results on BW Qbox 8192 tasks Qbox tasks Program execution time (s) genm default Program execution time (s) genm default 11x3x8 Allocation shape (X * Y * Z) 11x2x12 11x2x24 Allocation shape (X * Y * Z)
34 Thank you! Contact me:
Update on Topology Aware Scheduling (aka TAS)
Update on Topology Aware Scheduling (aka TAS) Work done in collaboration with J. Enos, G. Bauer, R. Brunner, S. Islam R. Fiedler Adaptive Computing 2 When things look good Image credit: Dave Semeraro 3
More informationToward Runtime Power Management of Exascale Networks by On/Off Control of Links
Toward Runtime Power Management of Exascale Networks by On/Off Control of Links, Nikhil Jain, Laxmikant Kale University of Illinois at Urbana-Champaign HPPAC May 20, 2013 1 Power challenge Power is a major
More informationTopology-Aware Job Scheduling Strategies for Torus Networks
Topology-Aware Job Scheduling Strategies for Torus Networks J. Enos, G. Bauer, R. Brunner, S. Islam NCSA, Blue Waters Project University of Illinois Urbana, IL USA e-mail: jenos@illinois.edu R. Fiedler
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationThe effects of common communication patterns in large scale networks with switch based static routing
The effects of common communication patterns in large scale networks with switch based static routing Torsten Hoefler Indiana University talk at: Cisco Systems San Jose, CA Nerd Lunch 21st August 2008
More informationUsing BigSim to Estimate Application Performance
October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator
More informationA CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign
More informationHeuristic-based techniques for mapping irregular communication graphs to mesh topologies
Heuristic-based techniques for mapping irregular communication graphs to mesh topologies Abhinav Bhatele and Laxmikant V. Kale University of Illinois at Urbana-Champaign LLNL-PRES-495871, P. O. Box 808,
More informationMPI for Cray XE/XK Systems & Recent Enhancements
MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
More informationAutomated Mapping of Regular Communication Graphs on Mesh Interconnects
Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:
More informationNearest Neighbor Predictors
Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,
More informationTopology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus
Topology, Bandwidth and Performance: A New Approach in Linear Orderings for Application Placement in a 3D Torus Carl Albing Cray Inc., University of Reading; Norm Troullier, Stephen Whalen, Ryan Olson,
More informationDr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011 A function level simulator for parallel applications on peta scale machine An
More informationGeneric Topology Mapping Strategies for Large-scale Parallel Architectures
Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous
More informationECE 574 Cluster Computing Lecture 13
ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationApplications. Oversampled 3D scan data. ~150k triangles ~80k triangles
Mesh Simplification Applications Oversampled 3D scan data ~150k triangles ~80k triangles 2 Applications Overtessellation: E.g. iso-surface extraction 3 Applications Multi-resolution hierarchies for efficient
More informationTopology Awareness in the Tofu Interconnect Series
Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationTraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks
TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at Urbana- Champaign *Contributors: Nikhil Jain,
More informationToward Understanding the Impact of I/O Patterns on Congestion Protection Events on Blue Waters
May 8, 2014 Toward Understanding the Impact of I/O Patterns on Congestion Protection Events on Blue Waters Rob Sisneros, Kalyana Chadalavada Why? Size!= Super! HSN == Super! Fast food philosophy What the
More informationAn Efficient Parallel Load-balancing Framework for Orthogonal Decomposition of Geometrical Data
ISC06 OVERVIEW SORT-BALANCE-SPLIT RESULTS An Efficient Parallel Load-balancing Framework for Orthogonal Decomposition of Geometrical Data Bruno R. C. Magalhães Farhan Tauheed Thomas Heinis Anastasia Ailamaki
More informationMPI for Scalable Computing (continued from yesterday)
MPI for Scalable Computing (continued from yesterday) Bill Gropp, University of Illinois at Urbana-Champaign Rusty Lusk, Argonne National Laboratory Rajeev Thakur, Argonne National Laboratory Topology
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationHyperparameter optimization. CS6787 Lecture 6 Fall 2017
Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum
More informationPREDICTING COMMUNICATION PERFORMANCE
PREDICTING COMMUNICATION PERFORMANCE Nikhil Jain CASC Seminar, LLNL This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationMPI Optimization. HPC Saudi, March 15 th 2016
MPI Optimization HPC Saudi, March 15 th 2016 Useful Variables MPI Variables to get more insights MPICH_VERSION_DISPLAY=1 MPICH_ENV_DISPLAY=1 MPICH_CPUMASK_DISPLAY=1 MPICH_RANK_REORDER_DISPLAY=1 When using
More informationThe Eclipse Parallel Tools Platform
May 1, 2012 Toward an Integrated Development Environment for Improved Software Engineering on Crays Agenda 1. What is the Eclipse Parallel Tools Platform (PTP) 2. Tour of features available in Eclipse/PTP
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationChallenges in Geographic Routing: Sparse Networks, Obstacles, and Traffic Provisioning
Challenges in Geographic Routing: Sparse Networks, Obstacles, and Traffic Provisioning Brad Karp Berkeley, CA bkarp@icsi.berkeley.edu DIMACS Pervasive Networking Workshop 2 May, 2 Motivating Examples Vast
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationGeometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data
Applications Oversampled 3D scan data ~150k triangles ~80k triangles 2 Copyright 2010 Gotsman, Pauly Page 1 Applications Overtessellation: E.g. iso-surface extraction 3 Applications Multi-resolution hierarchies
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More information8/19/13. Blue Waters User Monthly Teleconference
8/19/13 Blue Waters User Monthly Teleconference Extreme Scaling Workshop 2013 Successful workshop in Boulder. Presentations from 4 groups with allocations on Blue Waters. Industry representatives were
More informationRuntime Optimization of Application Level Communication Patterns
Runtime Optimization of Application Level Communication Patterns and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu Motivation Finite Difference code on a PC cluster
More informationStanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011
Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011 Lecture 1 In which we describe what this course is about. 1 Overview This class is about the following
More informationCoflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?
Big Data Coflow The volume of data businesses want to make sense of is increasing Increasing variety of sources Recent Advances and What s Next? Web, mobile, wearables, vehicles, scientific, Cheaper disks,
More information*University of Illinois at Urbana Champaign/NCSA Bell Labs
Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations Saurabh Jha*, Valerio Formicola*, Catello Di Martino, William Kramer*, Zbigniew Kalbarczyk*, Ravishankar K. Iyer* *University
More informationAbhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center Motivation: Contention Experiments Bhatele, A., Kale, L. V. 2008 An Evaluation
More informationExploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray
Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data
More informationL9: Hierarchical Clustering
L9: Hierarchical Clustering This marks the beginning of the clustering section. The basic idea is to take a set X of items and somehow partition X into subsets, so each subset has similar items. Obviously,
More informationEtiquette protocol for Ultra Low Power Operation in Sensor Networks
Etiquette protocol for Ultra Low Power Operation in Sensor Networks Samir Goel and Tomasz Imielinski {gsamir, imielins}@cs.rutgers.edu DataMan Lab, Department of Computer Science Acknowledgement: Prof.
More informationA Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez
A Comparative Study of High Performance Computing on the Cloud Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez What is The Cloud? The cloud is just a bunch of computers connected over
More informationRectangular Partitioning
Rectangular Partitioning Joe Forsmann and Rock Hymas Introduction/Abstract We will look at a problem that I (Rock) had to solve in the course of my work. Given a set of non-overlapping rectangles each
More informationCS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015
CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015 Contents 1 Overview 3 2 Formulations 4 2.1 Scatter-Reduce....................................... 4 2.2 Scatter-Gather.......................................
More informationMPI Programming Techniques
MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any
More informationTechnical Computing Suite supporting the hybrid system
Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect
More informationDistance-based Methods: Drawbacks
Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find
More informationEnabling Contiguous Node Scheduling on the Cray XT3
Enabling Contiguous Node Scheduling on the Cray XT3 Chad Vizino Pittsburgh Supercomputing Center Cray User Group Helsinki, Finland May 2008 With acknowledgements to Deborah Weisser, Jared Yanovich, Derek
More informationHybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores
Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores T/NT INTERFACE y/ x/ z/ 99 99 Juan A. Sillero, Guillem Borrell, Javier Jiménez (Universidad Politécnica de Madrid) and Robert D. Moser (U.
More informationImproving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement
Improving the Performance of the PSDNS Pseudo-Spectral Turbulence Application on Blue Waters using Coarray Fortran and Task Placement Robert A. Fiedler, Nathan Wichmann, Stephen Whalen Cray, Inc. St. Paul,
More informationPaging algorithms. CS 241 February 10, Copyright : University of Illinois CS 241 Staff 1
Paging algorithms CS 241 February 10, 2012 Copyright : University of Illinois CS 241 Staff 1 Announcements MP2 due Tuesday Fabulous Prizes Wednesday! 2 Paging On heavily-loaded systems, memory can fill
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationPerformance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters
Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters Xingfu Wu and Valerie Taylor Department of Computer Science, Texas A&M University Email: {wuxf, taylor}@cs.tamu.edu
More informationCS 229 Final Report: Location Based Adaptive Routing Protocol(LBAR) using Reinforcement Learning
CS 229 Final Report: Location Based Adaptive Routing Protocol(LBAR) using Reinforcement Learning By: Eunjoon Cho and Kevin Wong Abstract In this paper we present an algorithm for a location based adaptive
More informationBi-Objective Optimization for Scheduling in Heterogeneous Computing Systems
Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems Tony Maciejewski, Kyle Tarplee, Ryan Friese, and Howard Jay Siegel Department of Electrical and Computer Engineering Colorado
More informationDimemas internals and details. BSC Performance Tools
Dimemas ernals and details BSC Performance Tools CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS
More informationEnabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer
Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationAd hoc and Sensor Networks Topology control
Ad hoc and Sensor Networks Topology control Goals of this chapter Networks can be too dense too many nodes in close (radio) vicinity This chapter looks at methods to deal with such networks by Reducing/controlling
More informationEnabling Contiguous Node Scheduling on the Cray XT3
Enabling Contiguous Node Scheduling on the Cray XT3 Chad Vizino vizino@psc.edu Pittsburgh Supercomputing Center ABSTRACT: The Pittsburgh Supercomputing Center has enabled contiguous node scheduling to
More informationAdvanced Message-Passing Interface (MPI)
Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point
More informationPRIMEHPC FX10: Advanced Software
PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for
More informationPerformance Engineering: Lab Session
PDC Summer School 2018 Performance Engineering: Lab Session We are going to work with a sample toy application that solves the 2D heat equation (see the Appendix in this handout). You can download that
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationRouting Outline. EECS 122, Lecture 15
Fall & Walrand Lecture 5 Outline EECS, Lecture 5 Kevin Fall kfall@cs.berkeley.edu Jean Walrand wlr@eecs.berkeley.edu Definition/Key Questions Distance Vector Link State Comparison Variations EECS - Fall
More informationcontext: massive systems
cutting the electric bill for internetscale systems Asfandyar Qureshi (MIT) Rick Weber (Akamai) Hari Balakrishnan (MIT) John Guttag (MIT) Bruce Maggs (Duke/Akamai) Éole @ flickr context: massive systems
More informationToward the joint design of electronic and optical layer protection
Toward the joint design of electronic and optical layer protection Massachusetts Institute of Technology Slide 1 Slide 2 CHALLENGES: - SEAMLESS CONNECTIVITY - MULTI-MEDIA (FIBER,SATCOM,WIRELESS) - HETEROGENEOUS
More informationUnderstanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E
Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level
More informationScalable Topology Aware Object Mapping for Large Supercomputers
Scalable Topology Aware Object Mapping for Large Supercomputers Abhinav Bhatele (bhatele@illinois.edu) Department of Computer Science, University of Illinois at Urbana-Champaign Advisor: Laxmikant V. Kale
More informationDensity estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,
More informationLink Estimation and Tree Routing
Network Embedded Systems Sensor Networks Link Estimation and Tree Routing 1 Marcus Chang, mchang@cs.jhu.edu Slides: Andreas Terzis Outline Link quality estimation Examples of link metrics Four-Bit Wireless
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationGeometric Modeling and Processing
Geometric Modeling and Processing Tutorial of 3DIM&PVT 2011 (Hangzhou, China) May 16, 2011 6. Mesh Simplification Problems High resolution meshes becoming increasingly available 3D active scanners Computer
More informationData Communication. Guaranteed Delivery Based on Memorization
Data Communication Guaranteed Delivery Based on Memorization Motivation Many greedy routing schemes perform well in dense networks Greedy routing has a small communication overhead Desirable to run Greedy
More informationDensity estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,
More informationLLNL Tool Components: LaunchMON, P N MPI, GraphLib
LLNL-PRES-405584 Lawrence Livermore National Laboratory LLNL Tool Components: LaunchMON, P N MPI, GraphLib CScADS Workshop, July 2008 Martin Schulz Larger Team: Bronis de Supinski, Dong Ahn, Greg Lee Lawrence
More informationCRYSTAL in parallel: replicated and distributed. Ian Bush Numerical Algorithms Group Ltd, HECToR CSE
CRYSTAL in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE Introduction Why parallel? What is in a parallel computer When parallel? Pcrystal MPPcrystal
More informationModel-based Measurements Of Network Loss
Model-based Measurements Of Network Loss June 28, 2013 Mirja Kühlewind mirja.kuehlewind@ikr.uni-stuttgart.de Universität Stuttgart Institute of Communication Networks and Computer Engineering (IKR) Prof.
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationLecture 9: Load Balancing & Resource Allocation
Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently
More informationMaximizing Throughput on a Dragonfly Network
Maximizing Throughput on a Dragonfly Network Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, Laxmikant V. Kale Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationMemory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC]
Memory Management CSE 2431: Introduction to Operating Systems Reading: 8.1 8.3, [OSC] 1 Outline Basic Memory Management Swapping Variable Partitions Memory Management Problems 2 Basic Memory Management
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationFlat Routing on Curved Spaces
Flat Routing on Curved Spaces Dmitri Krioukov (CAIDA/UCSD) dima@caida.org Berkeley April 19 th, 2006 Clean slate: reassess fundamental assumptions Information transmission between nodes in networks that
More informationSmall-World Datacenters
2 nd ACM Symposium on Cloud Computing Oct 27, 2011 Small-World Datacenters Ji-Yong Shin * Bernard Wong +, and Emin Gün Sirer * * Cornell University + University of Waterloo Motivation Conventional networks
More informationImage Segmentation. Shengnan Wang
Image Segmentation Shengnan Wang shengnan@cs.wisc.edu Contents I. Introduction to Segmentation II. Mean Shift Theory 1. What is Mean Shift? 2. Density Estimation Methods 3. Deriving the Mean Shift 4. Mean
More informationParallel Code Optimisation
April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor
More informationCS Operating Systems
CS 4500 - Operating Systems Module 9: Memory Management - Part 1 Stanley Wileman Department of Computer Science University of Nebraska at Omaha Omaha, NE 68182-0500, USA June 9, 2017 In This Module...
More information