Communication and Topology-aware Load Balancing in Charm++ with TreeMatch
|
|
- Emery Francis
- 5 years ago
- Views:
Transcription
1 Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François Tessier Gengbin Zheng November 27, 2013 Emmanuel Jeannot Communication-aware load balancing 1 / 25
2 Introduction Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Issues Process placement should take into account process affinity Here: load balancing in Charm++ taking into account: load affinity topology migration cost (transfer time) Emmanuel Jeannot Communication-aware load balancing 2 / 25
3 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 3 / 25
4 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 4 / 25
5 Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares. In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Chares placement and load balancing is transparent for the programmer Emmanuel Jeannot Communication-aware load balancing 5 / 25
6 Chares/Process Placement Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application Chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance Cache hierarchy, memory bus, high-performance network... In this work we deal with tree topologies only Switch Cabinet Cabinet... Node Node... Processor Processor Core Core Core Core Emmanuel Jeannot Communication-aware load balancing 6 / 25
7 Problems Given The parallel machine topology The application communication pattern Map application processes/chares to physical resources (cores) to reduce the communication costs zeus16.map Receiver rank Sender rank Emmanuel Jeannot Communication-aware load balancing 7 / 25
8 TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i Emmanuel Jeannot Communication-aware load balancing 8 / 25
9 TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i Emmanuel Jeannot Communication-aware load balancing 8 / 25
10 Example example16.mat example16_treematch.mat Receiver rank σ =(0,2,8,10,4, 6,12,14,1,3,9, 11,5,7,13,15) = Receiver rank Sender rank Sender rank Emmanuel Jeannot Communication-aware load balancing 9 / 25
11 TreeMatch Vs. existing solution Graph partitionners Parallel Scotch (Par)Metis Other static algorithms [Träff 02]: placement through graph embedding and graph partitioning MPIPP [Chen et al. 2006]: placement through local exchange of processes LibTopoMap [Hoefler & Snir 11]: placement through network model + graph partitioning (ParMetis) Other topology-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB All these solution requires quantitative information about the network and the communication duration. TreeMatch: only qualitative information about the topology (the structure) is required. Emmanuel Jeannot Communication-aware load balancing 10 / 25
12 Load balancing Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides: chares load chares affinity etc... Constraints Dealing with complex modern architectures Taking into account communications between elements Cost of migrations Emmanuel Jeannot Communication-aware load balancing 11 / 25
13 What about Charm++? Not so easy... Several issues raised! Scalability of TreeMatch Need to find a relevant compromise between processes affinities and load balancing Compute-bound applications Communication-bound applications Impact of chares migrations? What about load balancing time? The next slides will present two load balancers relying on TreeMatch Compute-bound applications: TMLB_Min_Weight which applies a communication-aware load balancing by favoring the CPU load levelling and minimizing migrations Communication-bound applications: TMLB_TreeBased which performs a parallel communication-aware load balancing by giving advantage to the minimization of communication cost. Emmanuel Jeannot Communication-aware load balancing 12 / 25
14 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 13 / 25
15 Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25
16 Strategy for Charm++ CPU Load TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Sort each part by CPU load Chares placement + Load balancing -> groups of chares Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25
17 Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Emmanuel Jeannot Communication-aware load balancing 14 / 25
18 Results LeanMD Molecular Dynamics application Massive unbalance, few communications Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) Baseline GreedyLB RefineLB TMLB_min_weight LeanMD on 64 cores chares Particles per cell Emmanuel Jeannot Communication-aware load balancing 15 / 25
19 Results LeanMD - Migrations Comparing to TMLB_Min_Weight without minimizing migrations : Execution time up to 5% better Around 200 migrations less Number of migrated chares in LeanMD 960 chares - 64 cores Number of migrated chares Particles per cell GreedyLB RefineLB TMLB_min_weight Emmanuel Jeannot Communication-aware load balancing 16 / 25
20 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 17 / 25
21 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25
22 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25
23 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25
24 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25
25 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25
26 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25
27 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25
28 Results kneighbor Benchmarks application designed to simulate intensive communication between processes Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Particularly compared to RefineCommLB Takes into account load and communication Minimizes migrations Execution time (in seconds) kneighbor on 64 cores 64 elements 1MB message size 0 DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) kneighbor on 64 cores 128 elements 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) kneighbor on 64 cores 256 elements 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Emmanuel Jeannot Communication-aware load balancing 19 / 25
29 Results Impact on communication Communications evolution between ten iterations Communication between 10 iterations without any load balancing strategy (in thousands of messages sent) Communication between 10 iterations after the first call of TreeMatchLB (in thousands of messages sent) Emmanuel Jeannot Communication-aware load balancing 20 / 25
30 Results Stencil3D 3 dimensional stencil with regular communication with fixed neighbors One chare per core : balance only considering communications Only one load balancing step after 10 iterations Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) Stencil3D on 64 cores 64 elements DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Emmanuel Jeannot Communication-aware load balancing 21 / 25
31 Results What about the load balancing time? Linear trajectory while the number of chares is doubled TMLB_TreeBased is clearly slower than the other strategies But the parallel version is almost implemented... Execution time (in ms) Execution time of load balancing strategies (running on 64 cores) DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Number of chares Figure : Load balancing time of the different strategies vs. number of chares for the KNeighbor application. Emmanuel Jeannot Communication-aware load balancing 22 / 25
32 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 23 / 25
33 Conclusion and Future Conclusion Topology is not flat! Processes affinities are not homogeneous Take into account these information to map chares give us improvement Need to distinguish between compute-bound and communication-bound application Several criteria taken into account: affinity, topology, load, migration cost, etc... Future work Find a better way to gather the topology (Hwloc?) Distribute and parallelize TMLB_TreeBased on the different nodes (work in progess with the PPL) Make TMLB_TreeBased more scalable for large scale clusters: allow to chose the level in the hierarchy where the algorithm will be distributed Hybrid architecture? Intel MIC? Continue collaborations between Inria and PPL Emmanuel Jeannot Communication-aware load balancing 24 / 25
34 The End Thanks for your attention! Any questions? Emmanuel Jeannot Communication-aware load balancing 25 / 25
Topology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationLoad Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs
Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics
More informationNUMA Support for Charm++
NUMA Support for Charm++ Christiane Pousa Ribeiro (INRIA) Filippo Gioachin (UIUC) Chao Mei (UIUC) Jean-François Méhaut (INRIA) Gengbin Zheng(UIUC) Laxmikant Kalé (UIUC) Outline Introduction Motivation
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationNear-Optimal Placement of MPI processes on Hierarchical NUMA Architectures
Near-Optimal Placement of MPI processes on Hierarchical NUMA Architectures Emmanuel Jeannot 2,1 and Guillaume Mercier 3,2,1 1 LaBRI 2 INRIA Bordeaux Sud-Ouest 3 Institut Polytechnique de Bordeaux {Emmanuel.Jeannot
More informationc 2012 Harshitha Menon Gopalakrishnan Menon
c 2012 Harshitha Menon Gopalakrishnan Menon META-BALANCER: AUTOMATED LOAD BALANCING BASED ON APPLICATION BEHAVIOR BY HARSHITHA MENON GOPALAKRISHNAN MENON THESIS Submitted in partial fulfillment of the
More informationA Hierarchical Approach for Load Balancing on Parallel Multi-core Systems
A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems Laércio L. Pilla 1,2, Christiane Pousa Ribeiro 2, Daniel Cordeiro 2, Chao Mei 3, Abhinav Bhatele 4, Philippe O. A. Navaux 1, François
More informationA Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé
A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé Parallel Programming Laboratory Department of Computer Science University
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationIN the fields of science and engineering, it is necessary to
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014 993 Process Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques Emmanuel Jeannot, Guillaume Mercier,
More informationAdaptive Runtime Support
Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at
More informationParallel repartitioning and remapping in
Parallel repartitioning and remapping in Sébastien Fourestier François Pellegrini November 21, 2012 Joint laboratory workshop Table of contents Parallel repartitioning Shared-memory parallel algorithms
More informationPTRAM: A Parallel Topology- and Routing-Aware Mapping Framework for Large-Scale HPC Systems
PTRAM: A Parallel Topology- and Routing-Aware Mapping Framework for Large-Scale HPC Systems Seyed H. Mirsadeghi and Ahmad Afsahi ECE Department, Queen s University, Kingston, ON, Canada, K7L 3N6 Email:
More informationScalable Fault Tolerance Schemes using Adaptive Runtime Support
Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at
More informationLoad Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application
Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of
More informationCHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison Support: Rapid Innovation Fund, U.S. Army TARDEC ASME
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationProcess Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques
Process Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques Emmanuel Jeannot, Guillaume Mercier, François Tessier To cite this version: Emmanuel Jeannot, Guillaume Mercier, François
More informationAdvanced Parallel Programming. Is there life beyond MPI?
Advanced Parallel Programming Is there life beyond MPI? Outline MPI vs. High Level Languages Declarative Languages Map Reduce and Hadoop Shared Global Address Space Languages Charm++ ChaNGa ChaNGa on GPUs
More informationScalable In-memory Checkpoint with Automatic Restart on Failures
Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationGeneric Topology Mapping Strategies for Large-scale Parallel Architectures
Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous
More informationMemory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25
ROME Workshop @ IPDPS Vancouver Memory Footprint of Locality Information On Many- Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25 Locality Matters to HPC Applications Locality Matters
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationHigh Performance Computing Course Notes Course Administration
High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationGo Deep: Fixing Architectural Overheads of the Go Scheduler
Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationArchitecture Conscious Data Mining. Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
Architecture Conscious Data Mining Srinivasan Parthasarathy Data Mining Research Lab Ohio State University KDD & Next Generation Challenges KDD is an iterative and interactive process the goal of which
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationIS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC 08 Outline Why should we consider topology
More informationHigh Performance Computing Course Notes HPC Fundamentals
High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationOnline Optimization of VM Deployment in IaaS Cloud
Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,
More informationArchitecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing
Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing Angen Zheng, Alexandros Labrinidis, Panos K. Chrysanthis Advanced Data Management Technologies Laboratory Department of Computer
More informationICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction
ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationTowards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH
Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines
More informationWhy CPU Topology Matters. Andreas Herrmann
Why CPU Topology Matters 2010 Andreas Herrmann 2010-03-13 Contents Defining Topology Topology Examples for x86 CPUs Why does it matter? Topology Information Wanted In-Kernel
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationMolecular Dynamics and Quantum Mechanics Applications
Understanding the Performance of Molecular Dynamics and Quantum Mechanics Applications on Dell HPC Clusters High-performance computing (HPC) clusters are proving to be suitable environments for running
More informationThe Potential of Diffusive Load Balancing at Large Scale
Center for Information Services and High Performance Computing The Potential of Diffusive Load Balancing at Large Scale EuroMPI 2016, Edinburgh, 27 September 2016 Matthias Lieber, Kerstin Gößner, Wolfgang
More informationEnzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy
Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2
More informationA Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications
A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference
More informationClaude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique
Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES
More informationOPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS
OPTIMIZATION PRINCIPLES FOR COLLECTIVE NEIGHBORHOOD COMMUNICATIONS TORSTEN HOEFLER, TIMO SCHNEIDER ETH Zürich 2012 Joint Lab Workshop, Argonne, IL, USA PARALLEL APPLICATION CLASSES We distinguish five
More informationNAMD Performance Benchmark and Profiling. November 2010
NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory
More informationHierarchical task mapping for parallel applications on supercomputers
J Supercomput DOI 10.1007/s11227-014-1324-5 Hierarchical task mapping for parallel applications on supercomputers Jingjin Wu Xuanxing Xiong Zhiling Lan Springer Science+Business Media New York 2014 Abstract
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationELASTIC: Dynamic Tuning for Large-Scale Parallel Applications
Workshop on Extreme-Scale Programming Tools 18th November 2013 Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Toni Espinosa Andrea Martínez, Anna Sikora, Eduardo César
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationThe MOSIX Scalable Cluster Computing for Linux. mosix.org
The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What
More informationMUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA
The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationStrong Scaling for Molecular Dynamics Applications
Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics
More informationUNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM
UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM Sweety Sen, Sonali Samanta B.Tech, Information Technology, Dronacharya College of Engineering,
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationMore Data Locality for Static Control Programs on NUMA Architectures
More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationTechniques to improve the scalability of Checkpoint-Restart
Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationDISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES
DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline System Architectural Design Issues Centralized Architectures Application
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationParallel 3D Sweep Kernel with PaRSEC
Parallel 3D Sweep Kernel with PaRSEC Salli Moustafa Mathieu Faverge Laurent Plagne Pierre Ramet 1 st International Workshop on HPC-CFD in Energy/Transport Domains August 22, 2014 Overview 1. Cartesian
More informationPerformance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf
PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland
More informationDistributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju
Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale
More informationCapability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL
SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationAdvanced Job Launching. mapping applications to hardware
Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationMulti-core Computing Lecture 2
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date
More informationShared memory parallel algorithms in Scotch 6
Shared memory parallel algorithms in Scotch 6 François Pellegrini EQUIPE PROJET BACCHUS Bordeaux Sud-Ouest 29/05/2012 Outline of the talk Context Why shared-memory parallelism in Scotch? How to implement
More informationAdaptive MPI Multirail Tuning for Non-Uniform Input/Output Access
Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationDynamic load balancing in OSIRIS
Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationUsing Charm++ to Support Multiscale Multiphysics
LA-UR-17-23218 Using Charm++ to Support Multiscale Multiphysics On the Trinity Supercomputer Robert Pavel, Christoph Junghans, Susan M. Mniszewski, Timothy C. Germann April, 18 th 2017 Operated by Los
More informationAutomatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP
Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals
More informationScalable Interaction with Parallel Applications
Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Outline Overview Case Studies Charm++
More informationNAMD Performance Benchmark and Profiling. February 2012
NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -
More informationIntroduction to I/O Efficient Algorithms (External Memory Model)
Introduction to I/O Efficient Algorithms (External Memory Model) Jeff M. Phillips August 30, 2013 Von Neumann Architecture Model: CPU and Memory Read, Write, Operations (+,,,...) constant time polynomially
More informationQuantifying Load Imbalance on Virtualized Enterprise Servers
Quantifying Load Imbalance on Virtualized Enterprise Servers Emmanuel Arzuaga and David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston MA 1 Traditional Data Centers
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationVincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012
Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More information