Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Similar documents
c 2011 by Ehsan Totoni. All rights reserved.

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Using BigSim to Estimate Application Performance

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

Charm++ for Productivity and Performance

PREDICTING COMMUNICATION PERFORMANCE

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Strong Scaling for Molecular Dynamics Applications

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

NAMD GPU Performance Benchmark. March 2011

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

NAMD Performance Benchmark and Profiling. February 2012

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable In-memory Checkpoint with Automatic Restart on Failures

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NAMD Performance Benchmark and Profiling. November 2010

Scalable Topology Aware Object Mapping for Large Supercomputers

Maximizing Throughput on a Dragonfly Network

xsim The Extreme-Scale Simulator

CUDA Experiences: Over-Optimization and Future HPC

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar

Adaptive Runtime Support

Tutorial: Application MPI Task Placement

Analyses and Modeling of Applications Used to Demonstrate Sustained Petascale Performance on Blue Waters

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Using GPUs to compute the multilevel summation of electrostatic forces

Performance Modeling for Systematic Performance Tuning

Lecture 20: Distributed Memory Parallelism. William Gropp

Scalable Interaction with Parallel Applications

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

PRIMEHPC FX10: Advanced Software

NAMD Serial and Parallel Performance

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019

Multi-threaded, discrete event simulation of distributed computing systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Recent Advances in Heterogeneous Computing using Charm++

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

ECE 8823: GPU Architectures. Objectives

FINAL REPORT. Milestone/Deliverable Description: Final implementation and final report

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 10. Routing and Flow Control

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Linux Performance on IBM System z Enterprise

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

The Tofu Interconnect 2

A Case for High Performance Computing with Virtual Machines

Communication Characteristics in the NAS Parallel Benchmarks

BlueGene/L (No. 4 in the Latest Top500 List)

c 2014 by Ehsan Totoni. All rights reserved.

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Molecular Dynamics and Quantum Mechanics Applications

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Engineering Performance for Multiphysics Applications. William Gropp

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine

Technical Computing Suite supporting the hybrid system

Performance Modeling as the Key to Extreme Scale Computing. William Gropp

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

Cloud Computing CS

Installation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz

The Tofu Interconnect D

Charm++ on the Cell Processor

c Copyright by Sameer Kumar, 2005

Accelerating Molecular Modeling Applications with Graphics Processors

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Topology Awareness in the Tofu Interconnect Series

Programming for Fujitsu Supercomputers

Ultra-Fast NoC Emulation on a Single FPGA

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Automated Load Balancing Invocation based on Application Characteristics

Mapping Strategies for the PERCS Architecture

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Fujitsu s Approach to Application Centric Petascale Computing

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

AcuSolve Performance Benchmark and Profiling. October 2011

High Performance MPI on IBM 12x InfiniBand Architecture

CSE5351: Parallel Processing Part III

LAPI on HPS Evaluating Federation

MPI+X on The Way to Exascale. William Gropp

Interconnection Network for Tightly Coupled Accelerators Architecture

c Copyright by Gengbin Zheng, 2005

High Performance Computing with Accelerators

Transcription:

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011

A function level simulator for parallel applications on peta scale machine An emulator runs applications at full scale A simulator predicts performance by simulating message passing 2

Improved simulation framework from these aspects Even larger/faster emulation and simulation More accurate prediction of sequential performance Apply BigSim to predict Blue Waters machine 3

Using existing (possibly different) clusters for emulation Applied memory reduction techniques For example, we achieved for NAMD: Emulate 10M atom benchmark on 256K cores using only 4K cores of Ranger (2GB/core) Emulate 100M atom benchmark on 1.2M cores using only 48K cores of Jaguar (1.3GB/core) 4

Predicting parallel applications can be challenging Accurate prediction of sequential execution blocks Network performance This paper explores techniques to use statistical models for sequential execution blocks Gengbin Zheng, Gagan Gupta, Eric Bohm, Isaac Dooley, and Laxmikant V. Kale, "Simulating Large Scale Parallel Applications using Statistical Models for Sequential Execution Blocks", in the Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS 2010) 5

6

CPU scaling Take execution on one machine to predict another not accurate Hardware simulator Cycle accurate timing However it is slow Performance counters Very hard, and platform specific An efficient and accurate prediction method? 7

Identify parameters that characterize the execution time of an SEB: x 1,x 2,, x n With machine learning, build statistical model of the execution time: T=f(x 1,x 2,,x n ) Run SEBs in realistic scenario and collect data points X 11,x 21,,x n1 => T1 X 12,x 22,,x n2 => T2 8

9

Slicing (C 2 ) Record and replay any processor Instrument exact parameters Require binary compatible architecture Miniaturization (C 1 ) Scaled down execution (smaller dataset, smaller number of processors) Not all application can scale down 10

Machine learning techniques Linear Regression Least median squared linear regression SVMreg Use machine learning software like weka As an example: T(N) = 0.036N 2 + 0.009N 3 + 12.47 11

Use Abe to predict Ranger NAS BT Benchmark BT.D.1024 emulate 64 core Abe, recording 8 target cores replay on Ranger to predict 12

NAMD STMV 1M atom benchmark 4096 core prediction Using 512 core of BG/P 13

Native Prediction 14

Will be here soon More than 300K POWER7 Cores PERCS Network Unique network not similar to any other! No theory or legacy for it Applications and runtimes tuned for other networks NSF Acceptance: Sustained petaflops for real applications Tuning applications typically takes months to years after machines become online

Two level fully connected Supernodes with 32 nodes 24 GB/s LLocal (LL) link, 5 GB/s LRemote (LR) link, 10 GB/s D link, 192 GB/s to IBM HUB Chip Node Node Node Node... Node Supernode Supernode Supernode Node... Supernode

Detailed packet level network model developed for Blue Waters Validated against MERCURY and IH Drawer Ping Pong within 1.1% of MERCURY Alltoall within 10% of drawer Optimizations for that scale Different outputs (link statistics, projections ) Some example studies: (E. Totoni et al., submitted to SC11) Topology aware mapping System noise Collective optimization

Apply different mappings in simulation Evaluate using output tools (Link Utilization )

Example: Mapping of 3D stencil Default mapping: rank ordered mapping of MPI tasks to cores of nodes Drawer mapping: 3D cuboids on nodes and drawers Supernode mapping: 3D cuboids on supernodes also; 80% communication decrease, 20% overall time

BigSim s noise injection support Captured noise of a node with the same architecture Replicate on all the target machines nodes with random offsets (not synchronized) Artificial periodic noise patterns for each core of a node With periodicity, amplitude and offset (phase) Replicate on all nodes with random offsets Increases length of computations appropriately

Applications: NAMD: 10 Million atom system on 256K cores MILC: su3_rmd on 4K cores K Neighbor with Allreduce 1 ms sequential computation, 2.4 ms total iteration time K Neighbor without Allreduce Inspecting with different noise frequencies and amplitudes

NAMD is tolerant, but jumps around certain frequencies MILC is sensitive even in small jobs (4K cores), almost as much affected as large jobs

Simple models are not enough In high frequencies there is chance that small MPI calls for collectives get affected Delaying execution significantly Combining noise patterns:

Optimized Alltoall for large messages in SN BigSim s link utilization output gives the idea: Outgoing links of a node are not used effectively at all times during Alltoall Shifted usage of different links

New algorithm: each core send to different node Use all 31 outgoing links Link utilization now looks better

Up to 5 times faster than existing Pairwise Exchange scheme Much closer to theoretical peak bandwidth of links