Charm++ for Productivity and Performance

Similar documents
Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

OPTIMIZATION OF FFT COMMUNICATION ON 3-D TORUS AND MESH SUPERCOMPUTER NETWORKS ANSHU ARYA THESIS

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Software Topological Message Routing and Aggregation Techniques for Large Scale Parallel Systems

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

NAMD Serial and Parallel Performance

Scalable Topology Aware Object Mapping for Large Supercomputers

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Adaptive Runtime Support

Advanced Parallel Programming. Is there life beyond MPI?

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

PREDICTING COMMUNICATION PERFORMANCE

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable Dynamic Adaptive Simulations with ParFUM

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

1/4 Energy and Fault tolerance 2 Director s Desk 3 PPL at SC13 4 Grants 5 EpiSimdemics 6 Metabalancer 7 Transitions 7 Domain-Specific Languages

c 2012 Harshitha Menon Gopalakrishnan Menon

NAMD Performance Benchmark and Profiling. November 2010

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

Scalable Interaction with Parallel Applications

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Enabling Contiguous Node Scheduling on the Cray XT3

X10 for Productivity and Performance at Scale

Welcome to the 2017 Charm++ Workshop!

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

The Cray Rainier System: Integrated Scalar/Vector Computing


Dynamic Load Balancing for Weather Models via AMPI

NAMD Performance Benchmark and Profiling. February 2012

X10 and APGAS at Petascale

POWER- AWARE RESOURCE MANAGER Maximizing Data Center Performance Under Strict Power Budget

Recent Advances in Heterogeneous Computing using Charm++

Taming Parallel I/O Complexity with Auto-Tuning

Strong Scaling for Molecular Dynamics Applications

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

CUDA Experiences: Over-Optimization and Future HPC

ECE 669 Parallel Computer Architecture

Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

c 2014 by Ehsan Totoni. All rights reserved.

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

An Introduction to OpenACC

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Basics of Performance Engineering

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Characterizing Imbalance in Large-Scale Parallel Programs. David Bo hme September 26, 2013

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Memory Tagging in Charm++

NAMD Performance Benchmark and Profiling. January 2015

Using GPUs to compute the multilevel summation of electrostatic forces

Achieve Better Performance with PEAK on XSEDE Resources

Installation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Scalability of Heterogeneous Computing

Runtime Support for Scalable Task-parallel Programs

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

An Introduction to Virtualization and Cloud Technologies to Support Grid Computing

Performance Tools for Technical Computing

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.

The Mont-Blanc approach towards Exascale

Molecular Dynamics and Quantum Mechanics Applications

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

HPCS HPCchallenge Benchmark Suite

Introduction to High-Performance Computing

Accelerating Molecular Modeling Applications with Graphics Processors

NAMD GPU Performance Benchmark. March 2011

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

c Copyright by Gengbin Zheng, 2005

Master Informatics Eng.

Mixed MPI-OpenMP EUROBEN kernels

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Composite Metrics for System Throughput in HPC

Adaptive Methods for Irregular Parallel Discrete Event Simulation Workloads

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Transcription:

Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad Venkataraman Lukasz Wesolowski Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign {kale, ramv}@illinois.edu LLNL-PRES-513271 SC11: November 15, 2011 Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 1 / 25

Benchmarks Required Dense LU Factorization 1D FFT Random Access Optional Molecular Dynamics Barnes-Hut Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 2 / 25

Charm++ Programming Model Object-based Express logic via indexed collections of interacting objects (both data and tasks) Over-decomposed Expose more parallelism than available processors Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 3 / 25

Charm++ Programming Model Runtime-Assisted scheduling, observation-based adaptivity, load balancing, composition, etc. Message-Driven Trigger computation by invoking remote entry methods Non-blocking, Asynchronous Implicitly overlapped data transfer Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 4 / 25

Charm++ Program Structure Regular C++ code No special compilers Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Program Structure Regular C++ code No special compilers Small parallel interface description file Can contain control flow DAG Parsed to generate more C++ code Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Program Structure Regular C++ code No special compilers Small parallel interface description file Can contain control flow DAG Parsed to generate more C++ code Inherit from framework classes to Communicate with remote objects Serialize objects for transmission Exploit modern C++ program design techniques (OO, generics etc) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load Automatically handles heterogenous systems Adapts to reduce energy consumption Tolerates component failures For more info http://charm.cs.illinois.edu/why/ Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Metrics: Performance Our Implementations in Charm++ Code Machine Max Cores Best Performance LU Cray XT5 8K 67.4% of peak FFT IBM BG/P 64K 2.512 TFlop/s RandomAccess IBM BG/P 64K 22.19 GUPS MD Cray XE6 16K 1.9 ms/step (125K atoms) IBM BG/P 64K 11.6 ms/step (1M atoms) Barnes-Hut IBM BG/P 16K 27 10 9 interactions/s Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 7 / 25

Metrics: Code Size Our Implementations in Charm++ Code C++ CI Total 1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG 1 Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

Metrics: Code Size Our Implementations in Charm++ Code C++ CI Total 1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG Remember: Lots of freebies! automatic load balancing, fault tolerance, overlap, composition, portability 1 Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

LU: Capabilities Composable library Modular program structure Seamless execution structure (interleaved modules) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Composable library Modular program structure Seamless execution structure (interleaved modules) Block-centric Algorithm from a block s perspective Agnostic of processor-level considerations Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Composable library Modular program structure Seamless execution structure (interleaved modules) Block-centric Algorithm from a block s perspective Agnostic of processor-level considerations Separation of concerns Domain specialist codes algorithm Systems specialist codes tuning, resource mgmt etc Lines of Code Module-specific CI C++ Total Commits Factorization 517 419 936 472/572 83% Mem. Aware Sched. 9 492 501 86/125 69% Mapping 10 72 82 29/42 69% Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Flexible data placement Experiment with data layout Memory-constrained adaptive lookahead Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 10 / 25

LU: Performance Weak Scaling: (N such that matrix fills 75% memory) 100 Theoretical peak on XT5 Weak scaling on XT5 Total TFlop/s 10 1 67.4% 66.2% 67.4% 65.7% 67% 67.1% 0.1 128 1024 8192 Number of Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 11 / 25

LU: Performance... and strong scaling too! (N=96,000) 100 Theoretical peak on XT5 Weak scaling on XT5 Theoretical peak on BG/P Strong scaling on BG/P 10 Total TFlop/s 1 45% 40.8% 31.6% 60.3% 0.1 128 1024 8192 Number of Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 12 / 25

FFT: Parallel Coordination Code dofft() for(phase = 0; phase < 3; ++phase) { atomic { sendtranspose(); } for(count = 0; count < P; ++count) when recvtranspose[phase] (fftmsg *msg) atomic { applytranspose(msg); } if (phase < 2) atomic { fftw execute(plan); if(phase == 0) twiddle(); } } Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 13 / 25

FFT: Performance IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers 10 4 10 3 GFlop/s 10 2 P2P All to all Mesh All to all Serial FFT limit 10 1 256 512 1024 2048 4096 8192 16384 32768 65536 Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

FFT: Performance IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers 10 4 Charm++ 10 all-to-all 3 Asynchronous, Non-blocking, Topology-aware, Combining, Streaming GFlop/s 10 2 P2P All to all Mesh All to all Serial FFT limit 10 1 256 512 1024 2048 4096 8192 16384 32768 65536 Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

Random Access What Charm++ brings to the table Productivity Automatically detect completion by sensing quiescence Automatically detect network topology of partition Performance Uses same Charm++ all-to-all Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 15 / 25

Random Access: Performance IBM Blue Gene/P (Intrepid), 2 GB of memory per node 32 Perfect Scaling Charm++ 16 8 22.19 GUPS 4 2 1 0.5 0.25 0.125 128 256 512 1K 2K 4K 8K 16K 32K 64K Number of cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 16 / 25

Optional Benchmarks Why MD and Barnes-Hut? Relevant scientific computing kernels Challenge the parallelization paradigm Load imbalances Dynamic communication structure Express non-trivial parallel control flow Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 17 / 25

Molecular Dynamics Overview 1 Mimics force calculation in NAMD 2 Resembles the minimd application in the Mantevo benchmark suite 3 SLOC is 773 in comparison to just under 3000 lines for minimd (a) 1 Away Decomposition (b) 2 AwayX Decomposition Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 18 / 25

MD: Performance 125,000 atoms. Cray XE6 (Hopper) 100 Performance on Hopper (125,000 atoms) No LB Refine LB Time per step (ms) 10 1.91 ms/step 1 264 528 1032 2064 4104 8208 16392 Number of cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 19 / 25

MD: Performance 1 million atoms. IBM Blue Gene/P (Intrepid) Speedup on Intrepid (1 million atoms) Speedup 65536 32768 16384 8192 4096 2048 Ideal Charm++ 11.6 ms/step 1024 512 256 256 512 1024 2048 4096 8192 16384 32768 65536 Number of cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 20 / 25

MD: Performance Number of cores does not have to be a power-of-2 650 600 MD on non power-of-2 cores Intrepid Time per step (ms) 550 500 450 400 350 300 64 72 80 88 96 104 112 120 128 Number of cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 21 / 25

Barnes-Hut: Productivity 1 Adaptive overlap of computation and communication allows latency of requests for remote data to be hidden by useful local computation on PEs. 2 Automatic measurement-based load balancing allows dissociation of data decomposition from task assignment: balance communication through Oct-decomposition and computation through separate load balancing strategy. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 22 / 25

Barnes-Hut: Performance Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid) Time/step (seconds) 16.00 8.00 4.00 2.00 1.00 Barnes-Hut scaling on BG/P 50m 10m 0.50 2k 4k 8k 16k Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 23 / 25

Barnes-Hut: Performance Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid) Time/step (seconds) 16.00 8.00 4.00 Frequency 2.00 1.00 100000 10000 1000 100 10 1 Barnes-Hut scaling on BG/P Plummer 100k Distribution Distance from COM 0.50 2k 4k 8k 16k Cores 50m 10m 0.02 0.04 0.09 0.18 0.36 0.71 1.4 2.8 5.7 11.4 22.8 Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 23 / 25

Charm++ at SC11 Temperature-aware load balancing Tue @ 2:00 pm Fault tolerance protocol PhD Forum; Tue @ 3:45 pm NAMD at 200K+ cores Thu @ 11:00 am Topology aware mapping for PERCS Thu @ 4:00 pm Parallel stochastic optimization Poster All-to-all simulations on PERCS Poster For more info http://charm.cs.illinois.edu/why/ Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 24 / 25

MeshStreamer: Message Routing and Aggregation Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 25 / 25