X10 and APGAS at Petascale

Similar documents
X10 for Productivity and Performance at Scale

X10 at Scale. Olivier Tardieu and the X10 Team.

IBM Research Report. X10 for Productivity and Performance at Scale A Submission to the 2012 HPC Class II Challange

Implementing X10. Towards Performance and Productivity at Scale. David Grove IBM Research NICTA Summer School February 5, 2013

Resilient X10 Efficient failure-aware programming

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

Writing Fault-Tolerant Applications Using Resilient X10

Accelerating HPL on Heterogeneous GPU Clusters

Introduction to High Performance Computing and X10

Multi-place X10 programs (v 0.1) Vijay Saraswat

Strongly Performing Python Implementation of the HPC Challenge. Interactive Supercomputing, Inc.

APGAS Programming in X10

HPCC Results. Nathan Wichmann Benchmark Engineer

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Resilient X10 Efficient failure-aware programming

Charm++ for Productivity and Performance

Composite Metrics for System Throughput in HPC

Accelerating Spark Workloads using GPUs

Techniques to improve the scalability of Checkpoint-Restart

Linear Algebra Programming Motifs

Resilient X10. Efficient failure-aware programming

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

The Architecture and the Application Performance of the Earth Simulator

Benchmark Results. 2006/10/03

The Future of High Performance Computing

Software and Tools for HPE s The Machine Project

Portable, MPI-Interoperable! Coarray Fortran

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Lustre overview and roadmap to Exascale computing

HPCS HPCchallenge Benchmark Suite

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Global Load Balancing and Intra-Node Synchronization with the Java Framework APGAS

Best Practices for Setting BIOS Parameters for Performance

A Lightweight OpenMP Runtime

Scalable Software Transactional Memory for Chapel High-Productivity Language

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Heckaton. SQL Server's Memory Optimized OLTP Engine

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

On the cost of managing data flow dependencies

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

High Performance Computing on GPUs using NVIDIA CUDA

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Distributed Shared Memory for High-Performance Computing

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

Initial Performance Evaluation of the Cray SeaStar Interconnect

Performance Tools for Technical Computing

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Grappa: A latency tolerant runtime for large-scale irregular applications

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

STREAM and RA HPC Challenge Benchmarks

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

NAMD Performance Benchmark and Profiling. November 2010

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Future Routing Schemes in Petascale clusters

Reference Counting. Reference counting: a way to know whether a record has other users

Parallel Languages: Past, Present and Future

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

The way toward peta-flops

Effective Performance Measurement and Analysis of Multithreaded Applications

The Cray Rainier System: Integrated Scalar/Vector Computing

HPCC Optimizations and Results for the Cray X1 Nathan Wichmann Cray Inc. May 14, 2004

Performance of the AMD Opteron LS21 for IBM BladeCenter

OPENSHMEM AND OFI: BETTER TOGETHER

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

CLOUD-SCALE FILE SYSTEMS

Introduction to parallel Computing

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Harp-DAAL for High Performance Big Data Computing

Software Architecture in Practice

IBM Db2 Analytics Accelerator Version 7.1

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Portable, MPI-Interoperable! Coarray Fortran

2008 International ANSYS Conference

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Computer Architecture: Dataflow/Systolic Arrays

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

A Case for High Performance Computing with Virtual Machines

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Performance impact of dynamic parallelism on different clustering algorithms

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Metropolitan Road Traffic Simulation on FPGAs

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CellSs Making it easier to program the Cell Broadband Engine processor

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Jaguar: Enabling Efficient Communication and I/O in Java

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Java On Steroids: Sun s High-Performance Java Implementation. History

Balance of HPC Systems Based on HPCC Benchmark Results

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Transcription:

X10 and APGAS at Petascale Olivier Tardieu 1, Benjamin Herta 1, David Cunningham 2, David Grove 1, Prabhanjan Kambadur 1, Vijay Saraswat 1, Avraham Shinnar 1, Mikio Takeuchi 3, Mandana Vaziri 1 1 IBM T.J. Watson Research Center 2 Google Inc. 3 IBM Research Tokyo This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Background X10 tackles the challenge of programming at scale HPC, cluster, cloud scale out: run across many distributed nodes scale up: exploit multi-core and accelerators resilience and elasticity è this talk & PPAA talk è CGO tutorial è next talk X10 is a programming language imperative object-oriented strongly-typed garbage-collected (like Java) concurrent and distributed: Asynchronous Partitioned Global Address Space model an open-source tool chain developed at IBM Research è X10 2.4.2 just released a growing community X10 workshop at PLDI 14 è CFP at http://x10-lang.org Double goal: productivity and performance 2

Outline X10 programming model: Asynchronous Partitioned Global Address Space Optimizations for scale out distributed termination detection high-performance networks memory management Performance results Power 775 architecture benchmarks Global load balancing Unbalanced Tree Search at scale 3

X10 Overview 4

APGAS Places and Tasks Local Heap Global Reference Local Heap Tasks Tasks Place 0 Task parallelism async S! finish S! Place N Concurrency control when(c) S! atomic S! Place-shifting operations at(p) S! at(p) e! Distributed heap GlobalRef[T]! PlaceLocalHandle[T]! 5

APGAS Idioms Remote procedure call v = at(p) f(arg1, arg2);! Atomic remote update at(ref) async atomic ref() += v;! Active message at(p) async m(arg1, arg2); SPMD finish for(p in Place.places()) {! at(p) async {! for(i in 1..n) {! async dowork(i);! }! }! }! Divide-and-conquer parallelism def fib(n:long):long {! if(n < 2) return n;! val x:long;! val y:long;! finish {! async x = fib(n-1);! y = fib(n-2);! }! return x + y;! }!! finish construct is transitive and can cross place boundaries 6

Example: BlockDistRail.x10 public class BlockDistRail[T] {! protected val sz:long; // block size! protected val raw:placelocalhandle[rail[t]];!!! public def this(sz:long, places:long){t haszero} {! this.sz = sz;! raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz));! }! public operator this(i:long) = (v:t) { at(place(i/sz)) raw()(i%sz) = v; }! public operator this(i:long) = at(place(i/sz)) raw()(i%sz);! public static def main(rail[string]) {! val rail = new BlockDistRail[Long](5, 4);! rail(7) = 8;! Console.OUT.println(rail(7));! }! }! 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3 7

Optimizations for Scale Out 8

Distributed Termination Detection Local finish is easy synchronized counter: increment when task is spawned, decrement when task ends Distributed finish is non-trivial network can reorder increment and decrement messages X10 algorithm: disambiguation in space è space overhead one row of n counters per place with n places when place p spawns task at place q increment counter q at place p when task terminates at place p decrement counter p at place p finish triggered when sum of each column is zero Charm++ algorithm: disambiguation in time è communication overhead successive non-overlapping waves of termination detections 9

Optimized Distributed Termination Detection Source optimizations aggregate messages at source compress Software routing aggregate messages at intermediate nodes Pattern-based specialization put : a finish governing a single task è wait for one ack get : a finish governing round trip local finish: a finish with no remote tasks get è wait for return task è single counter SPMD finish: a finish with no nested remote task è single counter SPMD finish irregular/dense finish: a finish with lots of links è software routing Runtime optimizations + static analysis + pragmas! scalable finish 10

High-Performance Networks RDMAs efficient remote memory operations asynchronous semantics just another task! good fit for APGAS Array.asyncCopy[Double](src, srcindex, dst, dstindex, size);! Collectives multi-point coordination and communication networks/apis biased towards SPMD today! poor fit for APGAS today Team.WORLD.barrier(here.id);! columnteam.addreduce(columnrole, localmax, Team.MAX);! future: MPI-3 and beyond! good fit for APGAS one-sided collectives, endpoints, etc. 11 11

Memory Management Garbage collector problem: distributed heap distributed garbage collection is impractical solution: segregate local/remote refs! issue is contained only local refs are automatically collected Congruent memory allocator problem: low-level requirements large pages required to minimize TLB misses registered pages required for RDMAs congruent addresses required for RDMAs at scale solution: dedicated memory allocator! issue is contained congruent registered pages large pages if available only used for performance-critical arrays only impacts allocation & deallocation 12

Performance Results 13

DARPA HPCS/PERCS Prototype (Power 775) Compute Node 32 Power7 cores 3.84 GHz 128 GB DRAM peak performance: 982 Gflops Torrent interconnect Drawer 8 nodes Rack 8 to 12 drawers Full Prototype up to 1,740 compute nodes up to 55,680 cores up to 1.7 petaflops 1 petaflops with 1,024 compute nodes 14

DARPA HPCS/PERCS Benchmarks HPC Challenge benchmarks Linpack TOP500 (flops) Stream Triad local memory bandwidth Random Access distributed memory bandwidth Fast Fourier Transform mix Machine learning kernels KMeans graph clustering SSCA1 pattern matching SSCA2 irregular graph traversal UTS unbalanced tree traversal Implemented in X10 as pure scale out tests one core = one place = one main task native libraries for sequential math kernels: ESSL, FFTE, SHA1 15

Performance at Scale (Weak Scaling) number of cores at scale absolute performance at scale relative efficiency compared to single host (weak scaling) Stream 55,680 397 TB/s 98% 87% performance at scale relative to best implementation available FFT 32,768 28.7 Tflop/s 100% 41% (no tuning) Linpack 32,768 589 Tflop/s 87% 85% RandomAccess 32,768 843 Gup/s 100% 81% KMeans 47,040 98%? SSCA1 47,040 98%? SSCA2 47,040 245 B edges/s see paper for details? UTS (geometric) 55,680 596 B nodes/s 98% prior impl. do not scale 16 16

HPCC Class 2 Competition 2012: Best Performance Award Gflops 30000 25000 20000 15000 10000 5000 0 G-FFT 26958 1.20 0.88 1.00 0.80 0.82 0.60 0.40 0.20 0.00 0 16384 32768 Places Gflops/place Gflops 800000 600000 400000 200000 0 G-HPL 24.00 22.4 589231 22.00 18.0 20.00 18.00 16.00 0 16384 32768 Places Gflops/place GB/s 500000 400000 300000 200000 100000 0 EP Stream (Triad) 396614 7.23 7.12 0 27840 55680 Places 15 10 5 0 GB/s/place Gups 900 800 700 600 500 400 300 200 100 0.82 G-RandomAccess 844 0.82 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Gups/place Million nodes/s 700000 600000 500000 400000 300000 200000 100000 UTS 10.93 10.87 356344 596451 10.71 14.00 12.00 10.00 8.00 6.00 4.00 2.00 Million nodes/s/place 0 0 8192 16384 24576 32768 Places 0 0 0 13920 27840 41760 55680 Places 0.00 17

Global Load Balancing 18

Unbalanced Tree Search at Scale Problem count nodes in randomly generated tree è unbalanced & unpredictable separable random number generator è no locality constraint Lifeline-based global work stealing [PPoPP 11] n random victims then p lifelines (hypercube) steal (synchronous) then deal (asynchronous) Novel optimizations use of nested finish scopes è scalable finish use of dense finish pattern for root finish use of get finish pattern for random steal attempt pseudo random steals è software routing compact work queue encoding (for shallow trees) è less state, smaller messages lazy expansion of intervals of nodes (siblings) 19

Conclusions Performance X10 and APGAS can scale to Petaflop systems Productivity X10 and APGAS can implement legacy algorithms such as statically scheduled and distributed codes X10 and APGAS can ease the development of novel scalable codes including irregular and unbalanced workloads APGAS constructs deliver productivity and performance gains at scale Follow-up work presented PPAA 14 APGAS global load balancing library derived from UTS application to SSCA2 20