X10 and APGAS at Petascale Olivier Tardieu 1, Benjamin Herta 1, David Cunningham 2, David Grove 1, Prabhanjan Kambadur 1, Vijay Saraswat 1, Avraham Shinnar 1, Mikio Takeuchi 3, Mandana Vaziri 1 1 IBM T.J. Watson Research Center 2 Google Inc. 3 IBM Research Tokyo This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
Background X10 tackles the challenge of programming at scale HPC, cluster, cloud scale out: run across many distributed nodes scale up: exploit multi-core and accelerators resilience and elasticity è this talk & PPAA talk è CGO tutorial è next talk X10 is a programming language imperative object-oriented strongly-typed garbage-collected (like Java) concurrent and distributed: Asynchronous Partitioned Global Address Space model an open-source tool chain developed at IBM Research è X10 2.4.2 just released a growing community X10 workshop at PLDI 14 è CFP at http://x10-lang.org Double goal: productivity and performance 2
Outline X10 programming model: Asynchronous Partitioned Global Address Space Optimizations for scale out distributed termination detection high-performance networks memory management Performance results Power 775 architecture benchmarks Global load balancing Unbalanced Tree Search at scale 3
X10 Overview 4
APGAS Places and Tasks Local Heap Global Reference Local Heap Tasks Tasks Place 0 Task parallelism async S! finish S! Place N Concurrency control when(c) S! atomic S! Place-shifting operations at(p) S! at(p) e! Distributed heap GlobalRef[T]! PlaceLocalHandle[T]! 5
APGAS Idioms Remote procedure call v = at(p) f(arg1, arg2);! Atomic remote update at(ref) async atomic ref() += v;! Active message at(p) async m(arg1, arg2); SPMD finish for(p in Place.places()) {! at(p) async {! for(i in 1..n) {! async dowork(i);! }! }! }! Divide-and-conquer parallelism def fib(n:long):long {! if(n < 2) return n;! val x:long;! val y:long;! finish {! async x = fib(n-1);! y = fib(n-2);! }! return x + y;! }!! finish construct is transitive and can cross place boundaries 6
Example: BlockDistRail.x10 public class BlockDistRail[T] {! protected val sz:long; // block size! protected val raw:placelocalhandle[rail[t]];!!! public def this(sz:long, places:long){t haszero} {! this.sz = sz;! raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz));! }! public operator this(i:long) = (v:t) { at(place(i/sz)) raw()(i%sz) = v; }! public operator this(i:long) = at(place(i/sz)) raw()(i%sz);! public static def main(rail[string]) {! val rail = new BlockDistRail[Long](5, 4);! rail(7) = 8;! Console.OUT.println(rail(7));! }! }! 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3 7
Optimizations for Scale Out 8
Distributed Termination Detection Local finish is easy synchronized counter: increment when task is spawned, decrement when task ends Distributed finish is non-trivial network can reorder increment and decrement messages X10 algorithm: disambiguation in space è space overhead one row of n counters per place with n places when place p spawns task at place q increment counter q at place p when task terminates at place p decrement counter p at place p finish triggered when sum of each column is zero Charm++ algorithm: disambiguation in time è communication overhead successive non-overlapping waves of termination detections 9
Optimized Distributed Termination Detection Source optimizations aggregate messages at source compress Software routing aggregate messages at intermediate nodes Pattern-based specialization put : a finish governing a single task è wait for one ack get : a finish governing round trip local finish: a finish with no remote tasks get è wait for return task è single counter SPMD finish: a finish with no nested remote task è single counter SPMD finish irregular/dense finish: a finish with lots of links è software routing Runtime optimizations + static analysis + pragmas! scalable finish 10
High-Performance Networks RDMAs efficient remote memory operations asynchronous semantics just another task! good fit for APGAS Array.asyncCopy[Double](src, srcindex, dst, dstindex, size);! Collectives multi-point coordination and communication networks/apis biased towards SPMD today! poor fit for APGAS today Team.WORLD.barrier(here.id);! columnteam.addreduce(columnrole, localmax, Team.MAX);! future: MPI-3 and beyond! good fit for APGAS one-sided collectives, endpoints, etc. 11 11
Memory Management Garbage collector problem: distributed heap distributed garbage collection is impractical solution: segregate local/remote refs! issue is contained only local refs are automatically collected Congruent memory allocator problem: low-level requirements large pages required to minimize TLB misses registered pages required for RDMAs congruent addresses required for RDMAs at scale solution: dedicated memory allocator! issue is contained congruent registered pages large pages if available only used for performance-critical arrays only impacts allocation & deallocation 12
Performance Results 13
DARPA HPCS/PERCS Prototype (Power 775) Compute Node 32 Power7 cores 3.84 GHz 128 GB DRAM peak performance: 982 Gflops Torrent interconnect Drawer 8 nodes Rack 8 to 12 drawers Full Prototype up to 1,740 compute nodes up to 55,680 cores up to 1.7 petaflops 1 petaflops with 1,024 compute nodes 14
DARPA HPCS/PERCS Benchmarks HPC Challenge benchmarks Linpack TOP500 (flops) Stream Triad local memory bandwidth Random Access distributed memory bandwidth Fast Fourier Transform mix Machine learning kernels KMeans graph clustering SSCA1 pattern matching SSCA2 irregular graph traversal UTS unbalanced tree traversal Implemented in X10 as pure scale out tests one core = one place = one main task native libraries for sequential math kernels: ESSL, FFTE, SHA1 15
Performance at Scale (Weak Scaling) number of cores at scale absolute performance at scale relative efficiency compared to single host (weak scaling) Stream 55,680 397 TB/s 98% 87% performance at scale relative to best implementation available FFT 32,768 28.7 Tflop/s 100% 41% (no tuning) Linpack 32,768 589 Tflop/s 87% 85% RandomAccess 32,768 843 Gup/s 100% 81% KMeans 47,040 98%? SSCA1 47,040 98%? SSCA2 47,040 245 B edges/s see paper for details? UTS (geometric) 55,680 596 B nodes/s 98% prior impl. do not scale 16 16
HPCC Class 2 Competition 2012: Best Performance Award Gflops 30000 25000 20000 15000 10000 5000 0 G-FFT 26958 1.20 0.88 1.00 0.80 0.82 0.60 0.40 0.20 0.00 0 16384 32768 Places Gflops/place Gflops 800000 600000 400000 200000 0 G-HPL 24.00 22.4 589231 22.00 18.0 20.00 18.00 16.00 0 16384 32768 Places Gflops/place GB/s 500000 400000 300000 200000 100000 0 EP Stream (Triad) 396614 7.23 7.12 0 27840 55680 Places 15 10 5 0 GB/s/place Gups 900 800 700 600 500 400 300 200 100 0.82 G-RandomAccess 844 0.82 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Gups/place Million nodes/s 700000 600000 500000 400000 300000 200000 100000 UTS 10.93 10.87 356344 596451 10.71 14.00 12.00 10.00 8.00 6.00 4.00 2.00 Million nodes/s/place 0 0 8192 16384 24576 32768 Places 0 0 0 13920 27840 41760 55680 Places 0.00 17
Global Load Balancing 18
Unbalanced Tree Search at Scale Problem count nodes in randomly generated tree è unbalanced & unpredictable separable random number generator è no locality constraint Lifeline-based global work stealing [PPoPP 11] n random victims then p lifelines (hypercube) steal (synchronous) then deal (asynchronous) Novel optimizations use of nested finish scopes è scalable finish use of dense finish pattern for root finish use of get finish pattern for random steal attempt pseudo random steals è software routing compact work queue encoding (for shallow trees) è less state, smaller messages lazy expansion of intervals of nodes (siblings) 19
Conclusions Performance X10 and APGAS can scale to Petaflop systems Productivity X10 and APGAS can implement legacy algorithms such as statically scheduled and distributed codes X10 and APGAS can ease the development of novel scalable codes including irregular and unbalanced workloads APGAS constructs deliver productivity and performance gains at scale Follow-up work presented PPAA 14 APGAS global load balancing library derived from UTS application to SSCA2 20