The Parallel Boost Graph Library spawn(active Pebbles)

The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University

Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors, property maps, etc. Lie-Quan Lee, Jeremy Siek, and Andrew Lumsdaine. Generic Graph Algorithms for Sparse Matrix Ordering. ISCOPE. Lecture Notes in Computer Science 1732. 1999.

Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel Same interfaces as sequential BGL Coarse-grained, BSP Douglas Gregor and Andrew Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation. Workshop on Parallel Object- Oriented Scientific Computing. July, 2005. Douglas Gregor and Andrew Lumsdaine. Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. Object-Oriented Programming, Systems, Languages, and Applications. October, 2005

Case Study: Breadth-First Search ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

SPMD Breadth-First Search a b c h g i d f e a b c h g i d f e Process 0 Process 1 Process 2 a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q Q.push(d) a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q d a b c h g i d f e

SPMD Breadth-First Search Rank 0 Rank 1 Rank 2 Rank 3 Get neighbors Redistribute queues Combine received queues 6

SPMD Breadth-First Search (Strong Scaling) Time [s] 110 100 90 80 70 60 50 40 30 20 10 0 Breadth-First Search (2 27 vertices 2 29 edges) 2 4 8 16 32 64 128 Number of Nodes SPMD/BSP Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

SPMD Breadth-First Search (Weak Scaling) 80 70 Breadth-First Search (2 25 vertices 2 27 edges per node) SPMD/BSP 60 Time [s] 50 40 30 20 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Find the Sequential Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

Find the Synchronization Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications

What s Wrong Here? SPMD Traditional Scientific Apps Coarse-grained Static Balanced Bandwidth-bound with large messages Communicates data Data-driven Apps Fine-grained dependencies Dynamic, data-dependent Irregular Latency-bound with small messages Communicates control flow

Messaging Models Two-sided MPI Explicit sends and receives One-sided MPI-2 one-sided, ARMCI, PGAS languages Remote put and get operations Limited set of atomic updates into remote memory Active messages GASNet, DCMF, LAPI, Charm++, X10, etc. Explicit sends, implicit receives User-defined handler called on receiver for each message

Active Messages Created by von Eicken et al, for Split-C (1992) Messages sent explicitly Receivers register handlers but not involved with individual messages Messages often asynchronous for higher throughput Send Reply handler Process 1 Process 2 Message handler Reply Time

Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets, not processors Coalescing Message Type Message Type AM++ Transport Coalescing Message Type Termination Detection TD Level Epoch MPI or Vendor Communication Library

Low-Level vs. High-Level AM Systems Active messaging systems (loosely) on a spectrum of features vs. performance Low-level systems typically have restrictions on message handler behavior, explicit buffer management, etc. High-level systems often provide dynamic load balancing, service discovery, authentication/security, etc. Low DCMF GASNet Charm++/X10 Java RMI High

The AM++ Framework AM++ provides a middle ground between low- and high-level systems Gets performance from low-level systems Gets programmability from high-level systems High-level features can be built on top of AM++ Low AM++ DCMF GASNet Charm++/X10 Java RMI High

Messages Can Send Messages send<t>(t)! send<t>(t)! owner(t) local? send<t>(t)! Yes Execute handler<t>(t) owner(t) local? Yes Execute handler<t>(t) owner(t) local? No Termination detection Detect network quiescence Pluggable Send message return!

Active Pebbles Need to separate what the programmer expresses from what is actually executed A programming model and an execution model

Active Pebbles Features Programming model Active messages (pebbles) Fine-grained addressing (targets) Execution model Flexible message coalescing Message reductions Active routing Termination detection Features are synergistic AM++ is our implementation of Active Pebbles model

Programming Model Program with natural granularity No need to artificially coarsen object granularity Transparent addressing static and dynamic local and remote Bulk, anonymous handling of messages and targets Epoch model Enforce message delivery Controlled relaxed consistency!"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ;. 4565-&! <"#$%&'()*%+:=.,=."(48-./..7".4((.6%789!)*#.)".,.4*%.!(405.....46<.,.>.#):*0%....<"#$%&'()*%+,=.:=.?@-..%(#%.")*.%409.6%789!)*.A.)".,....<"#$%&'()*%+,=.A=.BCDE-. ;..,#-./0&12#3/& "#$%&'& "#$%&(& "#$%&)& "#$%&*& "#$%&+&

Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant computation Distributed computation into network Active routing Exploit physical topology Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.insert(6, D) 0 1 F table.insert(7,b) table.insert(6,a) 4 5 P0!"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.insert(6,a) table.insert(...) 2 3 MULTI-SOURCE REDUCTION 4 B A SINGLE-SOURCE REDUCTION table.insert(4,a) table.insert(4,b) 6 7

Routing + Message Coalescing Coalescing buffers limit scalability Communications typically all-to-all Impose a limited topology with fewer neighbors Better scalability, higher latency 11 P0 P1 P2 P3 P0 P1 P2 P3 14 11 14 11 14 P4 P5 P6 P7 P4 P5 P6 P7 11 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P15 14 11 14 Multi-source coalescing

Message Reductions + Routing Messages to same target can often be combined Reductions application-specific, user-defined Routing allows cache hits at intermediate hops P0 P1 P2 P3 Automatically synthesize optimized collectives P4 P5 P6 P7 Distribute computation into the network P8 P9 P10 P11 P12 P13 P14 P15

Active Message Abstraction Pebbles are agnostic as to where they execute, operate on targets Independent of how messages are processed Network communication (MPI, GASNet, DCMF, IB Verbs ) Work stealing (Cilk++, Task parallelism) OpenMP (over coalescing buffers) Immediate execution in caller (of send()) Thread-safe metadata allows weakening message order Updates to targets must be atomic Algorithms may have to tolerate weak consistency

Evaluation: Message Latency 25

Evaluation: Message Bandwidth 26

Active Pebbles Meant to support all kinds of parallelism Started with optimizing distributed memory communication Same features allow integration of fine-grained parallelism

Active Messages for Work Decomposition Key idea is to find natural granularity Each pebble represents an independent computation that can be executed in parallel ENQUEUE(Q, s) while (Q = ) #pragma omp parallel for u DEQUEUE(Q) for(each v Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) queue push msg send() else color[u] BLACK

What s Thread-safe in AP? Messaging Epoch begin/end Termination detection NOT: Message creation Modifying message features: routing, coalescing, reductions Modifying termination detection Modifying the number of threads

Parallel BGL Architecture AP makes messaging threadsafe. Property maps make metadata manipulation thread-safe and allow messages to be processed in arbitrary order Just as transactional memory generalizes processor atomics to arbitrary transactions, Active Pebbles generalizes one-sided operations to user-defined handlers T0 T1 T2 T3 Collectives Object- Based Addr. Message Type Algorithm Class Entry point operator() Shared State Message Types Message Types Message Types Object- Based Addr. Reductions Coalescing Message Type AM++ Transport LockMap Object- Based Addr. Coalescing Message Type Graph (static... for now) Termination Detection TD Level Graph Property Maps Epoch MPI or Vendor Communication Library

Active Pebbles Breadth-First Search a b explore(b, 1) c Process 0 explore(h, 2) explore(b, 1) explore(h, 3) d f h e g i Process 1 Process 2

BFS: Strong Scaling 120 100 Breadth-First Search (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 80 Time [s] 60 40 20 0 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

BFS: Weak Scaling 80 70 60 Breadth-First Search (2 25 vertices 2 27 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] 50 40 30 20 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Delta-Stepping: Strong Scaling 1000 Delta-Stepping SSSP (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 100 Time [s] 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Delta-Stepping: Weak Scaling 1000 Delta-Stepping SSSP (2 24 vertices 2 26 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] 100 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Transactional Metadata updates Sometimes we want to update non-contiguous, dependent data atomically Predecessor and BFS level Or arbitrary visitor code supplied by users Limited-scope transactions, are these any easier? T0 T1 T2 T3 Algorithm Class Entry point std::atomic<std::pair<t1, T2> >!?! operator() template <typename T> struct Shared State Foo;! std::atomic<foo<t> >!!?!! LockMap Graph (static... for now) Graph Property Maps

Summary Active Messages Express fine-grained, asynchronous operations elegantly Well-matched to data-driven problems Enable fine-grained parallelism (threads, GPUs, FPGAs, ) Asynchrony allows latency hiding Concise expression and efficient execution Separate programming and execution models Active Pebbles Simple programming model Execution model maps programs to hardware efficiently AM++ is our implementation Could be targeted as a runtime by languages (X10, Chapel, )

Future Work Constrained parallelism in shared memory Not entirely work queue-based Not entirely recursion-based Acceleration Coalesced message buffers offer great opportunities here if getting to the accelerator is cheap Optimizing local memory transactions Intel s TSX in Haswell CPUs Declarative language for transaction code generation

Questions? More info on Active Pebbles Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. Active Pebbles: Parallel Programming for Data-Driven Applications. ICS 11. More info on AM++ Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. AM++: A Generalized Active Message Framework. PACT 10. More info on the Parallel Boost Graph Library and graph applications: http://www.osl.iu.edu/research/pbgl http://www.boost.org Watch for a new release of PBGL based on Active Pebbles, running on AM++ soon! (Ask if you d like access to a pre-release, very alpha copy) ngedmond@cs.indiana.edu