The Parallel Boost Graph Library spawn(active Pebbles)

Size: px

Start display at page:

Download "The Parallel Boost Graph Library spawn(active Pebbles)"

Jerome Butler
6 years ago
Views:

1 The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University

Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors,

2 Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors, property maps, etc. Lie-Quan Lee, Jeremy Siek, and Andrew Lumsdaine. Generic Graph Algorithms for Sparse Matrix Ordering. ISCOPE. Lecture Notes in Computer Science

Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel

Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation.

3 Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel Same interfaces as sequential BGL Coarse-grained, BSP Douglas Gregor and Andrew Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation. Workshop on Parallel Object- Oriented Scientific Computing. July, Douglas Gregor and Andrew Lumsdaine. Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. Object-Oriented Programming, Systems, Languages, and Applications. October, 2005

4 Case Study: Breadth-First Search ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

5 SPMD Breadth-First Search a b c h g i d f e a b c h g i d f e Process 0 Process 1 Process 2 a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q Q.push(d) a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q d a b c h g i d f e

6 SPMD Breadth-First Search Rank 0 Rank 1 Rank 2 Rank 3 Get neighbors Redistribute queues Combine received queues 6

7 SPMD Breadth-First Search (Strong Scaling) Time [s] Breadth-First Search (2 27 vertices 2 29 edges) Number of Nodes SPMD/BSP Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

8 SPMD Breadth-First Search (Weak Scaling) Breadth-First Search (2 25 vertices 2 27 edges per node) SPMD/BSP 60 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

9 Find the Sequential Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

10 Find the Synchronization Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications

11 What s Wrong Here? SPMD Traditional Scientific Apps Coarse-grained Static Balanced Bandwidth-bound with large messages Communicates data Data-driven Apps Fine-grained dependencies Dynamic, data-dependent Irregular Latency-bound with small messages Communicates control flow

12 Messaging Models Two-sided MPI Explicit sends and receives One-sided MPI-2 one-sided, ARMCI, PGAS languages Remote put and get operations Limited set of atomic updates into remote memory Active messages GASNet, DCMF, LAPI, Charm++, X10, etc. Explicit sends, implicit receives User-defined handler called on receiver for each message

13 Active Messages Created by von Eicken et al, for Split-C (1992) Messages sent explicitly Receivers register handlers but not involved with individual messages Messages often asynchronous for higher throughput Send Reply handler Process 1 Process 2 Message handler Reply Time

Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets,

14 Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets, not processors Coalescing Message Type Message Type AM++ Transport Coalescing Message Type Termination Detection TD Level Epoch MPI or Vendor Communication Library

15 Low-Level vs. High-Level AM Systems Active messaging systems (loosely) on a spectrum of features vs. performance Low-level systems typically have restrictions on message handler behavior, explicit buffer management, etc. High-level systems often provide dynamic load balancing, service discovery, authentication/security, etc. Low DCMF GASNet Charm++/X10 Java RMI High

The AM++ Framework AM++ provides a middle

systems High-level features can be built on

16 The AM++ Framework AM++ provides a middle ground between low- and high-level systems Gets performance from low-level systems Gets programmability from high-level systems High-level features can be built on top of AM++ Low AM++ DCMF GASNet Charm++/X10 Java RMI High

quiescence Pluggable Send message return!

17 Messages Can Send Messages send<t>(t)! send<t>(t)! owner(t) local? send<t>(t)! Yes Execute handler<t>(t) owner(t) local? Yes Execute handler<t>(t) owner(t) local? No Termination detection Detect network quiescence Pluggable Send message return!

18 Active Pebbles Need to separate what the programmer expresses from what is actually executed A programming model and an execution model

19 Active Pebbles Features Programming model Active messages (pebbles) Fine-grained addressing (targets) Execution model Flexible message coalescing Message reductions Active routing Termination detection Features are synergistic AM++ is our implementation of Active Pebbles model

Programming Model Program with natural granularity No need to

dynamic local and remote Bulk, anonymous handling of messages and targets

"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ;.

,.>.#):*0%....<"#$%&'()*%+,=.:=.?@-..%(#%.")*.%409.6%789!)*.A.)".,....<"#$%&'()*%+,=.A=.

20 Programming Model Program with natural granularity No need to artificially coarsen object granularity Transparent addressing static and dynamic local and remote Bulk, anonymous handling of messages and targets Epoch model Enforce message delivery Controlled relaxed consistency!"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ; &! ;..,#-./0&12#3/& "#$%&'& "#$%&(& "#$%&)& "#$%&*& "#$%&+&

Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant

Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.

"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.

21 Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant computation Distributed computation into network Active routing Exploit physical topology Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.insert(6, D) 0 1 F table.insert(7,b) table.insert(6,a) 4 5 P0!"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.insert(6,a) table.insert(...) 2 3 MULTI-SOURCE REDUCTION 4 B A SINGLE-SOURCE REDUCTION table.insert(4,a) table.insert(4,b) 6 7

22 Routing + Message Coalescing Coalescing buffers limit scalability Communications typically all-to-all Impose a limited topology with fewer neighbors Better scalability, higher latency 11 P0 P1 P2 P3 P0 P1 P2 P P4 P5 P6 P7 P4 P5 P6 P7 11 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P Multi-source coalescing

23 Message Reductions + Routing Messages to same target can often be combined Reductions application-specific, user-defined Routing allows cache hits at intermediate hops P0 P1 P2 P3 Automatically synthesize optimized collectives P4 P5 P6 P7 Distribute computation into the network P8 P9 P10 P11 P12 P13 P14 P15

24 Active Message Abstraction Pebbles are agnostic as to where they execute, operate on targets Independent of how messages are processed Network communication (MPI, GASNet, DCMF, IB Verbs ) Work stealing (Cilk++, Task parallelism) OpenMP (over coalescing buffers) Immediate execution in caller (of send()) Thread-safe metadata allows weakening message order Updates to targets must be atomic Algorithms may have to tolerate weak consistency

25 Evaluation: Message Latency 25

26 Evaluation: Message Bandwidth 26

27 Active Pebbles Meant to support all kinds of parallelism Started with optimizing distributed memory communication Same features allow integration of fine-grained parallelism

Active Messages for Work Decomposition Key

parallel ENQUEUE(Q, s) while (Q = ) #pragma

for (each v Adj[u]) if (color[v] =WHITE)

28 Active Messages for Work Decomposition Key idea is to find natural granularity Each pebble represents an independent computation that can be executed in parallel ENQUEUE(Q, s) while (Q = ) #pragma omp parallel for u DEQUEUE(Q) for(each v Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) queue push msg send() else color[u] BLACK

29 What s Thread-safe in AP? Messaging Epoch begin/end Termination detection NOT: Message creation Modifying message features: routing, coalescing, reductions Modifying termination detection Modifying the number of threads

30 Parallel BGL Architecture AP makes messaging threadsafe. Property maps make metadata manipulation thread-safe and allow messages to be processed in arbitrary order Just as transactional memory generalizes processor atomics to arbitrary transactions, Active Pebbles generalizes one-sided operations to user-defined handlers T0 T1 T2 T3 Collectives Object- Based Addr. Message Type Algorithm Class Entry point operator() Shared State Message Types Message Types Message Types Object- Based Addr. Reductions Coalescing Message Type AM++ Transport LockMap Object- Based Addr. Coalescing Message Type Graph (static... for now) Termination Detection TD Level Graph Property Maps Epoch MPI or Vendor Communication Library

31 Active Pebbles Breadth-First Search a b explore(b, 1) c Process 0 explore(h, 2) explore(b, 1) explore(h, 3) d f h e g i Process 1 Process 2

32 BFS: Strong Scaling Breadth-First Search (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 80 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

33 BFS: Weak Scaling Breadth-First Search (2 25 vertices 2 27 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

34 Delta-Stepping: Strong Scaling 1000 Delta-Stepping SSSP (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 100 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

35 Delta-Stepping: Weak Scaling 1000 Delta-Stepping SSSP (2 24 vertices 2 26 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

36 Transactional Metadata updates Sometimes we want to update non-contiguous, dependent data atomically Predecessor and BFS level Or arbitrary visitor code supplied by users Limited-scope transactions, are these any easier? T0 T1 T2 T3 Algorithm Class Entry point std::atomic<std::pair<t1, T2> >!?! operator() template <typename T> struct Shared State Foo;! std::atomic<foo<t> >!!?!! LockMap Graph (static... for now) Graph Property Maps

37 Summary Active Messages Express fine-grained, asynchronous operations elegantly Well-matched to data-driven problems Enable fine-grained parallelism (threads, GPUs, FPGAs, ) Asynchrony allows latency hiding Concise expression and efficient execution Separate programming and execution models Active Pebbles Simple programming model Execution model maps programs to hardware efficiently AM++ is our implementation Could be targeted as a runtime by languages (X10, Chapel, )

38 Future Work Constrained parallelism in shared memory Not entirely work queue-based Not entirely recursion-based Acceleration Coalesced message buffers offer great opportunities here if getting to the accelerator is cheap Optimizing local memory transactions Intel s TSX in Haswell CPUs Declarative language for transaction code generation

39 Questions? More info on Active Pebbles Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. Active Pebbles: Parallel Programming for Data-Driven Applications. ICS 11. More info on AM++ Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. AM++: A Generalized Active Message Framework. PACT 10. More info on the Parallel Boost Graph Library and graph applications: Watch for a new release of PBGL based on Active Pebbles, running on AM++ soon! (Ask if you d like access to a pre-release, very alpha copy) ngedmond@cs.indiana.edu

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important