Grappa: A latency tolerant runtime for large-scale irregular applications

Size: px

Start display at page:

Download "Grappa: A latency tolerant runtime for large-scale irregular applications"

Abigail McDonald
5 years ago
Views:

1 Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University of Washington April 13,

We want to solve big ugly problems easily and efficiently on rack scale systems (and beyond) Abstract example: TB+ sized directed imbalanced tree all memory-resident traverse vertices reachable from

2 We want to solve big ugly problems easily and efficiently on rack scale systems (and beyond) Abstract example: TB+ sized directed imbalanced tree all memory-resident traverse vertices reachable from a given start vertex Other more useful examples: finding ephemeral patterns in streaming graph data (fraud detection) branch-and-bound for optimization (routing delivery vehicles) direct sparse linear solvers (SPICE) 2

3 A single node, serial starting point struct Vertex { index_t id; Vertex * children; size_t num_children; ; 3

4 A single node, serial starting point struct Vertex { index_t id; Vertex * children; size_t num_children; ; int main( int argc, char * argv[] ) { Vertex * root = create_big_tree(); search(root); return 0; 4

5 A single node, serial starting point struct Vertex { index_t id; Vertex * children; size_t num_children; ; void search(vertex * vertex_addr) { Vertex v = *vertex_addr; Vertex * child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { Vertex * root = create_big_tree(); search(root); return 0; 5

6 A single node, serial starting point struct Vertex { index_t id; Vertex * children; size_t num_children; ; void search(vertex * vertex_addr) { Vertex v = *vertex_addr; Vertex * child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { Vertex * root = create_big_tree(); search(root); return 0; 6

7 A single node, serial starting point struct Vertex { index_t id; Vertex * children; size_t num_children; ; void search(vertex * vertex_addr) { Vertex v = *vertex_addr; Vertex * child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { Vertex * root = create_big_tree(); search(root); return 0; 7

8 Add boiler-plate Grappa code struct Vertex { index_t id; Vertex * children; size_t num_children; ; void search(vertex * vertex_addr) { Vertex v = *vertex_addr; Vertex * child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ Vertex * root = create_big_tree(); search(root); ); finalize(); return 0; 8

9 Making graph & vertices into global structures struct Vertex { index_t id; GlobalAddress<Vertex> children; size_t num_children; ; void search(globaladdress<vertex> vertex_addr) { Vertex v = *vertex_addr; GlobalAddress<Vertex> child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_big_global_tree(); search(root); ); finalize(); return 0; 9

10 Making graph & vertices into global structures struct Vertex { index_t id; GlobalAddress<Vertex> children; size_t num_children; ; void search(globaladdress<vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr); GlobalAddress<Vertex> child0 = v.children; for( int i = 0; i < v.num_children; ++i ) { search(child0+i); int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_big_global_tree(); search(root); ); finalize(); return 0; 10

11 Make the loop over neighbors parallel struct Vertex { index_t id; GlobalAddress<Vertex> children; size_t num_children; ; void search(globaladdress<vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr); GlobalAddress<Vertex> child0 = v.children; parallel_for( 0, v.num_children, [child0](int64_t i) { search(child0+i); int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_big_global_tree(); search(root); ); finalize(); return 0; 11

12 That s it! Grappa code for a rack scale system! struct Vertex { index_t id; GlobalAddress<Vertex> children; size_t num_children; ; void search(globaladdress<vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr); GlobalAddress<Vertex> child0 = v.children; parallel_for( 0, v.num_children, [child0](int64_t i) { search(child0+i); int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_big_global_tree(); search(root); ); finalize(); return 0; 12

13 Straightforward to write, but does it work? 13

14 Comparison against special purpose hardware Traversing a 1.6B vertex imbalanced tree (UTS T1XL) 150 MVerts/s System: Grappa Cray XMT Nodes 64-node AMD Interlagos cluster; GHz cores & 64GB RAM per node; 40Gb Infiniband 128-node Cray XMT1; 1 500MHz core & 4GB per node; CrayXT Seastar interconnect 14

15 Grappa works: we can scale up a big ugly problem easily and efficiently. But what about Grappa is relevant to rack scale computing when solving big ugly problems? 15

16 At rack+ system scale, chaos is required Even easy problems that seem to divide up evenly, become irregular at scale. processor interruptions system asymmetries Scaling our irregular problems demand over-decomposition and dynamic work redistribution => asynchrony Chaotic, asynchronous parallelism is required to get efficient use from rack scale (or larger) systems when applying many processors on a single large problem. 16

17 Yet, at component scale, order is required Hardware components designed for order and structure: Caches efficient when references are grouped or repeated Prefetching efficient when access is predictable Pipelines efficient when there are few computational dependences Network interfaces efficient when messages are infrequent and large (>4KB) Atomics, fences efficient when not used ( ie, when operations do not induce races ) Ordered parallelism is required for efficiency from individual components 17

18 Grappa addresses this dilemma by using parallel slack and latency tolerance 18

19 Software prefetching of contexts Avg context switch latency (ns) At 1K thread operating point, ~50 ns No prefetching Prefetching 0e+00 1e+05 2e+05 3e+05 4e+05 Number of workers Pthreads: ns 500K threads: 75 ns This is switching at the bandwidth limit to DRAM! 19

20 Mitigating low injection rate with aggregation Worker 1 Stack Msg 1 Worker 3 1. Queue Messages in Linked List. Worker 2 Stack Msg 2 Maximum bandwidth (bytes/second) 3B 2B 1B 0B Grappa, aggregated Grappa, raw GASNet Raw MPI Message size (bytes) Stack Msg 3 2. Serialize Messages using Prefetching. Node 0 Wire Msg Msg 1 Msg 2 Msg 3 3. Send over network. Wire Msg Msg 1 Msg 2 Msg 3 4. Deserialize and Execute Messages. Node n 20

21 Accessing memory through delegates var DRAM DRAM DRAM DRAM var+1 var+1 var+1 Each word of memory has a designated home core All accesses to that word run on that core Requestor blocks until complete 21

22 Accessing memory through delegates var+1 var+1 var+1 var DRAM DRAM DRAM DRAM var+1 var+1 var+1 Since var is private to home core, updates can be applied 22

23 Random update BW is good Minimal context switching GUPS Update type: Blocking Delegated Theoretical peak at 64 nodes is 6.4 GUPs, so this is work in progress. GUPS pseudocode: global int a[big]; int b[n]; for (i=0;i<n;i++) a[b[i]]++; Nodes 23

General Combining Scheme Asynchronous (Competitive) Arbitrary # computations Any number of threads Timing of

Synchronous (Collaborative) Single computation Number of threads is explicit Synchronized, exclusive access to

combine in funnel aggregate tries lock got lock! if fail, circulate.

24 General Combining Scheme Asynchronous (Competitive) Arbitrary # computations Any number of threads Timing of interaction arbitrary Chaos! Synchronous (Collaborative) Single computation Number of threads is explicit Synchronized, exclusive access to data Order! Asynchronous threads make requests annihilate (or void fn). combine in funnel aggregate tries lock got lock! if fail, circulate... re-enter funnel and try again de-aggregate, returning results in parallel to requesting threads. Satisfy aggregate in parallel synchronously release lock Data Structure Similar to Combining Funnels: a Dynamic Approach to Software Combining, Nir Shavit, Asaph Zemach, Simon Kahan

25 Conclusion Grappa allows easy expression of asynchronous parallelism providing the concurrency needed to get high system utilization on big ugly problems on rack scale computers and efficient transformation into ordered parallelism by transforming the chaos to the order for which individual components are designed. through use of parallel slack to tolerate latency. What this means is that we can more easily write programs to attack large ugly problems at scale. Try it! 25

26 Questions? 26 1

Latency-Tolerant Software Distributed Shared Memory

Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25