Shoal: smart allocation and replication of memory for parallel programs

Size: px

Start display at page:

Download "Shoal: smart allocation and replication of memory for parallel programs"

Isabel O’Connor’
5 years ago
Views:

1 Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris $ ETH Zurich $ Oracle Labs Cambridge, UK

2 Problem Congestion on interconnect Load imbalance of memory controllers Interconnect memory controllers Performance of parallel application depends on memory allocation Suboptimal allocation bad performance 2

3 Shoal Memory abstraction: Arrays Statically derive access patterns from code Choose array implementation at runtime Reduces runtime: 4x over naïve memory allocation 3

4 Example: PageRank 250 runtime [s] x8 AMD Opteron 6378 Bulldozer 4 Sockets 512 GB RAM SNAP Twitter graph 41M nodes 1468M edges size in RAM: 2.5 GB cores perfect speedup 4

5 Example: PageRank 250 runtime [s] x8 AMD Opteron 6378 Bulldozer 4 Sockets 512 GB RAM SNAP Twitter graph 41M nodes 1468M edges size in RAM: 2.5 GB cores perfect scalability single-node 5

6 Problem: implicit allocation void *data = malloc(size); memset(data, 0, SIZE); Implicit Linux policy on where to allocate memory First touch all memory on same NUMA node 6

7 What we would like to do? Partitioning Split working space, put on different nodes Replication Copy array Updates: consistency Reduce load-imbalance Localizes access reduces interconnect traffic DMA 2M/1G pages 7

8 What we have today: Explicit placement of memory libnuma Advise Kernel about use of memory region madvise 8

9 SHOAL 9

10 Exploiting DSLs High-level API Efficient parallelization widely used Machine learning Signal processing Graph processing Idea: derive access patterns 10

11 Green Marl: graph storage nodes r_nodes edges r_edges

12 Overview: Green Marl high-level program (e.g. Pagerank) high-level compiler low-level code (e.g. C++) compiler (CC) program 12

13 Modifications to Green Marl high-level program (e.g. Green Marl) high-level compiler analysis low-level code (e.g. C++) memory abstraction compiler (e.g. gcc) program library 13

14 1) Array abstraction high-level program (e.g. Pagerank) high-level compiler analysis low-level code (e.g. C++) memory abstraction compiler (e.g. gcc) program library 14

15 Array abstraction get() and set() copy_from(arr) and init_with(const) array_malloc(size, access_patterns) 15

16 Shoal s access patterns Read-only Sequential Random Indexed indexed: for (i=0; i<size; i++) { foo(arr[i]); } sequential + local 16

17 2) Compiler high-level program (e.g. Pagerank) high-level compiler analysis Use Shoal memory abstraction Extract memory access patterns low-level code (e.g. C++) memory abstraction compiler (e.g. gcc) program library 17

18 Derivation of access patterns Foreach (t: G.Nodes) { Double val = k * Sum(nb: t.innbrs){ nb.rank / nb.outdegree()} ; diff += val - t.pg_rank ; t.pg_rank <= t; } 18

19 Derivation of access patterns Foreach (t: G.Nodes) { Double val = k * Sum(nb: t.innbrs){ nb.rank / nb.outdegree()} ; diff += val - t.pg_rank ; t.pg_rank <= t; } 19

20 Green Marl: graph storage nodes r_nodes edges r_edges

21 Green Marl: graph storage ranks nodes r_nodes edges r_edges

22 Deriving access patterns Sum(nb: t.innbrs) { // }; Operation: InNbrs - neighbors of node t: s = r_nodes[t] e = r_nodes[t+1]-1 nb = [r_edges[x] for x in (s..e-1)] ranks nodes r_nodes edges r_edges

23 Deriving access patterns Sum(nb: t.innbrs) { // }; indexed Operation: InNbrs - neighbors of node read-only t: s = r_nodes[t] sequential e = r_nodes[t+1]-1 read-only nb = [r_edges[x] for x in (s..e-1)] nodes ranks sequential read-only r_nodes edges r_edges

24 Deriving access patterns Sum(nb: t.innbrs) { nb.rank / nb.outdegree() }; Operation: rank - rank of neigbor nb: rnk_tmp = rank[nb] random read-only ranks nodes r_nodes edges r_edges

25 Deriving access patterns Sum(nb: t.innbrs) { nb.rank / nb.outdegree() }; Operation: OutDegree() - of neighbor w: nodes[nb+1] nodes[nb] random read-only ranks nodes r_nodes edges r_edges

26 3) Runtime high-level program (e.g. Pagerank) high-level compiler analysis low-level code (e.g. C++) memory abstraction compiler (e.g. gcc) program library 26

27 Runtime: Choice of arrays start replicated y indexed? n read-only? y fits all nodes? y partitioned n distributed H/W characteristics n 27

28 Runtime: Choice of arrays start replicated y indexed? y partitioned n read-only? n distributed y fits all nodes? #nodes #CPUs Size RAM / node DMA Huge/large page n H/W characteristics 28

29 Shoal workflow high-level program (e.g. Pagerank) high-level compiler analysis access patterns hardware characteristics low-level code (e.g. C++) memory abstraction compiler (e.g. gcc) program library 29

30 Alternative approaches Search-based auto-tuning HW page migration Carrefour: Online analysis of access patterns Simon Fraser University, ASPLOS 2013 Performance counters to monitor accesses Dynamically migrate and replicate pages 30

31 EVALUATION 31

32 Single node allocation runtime [s] cores single-node 32

33 Distribute memory runtime [s] #pragma parallel omp for for (int i=0; i<size; i++) data[i] = 0; Problem: this is OS / HW specific cores single-node distributed Default Green Marl behavior 33

34 Carrefour: reactive tuning runtime [s] cores single-node distributed Carrefour 34

35 Shoal runtime [s] Knowledge of access patterns Replication Distribution Large pages (2M) cores single-node distributed Carrefour Shoal 35

Performance breakdown 250 runtime [s] 200 150 100 50 0 single-node

36 Performance breakdown 250 runtime [s] single-node partitioning + replication partitioning partitioning + replication + 2M pages 36

37 Conclusion Memory abstraction, arrays Compiler analysis derive access patterns Runtime library selects implementation Works well with domain specific languages Also: support for manual annotation Too complex, too dynamic Online Public release next week 37

Callisto-RTS: Fine-Grain Parallel Loops

Callisto-RTS: Fine-Grain Parallel Loops Tim Harris, Oracle Labs Stefan Kaestle, ETH Zurich 7 July 2015 The following is intended to provide some insight into a line of research in Oracle Labs. It is intended