Region-Based Memory Management for Task Dataflow Models

Size: px

Start display at page:

Download "Region-Based Memory Management for Task Dataflow Models"

Donald Hunter
5 years ago
Views:

1 Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability for Emerging Architectures (EPoPPEA). Held in conjunction with the 7th International Conference on High Performance and Embedded Architectures and Compilers (HIPEAC). Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Download date:08. Mar. 2019

2 Region-Based Memory Management for Task Dataflow Models Dimitrios S. Nikolopoulos Joint work with: Spyros Lyberis and Polyvios Pratikakis Computer Architecture and VLSI Systems Laboratory (CARV) Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) EPoPPEA, January 24, 2012 Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

3 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

4 The Problem Manycore Processors and Shared Memory Shared memory can be hard to scale to many cores Cache coherence becomes expensive Causes excessive communication, memory contention Manycore processors that relax shared memory abstraction AMD Opteron: 12 cores, NUMA, MMU per core Cell Broadband Engine: cores, no shared memory GPUs: Thousands of cores, small shared memories between few cores Intel Single Chip Cloud: 48 cores, no shared memory Increased demand for parallel software development Threads and shared memory: accessible, error-prone Message-passing: tedious, experts-only Both are non-deterministic: difficult to debug and understand Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

5 Task-parallelism for distributed and shared memory Available in several contemporary programming models (Sequoia, Cilk, OMPSs) Recursive task-parallelism Parallel programs are composed from nested parallel tasks Annotate each task with its memory footprint Runtime system performs dependency analysis, scheduling, all required data transfers Support region-based memory management Easier to express irregular task footprints Enables use of dynamic data structures Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

6 Benefits Shared memory abstraction Resemble shared memory programming Runtime system hides data transfers among core memories Use memory footprints to transfer required data locally before starting a task Alternatively, schedule a task near the data Message-passing execution semantics Tasks compute on local data Hierarchical scheduling, easier to scale to more cores Remove shared-memory bottleneck No need for complicated SDSM protocol on every access Deterministic Implicit, correct synchronization Repeatable behavior Formal proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

7 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

8 Motivation for Regions Task memory footprints are easy to express if the are composed of few objects, contiguous in memory What if the task memory footprint is: A linked list or part of it? A tree or part of it? A graph or part of it? Hard to use many common linked data structures Hard to express irregular algorithms and applications Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

9 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

10 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

11 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

12 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

13 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

14 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

15 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

16 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

17 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

18 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

19 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

20 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

21 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

22 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

23 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

24 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

25 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 h[v2] f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

26 Deterministic Scheduler Operational Semantics Define small-step operational semantics T, D, S, R p T, D, S, R Memory model: Global address space: the store S includes all memory addresses Distributed memory implementation: each task only accesses memory locations declared in its footprint Scheduling algorithm: Dependency metadata D maintain a task queue per memory location Always possible to spawn a task But, it can only run when at the head of queue for all locations in the footprint Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

27 Deterministic Proof Technique (Pratikakis et. al, MSPC 11) Define sequential operational semantics S, e s S, e Prove sequential equivalence on execution traces Intuitively: Every parallel execution will produce the same value and memory state as the sequential execution More precisely: Given a program e and a finite (terminating) parallel execution trace for e that produces a value v and a memory state S, we can always construct a sequential execution trace for e that also returns v and produces S Proof by induction on the parallel trace We can always reorder steps in the parallel trace to bring the next sequential step to the front of the trace Proof does not require scoped parallelism (sync) Similar to a confluence proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

28 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

29 Distributed Memory Management API Object allocation: void *sys_alloc(size_t size, rid_t region); void sys_balloc(size_t size, rid_t region, int num_objects, void **objects); void sys_free(void *ptr); void *sys_realloc(void *ptr, size_t size, rid_t rgn); Region allocation: rid_t sys_ralloc(rid_t parent, char level_hint); void sys_rfree(rid_t region); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

30 Distributed Memory Implementation by Example A: 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

31 Distributed Memory Implementation by Example A: 0 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue 0(10), 1(20) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

32 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 C A, B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

33 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 0(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 core2 C A, B C, D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

34 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 2 A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

35 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue (80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

36 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

37 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1 2(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

38 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1(20), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 B, C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

39 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue Decode: Issue: Prepare: Run: Collect: (80) 1(10), 1(20) Enqueue task args into alloc() hash from left/right side; Count non-ownership args in dep waiting If dep waiting reaches 0, put task in ready queue along with affinity stats Decide where to run, fetch non-local data Execute code when fetch is finished Update alloc() hash ownership; Update dep waiting for these objects; Maybe goto Issue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

40 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

41 Hierarchical schedulers topology Level 0 00 Level 1 (Level 3: Workers) 10 Level A 2B 2C 2D 2E 2F Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

42 Splitting region trees for hierarchical scheduling root region root L0 scheduler region L 1 M R L1 L2 Ma Mb R1 Ra object L1 scheduler Ma1 Mb1 Mb2 L2 scheduler Ra1 L1 scheduler Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

43 Barnes-Hut, 1.1M bodies, Single scheduler Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

44 Barnes-Hut, 1.1M bodies, Two-level scheduler hierarchy Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial 1(1) 2(1) 4(1) 8(1) 16(2) 32(4) 64(8) 128(16) 256(32) 512(64) Number of workers (number of L1 schedulers) Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

45 UPC comparison, 16 worker cores Execution time (seconds) UPC, 248-B objects UPC, 504-B objects UPC, 1016-B objects UPC, 2040-B objects Myrmics, 248-B objects Myrmics, 504-B objects Myrmics, 1016-B objects Myrmics, 2040-B objects ,000 30,000 60, , ,000 Number of objects Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

46 UPC comparison, 15, B objects UPC Myrmics, single scheduler Myrmics, hierarchical schedulers Execution time (seconds) Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

47 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

48 Conclusions Regions language construct: arbitrary task memory footprints, better productivity shared memory programming abstractions distributed memory execution semantics scalability Work in progress: Distributed memory task-based runtime (Myrmics) Compiler support for regions (SCOOP) Implementation and testing on FPGA prototype (Formic) Acknowledgments: Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

49 CARV (FORTH-ICS) Formic 512-core FPGA Prototype Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

for Task-Based Parallelism

for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent