Islands RTS: a Hybrid Haskell Runtime System for NUMA Machines

Size: px

Start display at page:

Download "Islands RTS: a Hybrid Haskell Runtime System for NUMA Machines"

Felix Elliott
5 years ago
Views:

1 Islands RTS: a Hybrid Haskell Runtime System for NUMA Machines Marcin Orczyk University of Glasgow May 13, 2011 Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

2 Paralellism CPU clock rates have plateaued the key to increased performance is scalable parallelism from multicore CPUs, through NUMA and massively parallel hardware (GPUs, FPGAs), to clusters and cloud computing Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

3 NUMA 16-core machine composed of 4 quad-core chips Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

4 NUMA cache coherence is an overhead how high will it be at 20 cores? 100 cores? 500 cores? 16-core machine composed of 4 quad-core chips Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

5 NUMA cache coherence is an overhead how high will it be at 20 cores? 100 cores? 500 cores? 16-core machine composed of 4 quad-core chips non-coherent or partially cache coherent architectures Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

6 Islands RTS Haskell runtime system for partially cache coherent machines take advantage of shared memory/coherent cache when available Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

7 Islands RTS Haskell runtime system for partially cache coherent machines take advantage of shared memory/coherent cache when available Islands RTS is a blend of two existing parallel Haskell implementations: GHC: shared memory runtime system with support for parallel evaluation GUM: distributed runtime system based on message passing Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

8 Island island: a set of processing cores sharing a cache coherent memory system is composed of a number of islands Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

9 Island island: a set of processing cores sharing a cache coherent memory system is composed of a number of islands system with 4 cache-coherent islands, each with 4 processing elements Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

10 Island island: a set of processing cores sharing a cache coherent memory system is composed of a number of islands system with 4 cache-coherent islands, each with 4 processing elements one real heap per island virtual shared heap between islands Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

11 Virtual Shared Heap following GUM, Islands RTS provides virtual shared heap closures can transparently reside on different islands Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

12 Virtual Shared Heap following GUM, Islands RTS provides virtual shared heap closures can transparently reside on different islands consider evaluation of the expression (case x of...), where x exists on a different island Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

13 Virtual Shared Heap following GUM, Islands RTS provides virtual shared heap closures can transparently reside on different islands consider evaluation of the expression (case x of...), where x exists on a different island case x of... x Heap of Island A fetch Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

14 Virtual Shared Heap following GUM, Islands RTS provides virtual shared heap closures can transparently reside on different islands consider evaluation of the expression (case x of...), where x exists on a different island case x of... x Heap of Island A fetch Heap of Island B packing/unpacking closures global addresses message passing layer and protocol stub closures Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

15 Global Addresses triples (,, ) Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

16 Global Addresses triples (island,, ) Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

17 Global Addresses triples (island, slot, ) Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

18 Global Addresses triples (island, slot, weight) weighted reference counting Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

19 Global Addresses triples (island, slot, weight) weighted reference counting prevent garbage collection slots are reused similar to stable pointers Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

20 Global Addresses triples (island, slot, weight) weighted reference counting prevent garbage collection slots are reused similar to stable pointers Islands RTS calls them hard links they are quite heavyweight and the guarantees they provide are often unnecessary e.g. recognising duplicates, speculative evaluation Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

21 Global Addresses Soft Links analogous to weak pointers triples (island, slot, slot2) do not prevent garbage collection slots never reused Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

22 Global Addresses Soft Links analogous to weak pointers triples (island, slot, slot2) do not prevent garbage collection slots never reused GUM used only hard links reasons to distinguish between hard and soft links: clarifies implementation potentially improves performance Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

23 Message Passing Meassages virtual shared heap FETCH(ga-from, ga-to) UPDATE(ga, data) FREE(ga) work distribution FISH SPARK(data) startup, shutdown messages Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

24 Within Island based on GHC 7 multiple islands in the same process changes: eliminating global data structures scheduler loop: hooks for message handling garbage collector: hooks for global addresses new closures Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

25 Static Thunks compiler allocates certain closures statically some of them are thunks Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

26 Static Thunks compiler allocates certain closures statically some of them are thunks closure on island B refers to a static thunk STATIC static thunk closure Heap of Island A Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

27 Static Thunks compiler allocates certain closures statically some of them are thunks closure on island B refers to a static thunk island A evaluates the static thunk STATIC static ind thunk closure Heap of Island A Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

28 Static Thunks compiler allocates certain closures statically some of them are thunks closure on island B refers to a static thunk island A evaluates the static thunk island B evaluates thunk in A s heap STATIC static ind ind closure closure Heap of Island A Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

29 Static Thunks - Solution a layer of indirection Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

30 Static Thunks - Solution a layer of indirection closure on island B refers to a static thunk STATIC static thunk closure Heap of Island A Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

31 Static Thunks - Solution a layer of indirection closure on island B refers to a static thunk island A evaluates the static thunk STATIC IndIsland A B NULL static ind thunk closure Heap of Island A Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

32 Static Thunks - Solution a layer of indirection closure on island B refers to a static thunk island A evaluates the static thunk island B accesses the static thunk STATIC IndIsland A B static ind closure thunk Heap of Island A FetchMv Heap of Island B Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

33 Static Thunks - Solution overhead on evaluation and access memory overhead proportional to the number of local islands additional, nontrivial complexity suggestions for solving this problem are welcomed Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

34 Summary parallel hardware becomes hierarchical Islands RTS matches it with the hierarchical architecture of the runtime itself within a cache coherent island - shared memory graph reduction between the islands - virtual heap based on message passing enables exploiting most appropriate mechanisms at each level future directions non-coherent shared memory port to Barrelfish remote islands heterogenous islands Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

35 Questions? Questions? Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

36 Uniqueness of Soft Links we can have tons! Marcin Orczyk (University of Glasgow) Islands RTS May 13, / 16

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to