Asymmetry-aware execution placement on manycore chips

Size: px

Start display at page:

Download "Asymmetry-aware execution placement on manycore chips"

Bruce Farmer
5 years ago
Views:

1 Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY

2 Introduction: Core Scaling? Moore s Law continues: can still fit more transistors on a chip 2

3 Introduction: Memory Controllers core MC Multi-core multi-mc chips gain prevalence But, now latency is NOT: uniform memory access latency (UMA) hierarchically non-uniform memory latency (NUMA) 3

4 NUMA NUMA: symmetrically non-uniform crossbar 2x 2x 2x 2x 2x 2x 2x 2x x x 2x 2x x x 2x 2x bottom left MC access from all cores 4

5 ANUMA ANUMA: asymmetrically non-uniform core MC bottom left MC access from all cores 5

6 NUMA vs. ANUMA Latency & bandwidth asymmetries in multi- MC manycore NoC systems are different [numa] sharp division into local/remote that dwarfs other intra-node asymmetries [anuma] gradual continuum of access latencies 6

7 Problem Statement MC access asymmetries can result in suboptimal placement w.r.t. MC access patterns 7

8 Motivating Experiments Bare metal latency differential ~ 14% does it matter? GCC: 1 thread, 1 MC, 1 core at a time, tile linux 1.24e e e e e e+07 14% 1.12e e+07 Manhattan distance to MC e e e e e e e e+07 GCC runtime (µs) not 0 8

9 Motivation Latency differential gets much worse 14% figure is NOT an upper bound Major factor: contention on the NoC NoC node and link contention slows down packets amplifying effect on bare metal asymmetries Examples: up to 100% latency differential in a simulator 9

10 Exploring Solution Space: NUMA Designate grid quadrants into NUMA zones (+) a well-explored approach (+) kernel code readily available (-) lacks flexibility (-) contention avoidance 10

11 Solution Space: data placement OS allocates physical frames in a way that: balances the NoC contention optimizes some objective function maximizes throughput (+) proactive (-) no crystal ball: which pages will be hot? (-) once allocated, can t be easily undone inter-mc page migration worsens NoC contention 11

12 Solution Space: thread placement Move execution closer to MCs used (-) waits for the problem to occur before acting (-) cache thrashing (+) better-informed decisions are possible MC usage statistics collection 12

13 Solution Space: hybrid Combine execution and data placement allocate frames, then adjust execution placement place execution, then adjust data placement (+) Corrective action possible (-) Complexity 13

14 MC Usage Stats Collection Interface: procfs control handle to start/stop collection procfs data file for stats data extraction pages accessed per MC per sampling cycle Implementation instrumented VM to page fault each page fault is attributed to corresponding MC i very inefficient, but does work adjustable sampling frequency (20Hz) trades off overhead vs. accuracy 14

15 Placement Algorithm Place threads nearest to MCs they use greedily based on memory intensity Weighted geometric placement algorithm calculate thread s desired coordinates as a function of MC coordinates thread s access count vector (L1 normal) Placement collisions resolved find second choices in direction of preferred MCs 15

16 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" ()*+" (%'" ()*+" 16

17 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" Overhead ()*+" (%'" ()*+" 17

18 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" Ovrhd Weighted ()*+" (%'" ()*+" 18

19 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" Ovrhd Wghtd ()*+" Brute (%'" ()*+" 19

20 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" Ovrhd Wghtd ()*+" Brute Base (%'" ()*+" 20

21 Bandwidth Performance Results 550 <1+%'> <=1+%'>' 500,-./0!1!(2'" !$&$;*;'!"#$%&'!%:$;*;' 250 Base (%'" Ovrhd Wghtd ()*+" Brute Base (%'" Ovrhd BC)A# ()*+" Brute 21

22 Runtime Performance Results 1.3e e+07 1 threads 2 threads 4 threads 8 threads 16 threads Minimums Medians Maximums Microseconds to complete bench2 bench2 runtime (us) 1.1e+07 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 B O WG Br N N N N N

23 Runtime Performance Results 1.3e e+07 1 threads 2 threads 4 threads 8 threads 16 threads Minimums Medians Maximums Microseconds to complete bench2 bench2 runtime (us) 1.1e+07 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 B O WG Br B O WG Br N N N N N

24 Runtime Performance Results 1.3e e+07 1 threads 2 threads 4 threads 8 threads 16 threads Minimums Medians Maximums Microseconds to complete bench2 bench2 runtime (us) 1.1e+07 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 B N O 0WG 1 2Br B NO 0WG 1 Br 2 B NO 0WG 1 Br 2 B NO 0WG 1 Br 2 B NO 0WG 1 Br

25 Takeaways Promising, but stat collection overhead is high Call for per-core MC stats counters in hardware Works despite high cost of migration 25

26 Future Work More experimentation to understand benefits as a function of application properties heterogeneous workloads ANUMCA exploring how these same ideas apply to cache cache access is asymmetrically non-uniform too Optimizing the application-level goal SLAs, multi-threaded app. performance goals 26

27 Conclusion Promising, but stat collection overhead is high Call for per-core MC stats counters Works despite high cost of migration Questions? 27

28 References [1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger. Asymmetry-Aware Execution Placement on Manycore Chips. In Proc. of SFMA 13, EuroSys 2013, Czech Republic. [2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems. In Proc. of HPCA [3] Tilera Corporation, The Processor Architecture Overview for the TILEPro Series, Feb [pdf] 28

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea: