Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Size: px

Start display at page:

Download "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter"

Vernon Greene
6 years ago
Views:

1 Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

single performance Can even lead to starvation How to

2 Motivation Memory is a shared resource Core Core Core Core Memory Threads requests contend for memory Degradation in single performance Can even lead to starvation How to schedule memory requests to increase both system throughput and fairness? 2

3 Previous Scheduling Algorithms are Biased Better fairness Maximum Slowdown Fairness bias System throughput bias Weighted Speedup Better system throughput FRFCFS STFM PAR-BS ATLAS No previous memory scheduling algorithm provides both the best fairness and system throughput 3

4 Why do Previous Algorithms Fail? Throughput biased approach Prioritize less memory-intensive s Fairness biased approach Take turns accessing memory Good for throughput Does not starve less memory intensive A B C higher priority starvation unfairness C A B not prioritized reduced throughput Single policy for all s is insufficient 4

5 Insight: Achieving Best of Both Worlds higher priority For Throughput Prioritize memory-non-intensive s For Fairness Unfairness caused by memory-intensive being prioritized over each other Shuffle s Memory-intensive s have different vulnerability to interference Shuffle asymmetrically 5

6 Outline Motivation & Insights Overview Algorithm Bringing it All Together Evaluation Conclusion 6

7 Overview: Thread Cluster Memory Scheduling 1. Group s into two clusters 2. Prioritize non-intensive cluster 3. Different policies for each cluster Memory-non-intensive Non-intensive cluster higher priority Throughput Prioritized higher priority Threads in the system Memory-intensive Intensive cluster Fairness 7

8 Outline Motivation & Insights Overview Algorithm Bringing it All Together Evaluation Conclusion 8

9 TCM Outline 1. Clustering 9

10 Clustering Threads Step1 Sort s by MPKI (misses per kiloinstruction) Non-intensive cluster αt higher MPKI Intensive cluster T T = Total memory bandwidth usage α < 10% ClusterThreshold Step2 Memory bandwidth usage αt divides clusters 10

11 TCM Outline 1. Clustering 2. Between Clusters 11

12 Prioritization Between Clusters Prioritize non-intensive cluster > priority Increases system throughput Non-intensive s have greater potential for making progress Does not degrade fairness Non-intensive s are light Rarely interfere with intensive s 12

13 TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 13

14 Non-Intensive Cluster Prioritize s according to MPKI higher priority lowest MPKI highest MPKI Increases system throughput Least intensive has the greatest potential for making progress in the processor 14

15 TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 15

16 Intensive Cluster Periodically shuffle the priority of s higher priority Most prioritized Increases fairness Is treating all s equally good enough? BUT: Equal turns Same slowdown 16

17 Case Study: A Tale of Two Threads Case Study: Two intensive s contending 1. random-access Which is slowed down more easily? 2. streaming Prioritize random-access Prioritize streaming Slowdown prioritized 1x random-access 7x streaming Slowdown x random-access prioritized 1x streaming random-access is more easily slowed down 17

18 Why are Threads Different? random-access req stuck streaming req activated row rows Bank 1 Bank 2 Bank 3 Bank 4 Memory All requests parallel High bank-level parallelism All requests Same row High row-buffer locality Vulnerable to interference 18

19 TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 19

20 Niceness How to quantify difference between s? High Niceness Low Bank-level parallelism Vulnerability to interference Row-buffer locality Causes interference + Niceness - 20

21 Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D A B C D What can go wrong? GOOD: Each prioritized once Priority D C B A Nice Least nice Time ShuffleInterval 21

22 Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D A B C D What can go wrong? GOOD: Each prioritized once Priority D C B A ShuffleInterval A D C B B A D C C B A D BAD: Nice s receive lots of interference D C B A Nice Least nice Time 22

23 Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D C B A D GOOD: Each prioritized once Priority D C B A Nice Least nice Time ShuffleInterval 23

24 Shuffling: Round-Robin vs. Niceness-Aware 1. Round-Robin shuffling 2. Niceness-Aware shuffling Most prioritized D C B A D GOOD: Each prioritized once Priority D C B A ShuffleInterval D C B A B C D A A B C D D C B A Nice Least nice Time GOOD: Least nice stays mostly deprioritized 24

25 TCM Outline 3. Non-Intensive Cluster 1. Clustering 2. Between Clusters Throughput 4. Intensive Cluster Fairness 25

26 Outline Motivation & Insights Overview Algorithm Bringing it All Together Evaluation Conclusion 26

27 Quantum-Based Operation Previous quantum (~1M cycles) Current quantum (~1M cycles) Time During quantum: Monitor behavior 1. Memory intensity 2. Bank-level parallelism 3. Row-buffer locality Shuffle interval (~1K cycles) Beginning of quantum: Perform clustering Compute niceness of intensive s 27

28 TCM Scheduling Algorithm 1. Highest-rank: Requests from higher ranked s prioritized Non-Intensive cluster > Intensive cluster Non-Intensive cluster: lower intensity higher rank Intensive cluster: rank shuffling 2.Row-hit: Row-buffer hit requests are prioritized 3.Oldest: Older requests are prioritized 28

29 Implementation Costs Required storage at memory controller (24 cores) Thread memory behavior MPKI Bank-level parallelism Row-buffer locality Total Storage ~0.2kb ~0.6kb ~2.9kb < 4kbits No computation is on the critical path 29

30 Outline Motivation & Insights Overview Algorithm Throughput Bringing it All Together Evaluation Conclusion Fairness 30

31 Metrics & Methodology Metrics System throughput Unfairness Weighted Speedup = i IPC IPC shared i alone i Maximum Slowdown = max i IPC IPC alone i shared i Methodology Core model 4 GHz processor, 128-entry instruction window 512 KB/core L2 cache Memory model: DDR2 96 multiprogrammed SPEC CPU2006 workloads 31

32 Previous Work FRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits Thread-oblivious Low throughput & Low fairness STFM [Mutlu et al., MICRO07]: Equalizes slowdowns Non-intensive s not prioritized Low throughput PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism Non-intensive s not always prioritized Low throughput ATLAS [Kim et al., HPCA10]: Prioritizes s with less memory service Most intensive starves Low fairness 32

33 Results: Fairness vs. Throughput Better fairness Maximum Slowdown Averaged over 96 workloads 16 FRFCFS 14 ATLAS 12 5% 10 STFM 39% 8 PAR-BS 5% TCM 6 8% Weighted Speedup Better system throughput TCM provides best fairness and system throughput 33

Results: Fairness-Throughput Tradeoff Better fairness Maximum Slowdown 12 10 8 6 4 2 When configuration parameter is varied FRFCFS STFM ATLAS

34 Results: Fairness-Throughput Tradeoff Better fairness Maximum Slowdown When configuration parameter is varied FRFCFS STFM ATLAS PAR-BS TCM Weighted Speedup Better system throughput Adjusting ClusterThreshold TCM allows robust fairness-throughput tradeoff 34

35 Operating System Support ClusterThreshold is a tunable knob OS can trade off between fairness and throughput Enforcing weights OS assigns weights to s TCM enforces weights within each cluster 35

36 Outline Motivation & Insights Overview Algorithm Throughput Bringing it All Together Evaluation Conclusion Fairness 36

37 Conclusion No previous memory scheduling algorithm provides both high system throughput and fairness Problem: They use a single policy for all s TCM groups s into two clusters 1. Prioritize non-intensive cluster throughput 2. Shuffle priorities in intensive cluster fairness 3. Shuffling should favor nice s fairness TCM provides the best system throughput and fairness 37

38 THANK YOU 38

39 Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

40 Thread Weight Support Even if heaviest weighted happens to be the most intensive Not prioritized over the least intensive 40

41 Harmonic Speedup Better fairness Better system throughput 41

42 Shuffling Algorithm Comparison Niceness-Aware shuffling Average of maximum slowdown is lower Variance of maximum slowdown is lower Shuffling Algorithm Round-Robin Niceness-Aware E(Maximum Slowdown) VAR(Maximum Slowdown)

43 Sensitivity Results ShuffleInterval (cycles) System Throughput Maximum Slowdown System Throughput (compared to ATLAS) Maximum Slowdown (compared to ATLAS) Number of Cores % 3% 2% 1% 1% -4% -30% -29% -30% -41% 43

Staged Memory Scheduling

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation: