Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Size: px

Start display at page:

Download "Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger"

Simon Cameron
5 years ago
Views:

1 Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger

2 Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted energy NoC accounts for 30-70% of on-chip data access latency [Z. Li, HPCA 2009][A. Sharifi, MICRO 2012] NoC accounts for 20-30% of total chip power [J. D. Owens, IEEE Micro 2007][S. R. Vangal, IEEE JSSC 2008] 2

3 Network-On-Chip Efficiency 3

4 Network-On-Chip Efficiency load request 4

5 Network-On-Chip Efficiency load request data response 5

6 Network-On-Chip Efficiency load request Minimize wasted time Deliver data no later than needed data response 6

7 Network-On-Chip Efficiency 7

8 Network-On-Chip Efficiency load request 0x2 8

9 Network-On-Chip Efficiency load request 0x2 data response

10 Network-On-Chip Efficiency load request 0x2 Minimize wasted energy Deliver data no earlier than needed data response

11 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM 11

12 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM Store data at a coarse granularity, but move data at a fine granularity. 12

13 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 13

14 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 14

15 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 15

16 Network-On-Chip Efficiency Arrive no later than needed. But expensive. Wasted money since some arrive too early. 01:00 03:00 02:00 04:00 16

17 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 17

18 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 18

19 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 19

20 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 20

21 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 21

22 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 03:00 02:00 04:00 22

23 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 Deliver data both no later and no earlier 03:00 than needed Design for data criticality 02:00 04:00 23

24 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 24

25 Data Criticality Data Criticality is the promptness with which an application uses a data word after fetching it from memory. Critical: used immediately after being fetched. Non-critical: used some time later after being fetched. 25 Defining Criticality

26 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); 26 Defining Criticality

27 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); i = 0 sptprice 27 Defining Criticality

28 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice i = Defining Criticality

29 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice criticality 29 Defining Criticality

30 Data Criticality fluidanimate for (iparneigh++) { if (borderneigh) { pthread_mutex_lock(); neigh->a[iparneigh] -=...; pthread_mutex_unlock(); } else { neigh->a[iparneigh] -=...; } } 30 Defining Criticality

31 Data Criticality Data criticality is an inherent consequence of spatial locality and is exhibited by most (if not all) real-world applications. Examples of non-criticality: Long-running code between accesses Interference due to thread synchronization Dependences from other cache misses Preemption by the operating system 31 Defining Criticality

32 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 32 Defining Criticality

33 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 33 Defining Criticality

34 Data Criticality vs. Instruction Criticality Data criticality load miss A load miss B load miss C load miss D 34 Defining Criticality

35 Data Liveness Data Liveness describes whether or not an application uses a data word at all after fetching it from memory. Live-on-arrival (live): used at least once during its cache lifetime. Dead-on-arrival (dead): never used during its cache lifetime. 35 Defining Criticality

36 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } 36 Defining Criticality

37 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 37 Defining Criticality

38 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 38 Defining Criticality

39 Data Liveness Data liveness measures the degree of spatial locality in an application. Examples of dead words: Unused members of structs Irregular or random access patterns Heap fragmentation Padding between data elements Early evictions due to invalidations, cache pressure or poor replacement policies 39 Defining Criticality

40 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 40

41 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency 41 Measuring Criticality

42 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency 42 Measuring Criticality

43 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency access latency non criticality = fetch latency 1x for critical words, >1x for non-critical words 43 Measuring Criticality

44 Measuring Criticality Full-system simulations: FeS2, BookSim, DSENT GHz OoO cores 64 kb private L1 per core, 16-word cache blocks 16 MB shared distributed L2 Baseline NoC configuration: 4 x 4 mesh, 2.0 GHz, 128-bit channels X-Y routing, 3-stage router pipeline, 6 4-flit VCs per port Applications: PARSEC and SPLASH-2 44 Measuring Criticality

45 % accessed words (cumulative) Measuring Criticality Very low criticality 100% blackscholes bodytrack fluidanimate streamcluster swaptions 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 45 Measuring Criticality

46 % accessed words (cumulative) Measuring Criticality Low criticality 100% barnes lu_cb water_nsquared water_spatial 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 46 Measuring Criticality

47 % accessed words (cumulative) Measuring Criticality High criticality 100% fft vips volrend x264 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 47 Measuring Criticality

48 % accessed words (cumulative) Measuring Criticality Very high criticality 100% canneal cholesky radiosity radix 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 48 Measuring Criticality

49 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 49 Measuring Criticality

50 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 50 Measuring Criticality

51 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency high criticality low criticality 51 Measuring Criticality

52 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 100% 80% 60% 40% 20% 0% 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 52 Measuring Criticality

53 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x 60% 40% 20% 0% 1.1x 2x 2x 3.5x 3.5x 7x 7x 18x 18x - 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 53 Measuring Criticality

54 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x, 2.0 GHz 60% 40% 20% 0% 1.1x 2x, 1.8 GHz 2x 3.5x, 1.0 GHz 3.5x 7x, MHz 7x 18x, MHz 18x -, MHz 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 54 Measuring Criticality

55 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 55 Measuring Criticality

56 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 56 Measuring Criticality

57 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 57 Measuring Criticality

58 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 58 Measuring Criticality

59 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 68.8% energy wasted due to non-critical and dead words. 59 Measuring Criticality

60 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 60

61 Addressing Criticality A criticality-aware NoC design needs to: 1. Predict the criticality of a word prior to fetching it. 2. Separate the fetching of words based on their criticality. 3. Reduce energy consumption in fetching low-criticality words. 4. Eliminate the fetching of dead words. NoCNoC (Non-Critical NoC) A proof-of-concept, criticality-aware design 61 Addressing Criticality

62 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. 62 Addressing Criticality

63 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) 63 Addressing Criticality

64 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

65 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

66 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

67 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

68 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) Addressing Criticality

69 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) L1 request packet Addressing Criticality

70 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 70 Addressing Criticality

71 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 71 Addressing Criticality

72 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. critical non-critical 72 Addressing Criticality

73 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. 73 Addressing Criticality

74 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very low 74 Addressing Criticality

75 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality low 75 Addressing Criticality

76 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality high 76 Addressing Criticality

77 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very high 77 Addressing Criticality

78 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. 78 Addressing Criticality

79 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. prediction table instruction address X Addressing Criticality

80 NoCNoC More details and results in the paper: DVFS scheme Prediction tables Comparison to instruction criticality (Aergia [R. Das, ISCA 2010]) 80 Addressing Criticality

81 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness 81 Addressing Criticality

82 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness Outperforms prior liveness predictor (70% accuracy) [H. Kim, NOCS 2011] 82 Addressing Criticality

83 normalized to baseline NoCNoC Performance and Energy dynamic energy runtime 83 Addressing Criticality

84 Conclusion Define Data Criticality Deliver data both no later and no earlier than needed. Measure Criticality 68.8% energy wasted due to non-critical and dead words. Address Criticality NoCNoC, a proof-of-concept, criticality-aware design. 84

85 Thank you

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,