Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Size: px
Start display at page:

Download "Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger"

Transcription

1 Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger

2 Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted energy NoC accounts for 30-70% of on-chip data access latency [Z. Li, HPCA 2009][A. Sharifi, MICRO 2012] NoC accounts for 20-30% of total chip power [J. D. Owens, IEEE Micro 2007][S. R. Vangal, IEEE JSSC 2008] 2

3 Network-On-Chip Efficiency 3

4 Network-On-Chip Efficiency load request 4

5 Network-On-Chip Efficiency load request data response 5

6 Network-On-Chip Efficiency load request Minimize wasted time Deliver data no later than needed data response 6

7 Network-On-Chip Efficiency 7

8 Network-On-Chip Efficiency load request 0x2 8

9 Network-On-Chip Efficiency load request 0x2 data response

10 Network-On-Chip Efficiency load request 0x2 Minimize wasted energy Deliver data no earlier than needed data response

11 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM 11

12 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM Store data at a coarse granularity, but move data at a fine granularity. 12

13 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 13

14 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 14

15 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 15

16 Network-On-Chip Efficiency Arrive no later than needed. But expensive. Wasted money since some arrive too early. 01:00 03:00 02:00 04:00 16

17 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 17

18 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 18

19 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 19

20 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 20

21 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 21

22 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 03:00 02:00 04:00 22

23 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 Deliver data both no later and no earlier 03:00 than needed Design for data criticality 02:00 04:00 23

24 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 24

25 Data Criticality Data Criticality is the promptness with which an application uses a data word after fetching it from memory. Critical: used immediately after being fetched. Non-critical: used some time later after being fetched. 25 Defining Criticality

26 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); 26 Defining Criticality

27 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); i = 0 sptprice 27 Defining Criticality

28 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice i = Defining Criticality

29 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice criticality 29 Defining Criticality

30 Data Criticality fluidanimate for (iparneigh++) { if (borderneigh) { pthread_mutex_lock(); neigh->a[iparneigh] -=...; pthread_mutex_unlock(); } else { neigh->a[iparneigh] -=...; } } 30 Defining Criticality

31 Data Criticality Data criticality is an inherent consequence of spatial locality and is exhibited by most (if not all) real-world applications. Examples of non-criticality: Long-running code between accesses Interference due to thread synchronization Dependences from other cache misses Preemption by the operating system 31 Defining Criticality

32 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 32 Defining Criticality

33 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 33 Defining Criticality

34 Data Criticality vs. Instruction Criticality Data criticality load miss A load miss B load miss C load miss D 34 Defining Criticality

35 Data Liveness Data Liveness describes whether or not an application uses a data word at all after fetching it from memory. Live-on-arrival (live): used at least once during its cache lifetime. Dead-on-arrival (dead): never used during its cache lifetime. 35 Defining Criticality

36 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } 36 Defining Criticality

37 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 37 Defining Criticality

38 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 38 Defining Criticality

39 Data Liveness Data liveness measures the degree of spatial locality in an application. Examples of dead words: Unused members of structs Irregular or random access patterns Heap fragmentation Padding between data elements Early evictions due to invalidations, cache pressure or poor replacement policies 39 Defining Criticality

40 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 40

41 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency 41 Measuring Criticality

42 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency 42 Measuring Criticality

43 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency access latency non criticality = fetch latency 1x for critical words, >1x for non-critical words 43 Measuring Criticality

44 Measuring Criticality Full-system simulations: FeS2, BookSim, DSENT GHz OoO cores 64 kb private L1 per core, 16-word cache blocks 16 MB shared distributed L2 Baseline NoC configuration: 4 x 4 mesh, 2.0 GHz, 128-bit channels X-Y routing, 3-stage router pipeline, 6 4-flit VCs per port Applications: PARSEC and SPLASH-2 44 Measuring Criticality

45 % accessed words (cumulative) Measuring Criticality Very low criticality 100% blackscholes bodytrack fluidanimate streamcluster swaptions 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 45 Measuring Criticality

46 % accessed words (cumulative) Measuring Criticality Low criticality 100% barnes lu_cb water_nsquared water_spatial 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 46 Measuring Criticality

47 % accessed words (cumulative) Measuring Criticality High criticality 100% fft vips volrend x264 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 47 Measuring Criticality

48 % accessed words (cumulative) Measuring Criticality Very high criticality 100% canneal cholesky radiosity radix 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 48 Measuring Criticality

49 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 49 Measuring Criticality

50 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 50 Measuring Criticality

51 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency high criticality low criticality 51 Measuring Criticality

52 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 100% 80% 60% 40% 20% 0% 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 52 Measuring Criticality

53 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x 60% 40% 20% 0% 1.1x 2x 2x 3.5x 3.5x 7x 7x 18x 18x - 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 53 Measuring Criticality

54 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x, 2.0 GHz 60% 40% 20% 0% 1.1x 2x, 1.8 GHz 2x 3.5x, 1.0 GHz 3.5x 7x, MHz 7x 18x, MHz 18x -, MHz 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 54 Measuring Criticality

55 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 55 Measuring Criticality

56 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 56 Measuring Criticality

57 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 57 Measuring Criticality

58 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 58 Measuring Criticality

59 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 68.8% energy wasted due to non-critical and dead words. 59 Measuring Criticality

60 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 60

61 Addressing Criticality A criticality-aware NoC design needs to: 1. Predict the criticality of a word prior to fetching it. 2. Separate the fetching of words based on their criticality. 3. Reduce energy consumption in fetching low-criticality words. 4. Eliminate the fetching of dead words. NoCNoC (Non-Critical NoC) A proof-of-concept, criticality-aware design 61 Addressing Criticality

62 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. 62 Addressing Criticality

63 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) 63 Addressing Criticality

64 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

65 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

66 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

67 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality

68 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) Addressing Criticality

69 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) L1 request packet Addressing Criticality

70 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 70 Addressing Criticality

71 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 71 Addressing Criticality

72 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. critical non-critical 72 Addressing Criticality

73 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. 73 Addressing Criticality

74 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very low 74 Addressing Criticality

75 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality low 75 Addressing Criticality

76 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality high 76 Addressing Criticality

77 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very high 77 Addressing Criticality

78 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. 78 Addressing Criticality

79 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. prediction table instruction address X Addressing Criticality

80 NoCNoC More details and results in the paper: DVFS scheme Prediction tables Comparison to instruction criticality (Aergia [R. Das, ISCA 2010]) 80 Addressing Criticality

81 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness 81 Addressing Criticality

82 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness Outperforms prior liveness predictor (70% accuracy) [H. Kim, NOCS 2011] 82 Addressing Criticality

83 normalized to baseline NoCNoC Performance and Energy dynamic energy runtime 83 Addressing Criticality

84 Conclusion Define Data Criticality Deliver data both no later and no earlier than needed. Measure Criticality 68.8% energy wasted due to non-critical and dead words. Address Criticality NoCNoC, a proof-of-concept, criticality-aware design. 84

85 Thank you

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,

More information

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument

More information

The Runahead Network-On-Chip

The Runahead Network-On-Chip The Network-On-Chip Zimo Li University of Toronto zimo.li@mail.utoronto.ca Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Natalie Enright Jerger University of Toronto enright@ece.utoronto.ca

More information

Locality-Aware Data Replication in the Last-Level Cache

Locality-Aware Data Replication in the Last-Level Cache Locality-Aware Data Replication in the Last-Level Cache George Kurian, Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA USA {gkurian, devadas}@csail.mit.edu Omer Khan University of

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory

Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

SCORPIO: 36-Core Shared Memory Processor

SCORPIO: 36-Core Shared Memory Processor : 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE

VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE Alberto Ros 1 Stefanos Kaxiras 2 Kostis Sagonas 2 Mahdad Davari 2 Magnus Norgen 2 David Klaftenegger 2 1 Universidad de Murcia aros@ditec.um.es 2 Uppsala

More information

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

ShuttleNoC: Boosting On-chip Communication Efficiency by Enabling Localized Power Adaptation

ShuttleNoC: Boosting On-chip Communication Efficiency by Enabling Localized Power Adaptation ShuttleNoC: Boosting On-chip Communication Efficiency by Enabling Localized Power Adaptation Hang Lu, Guihai Yan, Yinhe Han, Ying Wang and Xiaowei Li State Key Laboratory of Computer Architecture, Institute

More information

Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Appears in the 43 rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43) Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Daehoon Kim, Hwanju Kim, and Jaehyuk Huh Computer

More information

Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control

Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control Biswabandan Panda, Shankar Balachandran Dept. of Computer Science and Engineering Indian Institute of Technology Madras, Chennai

More information

EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS

EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS by Ahmed Abousamra B.Sc., Alexandria University, Egypt, 2000 M.S., University of Pittsburgh, 2012 Submitted to

More information

Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis

Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis Minshu Zhao and Donald Yeung Department of Electrical and Computer Engineering University of Maryland

More information

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012 New Review Assignments Were Due: Sunday, October 28, 11:59pm. Das et

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

TECHNOLOGY scaling has enabled greater integration because

TECHNOLOGY scaling has enabled greater integration because Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications Yatish Turakhia, Guangshuo Liu 2, Siddharth Garg 3, and Diana Marculescu 2 arxiv:603.06346v

More information

Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs

Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs Woo-Cheol Kwon Department of EECS, MIT Cambridge, MA 9 wckwon@csail.mit.edu Tushar Krishna Intel Corporation, VSSAD Hudson,

More information

Doppelgänger: A Cache for Approximate Computing

Doppelgänger: A Cache for Approximate Computing Doppelgänger: A Cache for Approximate Computing ABSTRACT Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Andreas Moshovos University of Toronto moshovos@eecg.toronto.edu Modern

More information

Demand-Driven Software Race Detection using Hardware

Demand-Driven Software Race Detection using Hardware Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse, Zhiqiang Ma, Matthew I. Frank Ramesh Peri, Todd Austin University of Michigan Intel Corporation CSCADS Aug

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

A Cache Utility Monitor for Multi-core Processor

A Cache Utility Monitor for Multi-core Processor 3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,

More information

Towards Fair and Efficient SMP Virtual Machine Scheduling

Towards Fair and Efficient SMP Virtual Machine Scheduling Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency

More information

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Arm Research Hayoung Choi, John Kim KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

ELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? ROBERT SMOLINSKI

ELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? ROBERT SMOLINSKI ELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? BY ROBERT SMOLINSKI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels

Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels 2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels Robert Hesse, Jeff Nicholls and Natalie Enright Jerger

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas

A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2012-012 May 22, 2012 A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas

More information

Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip

Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip Sheng Ma, Natalie Enright Jerger, Zhiying Wang School of Computer, National University of Defense Technology,

More information

A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip

A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip Peng Wang Sobhan Niknam Zhiying Wang Todor Stefanov Leiden Institute of Advanced Computer Science, Leiden University,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Run-Time Energy Management of Manycore Systems Through Reconfigurable Interconnects

Run-Time Energy Management of Manycore Systems Through Reconfigurable Interconnects Run-Time Energy Management of Manycore Systems Through Reconfigurable Interconnects Jie Meng Chao Chen Ayse K. Coskun Ajay Joshi Electrical and Computer Engineering Department, Boston University, Boston,

More information

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

COTS Multicore Processors in Avionics Systems: Challenges and Solutions COTS Multicore Processors in Avionics Systems: Challenges and Solutions Dionisio de Niz Bjorn Andersson and Lutz Wrage dionisio@sei.cmu.edu, baandersson@sei.cmu.edu, lwrage@sei.cmu.edu Report Documentation

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

What s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled

What s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled Virtual Memory Management: TLB Prefetching & Page Walk Yuxin Bai, Yanwei Song CSC456 Seminar Nov 3, 2011 What s Virtual Memory Management Illusion of having a large amount of memory Protection from other

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

Protozoa: Adaptive Granularity Cache Coherence

Protozoa: Adaptive Granularity Cache Coherence Protozoa: Adaptive Granularity Cache Coherence Hongzhou Zhao, Arrvindh Shriraman, Snehasish Kumar, and Sandhya Dwarkadas Department of Computer Science, University of Rochester School of Computing Sciences,

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February

More information

A Composite and Scalable Cache Coherence Protocol for Large Scale CMPs

A Composite and Scalable Cache Coherence Protocol for Large Scale CMPs A Composite and Scalable Cache Coherence Protocol for Large Scale CMPs Yi Xu, Yu Du, Youtao Zhang, Jun Yang Department of Electrical and Computer Engineering Department of Computer Science University of

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

NVthreads: Practical Persistence for Multi-threaded Applications

NVthreads: Practical Persistence for Multi-threaded Applications NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster,

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Frequent Value Compression in Packet-based NoC Architectures

Frequent Value Compression in Packet-based NoC Architectures Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Cycle accurate transaction-driven simulation with multiple processor simulators

Cycle accurate transaction-driven simulation with multiple processor simulators Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea

More information

Identifying Ad-hoc Synchronization for Enhanced Race Detection

Identifying Ad-hoc Synchronization for Enhanced Race Detection Identifying Ad-hoc Synchronization for Enhanced Race Detection IPD Tichy Lehrstuhl für Programmiersysteme IPDPS 20 April, 2010 Ali Jannesari / Walter F. Tichy KIT die Kooperation von Forschungszentrum

More information

What is the Cost of Determinism?

What is the Cost of Determinism? What is the Cost of Determinism? Abstract Cedomir Segulja We analyze the fundamental performance impact of enforcing a fixed thread communication order to achieve deterministic execution. Our analysis

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs Sheng Ma, Student Member, IEEE, Natalie Enright Jerger, Member, IEEE, Zhiying Wang, Member, IEEE, Mingche Lai, Member, IEEE,

More information

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE/CS 757: Advanced Computer Architecture II Interconnects ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction

More information

Characterizing Multi-threaded Applications based on Shared-Resource Contention

Characterizing Multi-threaded Applications based on Shared-Resource Contention Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang Jack W. Davidson Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville,

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Getting Started with

Getting Started with /************************************************************************* * LaCASA Laboratory * * Authors: Aleksandar Milenkovic with help of Mounika Ponugoti * * Email: milenkovic@computer.org * * Date:

More information

Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega

Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega Georgia Institute of Technology HP Labs, Exascale Computing Lab {paul.bryan,

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Atomic SC for Simple In-order Processors

Atomic SC for Simple In-order Processors Atomic SC for Simple In-order Processors Dibakar Gope and Mikko H. Lipasti University of Wisconsin - Madison gope@wisc.edu, mikko@engr.wisc.edu Abstract Sequential consistency is arguably the most intuitive

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

A Using Multicore Reuse Distance to Study Coherence Directories

A Using Multicore Reuse Distance to Study Coherence Directories A Using Multicore Reuse Distance to Study Coherence Directories MINSHU ZHAO, The MathWorks, Inc. DONALD YEUNG, University of Maryland at College Park Researchers have proposed numerous techniques to improve

More information

ARE WE OPTIMIZING HARDWARE FOR

ARE WE OPTIMIZING HARDWARE FOR ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science

More information

Instruction-Level Execution Migration Omer Khan, Mieszko Lis, and Srinivas Devadas

Instruction-Level Execution Migration Omer Khan, Mieszko Lis, and Srinivas Devadas Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-019 April 17, 2010 Instruction-Level Execution Migration Omer Khan, Mieszko Lis, and Srinivas Devadas massachusetts

More information

To Be Silent or Not: On the Impact of Evictions of Clean Data in Cache-Coherent Multicores

To Be Silent or Not: On the Impact of Evictions of Clean Data in Cache-Coherent Multicores Noname manuscript No. (will be inserted by the editor) To Be Silent or Not: On the Impact of Evictions of Clean in Cache-Coherent Multicores Ricardo Fernández-Pascual Alberto Ros Manuel E. Acacio Received:

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks Lecture 15: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that

More information

Critical Packet Prioritisation by Slack-Aware Re-routing in On-Chip Networks

Critical Packet Prioritisation by Slack-Aware Re-routing in On-Chip Networks Critical Packet Prioritisation by Slack-Aware Re-routing in On-Chip Networks Abhijit Das, Sarath Babu, John Jose, Sangeetha Jose and Maurizio Palesi Dept. of Computer Science and Engineering, Indian Institute

More information

Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip

Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip 1 Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip Hyungjun Kim, Boris Grot, Paul V. Gratz and Daniel A. Jiménez Department of Electrical and Computer Engineering,

More information