Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
|
|
- Simon Cameron
- 5 years ago
- Views:
Transcription
1 Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger
2 Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted energy NoC accounts for 30-70% of on-chip data access latency [Z. Li, HPCA 2009][A. Sharifi, MICRO 2012] NoC accounts for 20-30% of total chip power [J. D. Owens, IEEE Micro 2007][S. R. Vangal, IEEE JSSC 2008] 2
3 Network-On-Chip Efficiency 3
4 Network-On-Chip Efficiency load request 4
5 Network-On-Chip Efficiency load request data response 5
6 Network-On-Chip Efficiency load request Minimize wasted time Deliver data no later than needed data response 6
7 Network-On-Chip Efficiency 7
8 Network-On-Chip Efficiency load request 0x2 8
9 Network-On-Chip Efficiency load request 0x2 data response
10 Network-On-Chip Efficiency load request 0x2 Minimize wasted energy Deliver data no earlier than needed data response
11 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM 11
12 Network-On-Chip Efficiency Why store data in blocks of multiple words? Exploit spatial locality in applications Avoid large tag arrays in caches Improve row buffer utilization in DRAM Store data at a coarse granularity, but move data at a fine granularity. 12
13 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 13
14 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 14
15 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 15
16 Network-On-Chip Efficiency Arrive no later than needed. But expensive. Wasted money since some arrive too early. 01:00 03:00 02:00 04:00 16
17 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 17
18 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 18
19 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 19
20 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 20
21 Network-On-Chip Efficiency 01:00 03:00 02:00 04:00 21
22 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 03:00 02:00 04:00 22
23 Network-On-Chip Efficiency Spend just enough money to arrive both no later and no earlier than needed. 01:00 Deliver data both no later and no earlier 03:00 than needed Design for data criticality 02:00 04:00 23
24 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 24
25 Data Criticality Data Criticality is the promptness with which an application uses a data word after fetching it from memory. Critical: used immediately after being fetched. Non-critical: used some time later after being fetched. 25 Defining Criticality
26 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); 26 Defining Criticality
27 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); i = 0 sptprice 27 Defining Criticality
28 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice i = Defining Criticality
29 Data Criticality blackscholes for (i++) { }... = BlkSchlsEqEuroNoDiv(sptprice[i]); sptprice criticality 29 Defining Criticality
30 Data Criticality fluidanimate for (iparneigh++) { if (borderneigh) { pthread_mutex_lock(); neigh->a[iparneigh] -=...; pthread_mutex_unlock(); } else { neigh->a[iparneigh] -=...; } } 30 Defining Criticality
31 Data Criticality Data criticality is an inherent consequence of spatial locality and is exhibited by most (if not all) real-world applications. Examples of non-criticality: Long-running code between accesses Interference due to thread synchronization Dependences from other cache misses Preemption by the operating system 31 Defining Criticality
32 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 32 Defining Criticality
33 Data Criticality vs. Instruction Criticality Instruction (or packet) criticality load miss A load miss B load miss C load miss D 33 Defining Criticality
34 Data Criticality vs. Instruction Criticality Data criticality load miss A load miss B load miss C load miss D 34 Defining Criticality
35 Data Liveness Data Liveness describes whether or not an application uses a data word at all after fetching it from memory. Live-on-arrival (live): used at least once during its cache lifetime. Dead-on-arrival (dead): never used during its cache lifetime. 35 Defining Criticality
36 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } 36 Defining Criticality
37 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 37 Defining Criticality
38 Data Liveness fluidanimate for (iz++) for (iy++) for (ix++) for (j++) { if (border(ix))... = cell->v[j].x; if (border(iy))... = cell->v[j].y; if (border(iz))... = cell->v[j].z; } cell->v x y z x y z x y z x y z x y z 38 Defining Criticality
39 Data Liveness Data liveness measures the degree of spatial locality in an application. Examples of dead words: Unused members of structs Irregular or random access patterns Heap fragmentation Padding between data elements Early evictions due to invalidations, cache pressure or poor replacement policies 39 Defining Criticality
40 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 40
41 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency 41 Measuring Criticality
42 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency 42 Measuring Criticality
43 Measuring Criticality time load miss A fetch A[i] use A[i] fetch latency access latency access latency non criticality = fetch latency 1x for critical words, >1x for non-critical words 43 Measuring Criticality
44 Measuring Criticality Full-system simulations: FeS2, BookSim, DSENT GHz OoO cores 64 kb private L1 per core, 16-word cache blocks 16 MB shared distributed L2 Baseline NoC configuration: 4 x 4 mesh, 2.0 GHz, 128-bit channels X-Y routing, 3-stage router pipeline, 6 4-flit VCs per port Applications: PARSEC and SPLASH-2 44 Measuring Criticality
45 % accessed words (cumulative) Measuring Criticality Very low criticality 100% blackscholes bodytrack fluidanimate streamcluster swaptions 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 45 Measuring Criticality
46 % accessed words (cumulative) Measuring Criticality Low criticality 100% barnes lu_cb water_nsquared water_spatial 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 46 Measuring Criticality
47 % accessed words (cumulative) Measuring Criticality High criticality 100% fft vips volrend x264 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 47 Measuring Criticality
48 % accessed words (cumulative) Measuring Criticality Very high criticality 100% canneal cholesky radiosity radix 80% 60% 40% 20% 0% 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x access latency / fetch latency 48 Measuring Criticality
49 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 49 Measuring Criticality
50 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency 50 Measuring Criticality
51 Measuring Criticality Energy Wasted Estimate energy wasted due to non-criticality Model an ideal NoC where for each word: fetch latency access latency high criticality low criticality 51 Measuring Criticality
52 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 100% 80% 60% 40% 20% 0% 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 52 Measuring Criticality
53 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x 60% 40% 20% 0% 1.1x 2x 2x 3.5x 3.5x 7x 7x 18x 18x - 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 53 Measuring Criticality
54 % accessed words (cumulative) Measuring Criticality Energy Wasted e.g., bodytrack: 100% subnet 1 subnet 2 subnet 3 subnet 4 subnet 5 subnet 6 80% 1x 1.1x, 2.0 GHz 60% 40% 20% 0% 1.1x 2x, 1.8 GHz 2x 3.5x, 1.0 GHz 3.5x 7x, MHz 7x 18x, MHz 18x -, MHz 1x 6x 11x 16x 21x 26x 31x access latency / fetch latency 54 Measuring Criticality
55 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 55 Measuring Criticality
56 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 56 Measuring Criticality
57 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to non-criticality 40% 35% 30% 25% 20% 15% 10% 5% 0% 57 Measuring Criticality
58 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 58 Measuring Criticality
59 barnes blackscholes bodytrack canneal cholesky fft fluidanimate lu_cb radiosity radix streamcluster swaptions vips volrend water_nsquared water_spatial x264 geomean dynamic energy wasted Measuring Criticality Energy Wasted Dynamic energy wasted due to dead words 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 68.8% energy wasted due to non-critical and dead words. 59 Measuring Criticality
60 Outline Defining Criticality Data Criticality Data Liveness Measuring Criticality Energy Wasted Addressing Criticality NoCNoC 60
61 Addressing Criticality A criticality-aware NoC design needs to: 1. Predict the criticality of a word prior to fetching it. 2. Separate the fetching of words based on their criticality. 3. Reduce energy consumption in fetching low-criticality words. 4. Eliminate the fetching of dead words. NoCNoC (Non-Critical NoC) A proof-of-concept, criticality-aware design 61 Addressing Criticality
62 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. 62 Addressing Criticality
63 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) 63 Addressing Criticality
64 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality
65 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality
66 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality
67 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) prediction table instruction address X Addressing Criticality
68 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) Addressing Criticality
69 NoCNoC 1. Predict the criticality of a word prior to fetching it. Binary predictor: either critical or non-critical. load miss (requested word 4) instruction address prediction table X prediction vector (requested word 4) L1 request packet Addressing Criticality
70 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 70 Addressing Criticality
71 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. 71 Addressing Criticality
72 NoCNoC 2. Separate the fetching of words based on their criticality. Two physical subnetworks: critical and non-critical. critical non-critical 72 Addressing Criticality
73 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. 73 Addressing Criticality
74 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very low 74 Addressing Criticality
75 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality low 75 Addressing Criticality
76 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality high 76 Addressing Criticality
77 NoCNoC 3. Reduce energy consumption in fetching low-criticality words. DVFS in non-critical subnetwork. critical if criticality very high 77 Addressing Criticality
78 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. 78 Addressing Criticality
79 NoCNoC 4. Eliminate the fetching of dead words. Binary predictor: either live or dead. prediction table instruction address X Addressing Criticality
80 NoCNoC More details and results in the paper: DVFS scheme Prediction tables Comparison to instruction criticality (Aergia [R. Das, ISCA 2010]) 80 Addressing Criticality
81 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness 81 Addressing Criticality
82 prediction accuracy NoCNoC Prediction Accuracy 100% correct over under 95% 90% 85% 80% criticality liveness Outperforms prior liveness predictor (70% accuracy) [H. Kim, NOCS 2011] 82 Addressing Criticality
83 normalized to baseline NoCNoC Performance and Energy dynamic energy runtime 83 Addressing Criticality
84 Conclusion Define Data Criticality Deliver data both no later and no earlier than needed. Measure Criticality 68.8% energy wasted due to non-critical and dead words. Address Criticality NoCNoC, a proof-of-concept, criticality-aware design. 84
85 Thank you
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,
More informationIntroduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on
Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument
More informationThe Runahead Network-On-Chip
The Network-On-Chip Zimo Li University of Toronto zimo.li@mail.utoronto.ca Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Natalie Enright Jerger University of Toronto enright@ece.utoronto.ca
More informationLocality-Aware Data Replication in the Last-Level Cache
Locality-Aware Data Replication in the Last-Level Cache George Kurian, Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA USA {gkurian, devadas}@csail.mit.edu Omer Khan University of
More informationSTLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip
STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationRespin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory
Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationSCORPIO: 36-Core Shared Memory Processor
: 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationVIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE
VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE Alberto Ros 1 Stefanos Kaxiras 2 Kostis Sagonas 2 Mahdad Davari 2 Magnus Norgen 2 David Klaftenegger 2 1 Universidad de Murcia aros@ditec.um.es 2 Uppsala
More informationPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites
PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationShuttleNoC: Boosting On-chip Communication Efficiency by Enabling Localized Power Adaptation
ShuttleNoC: Boosting On-chip Communication Efficiency by Enabling Localized Power Adaptation Hang Lu, Guihai Yan, Yinhe Han, Ying Wang and Xiaowei Li State Key Laboratory of Computer Architecture, Institute
More informationVirtual Snooping: Filtering Snoops in Virtualized Multi-cores
Appears in the 43 rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43) Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Daehoon Kim, Hwanju Kim, and Jaehyuk Huh Computer
More informationIntroducing Thread Criticality Awareness in Prefetcher Aggressiveness Control
Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control Biswabandan Panda, Shankar Balachandran Dept. of Computer Science and Engineering Indian Institute of Technology Madras, Chennai
More informationEXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS
EXPLOITING PROPERTIES OF CMP CACHE TRAFFIC IN DESIGNING HYBRID PACKET/CIRCUIT SWITCHED NOCS by Ahmed Abousamra B.Sc., Alexandria University, Egypt, 2000 M.S., University of Pittsburgh, 2012 Submitted to
More informationStudying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis
Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis Minshu Zhao and Donald Yeung Department of Electrical and Computer Engineering University of Maryland
More informationFall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012 New Review Assignments Were Due: Sunday, October 28, 11:59pm. Das et
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationTECHNOLOGY scaling has enabled greater integration because
Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications Yatish Turakhia, Guangshuo Liu 2, Siddharth Garg 3, and Diana Marculescu 2 arxiv:603.06346v
More informationLocality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs
Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs Woo-Cheol Kwon Department of EECS, MIT Cambridge, MA 9 wckwon@csail.mit.edu Tushar Krishna Intel Corporation, VSSAD Hudson,
More informationDoppelgänger: A Cache for Approximate Computing
Doppelgänger: A Cache for Approximate Computing ABSTRACT Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Andreas Moshovos University of Toronto moshovos@eecg.toronto.edu Modern
More informationDemand-Driven Software Race Detection using Hardware
Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse, Zhiqiang Ma, Matthew I. Frank Ramesh Peri, Todd Austin University of Michigan Intel Corporation CSCADS Aug
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationEfficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip
ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationA Cache Utility Monitor for Multi-core Processor
3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,
More informationTowards Fair and Efficient SMP Virtual Machine Scheduling
Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency
More informationTCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks
TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Arm Research Hayoung Choi, John Kim KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? ROBERT SMOLINSKI
ELIMINATING ON-CHIP TRAFFIC WASTE: ARE WE THERE YET? BY ROBERT SMOLINSKI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationLow-Power Interconnection Networks
Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:
More informationFine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels
2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels Robert Hesse, Jeff Nicholls and Natalie Enright Jerger
More informationMemory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9
Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationA Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2012-012 May 22, 2012 A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
More informationWhole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip
Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip Sheng Ma, Natalie Enright Jerger, Zhiying Wang School of Computer, National University of Defense Technology,
More informationA Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip
A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip Peng Wang Sobhan Niknam Zhiying Wang Todor Stefanov Leiden Institute of Advanced Computer Science, Leiden University,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationRun-Time Energy Management of Manycore Systems Through Reconfigurable Interconnects
Run-Time Energy Management of Manycore Systems Through Reconfigurable Interconnects Jie Meng Chao Chen Ayse K. Coskun Ajay Joshi Electrical and Computer Engineering Department, Boston University, Boston,
More informationCOTS Multicore Processors in Avionics Systems: Challenges and Solutions
COTS Multicore Processors in Avionics Systems: Challenges and Solutions Dionisio de Niz Bjorn Andersson and Lutz Wrage dionisio@sei.cmu.edu, baandersson@sei.cmu.edu, lwrage@sei.cmu.edu Report Documentation
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationLecture 16: On-Chip Networks. Topics: Cache networks, NoC basics
Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality
More informationCache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two
Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu
More informationWhat s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled
Virtual Memory Management: TLB Prefetching & Page Walk Yuxin Bai, Yanwei Song CSC456 Seminar Nov 3, 2011 What s Virtual Memory Management Illusion of having a large amount of memory Protection from other
More informationPerformance metrics for caches
Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:
More informationProtozoa: Adaptive Granularity Cache Coherence
Protozoa: Adaptive Granularity Cache Coherence Hongzhou Zhao, Arrvindh Shriraman, Snehasish Kumar, and Sandhya Dwarkadas Department of Computer Science, University of Rochester School of Computing Sciences,
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationStride- and Global History-based DRAM Page Management
1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system
More informationRethinking Last-Level Cache Management for Multicores Operating at Near-Threshold
Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationStash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho
Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February
More informationA Composite and Scalable Cache Coherence Protocol for Large Scale CMPs
A Composite and Scalable Cache Coherence Protocol for Large Scale CMPs Yi Xu, Yu Du, Youtao Zhang, Jun Yang Department of Electrical and Computer Engineering Department of Computer Science University of
More informationThomas Moscibroda Microsoft Research. Onur Mutlu CMU
Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank
More informationNVthreads: Practical Persistence for Multi-threaded Applications
NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster,
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationPhastlane: A Rapid Transit Optical Routing Network
Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens
More informationA Comprehensive Analytical Performance Model of DRAM Caches
A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationFrequent Value Compression in Packet-based NoC Architectures
Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh
More informationKey Point. What are Cache lines
Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationCycle accurate transaction-driven simulation with multiple processor simulators
Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea
More informationIdentifying Ad-hoc Synchronization for Enhanced Race Detection
Identifying Ad-hoc Synchronization for Enhanced Race Detection IPD Tichy Lehrstuhl für Programmiersysteme IPDPS 20 April, 2010 Ali Jannesari / Walter F. Tichy KIT die Kooperation von Forschungszentrum
More informationWhat is the Cost of Determinism?
What is the Cost of Determinism? Abstract Cedomir Segulja We analyze the fundamental performance impact of enforcing a fixed thread communication order to achieve deterministic execution. Our analysis
More informationEnergy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques
Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering
More informationHolistic Routing Algorithm Design to Support Workload Consolidation in NoCs
Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs Sheng Ma, Student Member, IEEE, Natalie Enright Jerger, Member, IEEE, Zhiying Wang, Member, IEEE, Mingche Lai, Member, IEEE,
More informationECE/CS 757: Advanced Computer Architecture II Interconnects
ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction
More informationCharacterizing Multi-threaded Applications based on Shared-Resource Contention
Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang Jack W. Davidson Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville,
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationGetting Started with
/************************************************************************* * LaCASA Laboratory * * Authors: Aleksandar Milenkovic with help of Mounika Ponugoti * * Email: milenkovic@computer.org * * Date:
More informationOur Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega
Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega Georgia Institute of Technology HP Labs, Exascale Computing Lab {paul.bryan,
More informationh Coherence Controllers
High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs
More informationAtomic SC for Simple In-order Processors
Atomic SC for Simple In-order Processors Dibakar Gope and Mikko H. Lipasti University of Wisconsin - Madison gope@wisc.edu, mikko@engr.wisc.edu Abstract Sequential consistency is arguably the most intuitive
More informationOutline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA
CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationLecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background
Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation
More informationA Using Multicore Reuse Distance to Study Coherence Directories
A Using Multicore Reuse Distance to Study Coherence Directories MINSHU ZHAO, The MathWorks, Inc. DONALD YEUNG, University of Maryland at College Park Researchers have proposed numerous techniques to improve
More informationARE WE OPTIMIZING HARDWARE FOR
ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science
More informationInstruction-Level Execution Migration Omer Khan, Mieszko Lis, and Srinivas Devadas
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-019 April 17, 2010 Instruction-Level Execution Migration Omer Khan, Mieszko Lis, and Srinivas Devadas massachusetts
More informationTo Be Silent or Not: On the Impact of Evictions of Clean Data in Cache-Coherent Multicores
Noname manuscript No. (will be inserted by the editor) To Be Silent or Not: On the Impact of Evictions of Clean in Cache-Coherent Multicores Ricardo Fernández-Pascual Alberto Ros Manuel E. Acacio Received:
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationLecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks
Lecture 15: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that
More informationCritical Packet Prioritisation by Slack-Aware Re-routing in On-Chip Networks
Critical Packet Prioritisation by Slack-Aware Re-routing in On-Chip Networks Abhijit Das, Sarath Babu, John Jose, Sangeetha Jose and Maurizio Palesi Dept. of Computer Science and Engineering, Indian Institute
More informationSpatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip
1 Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip Hyungjun Kim, Boris Grot, Paul V. Gratz and Daniel A. Jiménez Department of Electrical and Computer Engineering,
More information