Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Size: px
Start display at page:

Download "Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks"

Transcription

1 Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas UIUC, OSU, UMN, NVIDIA 1

2 Motivation: Cost of Power/Ground Pins in 3D stacks DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 2

3 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 3

4 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 4

5 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and Each dimensioned for the worst case DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 5

6 Motivation: Underutilization of Power Budget High or Power phases 6

7 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory Mem1 die Conventional Mem Mem0 die TSVs die Proc 7

8 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory On-chip voltage regulator connects the two Power Delivery Networks or can consume more power for the same # of pins Mem1 die Snatch Mem Mem0 die TSVs die Proc On-chip 8

9 Impact Compared to Conventional 3D Stacks For same # of power/ground pins: Application can consume more power Up to 23% application speedup For the same maximum power in and Fewer pins, about 30% package cost reduction 9

10 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 10

11 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 11

12 Conventional Implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V 5.5W 4.5W 12V Cross-section 12

13 Snatch implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 13

14 Snatch implementation Small 2W on-chip bidirectional on Proc die Bulk of work from off-chip s V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 14

15 Snatch: Dynamic power reassignment Up/Down convert power Snatched DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 15

16 Snatch: Dynamic power reassignment Up/Down convert power Snatched 2W 1.1V V DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 16

17 Snatch: Cross-Section Small 2W on-chip bidirectional on Proc die DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip Cross-section 17

18 Snatch: Top Down Small 2W on-chip bidirectional on Proc die DRAM1 die DRAM0 die die TSVs microbumps 5.5W V On-chip PCB substrate C4 bumps BGA pins Single on-chip Single On-Chip 4.5W 1.1V Cross-section Top Down 18

19 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 19

20 Snatching Power On processor intensive phase 5.5W 5.5W On-chip 4.5W 4.5W 20

21 Snatching Power On processor intensive phase Snatch Power TurboBoost 5.5W 7.5W 2W On-chip 4.5W 2.5W 21

22 Snatching Power On memory intensive phase Snatch Power TurboBoost 5.5W 3.5W 2W On-chip 4.5W 6.5W 22

23 Snatching Decisions or Intensive Phase? 5.5W 5.5W?W On-chip 4.5W 4.5W 23

24 Snatching Decisions or Intensive Phase? How much Power is available? 5.5W 5.5W?W On-chip 4.5W 4.5W 24

25 Snatching Decisions or Intensive Phase? How much Power is available? How much Power can we Snatch? 5.5W 5.5W?W On-chip 4.5W 4.5W 25

26 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs 5.5W?W On-chip 4.5W 26

27 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W?W On-chip 4.5W 27

28 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W MAX for power availability?w On-chip 4.5W 28

29 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection MAX for power availability 5.5W Avoid hysteresis?w On-chip 4.5W 29

30 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 30

31 Conventional Power Provisioning provisioned for 7.5W On-chip 31

32 Conventional Power Provisioning provisioned for 7.5W 7.5W 7.5W On-chip 32

33 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W 7.5W 7.5W On-chip 6.5W 6.5W 33

34 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 34

35 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 35

36 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W Snatch 2W On-chip 6.5W 6.5W 36

37 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W Total = + = 14W 5.5W 7.5W Snatch 2W 5.5W +-2W On-chip 6.5W 6.5W 37

38 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W 5.5W 5.5W +-2W 7.5W Snatch 2W 6.5W 4.5W On-chip 4.5W +-2W38

39 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 6.5W 4.5W 4.5W +-2W39

40 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 30% Reduction in Package Power/Ground Pins 6.5W 4.5W 4.5W +-2W40

41 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 41

42 Conventional Power Provisioning & provisioned for 5.5W & 4.5W On-chip 42

43 Conventional Power Provisioning & provisioned for 5.5W & 4.5W 5.5W 5.5W On-chip 4.5W 4.5W 43

44 Snatch: Provide Additional Power & provisioned for 5.5W & 4.5W Snatch power 5.5W 5.5W Snatch 2W On-chip 4.5W 4.5W 44

45 Snatch: Opportunistically boost performance & provisioned for 5.5W & 4.5W Snatch power and boost performance 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W45

46 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W46

47 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost 5.5W Snatch 2W 5.5W +-2W On-chip 4.5W 4.5W +-2W47

48 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost IR-drop and EM characteristics remain the same 5.5W Snatch 2W 4.5W 5.5W +-2W On-chip 4.5W +-2W48

49 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 49

50 Evaluation Methodology Case 2: Same # of pins, improved performance : 22nm LP 8-core w/ SESC + McPAT : 4GB 2-layer WideIO2 w/ DRAMSim2 Benchmarks: SPLASH-2, NAS, and SPEC 50

51 Performance 1.3 Baseline Turbo Boost Snatch Speedup Average(Splash + NAS) Average(SPEC) 51

52 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) 52

53 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Average(Splash + NAS) Average(SPEC) 53

54 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W 54

55 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 25% Snatch against Baseline and 8% against Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W for Splash and NAS benchmarks 55

56 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 10% Snatch against Baseline for SPEC benchmarks P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W Negligible gains against Turbo Boost 56

57 Snatching Activity Overview 90 M->P P->M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 57

58 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 58

59 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 59

60 Snatching Activity Overview 90 M P P M % of Total Time Snatching Application Snatch on average, 30% for Splash+NAS 9.4% for SPEC 0 Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 60

61 More On the Paper Design and Implementation: On-chip Voltage Regulator Snatch Algorithm Additional Evaluation: Snatch Algorithm Power Delivery Network Pin Reliability 3D Stack Thermals 61

62 Summary Snatch: An opportunistic power reassignment design for 3D Stacked architectures Small on-chip bidirectional - phase detection and power availability estimation Up to 23% application speedup Alternatively, about 30% package cost reduction 62

63 Image Source : 63

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona

More information

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard

More information

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D

More information

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

PageForge: A Near-Memory Content- Aware Page-Merging Architecture PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server

More information

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Physical Design Implementation for 3D IC Methodology and Tools. Dave Noice Vassilios Gerousis

Physical Design Implementation for 3D IC Methodology and Tools. Dave Noice Vassilios Gerousis I NVENTIVE Physical Design Implementation for 3D IC Methodology and Tools Dave Noice Vassilios Gerousis Outline 3D IC Physical components Modeling 3D IC Stack Configuration Physical Design With TSV Summary

More information

Power Bounds and Large Scale Computing

Power Bounds and Large Scale Computing 1 Power Bounds and Large Scale Computing Friday, March 1, 2013 Bronis R. de Supinski 1 Tapasya Patki 2, David K. Lowenthal 2, Barry L. Rountree 1 and Martin Schulz 1 2 University of Arizona This work has

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

BREAKING THE MEMORY WALL

BREAKING THE MEMORY WALL BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power

More information

c 2014 Aditya Agrawal

c 2014 Aditya Agrawal c 2014 Aditya Agrawal REFRESH REDUCTION IN DYNAMIC MEMORIES BY ADITYA AGRAWAL DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 14: Photonic Interconnect Instructor: Ron Dreslinski Winter 2016 1 1 Announcements 2 Remaining lecture schedule 3/15: Photonics

More information

Near-Threshold Computing: Reclaiming Moore s Law

Near-Threshold Computing: Reclaiming Moore s Law 1 Near-Threshold Computing: Reclaiming Moore s Law Dr. Ronald G. Dreslinski Research Fellow Ann Arbor 1 1 Motivation 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W)

More information

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas.   a.cs.uiuc.edu : Ligh htweight hardware support for data ra ce detection ec during systematic testing Adrian Nistor, Darko Marinov, Josep Torrellas University of Illinois, Urbana Champaign http://iacoma a.cs.uiuc.edu

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis

Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis Accelerating Dynamic Data Race Detection Using Static Thread Interference Peng Di and Yulei Sui School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March

More information

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER 3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER CODES+ISSS: Special session on memory controllers Taipei, October 10 th 2011 Denis Dutoit, Fabien Clermidy, Pascal Vivet {denis.dutoit@cea.fr}

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted

More information

Abhishek Pandey Aman Chadha Aditya Prakash

Abhishek Pandey Aman Chadha Aditya Prakash Abhishek Pandey Aman Chadha Aditya Prakash System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning

More information

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction

More information

3D Technologies For Low Power Integrated Circuits

3D Technologies For Low Power Integrated Circuits 3D Technologies For Low Power Integrated Circuits Paul Franzon North Carolina State University Raleigh, NC paulf@ncsu.edu 919.515.7351 Outline 3DIC Technology Set Approaches to 3D Specific Power Minimization

More information

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument

More information

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Shin-Shiun Chen, Chun-Kai Hsu, Hsiu-Chuan Shih, and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Wafer Level Packaging The Promise Evolves Dr. Thomas Di Stefano Centipede Systems, Inc. IWLPC 2008

Wafer Level Packaging The Promise Evolves Dr. Thomas Di Stefano Centipede Systems, Inc. IWLPC 2008 Wafer Level Packaging The Promise Evolves Dr. Thomas Di Stefano Centipede Systems, Inc. IWLPC 2008 / DEVICE 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 Productivity Gains

More information

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary

More information

Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory

Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

Congestion-Aware Power Grid. and CMOS Decoupling Capacitors. Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar

Congestion-Aware Power Grid. and CMOS Decoupling Capacitors. Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar Congestion-Aware Power Grid Optimization for 3D circuits Using MIM and CMOS Decoupling Capacitors Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar University of Minnesota 1 Outline Motivation A new

More information

Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors

Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors Bharathan Balaji John McCullough, Rajesh Gupta, Yuvraj Agarwal Computer Science and Engineering, UC San Diego

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Scalable series-stacked power delivery architectures for improved efficiency and reduced supply current

Scalable series-stacked power delivery architectures for improved efficiency and reduced supply current Scalable series-stacked power delivery architectures for improved efficiency and reduced supply current Robert Pilawa Enver Candan, Josiah McClurg, Sai Zhang, Pradeep Shenoy* Phil Krein, Naresh Shanbhag

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Mosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip edram Modules

Mosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip edram Modules Mosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip edram Modules Aditya Agrawal, Amin Ansari and Josep Torrellas University of Illinois at Urbana-Champaign

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

High Performance Memory Opportunities in 2.5D Network Flow Processors

High Performance Memory Opportunities in 2.5D Network Flow Processors High Performance Memory Opportunities in 2.5D Network Flow Processors Jay Seaton, VP Silicon Operations, Netronome Larry Zu, PhD, President, Sarcina Technology LLC August 6, 2013 2013 Netronome 1 Netronome

More information

Stacked Silicon Interconnect Technology (SSIT)

Stacked Silicon Interconnect Technology (SSIT) Stacked Silicon Interconnect Technology (SSIT) Suresh Ramalingam Xilinx Inc. MEPTEC, January 12, 2011 Agenda Background and Motivation Stacked Silicon Interconnect Technology Summary Background and Motivation

More information

There is a paradigm shift in semiconductor industry towards 2.5D and 3D integration of heterogeneous parts to build complex systems.

There is a paradigm shift in semiconductor industry towards 2.5D and 3D integration of heterogeneous parts to build complex systems. Direct Connection and Testing of TSV and Microbump Devices using NanoPierce Contactor for 3D-IC Integration There is a paradigm shift in semiconductor industry towards 2.5D and 3D integration of heterogeneous

More information

Stacked IC Analysis Modeling for Power Noise Impact

Stacked IC Analysis Modeling for Power Noise Impact Si2 Open3D Kick-off Meeting June 7, 2011 Stacked IC Analysis Modeling for Power Noise Impact Aveek Sarkar Vice President Product Engineering & Support Stacked IC Design Needs Implementation Electrical-,

More information

DPPC: Dynamic Power Partitioning and Capping in Chip Multiprocessors

DPPC: Dynamic Power Partitioning and Capping in Chip Multiprocessors DPPC: Dynamic Partitioning and Capping in Chip Multiprocessors Kai Ma 1,2, Xiaorui Wang 1,2, and Yefu Wang 2 1 The Ohio State University 2 University of Tennessee, Knoxville Abstract A key challenge in

More information

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk

More information

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,

More information

From 3D Toolbox to 3D Integration: Examples of Successful 3D Applicative Demonstrators N.Sillon. CEA. All rights reserved

From 3D Toolbox to 3D Integration: Examples of Successful 3D Applicative Demonstrators N.Sillon. CEA. All rights reserved From 3D Toolbox to 3D Integration: Examples of Successful 3D Applicative Demonstrators N.Sillon Agenda Introduction 2,5D: Silicon Interposer 3DIC: Wide I/O Memory-On-Logic 3D Packaging: X-Ray sensor Conclusion

More information

Developed Hybrid Memory System for New SoC. -Why choose Wide I/O?

Developed Hybrid Memory System for New SoC. -Why choose Wide I/O? Developed Hybrid Memory System for New SoC. -Why choose Wide I/O? Takashi Yamada Chief Architect, System LSI Business Division Mobile Forum 2014 Copyright 2014 - Panasonic Agenda 4K (UHD) market and changes

More information

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

IMEC CORE CMOS P. MARCHAL

IMEC CORE CMOS P. MARCHAL APPLICATIONS & 3D TECHNOLOGY IMEC CORE CMOS P. MARCHAL OUTLINE What is important to spec 3D technology How to set specs for the different applications - Mobile consumer - Memory - High performance Conclusions

More information

Apache s Power Noise Simulation Technologies

Apache s Power Noise Simulation Technologies Enabling Power Efficient i Designs Apache s Power Noise Simulation Technologies 1 Aveek Sarkar VP of Support Apache Design Inc, A wholly owned subsidiary of ANSYS Trends in Today s Electronic Designs Low-power

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

39 Does the Sharing of Execution Units Improve Performance/Power of Multicores?

39 Does the Sharing of Execution Units Improve Performance/Power of Multicores? 39 Does the Sharing of Execution Units Improve Performance/Power of Multicores? RANCE RODRIGUES, ISRAEL KOREN AND SANDIP KUNDU, Department of Electrical and Computer Engineering, University of Massachusetts,

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface

Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface JISHEN ZHAO, Pennsylvania State University GUANGYU SUN, Peking University GABRIEL H. LOH, Advanced

More information

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY

UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY ... UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY... POWER CONSUMPTION IS A CONCERN FOR HELPER-THREAD PREFETCHING THAT USES EXTRA CORES TO SPEED UP THE SINGLE-THREAD EXECUTION, BECAUSE POWER

More information

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp

More information

A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias

A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias Moongon Jung and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, Georgia, USA Email:

More information

AS memory-intensive applications such as web servers,

AS memory-intensive applications such as web servers, 274 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 1, JANUARY 2017 Using Switchable Pins to Increase Off-Chip Bandwidth in Chip-Multiprocessors Shaoming Chen, Samuel Irving, Lu Peng,

More information

Dynamic Cache Pooling in 3D Multicore Processors

Dynamic Cache Pooling in 3D Multicore Processors Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising

More information

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi *, Jongbok Lee +, and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012 Memory Prefetching for the GreenDroid Microprocessor David Curran December 10, 2012 Outline Memory Prefetchers Overview GreenDroid Overview Problem Description Design Placement Prediction Logic Simulation

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 24

ECE 571 Advanced Microprocessor-Based Design Lecture 24 ECE 571 Advanced Microprocessor-Based Design Lecture 24 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 25 April 2013 Project/HW Reminder Project Presentations. 15-20 minutes.

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Imaging Solutions by Mercury Computer Systems

Imaging Solutions by Mercury Computer Systems Imaging Solutions by Mercury Computer Systems Presented By Raj Parihar Computer Architecture Reading Group, UofR Mercury Computer Systems Boston based; designs and builds embedded multi computers Loosely

More information

Xilinx SSI Technology Concept to Silicon Development Overview

Xilinx SSI Technology Concept to Silicon Development Overview Xilinx SSI Technology Concept to Silicon Development Overview Shankar Lakka Aug 27 th, 2012 Agenda Economic Drivers and Technical Challenges Xilinx SSI Technology, Power, Performance SSI Development Overview

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

Part IV: 3D WiNoC Architectures

Part IV: 3D WiNoC Architectures Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures

More information

Advances in 3D Simulations of Chip/Package/PCB Co-Design

Advances in 3D Simulations of Chip/Package/PCB Co-Design Advances in 3D Simulations of Chip/Package/PCB Co-Design Richard Sjiariel, CST AG Co-design environment Signal Integrity and timing Thermal analysis and stress Power Integrity and noise analysis EMC/EMI

More information

WLSI Extends Si Processing and Supports Moore s Law. Douglas Yu TSMC R&D,

WLSI Extends Si Processing and Supports Moore s Law. Douglas Yu TSMC R&D, WLSI Extends Si Processing and Supports Moore s Law Douglas Yu TSMC R&D, chyu@tsmc.com SiP Summit, Semicon Taiwan, Taipei, Taiwan, Sep. 9 th, 2016 Introduction Moore s Law Challenges Heterogeneous Integration

More information

Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis

Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis Clay Hughes & Tao Li Department of Electrical and Computer Engineering University of Florida

More information

Advanced Heterogeneous Solutions for System Integration

Advanced Heterogeneous Solutions for System Integration Advanced Heterogeneous Solutions for System Integration Kees Joosse Director Sales, Israel TSMC High-Growth Applications Drive Product and Technology Smartphone Cloud Data Center IoT CAGR 12 17 20% 24%

More information

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing

More information

Good old days ended in Nov. 2002

Good old days ended in Nov. 2002 WaveScalar! Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale

More information

Power and Thermal Models. for RAMP2

Power and Thermal Models. for RAMP2 Power and Thermal Models for 2 Jose Renau Department of Computer Engineering, University of California Santa Cruz http://masc.cse.ucsc.edu Motivation Performance not the only first order design parameter

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Arm Research Hayoung Choi, John Kim KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly

More information

Comparison & highlight on the last 3D TSV technologies trends Romain Fraux

Comparison & highlight on the last 3D TSV technologies trends Romain Fraux Comparison & highlight on the last 3D TSV technologies trends Romain Fraux Advanced Packaging & MEMS Project Manager European 3D Summit 18 20 January, 2016 Outline About System Plus Consulting 2015 3D

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 The Road to the AMD Fiji GPU Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 Fiji Chip DETAILED LOOK 4GB High-Bandwidth Memory 4096-bit wide interface 512 GB/s

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information