Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks
|
|
- Magnus Fletcher
- 5 years ago
- Views:
Transcription
1 Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas UIUC, OSU, UMN, NVIDIA 1
2 Motivation: Cost of Power/Ground Pins in 3D stacks DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 2
3 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 3
4 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 4
5 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and Each dimensioned for the worst case DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 5
6 Motivation: Underutilization of Power Budget High or Power phases 6
7 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory Mem1 die Conventional Mem Mem0 die TSVs die Proc 7
8 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory On-chip voltage regulator connects the two Power Delivery Networks or can consume more power for the same # of pins Mem1 die Snatch Mem Mem0 die TSVs die Proc On-chip 8
9 Impact Compared to Conventional 3D Stacks For same # of power/ground pins: Application can consume more power Up to 23% application speedup For the same maximum power in and Fewer pins, about 30% package cost reduction 9
10 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 10
11 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 11
12 Conventional Implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V 5.5W 4.5W 12V Cross-section 12
13 Snatch implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 13
14 Snatch implementation Small 2W on-chip bidirectional on Proc die Bulk of work from off-chip s V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 14
15 Snatch: Dynamic power reassignment Up/Down convert power Snatched DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 15
16 Snatch: Dynamic power reassignment Up/Down convert power Snatched 2W 1.1V V DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 16
17 Snatch: Cross-Section Small 2W on-chip bidirectional on Proc die DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip Cross-section 17
18 Snatch: Top Down Small 2W on-chip bidirectional on Proc die DRAM1 die DRAM0 die die TSVs microbumps 5.5W V On-chip PCB substrate C4 bumps BGA pins Single on-chip Single On-Chip 4.5W 1.1V Cross-section Top Down 18
19 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 19
20 Snatching Power On processor intensive phase 5.5W 5.5W On-chip 4.5W 4.5W 20
21 Snatching Power On processor intensive phase Snatch Power TurboBoost 5.5W 7.5W 2W On-chip 4.5W 2.5W 21
22 Snatching Power On memory intensive phase Snatch Power TurboBoost 5.5W 3.5W 2W On-chip 4.5W 6.5W 22
23 Snatching Decisions or Intensive Phase? 5.5W 5.5W?W On-chip 4.5W 4.5W 23
24 Snatching Decisions or Intensive Phase? How much Power is available? 5.5W 5.5W?W On-chip 4.5W 4.5W 24
25 Snatching Decisions or Intensive Phase? How much Power is available? How much Power can we Snatch? 5.5W 5.5W?W On-chip 4.5W 4.5W 25
26 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs 5.5W?W On-chip 4.5W 26
27 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W?W On-chip 4.5W 27
28 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W MAX for power availability?w On-chip 4.5W 28
29 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection MAX for power availability 5.5W Avoid hysteresis?w On-chip 4.5W 29
30 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 30
31 Conventional Power Provisioning provisioned for 7.5W On-chip 31
32 Conventional Power Provisioning provisioned for 7.5W 7.5W 7.5W On-chip 32
33 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W 7.5W 7.5W On-chip 6.5W 6.5W 33
34 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 34
35 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 35
36 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W Snatch 2W On-chip 6.5W 6.5W 36
37 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W Total = + = 14W 5.5W 7.5W Snatch 2W 5.5W +-2W On-chip 6.5W 6.5W 37
38 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W 5.5W 5.5W +-2W 7.5W Snatch 2W 6.5W 4.5W On-chip 4.5W +-2W38
39 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 6.5W 4.5W 4.5W +-2W39
40 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 30% Reduction in Package Power/Ground Pins 6.5W 4.5W 4.5W +-2W40
41 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 41
42 Conventional Power Provisioning & provisioned for 5.5W & 4.5W On-chip 42
43 Conventional Power Provisioning & provisioned for 5.5W & 4.5W 5.5W 5.5W On-chip 4.5W 4.5W 43
44 Snatch: Provide Additional Power & provisioned for 5.5W & 4.5W Snatch power 5.5W 5.5W Snatch 2W On-chip 4.5W 4.5W 44
45 Snatch: Opportunistically boost performance & provisioned for 5.5W & 4.5W Snatch power and boost performance 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W45
46 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W46
47 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost 5.5W Snatch 2W 5.5W +-2W On-chip 4.5W 4.5W +-2W47
48 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost IR-drop and EM characteristics remain the same 5.5W Snatch 2W 4.5W 5.5W +-2W On-chip 4.5W +-2W48
49 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 49
50 Evaluation Methodology Case 2: Same # of pins, improved performance : 22nm LP 8-core w/ SESC + McPAT : 4GB 2-layer WideIO2 w/ DRAMSim2 Benchmarks: SPLASH-2, NAS, and SPEC 50
51 Performance 1.3 Baseline Turbo Boost Snatch Speedup Average(Splash + NAS) Average(SPEC) 51
52 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) 52
53 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Average(Splash + NAS) Average(SPEC) 53
54 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W 54
55 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 25% Snatch against Baseline and 8% against Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W for Splash and NAS benchmarks 55
56 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 10% Snatch against Baseline for SPEC benchmarks P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W Negligible gains against Turbo Boost 56
57 Snatching Activity Overview 90 M->P P->M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 57
58 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 58
59 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 59
60 Snatching Activity Overview 90 M P P M % of Total Time Snatching Application Snatch on average, 30% for Splash+NAS 9.4% for SPEC 0 Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 60
61 More On the Paper Design and Implementation: On-chip Voltage Regulator Snatch Algorithm Additional Evaluation: Snatch Algorithm Power Delivery Network Pin Reliability 3D Stack Thermals 61
62 Summary Snatch: An opportunistic power reassignment design for 3D Stacked architectures Small on-chip bidirectional - phase detection and power availability estimation Up to 23% application speedup Alternatively, about 30% package cost reduction 62
63 Image Source : 63
Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks
Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More information3D Memory Architecture. Kyushu University
3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D
More informationPageForge: A Near-Memory Content- Aware Page-Merging Architecture
PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server
More informationfor High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami
3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University
More informationCache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two
Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationPhysical Design Implementation for 3D IC Methodology and Tools. Dave Noice Vassilios Gerousis
I NVENTIVE Physical Design Implementation for 3D IC Methodology and Tools Dave Noice Vassilios Gerousis Outline 3D IC Physical components Modeling 3D IC Stack Configuration Physical Design With TSV Summary
More informationPower Bounds and Large Scale Computing
1 Power Bounds and Large Scale Computing Friday, March 1, 2013 Bronis R. de Supinski 1 Tapasya Patki 2, David K. Lowenthal 2, Barry L. Rountree 1 and Martin Schulz 1 2 University of Arizona This work has
More informationPhastlane: A Rapid Transit Optical Routing Network
Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationc 2014 Aditya Agrawal
c 2014 Aditya Agrawal REFRESH REDUCTION IN DYNAMIC MEMORIES BY ADITYA AGRAWAL DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 14: Photonic Interconnect Instructor: Ron Dreslinski Winter 2016 1 1 Announcements 2 Remaining lecture schedule 3/15: Photonics
More informationNear-Threshold Computing: Reclaiming Moore s Law
1 Near-Threshold Computing: Reclaiming Moore s Law Dr. Ronald G. Dreslinski Research Fellow Ann Arbor 1 1 Motivation 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W)
More informationLight64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu
: Ligh htweight hardware support for data ra ce detection ec during systematic testing Adrian Nistor, Darko Marinov, Josep Torrellas University of Illinois, Urbana Champaign http://iacoma a.cs.uiuc.edu
More informationEECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects
1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip
More informationVirtualized and Flexible ECC for Main Memory
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationAccelerating Dynamic Data Race Detection Using Static Thread Interference Analysis
Accelerating Dynamic Data Race Detection Using Static Thread Interference Peng Di and Yulei Sui School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March
More information3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER
3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER CODES+ISSS: Special session on memory controllers Taipei, October 10 th 2011 Denis Dutoit, Fabien Clermidy, Pascal Vivet {denis.dutoit@cea.fr}
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationData Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted
More informationAbhishek Pandey Aman Chadha Aditya Prakash
Abhishek Pandey Aman Chadha Aditya Prakash System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning
More informationPhase-Based Application-Driven Power Management on the Single-chip Cloud Computer
Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction
More information3D Technologies For Low Power Integrated Circuits
3D Technologies For Low Power Integrated Circuits Paul Franzon North Carolina State University Raleigh, NC paulf@ncsu.edu 919.515.7351 Outline 3DIC Technology Set Approaches to 3D Specific Power Minimization
More informationIntroduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on
Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument
More informationProcessor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs
Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Shin-Shiun Chen, Chun-Kai Hsu, Hsiu-Chuan Shih, and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University
More informationEnergy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques
Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering
More informationWafer Level Packaging The Promise Evolves Dr. Thomas Di Stefano Centipede Systems, Inc. IWLPC 2008
Wafer Level Packaging The Promise Evolves Dr. Thomas Di Stefano Centipede Systems, Inc. IWLPC 2008 / DEVICE 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 Productivity Gains
More informationHybrid Cache Architecture (HCA) with Disparate Memory Technologies
Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationInterconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp
Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary
More informationRespin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory
Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More informationCongestion-Aware Power Grid. and CMOS Decoupling Capacitors. Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar
Congestion-Aware Power Grid Optimization for 3D circuits Using MIM and CMOS Decoupling Capacitors Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar University of Minnesota 1 Outline Motivation A new
More informationAccurate Characterization of the Variability in Power Consumption in Modern Mobile Processors
Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors Bharathan Balaji John McCullough, Rajesh Gupta, Yuvraj Agarwal Computer Science and Engineering, UC San Diego
More informationRethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,
More informationScalable series-stacked power delivery architectures for improved efficiency and reduced supply current
Scalable series-stacked power delivery architectures for improved efficiency and reduced supply current Robert Pilawa Enver Candan, Josiah McClurg, Sai Zhang, Pradeep Shenoy* Phil Krein, Naresh Shanbhag
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationMosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip edram Modules
Mosaic: Exploiting the Spatial Locality of Process Variation to Reduce Refresh Energy in On-Chip edram Modules Aditya Agrawal, Amin Ansari and Josep Torrellas University of Illinois at Urbana-Champaign
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationHigh Performance Memory Opportunities in 2.5D Network Flow Processors
High Performance Memory Opportunities in 2.5D Network Flow Processors Jay Seaton, VP Silicon Operations, Netronome Larry Zu, PhD, President, Sarcina Technology LLC August 6, 2013 2013 Netronome 1 Netronome
More informationStacked Silicon Interconnect Technology (SSIT)
Stacked Silicon Interconnect Technology (SSIT) Suresh Ramalingam Xilinx Inc. MEPTEC, January 12, 2011 Agenda Background and Motivation Stacked Silicon Interconnect Technology Summary Background and Motivation
More informationThere is a paradigm shift in semiconductor industry towards 2.5D and 3D integration of heterogeneous parts to build complex systems.
Direct Connection and Testing of TSV and Microbump Devices using NanoPierce Contactor for 3D-IC Integration There is a paradigm shift in semiconductor industry towards 2.5D and 3D integration of heterogeneous
More informationStacked IC Analysis Modeling for Power Noise Impact
Si2 Open3D Kick-off Meeting June 7, 2011 Stacked IC Analysis Modeling for Power Noise Impact Aveek Sarkar Vice President Product Engineering & Support Stacked IC Design Needs Implementation Electrical-,
More informationDPPC: Dynamic Power Partitioning and Capping in Chip Multiprocessors
DPPC: Dynamic Partitioning and Capping in Chip Multiprocessors Kai Ma 1,2, Xiaorui Wang 1,2, and Yefu Wang 2 1 The Ohio State University 2 University of Tennessee, Knoxville Abstract A key challenge in
More informationDynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization
Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Talk
More informationPicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor
PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,
More informationFrom 3D Toolbox to 3D Integration: Examples of Successful 3D Applicative Demonstrators N.Sillon. CEA. All rights reserved
From 3D Toolbox to 3D Integration: Examples of Successful 3D Applicative Demonstrators N.Sillon Agenda Introduction 2,5D: Silicon Interposer 3DIC: Wide I/O Memory-On-Logic 3D Packaging: X-Ray sensor Conclusion
More informationDeveloped Hybrid Memory System for New SoC. -Why choose Wide I/O?
Developed Hybrid Memory System for New SoC. -Why choose Wide I/O? Takashi Yamada Chief Architect, System LSI Business Division Mobile Forum 2014 Copyright 2014 - Panasonic Agenda 4K (UHD) market and changes
More informationDynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors
Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,
More informationIMEC CORE CMOS P. MARCHAL
APPLICATIONS & 3D TECHNOLOGY IMEC CORE CMOS P. MARCHAL OUTLINE What is important to spec 3D technology How to set specs for the different applications - Mobile consumer - Memory - High performance Conclusions
More informationApache s Power Noise Simulation Technologies
Enabling Power Efficient i Designs Apache s Power Noise Simulation Technologies 1 Aveek Sarkar VP of Support Apache Design Inc, A wholly owned subsidiary of ANSYS Trends in Today s Electronic Designs Low-power
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More information39 Does the Sharing of Execution Units Improve Performance/Power of Multicores?
39 Does the Sharing of Execution Units Improve Performance/Power of Multicores? RANCE RODRIGUES, ISRAEL KOREN AND SANDIP KUNDU, Department of Electrical and Computer Engineering, University of Massachusetts,
More informationAdvanced Computer Architecture (CS620)
Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationOptimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface JISHEN ZHAO, Pennsylvania State University GUANGYU SUN, Peking University GABRIEL H. LOH, Advanced
More informationEconomic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng
Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng Energy Efficient Supercompu1ng, SC 16 November 14, 2016 This work was performed under the auspices of the U.S.
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationUNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY
... UNDERCLOCKED SOFTWARE PREFETCHING: MORE CORES, LESS ENERGY... POWER CONSUMPTION IS A CONCERN FOR HELPER-THREAD PREFETCHING THAT USES EXTRA CORES TO SPEED UP THE SINGLE-THREAD EXECUTION, BECAUSE POWER
More informationDeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently
: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationPerformance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology
Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp
More informationA Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias
A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias Moongon Jung and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, Georgia, USA Email:
More informationAS memory-intensive applications such as web servers,
274 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 1, JANUARY 2017 Using Switchable Pins to Increase Off-Chip Bandwidth in Chip-Multiprocessors Shaoming Chen, Samuel Irving, Lu Peng,
More informationDynamic Cache Pooling in 3D Multicore Processors
Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising
More informationMemory Access Pattern-Aware DRAM Performance Model for Multi-core Systems
Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi *, Jongbok Lee +, and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul
More informationEmerging NVM Memory Technologies
Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement
More informationMemory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012
Memory Prefetching for the GreenDroid Microprocessor David Curran December 10, 2012 Outline Memory Prefetchers Overview GreenDroid Overview Problem Description Design Placement Prediction Logic Simulation
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationECE 571 Advanced Microprocessor-Based Design Lecture 24
ECE 571 Advanced Microprocessor-Based Design Lecture 24 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 25 April 2013 Project/HW Reminder Project Presentations. 15-20 minutes.
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationImaging Solutions by Mercury Computer Systems
Imaging Solutions by Mercury Computer Systems Presented By Raj Parihar Computer Architecture Reading Group, UofR Mercury Computer Systems Boston based; designs and builds embedded multi computers Loosely
More informationXilinx SSI Technology Concept to Silicon Development Overview
Xilinx SSI Technology Concept to Silicon Development Overview Shankar Lakka Aug 27 th, 2012 Agenda Economic Drivers and Technical Challenges Xilinx SSI Technology, Power, Performance SSI Development Overview
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationPart IV: 3D WiNoC Architectures
Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures
More informationAdvances in 3D Simulations of Chip/Package/PCB Co-Design
Advances in 3D Simulations of Chip/Package/PCB Co-Design Richard Sjiariel, CST AG Co-design environment Signal Integrity and timing Thermal analysis and stress Power Integrity and noise analysis EMC/EMI
More informationWLSI Extends Si Processing and Supports Moore s Law. Douglas Yu TSMC R&D,
WLSI Extends Si Processing and Supports Moore s Law Douglas Yu TSMC R&D, chyu@tsmc.com SiP Summit, Semicon Taiwan, Taipei, Taiwan, Sep. 9 th, 2016 Introduction Moore s Law Challenges Heterogeneous Integration
More informationAccelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis
Accelerating Multi-core Processor Design Space Evaluation Using Automatic Multi-threaded Workload Synthesis Clay Hughes & Tao Li Department of Electrical and Computer Engineering University of Florida
More informationAdvanced Heterogeneous Solutions for System Integration
Advanced Heterogeneous Solutions for System Integration Kees Joosse Director Sales, Israel TSMC High-Growth Applications Drive Product and Technology Smartphone Cloud Data Center IoT CAGR 12 17 20% 24%
More informationVirtualized ECC: Flexible Reliability in Memory Systems
Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing
More informationGood old days ended in Nov. 2002
WaveScalar! Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale
More informationPower and Thermal Models. for RAMP2
Power and Thermal Models for 2 Jose Renau Department of Computer Engineering, University of California Santa Cruz http://masc.cse.ucsc.edu Motivation Performance not the only first order design parameter
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationTCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks
TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Arm Research Hayoung Choi, John Kim KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly
More informationComparison & highlight on the last 3D TSV technologies trends Romain Fraux
Comparison & highlight on the last 3D TSV technologies trends Romain Fraux Advanced Packaging & MEMS Project Manager European 3D Summit 18 20 January, 2016 Outline About System Plus Consulting 2015 3D
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationThe Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015
The Road to the AMD Fiji GPU Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 Fiji Chip DETAILED LOOK 4GB High-Bandwidth Memory 4096-bit wide interface 512 GB/s
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More information