Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns
|
|
- Barnaby Logan
- 5 years ago
- Views:
Transcription
1 March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang
2 Executive Summary Problems: performance and reliability of write operations The large sneaky currents and IR drop issues in crossbar ReRAM Proposed solutions: speeding up RESET operation based on data pattern Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing capability of crossbar ReRAM Data compression and row address dependentlayoutto reduce bitline LRS cells Contributions Correlation between the RESET latency and the number of LRS cells on selected bitlines A novel profiling technique to dynamically track the bitline data patterns Results Performance: 20.5% over baseline, 14.2% over state-of-the-art Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art 2
3 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 3
4 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 4
5 ReRAM Cell Oxygen Ion High Resistance State (HRS, Logic 0 ) Top Metal Layer Metal Oxide Oxygen Vacancy Low Resistance State (LRS, Logic 1 ) Bottom Metal Layer SET: High to Low /RESET: Low to High ReRAM cell structure and two resistance states 5
6 ReRAM Crossbar Wordline Sourceline Bitline Wordline ReRAM Cell 1T1R Structure Wordline Bitline 1D1R Structure Diode ReRAM Cell ReRAM array structures üsmallest 4F 2 planar cell size, low fab cost and better scalability. Sneak currents and IR drop Bitline 0T1R Structure ReRAM Cell Crossbar Structures 6
7 Sneak Currents in Crossbar ReRAM 1/2V 1/2V 1/2V Diode selectors help but cannot eliminate sneak currents Sneaky currents lead to serious IR drop issue Hurt energy efficiency, performance and write reliability V Half-selected cells 1/2V 1/2V 1/2V 0 Sneak Current Selected cell Half-selected cells The Slower RESET operation is the performance bottleneck SET takes shorter time than RESET [Xu et al HPCA15, Zhang et al DATE16] 7
8 How does IR drop affect RESET latency?! # $% & = ( RESET latency is highly sensitive, i.e., exponentially inverse correlation, to voltage drop t:reset switching time; V d :voltage drop C and k are experimental fittings constants A voltage drop of 0.4V results in 10x RESET latency increase [Govoreanu et al IEDM11] 8
9 Intuitive Thoughts Facts Observations 1.During RESET, half-selected cells exhibit as resistive devices. 2.With same voltage stress, a half-selected cell in LRS would have larger sneak current than the one in HRS. 3.RESET operations conservatively use the worst-case access latency of all cells in ReRAM arrays. 1.The number of half-selected cells in LRS affects RESET latency. 2.Dynamically profile and track runtime data patterns may avoid using worst-case RESET latency for all cells. TODO List 1.Explore correlation between RESET latency and the number of bitline LRS cells. 2.Need a runtime profiler to dynamically track bitline data patterns. 3.Need to reduce the number of LRS cells on bitlines. 9
10 RESET latency vs. # of LRS cells RESET Latency (ns) RESET Latency Voltage Bitline LRS Cell Percentage Voltage Drop (V) More LRS cells there are in the bitline, the larger IR drop the sneak current brings, and the longer time the RESET operation takes. 10
11 RESET latency vs. # of LRS cells This impact diminishes as the row becomes closer to the write driver. 11
12 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 12
13 Low Overhead Runtime Profiling(1) Mat0 Mat1 bitline-sharing-set Mat63 Wordline Decoders a0 a1 Wordline Decoders a0 a1 Wordline Decoders a0 a1 Shared ADC & Comparator Shared ADC & Comparator Shared ADC & Comparator Profiling Profiling Profiling 3-bit 3-bit 3-bit 64B cacheline: a0 64B cacheline: a1 Update W-Flag = MAX{64 3-bit values} The worst-case bitline data pattern within one bitline-sharing-set determines 13 the optimal RESET latency
14 Mat0 Wordline Decoders a0 a1 Low Overhead Runtime Profiling(1) Dot-Product Operation Vread I1=Vread/Ron Vread I2=Vread/Roff I=I1+I2 Mat1 Ron a0 Roffa1 Wordline Decoders Vread bitline-sharing-set Transmission Gate Mux S/H S/H S/H Shared ADC Mat63 Wordline Decoders a0 a1 From other columns Shared ADC & Comparator Profiling Shared Comparators Shared ADC & Comparator 3-bit counter value Profiling 3-bit 3-bit 3-bit 64B cacheline: a0 64B cacheline: a1 Update W-Flag = MAX{64 3-bit values} Shared ADC & Comparator Profiling The worst-case bitline data pattern within one bitline-sharing-set determines 14 the optimal RESET latency
15 Low Overhead Runtime Profiling(2) Dot-Product Operation Vread I1=Vread/Ron Vread I2=Vread/Roff I=I1+I2 Ron Roff Vread Transmission Gate Mux S/H S/H S/H Shared ADC Shared Comparators 3-bit counter value From other columns Counter = 101 Counter = 100 Counter = 011 Counter = 010 Counter = Counter = Aggregated currents firstly are converted into digital counters, which represent LRS cell percentages. W-Flag is updated by comparing all counters in one bit-sharing-set to decide the worst-case bitline. 15 I/mA Counter = 111 Counter = LRS Cell/% Current to 3-bit value of LRS percentage Safeguarding area
16 Row Address Impact Until now, we ve talked about how to profile bitline data patterns, but we have not exploit the impact of row address on RESET latency yet! The rows with different addresses are mapped to 8 groups with different worst-case RESET latencies. Wordline Decoders Row Address Group #0 : 0-63 Row Address Group #1 : Row Address Group #7 : RESET latency Decrease Write Drivers & SA 16
17 A Summary for Profiling Technique Finding out the worst-case bitline: 3-bit W-Flag Recording the percentage of LRS cells in worst-case bitline Periodically detecting in each mat Tracking the worst-case: 6-bit W-Cnt W-Cnt is cleared when W-Flag is updated Increment the counter of W-Cnt for each write W-Cnt overflow triggers increment of the W-Flag RESET latency optimization W-Flag, W-Cnt and row address are used to determine twr for RESET 17
18 Determine RESET Timing (ns) RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row address W-Flag, W-Cnt Row address Row Address Group LRS Cell Ratio RESET latency conservatively uses the upper limit number, which is the worstcase in next LRS cell ratio range 18
19 Determine RESET Timing (ns) RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row address W-Flag, W-Cnt Row address Row Address Group LRS Cell Ratio LRS: %~75%, but 142no exceeding % before W-Cnt 140.9overflows RESET latency conservatively uses the upper limit number, which is the worstcase in next LRS cell ratio range 19
20 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 20
21 Original Data Before Compression Reduce Bitline LRS Cells Compressed Data Before Shifting Naïve data compression cannot help since: The RESET latency depends on the worst-case of all 512 bitlines Naïve data compression may not help the worst-case bitline Row-address biased data layout Evenly distribute extra 0s after compression 0b 1b 2b 3b 4b 5b 6b 7b n-bit shifting Compressed Data After Shifting Mat Compressed Data Layout after Shifting in ReRAM Mat Bitline-sharing-sets 21
22 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 22
23 Chip-multiprocessor system simulator CPU, multi-level cache and ReRAM memory ReRAM Performance/energy numbers from HSPICE and NVSim Memory: 8GB, 1 channel, 2 ranks, 8 chips/rank, 2Gb x8 ReRAM Chip, 8 banks/chip, 1024 mats/bank ReRAM Timing: Read: 18ns@1.5V, SET: 10ns@3V, RESET is dynamically determined@-3v, 88!A Workloads SPEC2006, BioBench and PARSEC WPKI/RPKI: High, Medium, Low Experimental Methodologies Compared schemes BL: Conventional ReRAM crossbar design with DSGB RA: The state-of-the-art row address awareness technique [Zhang et al DATE 16] LRS: Naïve data pattern profiling technique CMP: LRS + data compression + row-address biased shifting technique ALL: Proposed design with all enhancements 23
24 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 24
25 Memory Write Latency Normalized Memory Write Latency BL RA LRS CMP ALL 63% 54% ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 25
26 Memory Read Latency Normalized Memory Read Latency % BL RA LRS CMP ALL 38% ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 26
27 System Performance Normalized CPI 1.2 BL RA LRS CMP ALL 1 21% % ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low Normalized Cycle Per Instruction Comparison (The lower, the better) More performance improvements on high memory intensity benchmarks 27
28 Memory Dynamic Energy and EDP 1.2 write read profiling EDP 1.2 Normalized Dynamic Energy BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL Normalized EDP ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 15.7% and 7.6% dynamic energy reduction, and 31.9% and 19.5% EDP improvements over baseline and state-of-the-art, respectively. 28
29 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 29
30 Conclusion Problems: performance and reliability of write operations The large sneaky currents and IR drop issues in crossbar ReRAM Proposed solutions: speeding up RESET operation based on data pattern Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing capability of crossbar ReRAM Data compression and row address dependentlayoutto reduce bitline LRS cells Contributions Correlation between the RESET latency and the number of LRS cells on selected bitlines A novel profiling technique to dynamically track the bitline data patterns Results Performance: 20.5% over baseline, 14.2% over state-of-the-art Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art 30
31 Thank you! 31
32 March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang
33 Backup Slides 33
34 Prior Arts Performance of RESET operation Current accumulation feature of ReRAM crossbar Double-sided ground biasing and multi-phase RESET [Xu et al HPCA15] Row address dependent logical regions [Zhang et al DATE16] No studies on impact of bitline data patterns Using ReRAM crossbar to implement dot-product analogy calculations [Bojnordi et al HPCA16, Chi et al ISCA16, Shafiee et al ISCA16, Song et al HPCA17] Few studies on exploiting this feature to accelerate memory access Data pattern in ReRAM crossbar Correlation between voltage drop and data pattern in ReRAM crossbar [Liang et al TED2010] Correlation between Read latency and bitline data pattern [Zhang et al JETCAS15] Correlation between RESET latency and the number of RESET bits [Xu et al HPCA15] No observations on correlation between RESET latency and bitline data pattern 34
35 RESET latency vs. # of LRS cells Build and simulate crossbar ReRAM HSPICE model Key parameters are summarized below Metric Description Value A Mat Size: A wordlines A bitlines n Number of bits to read/write 8 Rwire Wire resistance between adjacent cells 2.82! Kr Nonlinearity of the selector 200 Vw Full selected voltage during write 3V - Voltage biasing Scheme DSGB 35
36 Overhead Analysis Profiling Power consumption/area estimation by HSPICE and NVSim at 32nm Profiling overhead is small: 3.7x read energy, negligible area overhead Profiling overhead of one bank is summarized as below Comp. Params Spec. Power/Energy Area (mm 2 ) ADC Sampling speed Resolution Number 1.28GS/s 8-bit 8 Counters storage and RESET adjustment 3-bit W-Flag and 6-bit W-Cnt for each bitline-sharing-set 288KB storage overhead of all flags for a 8GB memory 24.48mW S+H Number uW ReRAM Array Mat number Mat size Profiling: pJ Read: pJ Leakage: mW
37 Sensitivity Study Normalized CPI Normalized CPI Normalized Dynamic Energy 14% BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL Normalized Dynamic Energy 8 ADC units 16 ADC units 256x x512 Sensitivity to # of ADC units Sensitivity to mat size Trivial improvement by doubling the number of ADC units Only 1.1% performance improvementobserved The proposed scheme is slightly worse (only 1.6%) than RA for 256 x 256 mat size Profiling overhead is independent on mat size, and has larger impacton smaller mats 37
PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationDesign-Induced Latency Variation in Modern DRAM Chips:
Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee 1,2 Samira Khan 3 Lavanya Subramanian 2 Saugata Ghose 2 Rachata Ausavarungnirun
More informationOvercoming the Challenges of Crossbar Resistive Memory Architectures
Overcoming the Challenges of Crossbar Resistive Memory Architectures Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, Yuan Xie Pennsylvania State University {czx,dun8,tzz6}@cse.psu.edu
More informationEmerging NVM Memory Technologies
Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement
More informationCouture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung
Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Executive Summary Motivation: DRAM plays an instrumental role in modern
More informationCACF: A Novel Circuit Architecture Co-optimization Framework for Improving Performance, Reliability and Energy of ReRAM-based Main Memory System
CACF: A Novel Circuit Architecture Co-optimization Framework for Improving Performance, Reliability and Energy of ReRAM-based Main Memory System YANG ZHANG, DAN FENG, WEI TONG, YU HUA, JINGNING LIU, ZHIPENG
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationWALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems
: A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University
More informationTriState-SET: Proactive SET for Improved Performance of MLC Phase Change Memories
TriState-SET: Proactive SET for Improved Performance of MLC Phase Change Memories Xianwei Zhang, Youtao Zhang and Jun Yang Computer Science Department, University of Pittsburgh, Pittsburgh, PA, USA {xianeizhang,
More informationNovel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014
Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014 Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda, Keiko Abe Toshiba Corporation, R&D Center Advanced
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationUnleashing the Power of Embedded DRAM
Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationForthcoming Cross Point ReRAM. Amigo Tsutsui Sony Semiconductor Solutions Corp
Forthcoming Cross Point ReRAM Amigo Tsutsui Sony Semiconductor Solutions Corp ReRAM: High Speed and Low Power PCM Two states of phase change material Based on thermal operation Amorphous: low resistance
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationMinimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007
Minimizing Power Dissipation during Write Operation to Register Files Kimish Patel, Wonbok Lee, Massoud Pedram University of Southern California Los Angeles CA August 28 th, 2007 Introduction Outline Conditional
More informationMohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu
Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly
More informationOAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches
OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,
More informationZ-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.
Z-RAM Ultra-Dense Memory for 90nm and Below Hot Chips 2006 David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc. Outline Device Overview Operation Architecture Features Challenges Z-RAM Performance
More informationCouture: Tailoring STT-MRAM for Persistent Main Memory
: Tailoring for Persistent Main Memory Mustafa M Shihab 1, Jie Zhang 3, Shuwen Gao 2, Joseph Callenes-Sloan 1, and Myoungsoo Jung 3 1 University of Texas at Dallas, 2 Intel Corporation 3 School of Integrated
More informationBIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory
HotStorage 18 BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory Gyuyoung Park 1, Miryeong Kwon 1, Pratyush Mahapatra 2, Michael Swift 2, and Myoungsoo Jung 1 Yonsei University Computer
More informationLorem ipsum dolor sit amet, consectetur adipiscing elit. Increasing dependence of the functionality
Analysis on Resistive Random Access Memory (RRAM) 1S1R Array Zizhen Jiang I. Introduction Lorem ipsum dolor sit amet, consectetur adipiscing elit. Increasing dependence of the functionality and performance
More informationCache/Memory Optimization. - Krishna Parthaje
Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST
More informationDIRECT Rambus DRAM has a high-speed interface of
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999 A 1.6-GByte/s DRAM with Flexible Mapping Redundancy Technique and Additional Refresh Scheme Satoru Takase and Natsuki Kushiyama
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationMarching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.
UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer
More informationVulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, Erich F. Haratsch February 6, 2017
More informationReducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression
Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression Xuebin Zhang, Jiangpeng Li, Hao Wang, Kai Zhao and Tong Zhang xuebinzhang.rpi@gmail.com ECSE Department,
More informationDon t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration
Don t Forget the : Automatic Block RAM Modelling, Optimization, and Architecture Exploration S. Yazdanshenas, K. Tatsumura *, and V. Betz University of Toronto, Canada * Toshiba Corporation, Japan : An
More informationTiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive
More informationDesign of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture
Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and
More informationROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture
ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture Jie Zhang, Miryeong Kwon, Changyoung Park, Myoungsoo Jung, Songkuk Kim Computer Architecture and Memory
More informationComputer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University Main Memory Lectures These slides are from the Scalable Memory Systems course taught at ACACES 2013 (July 15-19,
More informationSOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS
SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS Samira Khan MEMORY IN TODAY S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance
More information15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling
More informationRespin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory
Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance
More informationRead Error Resilient MLC STT-MRAM based Last Level Cache
Read Error Resilient STT-MRAM based Last Level Cache Wen Wen, Youtao Zhang, Jun Yang Department of Electrical and Computer Engineering, Department of Computer Science University of Pittsburgh, Pittsburgh,
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationGRAPHS have been widely used to represent relationships
IEEE TRANSACTIONS ON COMPUTER, SPECIAL SECTION ON EMERGING NON-VOLATILE MEMORY TECHNOLOGIES HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing Guohao Dai,Tianhao Huang,Yu Wang,
More informationVLSID KOLKATA, INDIA January 4-8, 2016
VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical
More informationEvaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationAn Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches
To appear in the Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA), 2001. An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationEfficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness
Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationImproving Energy Efficiency of Write-asymmetric Memories by Log Style Write
Improving Energy Efficiency of Write-asymmetric Memories by Log Style Write Guangyu Sun 1, Yaojun Zhang 2, Yu Wang 3, Yiran Chen 2 1 Center for Energy-efficient Computing and Applications, Peking University
More informationVersatile RRAM Technology and Applications
Versatile RRAM Technology and Applications Hagop Nazarian Co-Founder and VP of Engineering, Crossbar Inc. Santa Clara, CA 1 Agenda Overview of RRAM Technology RRAM for Embedded Memory Mass Storage Memory
More informationCOMPUTER ARCHITECTURES
COMPUTER ARCHITECTURES Random Access Memory Technologies Gábor Horváth BUTE Department of Networked Systems and Services ghorvath@hit.bme.hu Budapest, 2019. 02. 24. Department of Networked Systems and
More informationSe-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar
AN ENERGY-EFFICIENT HIGH-PERFORMANCE DEEP-SUBMICRON INSTRUCTION CACHE Se-Hyun Yang, Michael Powell, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar School of Electrical and Computer Engineering Purdue
More informationRead Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery
Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch*, Ken Mai, Onur Mutlu Carnegie Mellon University, *Seagate Technology
More informationA Low-Power Hybrid Magnetic Cache Architecture Exploiting Narrow-Width Values
A Low-Power Hybrid Magnetic Cache Architecture Exploiting Narrow-Width Values Mohsen Imani, Abbas Rahimi, Yeseong Kim, Tajana Rosing Computer Science and Engineering, UC San Diego, La Jolla, CA 92093,
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationNeural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators
Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Yulhwa Kim, Hyungjun Kim, and Jae-Joon Kim Dept. of Creative IT Engineering, Pohang University of Science and Technology (POSTECH),
More informationEnergy-Efficient SQL Query Exploiting RRAM-based Process-in-Memory Structure
Energy-Efficient SQL Query Exploiting RRAM-based Process-in-Memory Structure Yuliang Sun, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList),
More informationHow Much Can Data Compressibility Help to Improve NAND Flash Memory Lifetime?
How Much Can Data Compressibility Help to Improve NAND Flash Memory Lifetime? Jiangpeng Li *, Kai Zhao *, Xuebin Zhang *, Jun Ma, Ming Zhao, and Tong Zhang * * Rensselaer Polytechnic Institute Shanghai
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More information250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011
250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011 Energy-Efficient Hardware Data Prefetching Yao Guo, Member, IEEE, Pritish Narayanan, Student Member,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationRow Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different
More information15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011
15-740/18-740 Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationHybrid Cache Architecture (HCA) with Disparate Memory Technologies
Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationComputer Sciences Department
Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for
More informationPower Reduction Techniques in the Memory System. Typical Memory Hierarchy
Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache
More informationIntegrated Circuits & Systems
Federal University of Santa Catarina Center for Technology Computer Science & Electronics Engineering Integrated Circuits & Systems INE 5442 Lecture 23-1 guntzel@inf.ufsc.br Semiconductor Memory Classification
More informationSolar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines
Solar-: Reducing Access Latency by Exploiting the Variation in Local Bitlines Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu Carnegie Mellon University ETH Zürich latency is a major bottleneck for
More informationFrequent Value Compression in Packet-based NoC Architectures
Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh
More information+1 (479)
Memory Courtesy of Dr. Daehyun Lim@WSU, Dr. Harris@HMC, Dr. Shmuel Wimer@BIU and Dr. Choi@PSU http://csce.uark.edu +1 (479) 575-6043 yrpeng@uark.edu Memory Arrays Memory Arrays Random Access Memory Serial
More informationViews of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)
CS6290 Memory Views of Memory Real machines have limited amounts of memory 640KB? A few GB? (This laptop = 2GB) Programmer doesn t want to be bothered Do you think, oh, this computer only has 128MB so
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationAn Energy-Efficient High-Performance Deep-Submicron Instruction Cache
An Energy-Efficient High-Performance Deep-Submicron Instruction Cache Michael D. Powell ϒ, Se-Hyun Yang β1, Babak Falsafi β1,kaushikroy ϒ, and T. N. Vijaykumar ϒ ϒ School of Electrical and Computer Engineering
More informationHeatWatch Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
HeatWatch Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu Storage Technology Drivers
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationGHz Asynchronous SRAM in 65nm. Jonathan Dama, Andrew Lines Fulcrum Microsystems
GHz Asynchronous SRAM in 65nm Jonathan Dama, Andrew Lines Fulcrum Microsystems Context Three Generations in Production, including: Lowest latency 24-port 10G L2 Ethernet Switch Lowest Latency 24-port 10G
More informationEfficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories
Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* Carnegie Mellon University * Hewlett-Packard
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationFault Injection Attacks on Emerging Non-Volatile Memories
Lab of Green and Secure Integrated Circuit Systems (LOGICS) Fault Injection Attacks on Emerging Non-Volatile Memories Mohammad Nasim Imtiaz Khan and Swaroop Ghosh School of Electrical Engineering and Computer
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationIntegrated CPU and Cache Power Management in Multiple Clock Domain Processors
Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC
More informationSEESAW: Set Enhanced Superpage Aware caching
SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia
More informationAn Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization
An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationA 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag
A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource
More informationExploring High Bandwidth Pipelined Cache Architecture for Scaled Technology
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology Amit Agarwal, Kaushik Roy, and T. N. Vijaykumar Electrical & Computer Engineering Purdue University, West Lafayette, IN 4797,
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationCS3350B Computer Architecture CPU Performance and Profiling
CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada
More informationTing Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China
CMOS Crossbar Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China OUTLINE Motivations Problems of Designing Large Crossbar Our Approach - Pipelined MUX
More informationA Write-Back-Free 2T1D Embedded. a Dual-Row-Access Low Power Mode.
A Write-Back-Free 2T1D Embedded DRAM with Local Voltage Sensing and a Dual-Row-Access Low Power Mode Wei Zhang, Ki Chul Chun, Chris H. Kim University of Minnesota, Minneapolis, MN zhang758@umn.edu Outline
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationMemory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology
Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationEnabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem Computer Science Department, Electrical and Computer Engineering
More informationEnabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem Computer Science Department, Electrical and Computer Engineering
More information