DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

Size: px
Start display at page:

Download "DRISA: A DRAM-based Reconfigurable In-Situ Accelerator"

Transcription

1 DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung Semiconductor Inc. Scalable and Energy-efficient Architecture Lab (SEAL) SEAL@UCSB

2 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 Merging the computing resources and memory fabrics 1.E+02 1.E+01 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area 2

3 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity 1.E+01 Memory-rich Processor Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area 2

4 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance 2

5 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor This Work Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance 2

6 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor This Work Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology 2

7 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology 3

8 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints DRAM technology Building an accelerator with DRAM technology 3

9 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints DRAM technology Logic Incompatible Building an accelerator with DRAM technology 3

10 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic Operation Cells Bitline NOR 3

11 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic Operation General Purpose Reconfigurable Cells Bitline NOR SHIFT 3

12 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic operations General Purpose High Pref. Reconfigurable Improve Parallelism Cells Bitline Multisubarray active Unblock Data Mov. Optimize Activation NOR SHIFT Multi-bank active 3

13 Architecture Overview Group Bank Bank Bank Bank Group Group (a) Chip DRAM modifications: 4

14 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Mat Mat Bank Bank Group Group Subarry (a) Chip DRAM modifications: (b) Bank 4

15 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip DRAM modifications: (b) Bank (c) Subarray and mat 4

16 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip DRAM modifications: Change decoders to controllers (b) Bank (c) Subarray and mat 4

17 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations (c) Subarray and mat 4

18 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations Add shifters (c) Subarray and mat 4

19 bctrl Architecture Overview Group Group Bank Bank Bank Bank Group Mat Subarry Mat sctrl DRAM Cells supports Boolean logic operations Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations Add shifters (c) Subarray and mat Others: Group/Bank buffers helps internal data transfer, Bank/Subarray reorganization, Spitted cell array regions 4

20 Make BL Be Able To Compute (1/2) Three solutions: Cells Bitline NOR SHIFT 5

21 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL NOR Cells Bitline SHIFT 3T1C-NOR Rs Rt Rr wbl rwl wwl rbl 5

22 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL 1T1C: adds gates or adopting AMBIT s methods NOR Cells Bitline SHIFT 3T1C-NOR Rs rwl Rt wwl Rr rbl wbl Rs Rt Rr 1T1C-NOR/MIX and or Pre-load <0.5 > Or Rs Rt Rr logic gate latch 5

23 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL 1T1C: adds gates or adopting AMBIT s methods 1T1C-adder: adds full-adders to BL NOR Cells Bitline SHIFT 3T1C-NOR Rs rwl Rt wwl Rr rbl wbl Rs Rt Rr 1T1C-NOR/MIX and or Pre-load <0.5 > Or Rs Rt Rr logic gate latch 1T1C-ADDER Rs Rt Rr latches n-bit adder

24 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline NOR SHIFT 6

25 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR SHIFT 6

26 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) NOR SHIFT 6

27 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic X Y S NOR SHIFT R = NOR( NOR( ሚS, X), NOR(S, Y) ) 6

28 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) X Y S!X NOR SHIFT Step-1: X = NOR(0, X) 6

29 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) X Y S!X!Y NOR SHIFT Step-1: Step-2: X = NOR(0, X) Y = NOR(0, Y) 6

30 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-1: X = NOR(0, X) X Y S!X!Y!S NOR SHIFT Step-2: Step-3: Y = NOR(0, Y) ሚS = NOR(0, S) 6

31 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: tmp1 = NOR( ሚS, X) X Y S!X!Y!S!(!X+!S) NOR SHIFT 6

32 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: Step-5: tmp1 = NOR( ሚS, X) tmp2 = NOR(S, Y) X Y S!X!Y!S!(!X+!S)!(!Y+S) NOR SHIFT 6

33 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: Step-5: Step-6: tmp1 = NOR( ሚS, X) tmp2 = NOR(S, Y) R = NOR(tmp1,tmp2) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R NOR SHIFT 6

34 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-7: R = NOR(0, R) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R NOR SHIFT R 6

35 Why include shifters: E.g., carry-in propagation Shifters (1/2) NOR Cells Bitline SHIFT 7

36 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 7

37 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 S 0 7

38 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 S 0 C out0 7

39 Why include shifters: E.g., carry-in propagation X 1 Y 1 Y 0 Y 1 X 0 Shifters (1/2) NOR Cells Bitline SHIFT C in1 C in0 S 0 C out0 7

40 Multiple hierarchies: Shifters (2/2) Cells Bitline NOR SHIFT 8

41 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

42 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane Inter-lane: array element shift NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

43 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane Inter-lane: array element shift Forwarding: access any element in the array NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

44 Cycles Scalable and Energy-efficient Architecture Lab (SEAL) Putting Compute-capable BLs and Shifters Together C FA Observations: Operand bit length C is preferred: reduction works fine 9

45 Cycles Scalable and Energy-efficient Architecture Lab (SEAL) Cycles Putting Compute-capable BLs and Shifters Together 40 C FA Operand-2 bit length = 2 bit Operand bit length Operand-1 bit length 1 Observations: C is preferred: reduction works fine Affordable MUL: need to have one operand within 2-bit 9

46 Optimizations for high performance 10

47 Optimizations for high performance DRAM technology Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 10

48 Optimizations for high performance DRAM technology Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 10

49 Normalized On-chip Mem.Capacity per Area DRAM technology Optimizations for high performance 1.E+03 Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 1.E+02 Compute-capable Memory (PIM) Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+01 1.E+00 Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

50 Normalized On-chip Mem.Capacity per Area DRAM technology Optimizations for high performance 1.E+03 Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 1.E+02 un-optimized Compute-capable Memory (PIM) Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+01 1.E+00 Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

51 Normalized On-chip Mem.Capacity per Area DRAM technology Logic Incompatible Simple Boolean logic + Serially run Optimizations for high performance General Purpose High Pref. Reconfigurable Improve Parallelism Unblock Data Mov. Optimize Activation Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+03 1.E+02 1.E+01 1.E+00 un-optimized Compute-capable Memory (PIM) Target Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

52 Experiment Setup DRI circuit simulator: NN topology Heavily modified CACTI Digital circuit (controller, logic gates) From Design Compiler synthesis Scaled to DRAM process with 20% perf. Overhead and 80% area overhead (ISCAS 99) DRI performance simulator: A behavior-level simulator Including a mapping optimization framework Mapping scheme Design options # mat/ subarr y/bank Devise parameter Design options Performance Simulator [In-house] Latency/ cycles Circuit Simulator [DesignCompiler+ CACTI-3DD] Circuits Power/ops Speed Power Area Leakage 11

53 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

54 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

55 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

56 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

57 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

58 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

59 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) 1E+02 1E+01 Binary weight, 8-bit activation CNN inference case study 3T1C 1T1C-nor 1T1C-mixed 1T1C-adder GPU-INT 3T1C is not good The lowest area overhead Large memory cells 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

60 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) 1E+02 1E+01 1E+00 1E-01 1E-02 Binary weight, 8-bit activation CNN inference 3T1C 1T1C-mixed GPU-INT case study 1T1C-nor 1T1C-adder AlexNet vgg-16 vgg-19 resnet-152 GM 3T1C is not good The lowest area overhead Large memory cells 1T1C-adder is not the best The best peak performance Low effective performance 1T1C-mixed is the best solution 12

61 More in the paper Microarchitectures of BL-logic operations and shifter Interface design Optimizations for high performance Impact of variation CNN mapping and optimizations Detail experiment setup and more results 13

62 Summary In-situ computing: building an accelerator with DRAM technology DRAM for large memory capacity BL-computing logic design + Shifter for general purpose instructions Optimized for high computing performance Experiments on binary CNN acceleration: perf. per area 8.8x than ASIC,7.7x than GPU energy efficiency per area: 1.2x than ASIC, 15x than GPU NOR Cells Bitline SHIFT Multisubarray active Multi-bank active 14

63 Questions? DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung Semiconductor Inc. Scalable and Energy-efficient Architecture Lab (SEAL) SEAL@UCSB

SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator

SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Cache/Memory Optimization. - Krishna Parthaje

Cache/Memory Optimization. - Krishna Parthaje Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System Lixue Xia 1, Boxun Li 1, Tianqi Tang 1, Peng Gu 12, Xiling Yin 1, Wenqin Huangfu 1, Pai-Yu Chen 3, Shimeng Yu 3, Yu Cao 3,

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Ping Chi, Shuangchen

More information

VLSID KOLKATA, INDIA January 4-8, 2016

VLSID KOLKATA, INDIA January 4-8, 2016 VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical

More information

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell EEC 581 Computer Architecture Memory Hierarchy Design (III) Department of Electrical Engineering and Computer Science Cleveland State University The DRAM Cell Word Line (Control) Bit Line (Information)

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive

More information

ARCHITECTURAL TECHNIQUES TO ENHANCE DRAM SCALING. Thesis Defense Yoongu Kim

ARCHITECTURAL TECHNIQUES TO ENHANCE DRAM SCALING. Thesis Defense Yoongu Kim ARCHITECTURAL TECHNIQUES TO ENHANCE DRAM SCALING Thesis Defense Yoongu Kim CPU+CACHE MAIN MEMORY STORAGE 2 Complex Problems Large Datasets High Throughput 3 DRAM Module DRAM Chip 1 0 DRAM Cell (Capacitor)

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment.

A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment. A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment. 8th Workshop on Electronics for LHC Experiments 9-13 Sept.

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

Macro in a Generic Logic Process with No Boosted Supplies

Macro in a Generic Logic Process with No Boosted Supplies A 700MHz 2T1C Embedded DRAM Macro in a Generic Logic Process with No Boosted Supplies Ki Chul Chun, Wei Zhang, Pulkit Jain, and Chris H. Kim University of Minnesota, Minneapolis, MN Outline Motivation

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation

More information

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

DRAM with Boosted 3T Gain Cell, PVT-tracking Read Reference Bias

DRAM with Boosted 3T Gain Cell, PVT-tracking Read Reference Bias ASub-0 Sub-0.9V Logic-compatible Embedded DRAM with Boosted 3T Gain Cell, Regulated Bit-line Write Scheme and PVT-tracking Read Reference Bias Ki Chul Chun, Pulkit Jain, Jung Hwa Lee*, Chris H. Kim University

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

ECE 152 Introduction to Computer Architecture

ECE 152 Introduction to Computer Architecture Introduction to Computer Architecture Main Memory and Virtual Memory Copyright 2009 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2009 1 Where We Are in This Course

More information

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration Anirban Nag, Ali Shafiee, Rajeev Balasubramonian, Vivek Srikumar, Naveen Muralimanohar School of Computing, University of Utah,

More information

Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach

Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach Jiawen Liu*, Hengyu Zhao*, Matheus Almeida

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

ENEE 759H, Spring 2005 Memory Systems: Architecture and

ENEE 759H, Spring 2005 Memory Systems: Architecture and SLIDE, Memory Systems: DRAM Device Circuits and Architecture Credit where credit is due: Slides contain original artwork ( Jacob, Wang 005) Overview Processor Processor System Controller Memory Controller

More information

EECS 427 Lecture 17: Memory Reliability and Power Readings: 12.4,12.5. EECS 427 F09 Lecture Reminders

EECS 427 Lecture 17: Memory Reliability and Power Readings: 12.4,12.5. EECS 427 F09 Lecture Reminders EECS 427 Lecture 17: Memory Reliability and Power Readings: 12.4,12.5 1 Reminders Deadlines HW4 is due Tuesday 11/17 at 11:59 pm (email submission) CAD8 is due Saturday 11/21 at 11:59 pm Quiz 2 is on Wednesday

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

Embedded Systems Ch 15 ARM Organization and Implementation

Embedded Systems Ch 15 ARM Organization and Implementation Embedded Systems Ch 15 ARM Organization and Implementation Byung Kook Kim Dept of EECS Korea Advanced Institute of Science and Technology Summary ARM architecture Very little change From the first 3-micron

More information

A Write-Back-Free 2T1D Embedded. a Dual-Row-Access Low Power Mode.

A Write-Back-Free 2T1D Embedded. a Dual-Row-Access Low Power Mode. A Write-Back-Free 2T1D Embedded DRAM with Local Voltage Sensing and a Dual-Row-Access Low Power Mode Wei Zhang, Ki Chul Chun, Chris H. Kim University of Minnesota, Minneapolis, MN zhang758@umn.edu Outline

More information

Synthesis at different abstraction levels

Synthesis at different abstraction levels Synthesis at different abstraction levels System Level Synthesis Clustering. Communication synthesis. High-Level Synthesis Resource or time constrained scheduling Resource allocation. Binding Register-Transfer

More information

ECE 2020 Fundamentals of Digital Design Spring problems, 7 pages Exam Three Solutions 2 April DRAM chips required 4*16 = 64

ECE 2020 Fundamentals of Digital Design Spring problems, 7 pages Exam Three Solutions 2 April DRAM chips required 4*16 = 64 Problem 1 (3 parts, 30 points) Memory Chips/Systems Part A (12 points) Consider a 256 Mbit DRAM chip organized as 16 million addresses of 16-bit words. Assume both the DRAM cell and the DRAM chip are square.

More information

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis Jan 31, 2012 John Wawrzynek Spring 2012 EECS150 - Lec05-verilog_synth Page 1 Outline Quick review of essentials of state elements Finite State

More information

Magnetic core memory (1951) cm 2 ( bit)

Magnetic core memory (1951) cm 2 ( bit) Magnetic core memory (1951) 16 16 cm 2 (128 128 bit) Semiconductor Memory Classification Read-Write Memory Non-Volatile Read-Write Memory Read-Only Memory Random Access Non-Random Access EPROM E 2 PROM

More information

ELCT 912: Advanced Embedded Systems

ELCT 912: Advanced Embedded Systems Advanced Embedded Systems Lecture 2: Memory and Programmable Logic Dr. Mohamed Abd El Ghany, Memory Random Access Memory (RAM) Can be read and written Static Random Access Memory (SRAM) Data stored so

More information

Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories

Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Shuangchen Li 1, Cong Xu 2, Qiaosha Zou 1,5, Jishen Zhao 3,YuLu 4, and Yuan Xie 1 University

More information

Regular Fabrics for Retiming & Pipelining over Global Interconnects

Regular Fabrics for Retiming & Pipelining over Global Interconnects Regular Fabrics for Retiming & Pipelining over Global Interconnects Jason Cong Computer Science Department University of California, Los Angeles cong@cs cs.ucla.edu http://cadlab cadlab.cs.ucla.edu/~cong

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

HIERARCHICAL DESIGN. RTL Hardware Design by P. Chu. Chapter 13 1

HIERARCHICAL DESIGN. RTL Hardware Design by P. Chu. Chapter 13 1 HIERARCHICAL DESIGN Chapter 13 1 Outline 1. Introduction 2. Components 3. Generics 4. Configuration 5. Other supporting constructs Chapter 13 2 1. Introduction How to deal with 1M gates or more? Hierarchical

More information

Outline HIERARCHICAL DESIGN. 1. Introduction. Benefits of hierarchical design

Outline HIERARCHICAL DESIGN. 1. Introduction. Benefits of hierarchical design Outline HIERARCHICAL DESIGN 1. Introduction 2. Components 3. Generics 4. Configuration 5. Other supporting constructs Chapter 13 1 Chapter 13 2 1. Introduction How to deal with 1M gates or more? Hierarchical

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Executive Summary Motivation: DRAM plays an instrumental role in modern

More information

Introduction to Semiconductor Memory Dr. Lynn Fuller Webpage:

Introduction to Semiconductor Memory Dr. Lynn Fuller Webpage: ROCHESTER INSTITUTE OF TECHNOLOGY MICROELECTRONIC ENGINEERING Introduction to Semiconductor Memory Webpage: http://people.rit.edu/lffeee 82 Lomb Memorial Drive Rochester, NY 14623-5604 Tel (585) 475-2035

More information

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

CMOS Logic Circuit Design   Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計 CMOS Logic Circuit Design http://www.rcns.hiroshima-u.ac.jp Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計 Memory Circuits (Part 1) Overview of Memory Types Memory with Address-Based Access Principle of Data Access

More information

General-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty

General-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty General-purpose Reconfigurable Functional Cache architecture by Rajesh Ramanujam A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

More information

Spiral 1 / Unit 4 Verilog HDL. Digital Circuit Design Steps. Digital Circuit Design OVERVIEW. Mark Redekopp. Description. Verification.

Spiral 1 / Unit 4 Verilog HDL. Digital Circuit Design Steps. Digital Circuit Design OVERVIEW. Mark Redekopp. Description. Verification. 1-4.1 1-4.2 Spiral 1 / Unit 4 Verilog HDL Mark Redekopp OVERVIEW 1-4.3 1-4.4 Digital Circuit Design Steps Digital Circuit Design Description Design and computer-entry of circuit Verification Input Stimulus

More information

Integrated Circuits & Systems

Integrated Circuits & Systems Federal University of Santa Catarina Center for Technology Computer Science & Electronics Engineering Integrated Circuits & Systems INE 5442 Lecture 23-1 guntzel@inf.ufsc.br Semiconductor Memory Classification

More information

Topic #6. Processor Design

Topic #6. Processor Design Topic #6 Processor Design Major Goals! To present the single-cycle implementation and to develop the student's understanding of combinational and clocked sequential circuits and the relationship between

More information

Design Methodologies and Tools. Full-Custom Design

Design Methodologies and Tools. Full-Custom Design Design Methodologies and Tools Design styles Full-custom design Standard-cell design Programmable logic Gate arrays and field-programmable gate arrays (FPGAs) Sea of gates System-on-a-chip (embedded cores)

More information

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

! Memory.  RAM Memory.  Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell.  Used in most commercial chips ESE 57: Digital Integrated Circuits and VLSI Fundamentals Lec : April 5, 8 Memory: Periphery circuits Today! Memory " RAM Memory " Architecture " Memory core " SRAM " DRAM " Periphery " Serial Access Memories

More information

AC-DIMM: Associative Computing with STT-MRAM

AC-DIMM: Associative Computing with STT-MRAM AC-DIMM: Associative Computing with STT-MRAM Qing Guo, Xiaochen Guo, Ravi Patel Engin Ipek, Eby G. Friedman University of Rochester Published In: ISCA-2013 Motivation Prevalent Trends in Modern Computing:

More information

ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path

ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path Project Summary This project involves the schematic and layout design of an 8-bit microprocessor

More information

+1 (479)

+1 (479) Memory Courtesy of Dr. Daehyun Lim@WSU, Dr. Harris@HMC, Dr. Shmuel Wimer@BIU and Dr. Choi@PSU http://csce.uark.edu +1 (479) 575-6043 yrpeng@uark.edu Memory Arrays Memory Arrays Random Access Memory Serial

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Architectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin

Architectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin Architectural Support for Large-Scale Visual Search Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin Motivation: Visual Data & Their Applications Rebooting the IT Revolution, SIA, September

More information

Chapter 4. The Processor Designing the datapath

Chapter 4. The Processor Designing the datapath Chapter 4 The Processor Designing the datapath Introduction CPU performance determined by Instruction Count Clock Cycles per Instruction (CPI) and Cycle time Determined by Instruction Set Architecure (ISA)

More information

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3) Lecture 15: DRAM Main Memory Systems Today: DRAM basics and innovations (Section 2.3) 1 Memory Architecture Processor Memory Controller Address/Cmd Bank Row Buffer DIMM Data DIMM: a PCB with DRAM chips

More information

DIRECT Rambus DRAM has a high-speed interface of

DIRECT Rambus DRAM has a high-speed interface of 1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999 A 1.6-GByte/s DRAM with Flexible Mapping Redundancy Technique and Additional Refresh Scheme Satoru Takase and Natsuki Kushiyama

More information

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

Novel Nonvolatile Memory Hierarchies to Realize Normally-Off Mobile Processors ASP-DAC 2014 Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014 Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda, Keiko Abe Toshiba Corporation, R&D Center Advanced

More information

DRAM Tutorial Lecture. Vivek Seshadri

DRAM Tutorial Lecture. Vivek Seshadri DRAM Tutorial 18-447 Lecture Vivek Seshadri DRAM Module and Chip 2 Goals Cost Latency Bandwidth Parallelism Power Energy 3 DRAM Chip Bank I/O 4 Sense Amplifier top enable Inverter bottom 5 Sense Amplifier

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

Embedded Memories. Advanced Digital IC Design. What is this about? Presentation Overview. Why is this important? Jingou Lai Sina Borhani

Embedded Memories. Advanced Digital IC Design. What is this about? Presentation Overview. Why is this important? Jingou Lai Sina Borhani 1 Advanced Digital IC Design What is this about? Embedded Memories Jingou Lai Sina Borhani Master students of SoC To introduce the motivation, background and the architecture of the embedded memories.

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning Presented by Nils Weller Hardware Acceleration for Data Processing Seminar, Fall 2017 PipeLayer: A Pipelined ReRAM-Based Accelerator for

More information

ECE 2300 Digital Logic & Computer Organization

ECE 2300 Digital Logic & Computer Organization ECE 2300 Digital Logic & Computer Organization Spring 201 Memories Lecture 14: 1 Announcements HW6 will be posted tonight Lab 4b next week: Debug your design before the in-lab exercise Lecture 14: 2 Review:

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University Main Memory Lectures These slides are from the Scalable Memory Systems course taught at ACACES 2013 (July 15-19,

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University

More information

An Introduction to the Logic. Silicon Chips

An Introduction to the Logic. Silicon Chips An Introduction to the Logic of Silicon Chips Here is a photo of a typical silicon chip, taken alongside the tip of my little finger. Modern chips can be made a good deal smaller than the one shown - just

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

The Processor That Don't Cost a Thing

The Processor That Don't Cost a Thing The Processor That Don't Cost a Thing Peter Hsu, Ph.D. Peter Hsu Consulting, Inc. http://cs.wisc.edu/~peterhsu DRAM+Processor Commercial demand Heat stiffling industry's growth Heat density limits small

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Boolean Unit (The obvious way)

Boolean Unit (The obvious way) oolean Unit (The obvious way) It is simple to build up a oolean unit using primitive gates and a mux to select the function. Since there is no interconnection between bits, this unit can be simply replicated

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

EXPERIMENT NUMBER 11 REGISTERED ALU DESIGN

EXPERIMENT NUMBER 11 REGISTERED ALU DESIGN 11-1 EXPERIMENT NUMBER 11 REGISTERED ALU DESIGN Purpose Extend the design of the basic four bit adder to include other arithmetic and logic functions. References Wakerly: Section 5.1 Materials Required

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

ENGIN 112 Intro to Electrical and Computer Engineering

ENGIN 112 Intro to Electrical and Computer Engineering ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM) Overview Memory is a collection of storage cells with associated input and output circuitry Possible to read

More information