SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator

Size: px
Start display at page:

Download "SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator"

Transcription

1 SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.

2 Executive Summary Introducing Stochastic Computing to In-DRAM Computing Architecture: For solving slow multiplication (MUL), Stochastic Leverage large capacity and bandwidth. Computing DRAM-based In-situ Acc. MUL AND Three arithmetic techniques (H 2 D) to improve stochastic computing on such architecture. In-DRAM comp. H 2 D Arithmetic Experiments shows 2.3x performance improvement v.s. w/o stochastic computing. 2

3 ... On-chip memory capacity DRAM in-situ Accelerator Rocks Cells Cells Cells Bitline... DRAM-based In-situ Accelerator rocks: with computing engines at BL-level, tightly bound memory and computing. NOR SA Compute-Capable 2D/3D Memory[neurocube] DRAM Memory-rich Processor[dadiannao] Performance 3

4 ... Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-1: X = NOR(0, X) Step-2: Y = NOR(0, Y) Step-3: ሚS = NOR(0, S) Step-4: tmp1 = NOR( ሚS, X) Step-5: tmp2 = NOR(S, Y) Step-6: R = NOR(tmp1,tmp2) Step-7: R = NOR(0, R) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R R 4

5 ... Cycles Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y DRAM-Acc s Big Challenge: Very Slow Multiplications! NOR-only logic X Y S 143 cycles!x for 8-bit MUL! 1.0E+3 1.0E+2 1.0E+1 MUL SC R = NOR(0, R)!Y!S!(!X+!S)!(!Y+S) 1.0E+0!R R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-7: Binary operand bit length R 4

6 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 # 1 s = 3 #bits = 6 # 1 s = 2 #bits = 6 5

7 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 Binary Long SC bitstream & & & & & & MUL Simple bitwise AND! 5

8 Thr/Area (Muls/ns/nm 2 ) Thr/W (Muls/W) 2.94E E+12 The Good: But not a Free Lunch MUL AND, reducing MUL latency by 47x. The Bad: Hurt throughput & energy efficiency Exp-long bitstream (8 256), intensive BW and Capacity demands Stoc-and-Binary conversion overhead The Ugly: The Ugly: Numerical precision loss Unreproducible error, no debug 1.E+4 1.E+3 1.E+2 1.E+1 1.E+0 (a). Stoch. Bin 1.E+13 1.E+12 1.E+11 1.E+10 (b). SC-MUL area AND 1% PC 4% SN G 95% Stoch. Bin 6

9 Cycles Key Idea: Combining DRAM-Acc with SC Stochastic 1.0E+3 1.0E+2 1.0E+1 MUL SC Computing MUL AND 1.0E Binary operand bit length DRAM-based In-situ Acc. 7

10 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND BIN(5) = DRAM-based In-situ Acc. SC(5) =

11 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND DRAM-based In-situ Acc. In-DRAM comp. H 2 D Arithmetic 7

12 Related Work BL-level in memory computing architecture is hot. AMBIT[MICRO 17], DIRSA[MICRO 17],Compute Caches[HPCA 17] Stochastic computing is well study since 1960s. Showing promising results on DNN workloads J. Dickson[ICNN 93], K.Kim[DAC 16], DSCNN[ASPLOS 16], DPS[DAC 18], S.K. Khatamifard[CAL 18] This is the first work combines them together. Putting together, they synergistically reinforce the strengths and address the weaknesses of each other. 8

13 controller Decoder Improved Controller Stoch. Num. Generator (SNG) Overview of the Architecture Subarray Data Cell Region BL... Bank Subarray Comping Cell Region 2 Computational SA 3 BL logic 4 Register Bank SNG (optional) Popcount (optional) 5 Simple Shifters (b) Bank (c) Subarray Building upon BL-level in-dram computing architecture. Adding Stochastic number generator (SNG) and popcount (PC), and Improving controller and BL logic design 9

14 MuLs/Cycle/area Can we do better: Introducing H 2 D Arithmetic MUL AND In-DRAM comp. H 2 D Arithmetic 1.0E+3 1.0E+1 1.0E-1 1.0E-3 1.0E-5 1.0E-7 Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. Only 1.5x throughput! MUL Binary operand bit length SC 10

15 H 2 D Arithmetic-1: Hierarchical Representation Observation: trick for O(2 n ), change 2 n to (2 n/2 + 2 n/2 ). Converting MSB-part and LSB-part separately! Example: Binary: (n width) SC bitstream (2 n -1 width) Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. [1,0,0,1] [1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,1] Binary: [MSBs,LSBs] [1,0,0,1] SC bitstream: [1,0,1],[0,1,0] ( (2 n/2-1)*2 width) 11

16 H 2 D Arithmetic-2: Hybrid Binary-Stochastic Observation: Unnecessarily redundant representation, DRISA is fast at 2-bit MUL. Encoding part of SC as BIN, hybrid SC-BIN! SC bitstream: sub-stream: Count #-of- 1 in each substream: Hybrid-SC: >intra sub-stream is BIN >whole stream is still SC [1,1,1,0,1,0,1,0,1,0,0,0,0,0,1] [ 3, 1, 2, 0, 1 ] [ 11, 01, 10, 00, 01 ] Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. 2 n -1 width (2 n -1)/3*2 width 12

17 H 2 D Arithmetic-3: Deterministic SNG Observation: Randomness is used to ensure low correlation, We only do one OP in SC domain anyway, Real random bitstream is unnecessary. Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. LUT-based Stochastic Number Generator: offset 1 s 0 s Operand X (5/16): Operand Y (5/16): [ 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0 ] [ 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0 ] periodically 13

18 Norm. Error RMS 0% -33% 275% -23% -25% -60% H 2 D Arithmetic: Putting Them Together Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error H 2 D further increases the throughput by 6x H 2 D improves precision by 60% baseline Hybrid Hier. Deter. w.o.apc H2D-w.PC 14

19 Norm. Perf./Area Experiments: A Case Study on DNN 10 SCOPE-vanilla SCOPE-hyb SCOPE-hier SCOPE-H2D GPU-INT8 (All Norm. to DRISA) vgg resnet rnn Alex-train gmean SCOPE w/ H 2 D: 2.3x than DRISA, 11.6x than w/o H 2 D 15

20 More In the Paper Architecture design detail. H2D design detail. DNN case study detail. More experiments. Come to our poster! 16

21 Summary Stochastic Computing Stochastic Computing each other. In-DRAM Computing Architecture: Synergistically reinforce the strengths and address the weaknesses of MUL AND In-situ Acc. In-DRAM comp. H 2 D Arithmetic Three DRAM-based arithmetic techniques (H 2 D) to further improve performance, and solve precision problems. 17

22 Thanks! Questions SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today. ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive

More information

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today. ESE534: Computer Organization Previously Day 5: February 1, 2010 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) shared datapath elements FSMDs Area/Time Tradeoffs

More information

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture, Fall 2011 Midterm Exam II 15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Executive Summary Motivation: DRAM plays an instrumental role in modern

More information

ECE 2300 Digital Logic & Computer Organization

ECE 2300 Digital Logic & Computer Organization ECE 2300 Digital Logic & Computer Organization Spring 201 Memories Lecture 14: 1 Announcements HW6 will be posted tonight Lab 4b next week: Debug your design before the in-lab exercise Lecture 14: 2 Review:

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

ENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College

ENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Introduction to Logic Design Lecture 7 Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Outline for Todays Lecture Shifter Multiplier / Divider Memory Shifters Logical

More information

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring EE 260: Introduction to Digital Design Arithmetic II Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Overview n Integer multiplication n Booth s algorithm n Integer division

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives Chao Sun 1, Asuka Arakawa 1, Ayumi Soga 1, Chihiro Matsui 1 and Ken Takeuchi 1 1 Chuo University Santa Clara,

More information

Lecture 5. Other Adder Issues

Lecture 5. Other Adder Issues Lecture 5 Other Adder Issues Mark Horowitz Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 24 by Mark Horowitz with information from Brucek Khailany 1 Overview Reading There

More information

Lightweight Arithmetic for Mobile Multimedia Devices. IEEE Transactions on Multimedia

Lightweight Arithmetic for Mobile Multimedia Devices. IEEE Transactions on Multimedia Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar IEEE Transactions on Multimedia EDICS Signal Processing for

More information

Operations, Operands, and Instructions

Operations, Operands, and Instructions Operations, Operands, and Instructions Tom Kelliher, CS 220 Sept. 12, 2011 1 Administrivia Announcements Assignment Read 2.6 2.7. From Last Time Macro-architectural trends; IC fab. Outline 1. Introduction.

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

High Efficiency Video Coding. Li Li 2016/10/18

High Efficiency Video Coding. Li Li 2016/10/18 High Efficiency Video Coding Li Li 2016/10/18 Email: lili90th@gmail.com Outline Video coding basics High Efficiency Video Coding Conclusion Digital Video A video is nothing but a number of frames Attributes

More information

Cache/Memory Optimization. - Krishna Parthaje

Cache/Memory Optimization. - Krishna Parthaje Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST

More information

DIRECT Rambus DRAM has a high-speed interface of

DIRECT Rambus DRAM has a high-speed interface of 1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999 A 1.6-GByte/s DRAM with Flexible Mapping Redundancy Technique and Additional Refresh Scheme Satoru Takase and Natsuki Kushiyama

More information

Habermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder

Habermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder Habermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder Conference paper Published version This version is available at https://doi.org/10.14279/depositonce-6780

More information

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to

More information

CS 261 Fall Binary Information (convert to hex) Mike Lam, Professor

CS 261 Fall Binary Information (convert to hex) Mike Lam, Professor CS 261 Fall 2018 Mike Lam, Professor 3735928559 (convert to hex) Binary Information Binary information Topics Base conversions (bin/dec/hex) Data sizes Byte ordering Character and program encodings Bitwise

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yanjing Li Department of Computer Science University of Chicago Administrative Stuff! Lab1 is out! " Due next Thursday (10/6)! Lab2 " Out next Thursday

More information

ORBX 2 Technical Introduction. October 2013

ORBX 2 Technical Introduction. October 2013 ORBX 2 Technical Introduction October 2013 Summary The ORBX 2 video codec is a next generation video codec designed specifically to fulfill the requirements for low latency real time video streaming. It

More information

AC-DIMM: Associative Computing with STT-MRAM

AC-DIMM: Associative Computing with STT-MRAM AC-DIMM: Associative Computing with STT-MRAM Qing Guo, Xiaochen Guo, Ravi Patel Engin Ipek, Eby G. Friedman University of Rochester Published In: ISCA-2013 Motivation Prevalent Trends in Modern Computing:

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Lightweight Arithmetic for Mobile Multimedia Devices

Lightweight Arithmetic for Mobile Multimedia Devices Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen 陳祖翰 Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar Multimedia Applications on Mobile Devices Multimedia

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1 Overview How to

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Xilinx Machine Learning Strategies For Edge

Xilinx Machine Learning Strategies For Edge Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

But first, encode deck of cards. Integer Representation. Two possible representations. Two better representations WELLESLEY CS 240 9/8/15

But first, encode deck of cards. Integer Representation. Two possible representations. Two better representations WELLESLEY CS 240 9/8/15 Integer Representation Representation of integers: unsigned and signed Sign extension Arithmetic and shifting Casting But first, encode deck of cards. cards in suits How do we encode suits, face cards?

More information

Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories

Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Shuangchen Li 1, Cong Xu 2, Qiaosha Zou 1,5, Jishen Zhao 3,YuLu 4, and Yuan Xie 1 University

More information

VLSID KOLKATA, INDIA January 4-8, 2016

VLSID KOLKATA, INDIA January 4-8, 2016 VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical

More information

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System Lixue Xia 1, Boxun Li 1, Tianqi Tang 1, Peng Gu 12, Xiling Yin 1, Wenqin Huangfu 1, Pai-Yu Chen 3, Shimeng Yu 3, Yu Cao 3,

More information

Evaluating Computers: Bigger, better, faster, more?

Evaluating Computers: Bigger, better, faster, more? Evaluating Computers: Bigger, better, faster, more? 1 Key Points What does it mean for a computer to be good? What is latency? bandwidth? What is the performance equation? 2 What do you want in a computer?

More information

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video ICASSP 6 CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video Yong Wang Prof. Shih-Fu Chang Digital Video and Multimedia (DVMM) Lab, Columbia University Outline Complexity aware

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

arxiv: v1 [cs.ar] 9 May 2018

arxiv: v1 [cs.ar] 9 May 2018 To appear in the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA 28) Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks arxiv:85.378v [cs.ar] 9 May 28 Charles Eckert,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer) Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Automated Debugging In Data Intensive Scalable Computing Systems

Automated Debugging In Data Intensive Scalable Computing Systems Automated Debugging In Data Intensive Scalable Computing Systems Muhammad Ali Gulzar 1, Matteo Interlandi 3, Xueyuan Han 2, Mingda Li 1, Tyson Condie 1, and Miryung Kim 1 1 University of California, Los

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Chapter 2A Instructions: Language of the Computer

Chapter 2A Instructions: Language of the Computer Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers

A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers Chi-Weon Yoon, Yon-Kyun Im, Seon-Ho Han, Hoi-Jun Yoo and Tae-Sung Jung* Dept. of Electrical Engineering, KAIST *Samsung Electronics Co.,

More information

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng. CS 265 Computer Architecture Wei Lu, Ph.D., P.Eng. Part 3: von Neumann Architecture von Neumann Architecture Our goal: understand the basics of von Neumann architecture, including memory, control unit

More information

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide

More information

Accelerating computation with FPGAs

Accelerating computation with FPGAs Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers Implementation Today Review representations (252/352 recap) Floating point Addition: Ripple

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces

Lecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces Lecture 3 Machine Language Speaking computer before voice recognition interfaces 1 Instructions: Language of the Machine More primitive than higher level languages e.g., no sophisticated control flow Very

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture Jie Zhang, Miryeong Kwon, Changyoung Park, Myoungsoo Jung, Songkuk Kim Computer Architecture and Memory

More information

8051 Overview and Instruction Set

8051 Overview and Instruction Set 8051 Overview and Instruction Set Curtis A. Nelson Engr 355 1 Microprocessors vs. Microcontrollers Microprocessors are single-chip CPUs used in microcomputers Microcontrollers and microprocessors are different

More information

Integer Representation

Integer Representation Integer Representation Announcements assign0 due tonight assign1 out tomorrow Labs start this week SCPD Note on ofce hours on Piazza Will get an email tonight about labs Goals for Today Introduction to

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon Carnegie Mellon Floating Point 15-213/18-213/14-513/15-513: Introduction to Computer Systems 4 th Lecture, Sept. 6, 2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point

More information

HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution

HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution Kunal Varshney, Open-Silicon Ganesh Venkatkrishnan, Open-Silicon Pankaj Prajapati, Open-Silicon May 9, 9, 2016 1 Agenda High Bandwidth Memory

More information

Unsigned Binary Integers

Unsigned Binary Integers Unsigned Binary Integers Given an n-bit number x x n 1 n 2 1 0 n 12 xn 22 x12 x02 Range: 0 to +2 n 1 Example 2.4 Signed and Unsigned Numbers 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 1 2 3 + 0

More information

Unsigned Binary Integers

Unsigned Binary Integers Unsigned Binary Integers Given an n-bit number x x n 1 n 2 1 0 n 12 xn 22 x12 x02 Range: 0 to +2 n 1 Example 2.4 Signed and Unsigned Numbers 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 1 2 3 + 0

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information