SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator
|
|
- Maud Oliver
- 5 years ago
- Views:
Transcription
1 SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.
2 Executive Summary Introducing Stochastic Computing to In-DRAM Computing Architecture: For solving slow multiplication (MUL), Stochastic Leverage large capacity and bandwidth. Computing DRAM-based In-situ Acc. MUL AND Three arithmetic techniques (H 2 D) to improve stochastic computing on such architecture. In-DRAM comp. H 2 D Arithmetic Experiments shows 2.3x performance improvement v.s. w/o stochastic computing. 2
3 ... On-chip memory capacity DRAM in-situ Accelerator Rocks Cells Cells Cells Bitline... DRAM-based In-situ Accelerator rocks: with computing engines at BL-level, tightly bound memory and computing. NOR SA Compute-Capable 2D/3D Memory[neurocube] DRAM Memory-rich Processor[dadiannao] Performance 3
4 ... Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-1: X = NOR(0, X) Step-2: Y = NOR(0, Y) Step-3: ሚS = NOR(0, S) Step-4: tmp1 = NOR( ሚS, X) Step-5: tmp2 = NOR(S, Y) Step-6: R = NOR(tmp1,tmp2) Step-7: R = NOR(0, R) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R R 4
5 ... Cycles Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y DRAM-Acc s Big Challenge: Very Slow Multiplications! NOR-only logic X Y S 143 cycles!x for 8-bit MUL! 1.0E+3 1.0E+2 1.0E+1 MUL SC R = NOR(0, R)!Y!S!(!X+!S)!(!Y+S) 1.0E+0!R R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-7: Binary operand bit length R 4
6 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 # 1 s = 3 #bits = 6 # 1 s = 2 #bits = 6 5
7 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 Binary Long SC bitstream & & & & & & MUL Simple bitwise AND! 5
8 Thr/Area (Muls/ns/nm 2 ) Thr/W (Muls/W) 2.94E E+12 The Good: But not a Free Lunch MUL AND, reducing MUL latency by 47x. The Bad: Hurt throughput & energy efficiency Exp-long bitstream (8 256), intensive BW and Capacity demands Stoc-and-Binary conversion overhead The Ugly: The Ugly: Numerical precision loss Unreproducible error, no debug 1.E+4 1.E+3 1.E+2 1.E+1 1.E+0 (a). Stoch. Bin 1.E+13 1.E+12 1.E+11 1.E+10 (b). SC-MUL area AND 1% PC 4% SN G 95% Stoch. Bin 6
9 Cycles Key Idea: Combining DRAM-Acc with SC Stochastic 1.0E+3 1.0E+2 1.0E+1 MUL SC Computing MUL AND 1.0E Binary operand bit length DRAM-based In-situ Acc. 7
10 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND BIN(5) = DRAM-based In-situ Acc. SC(5) =
11 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND DRAM-based In-situ Acc. In-DRAM comp. H 2 D Arithmetic 7
12 Related Work BL-level in memory computing architecture is hot. AMBIT[MICRO 17], DIRSA[MICRO 17],Compute Caches[HPCA 17] Stochastic computing is well study since 1960s. Showing promising results on DNN workloads J. Dickson[ICNN 93], K.Kim[DAC 16], DSCNN[ASPLOS 16], DPS[DAC 18], S.K. Khatamifard[CAL 18] This is the first work combines them together. Putting together, they synergistically reinforce the strengths and address the weaknesses of each other. 8
13 controller Decoder Improved Controller Stoch. Num. Generator (SNG) Overview of the Architecture Subarray Data Cell Region BL... Bank Subarray Comping Cell Region 2 Computational SA 3 BL logic 4 Register Bank SNG (optional) Popcount (optional) 5 Simple Shifters (b) Bank (c) Subarray Building upon BL-level in-dram computing architecture. Adding Stochastic number generator (SNG) and popcount (PC), and Improving controller and BL logic design 9
14 MuLs/Cycle/area Can we do better: Introducing H 2 D Arithmetic MUL AND In-DRAM comp. H 2 D Arithmetic 1.0E+3 1.0E+1 1.0E-1 1.0E-3 1.0E-5 1.0E-7 Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. Only 1.5x throughput! MUL Binary operand bit length SC 10
15 H 2 D Arithmetic-1: Hierarchical Representation Observation: trick for O(2 n ), change 2 n to (2 n/2 + 2 n/2 ). Converting MSB-part and LSB-part separately! Example: Binary: (n width) SC bitstream (2 n -1 width) Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. [1,0,0,1] [1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,1] Binary: [MSBs,LSBs] [1,0,0,1] SC bitstream: [1,0,1],[0,1,0] ( (2 n/2-1)*2 width) 11
16 H 2 D Arithmetic-2: Hybrid Binary-Stochastic Observation: Unnecessarily redundant representation, DRISA is fast at 2-bit MUL. Encoding part of SC as BIN, hybrid SC-BIN! SC bitstream: sub-stream: Count #-of- 1 in each substream: Hybrid-SC: >intra sub-stream is BIN >whole stream is still SC [1,1,1,0,1,0,1,0,1,0,0,0,0,0,1] [ 3, 1, 2, 0, 1 ] [ 11, 01, 10, 00, 01 ] Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. 2 n -1 width (2 n -1)/3*2 width 12
17 H 2 D Arithmetic-3: Deterministic SNG Observation: Randomness is used to ensure low correlation, We only do one OP in SC domain anyway, Real random bitstream is unnecessary. Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. LUT-based Stochastic Number Generator: offset 1 s 0 s Operand X (5/16): Operand Y (5/16): [ 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0 ] [ 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0 ] periodically 13
18 Norm. Error RMS 0% -33% 275% -23% -25% -60% H 2 D Arithmetic: Putting Them Together Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error H 2 D further increases the throughput by 6x H 2 D improves precision by 60% baseline Hybrid Hier. Deter. w.o.apc H2D-w.PC 14
19 Norm. Perf./Area Experiments: A Case Study on DNN 10 SCOPE-vanilla SCOPE-hyb SCOPE-hier SCOPE-H2D GPU-INT8 (All Norm. to DRISA) vgg resnet rnn Alex-train gmean SCOPE w/ H 2 D: 2.3x than DRISA, 11.6x than w/o H 2 D 15
20 More In the Paper Architecture design detail. H2D design detail. DNN case study detail. More experiments. Come to our poster! 16
21 Summary Stochastic Computing Stochastic Computing each other. In-DRAM Computing Architecture: Synergistically reinforce the strengths and address the weaknesses of MUL AND In-situ Acc. In-DRAM comp. H 2 D Arithmetic Three DRAM-based arithmetic techniques (H 2 D) to further improve performance, and solve precision problems. 17
22 Thanks! Questions SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.
DRISA: A DRAM-based Reconfigurable In-Situ Accelerator
DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationEmerging NVM Memory Technologies
Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement
More information2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.
ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationTiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive
More information2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.
ESE534: Computer Organization Previously Day 5: February 1, 2010 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) shared datapath elements FSMDs Area/Time Tradeoffs
More information15-740/ Computer Architecture, Fall 2011 Midterm Exam II
15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationCouture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung
Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Executive Summary Motivation: DRAM plays an instrumental role in modern
More informationECE 2300 Digital Logic & Computer Organization
ECE 2300 Digital Logic & Computer Organization Spring 201 Memories Lecture 14: 1 Announcements HW6 will be posted tonight Lab 4b next week: Debug your design before the in-lab exercise Lecture 14: 2 Review:
More informationSpring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand
Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates
More informationENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College
Introduction to Logic Design Lecture 7 Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Outline for Todays Lecture Shifter Multiplier / Divider Memory Shifters Logical
More informationEE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring
EE 260: Introduction to Digital Design Arithmetic II Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Overview n Integer multiplication n Booth s algorithm n Integer division
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationMiddleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives
Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives Chao Sun 1, Asuka Arakawa 1, Ayumi Soga 1, Chihiro Matsui 1 and Ken Takeuchi 1 1 Chuo University Santa Clara,
More informationLecture 5. Other Adder Issues
Lecture 5 Other Adder Issues Mark Horowitz Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 24 by Mark Horowitz with information from Brucek Khailany 1 Overview Reading There
More informationLightweight Arithmetic for Mobile Multimedia Devices. IEEE Transactions on Multimedia
Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar IEEE Transactions on Multimedia EDICS Signal Processing for
More informationOperations, Operands, and Instructions
Operations, Operands, and Instructions Tom Kelliher, CS 220 Sept. 12, 2011 1 Administrivia Announcements Assignment Read 2.6 2.7. From Last Time Macro-architectural trends; IC fab. Outline 1. Introduction.
More informationAn introduction to Machine Learning silicon
An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional
More informationHigh Efficiency Video Coding. Li Li 2016/10/18
High Efficiency Video Coding Li Li 2016/10/18 Email: lili90th@gmail.com Outline Video coding basics High Efficiency Video Coding Conclusion Digital Video A video is nothing but a number of frames Attributes
More informationCache/Memory Optimization. - Krishna Parthaje
Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST
More informationDIRECT Rambus DRAM has a high-speed interface of
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 11, NOVEMBER 1999 A 1.6-GByte/s DRAM with Flexible Mapping Redundancy Technique and Additional Refresh Scheme Satoru Takase and Natsuki Kushiyama
More informationHabermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder
Habermann, Philipp Design and Implementation of a High- Throughput CABAC Hardware Accelerator for the HEVC Decoder Conference paper Published version This version is available at https://doi.org/10.14279/depositonce-6780
More informationFacilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit
Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to
More informationCS 261 Fall Binary Information (convert to hex) Mike Lam, Professor
CS 261 Fall 2018 Mike Lam, Professor 3735928559 (convert to hex) Binary Information Binary information Topics Base conversions (bin/dec/hex) Data sizes Byte ordering Character and program encodings Bitwise
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More informationCMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago
CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yanjing Li Department of Computer Science University of Chicago Administrative Stuff! Lab1 is out! " Due next Thursday (10/6)! Lab2 " Out next Thursday
More informationORBX 2 Technical Introduction. October 2013
ORBX 2 Technical Introduction October 2013 Summary The ORBX 2 video codec is a next generation video codec designed specifically to fulfill the requirements for low latency real time video streaming. It
More informationAC-DIMM: Associative Computing with STT-MRAM
AC-DIMM: Associative Computing with STT-MRAM Qing Guo, Xiaochen Guo, Ravi Patel Engin Ipek, Eby G. Friedman University of Rochester Published In: ISCA-2013 Motivation Prevalent Trends in Modern Computing:
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationLightweight Arithmetic for Mobile Multimedia Devices
Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen 陳祖翰 Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar Multimedia Applications on Mobile Devices Multimedia
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1 Overview How to
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationXilinx Machine Learning Strategies For Edge
Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationBut first, encode deck of cards. Integer Representation. Two possible representations. Two better representations WELLESLEY CS 240 9/8/15
Integer Representation Representation of integers: unsigned and signed Sign extension Arithmetic and shifting Casting But first, encode deck of cards. cards in suits How do we encode suits, face cards?
More informationPinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories
Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Shuangchen Li 1, Cong Xu 2, Qiaosha Zou 1,5, Jishen Zhao 3,YuLu 4, and Yuan Xie 1 University
More informationVLSID KOLKATA, INDIA January 4-8, 2016
VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical
More informationMNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System
MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System Lixue Xia 1, Boxun Li 1, Tianqi Tang 1, Peng Gu 12, Xiling Yin 1, Wenqin Huangfu 1, Pai-Yu Chen 3, Shimeng Yu 3, Yu Cao 3,
More informationEvaluating Computers: Bigger, better, faster, more?
Evaluating Computers: Bigger, better, faster, more? 1 Key Points What does it mean for a computer to be good? What is latency? bandwidth? What is the performance equation? 2 What do you want in a computer?
More informationCAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video
ICASSP 6 CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video Yong Wang Prof. Shih-Fu Chang Digital Video and Multimedia (DVMM) Lab, Columbia University Outline Complexity aware
More information1/19/2009. Data Locality. Exploiting Locality: Caches
Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby
More informationEvaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationarxiv: v1 [cs.ar] 9 May 2018
To appear in the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA 28) Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks arxiv:85.378v [cs.ar] 9 May 28 Charles Eckert,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationLecture-14 (Memory Hierarchy) CS422-Spring
Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationCaches 3/23/17. Agenda. The Dataflow Model (of a Computer)
Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationAutomated Debugging In Data Intensive Scalable Computing Systems
Automated Debugging In Data Intensive Scalable Computing Systems Muhammad Ali Gulzar 1, Matteo Interlandi 3, Xueyuan Han 2, Mingda Li 1, Tyson Condie 1, and Miryung Kim 1 1 University of California, Los
More informationCaches. Samira Khan March 23, 2017
Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationCS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es
CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationChapter 2A Instructions: Language of the Computer
Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationA Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers
A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers Chi-Weon Yoon, Yon-Kyun Im, Seon-Ho Han, Hoi-Jun Yoo and Tae-Sung Jung* Dept. of Electrical Engineering, KAIST *Samsung Electronics Co.,
More informationCS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.
CS 265 Computer Architecture Wei Lu, Ph.D., P.Eng. Part 3: von Neumann Architecture von Neumann Architecture Our goal: understand the basics of von Neumann architecture, including memory, control unit
More informationSlide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng
Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide
More informationAccelerating computation with FPGAs
Accelerating computation with FPGAs Michael J. Flynn Maxeler Technologies and Stanford University M. J. Flynn Maxeler Technologies 1 Based on work done by my colleagues at Maxeler, especially Oskar Mencer,
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers Implementation Today Review representations (252/352 recap) Floating point Addition: Ripple
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationLecture 3 Machine Language. Instructions: Instruction Execution cycle. Speaking computer before voice recognition interfaces
Lecture 3 Machine Language Speaking computer before voice recognition interfaces 1 Instructions: Language of the Machine More primitive than higher level languages e.g., no sophisticated control flow Very
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture
ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture Jie Zhang, Miryeong Kwon, Changyoung Park, Myoungsoo Jung, Songkuk Kim Computer Architecture and Memory
More information8051 Overview and Instruction Set
8051 Overview and Instruction Set Curtis A. Nelson Engr 355 1 Microprocessors vs. Microcontrollers Microprocessors are single-chip CPUs used in microcomputers Microcontrollers and microprocessors are different
More informationInteger Representation
Integer Representation Announcements assign0 due tonight assign1 out tomorrow Labs start this week SCPD Note on ofce hours on Piazza Will get an email tonight about labs Goals for Today Introduction to
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationBryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon
Carnegie Mellon Floating Point 15-213/18-213/14-513/15-513: Introduction to Computer Systems 4 th Lecture, Sept. 6, 2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point
More informationHEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution
HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution Kunal Varshney, Open-Silicon Ganesh Venkatkrishnan, Open-Silicon Pankaj Prajapati, Open-Silicon May 9, 9, 2016 1 Agenda High Bandwidth Memory
More informationUnsigned Binary Integers
Unsigned Binary Integers Given an n-bit number x x n 1 n 2 1 0 n 12 xn 22 x12 x02 Range: 0 to +2 n 1 Example 2.4 Signed and Unsigned Numbers 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 1 2 3 + 0
More informationUnsigned Binary Integers
Unsigned Binary Integers Given an n-bit number x x n 1 n 2 1 0 n 12 xn 22 x12 x02 Range: 0 to +2 n 1 Example 2.4 Signed and Unsigned Numbers 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 1 2 3 + 0
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More information