Deep Learning Hardware Acceleration
|
|
- Shanon French
- 5 years ago
- Views:
Transcription
1 * Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos
2 Disclaimer The University of Toronto has filed patent applications for the mentioned technologies.
3 Deep Learning: Where Time Goes? Time: ~ 6% - 9% inner products X 1s - + 1s X Convolutional Neural Networks: e.g., Image Classification
4 Deep Learning: Where Time Goes? Time: ~ 6% - 9% inner products X 1s - 1s + X 4
5 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x256 x16 x 15 x 5
6 Our Approach 15 Improve by Exploiting Value Properties Maintaining: Filter 15 Massive Parallelism Wide Memory Accesses Filter 15 SIMD Lanes No Modifications to the Networks 15 6
7 Longer Term Goal 15 Filter One Architecture to Rule them All 15 Filter
8 Value Properties to Exploit? Many ~ values 8
9 Value Properties to Exploit? Varying Precision Needs X X X 9
10 Our Results: Performance Accuracy 1% 99% 4.3x 2 2.8x 2 1.6x 3.1x x 1.9x TARTAN + CNVLUTIN STRIPES PRAGMATIC ENGINE P ISCA 16 MICRO 16 Arxiv +ICLR Workshop vs. DaDianNao which was ~3x over GPUs 1
11 Our Results: Memory Footprint and Bandwidth Proteus: 44% less memory bandwidth + footprint 11
12 Roadmap Avoiding computations with ~ Performance from precision Performance from zero bits Reducing footprint and bandwidth 12
13 #1: Skipping Ineffectual Activations 13
14 Cnvlutin: ISCA 16 X + X Many ineffectual multiplications 14
15 Many Activations and Weights are Intrinsically Ineffectual (zero) 45% of Runtime Values are zero % Stable for any Input None always zero Zero Activations: Runtime calculated None that are always zero Zero Weights: Known in Advance Not pursued in this work 15
16 Cnvlutin X + X Many ineffectual multiplications 16
17 Cnvlutin b X + a X Many more ineffectual multiplications 17
18 Cnvlutin Beating Fast and X Dumb SIMD is Hard + On-the-fly ineffectual product elimination Performance + energy X Optional: accuracy loss +performance 18
19 Cnvlutin No Accuracy Loss +52% performance -7% power +5% area Can relax the ineffectual criterion better performance: 6% even more w/ some accuracy loss
20 Deep Learning: Convolutional Neural Networks Beaver maybe image layers 1s-1 2
21 Deep Learning: Convolutional Networks input activations weights i N output activations N i K K > 6% -- 9% of time in Convolutional Layers 21
22 Why are there so many zero neurons? Hypothesis: Filter = feature N z 22
23 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x256 x16 x 15 x 23
24 Skipping Ineffectual Activations: Key Challenge Processing all Activations: All Lanes operate in lock step 15 x 15 x x Lane 15 Lane 15 x
25 Naïve Solution: No Wide Memory Accesses 16 independent narrow activation streams Lane 15 Lane 25
26 encode Neuron Mem Removing Zeroes: At the output of each layer Beaver maybe Layer i Layer i + 1
27 Maintaining Wide Accesses But Skipping Zeroes #1: Partition NM in 16 Slices over 16 Banks Processing order does not matter Neuron lane 15 Neuron lane 1 Neuron lane 27
28 Maintaining Wide Accesses But Skipping Zeroes #2: Fetch and Maintain One Container per Slice Container: up to 16 non-zero neurons Neuron lane 15 Neuron lane 1 Neuron lane 28
29 Maintaining Wide Accesses But Skipping Zeroes #3: Keep Neuron Lanes Supplied with One Neuron Per Cycle Lane 15 Lane 1 Lane 29
30 Maintaining Wide Accesses But Skipping Zeroes #4: When a container is exhausted, get the next one within the slice Lane 15 Lane 1 Lane 3
31 Maintaining Wide Accesses But Skipping Zeroes Container: stores only non-zeros Encoding: Value, 4-bit offset Could use 1 extra bit: encoded vs. raw
32 Inside Each Neuron Container ZFNAf: Enabling the Skipping of Ineffectual Neurons Zero-Free Neuron Array Format: Only non-zero neurons + offsets Brick-level 32
33 Cnvlutin: No Accuracy Loss better 33
34 Loosening the Ineffectual Neuron Criterion Treat Neurons close to zero as zero Open Questions: Are these robust? How to find the best? 34
35 #2: Exploiting Precision X X X 35
36 Another Property of CNNs 16 bits X + X Operand Precision Required Fixed? 16 bits? 36
37 CNNs: Precision Requirements Vary 16 bits p bits X + X Operand Precision Required Fixed Varies 5 bits to 13 bits 37
38 Stripes p bits X + X Execution Time = 16 / P Peformance + Energy Efficiency + Accuracy Knob 38
39 Stripes: Key Concept 2 2x2b Terms/Step 2 1x2b 4 1x2b Devil in the Details: Carefully chose what to serialize and what to reuse same input wires as baseline 39
40 SIMD: Exploit Computation Stucture 15 DaDianNao 4K terms/cycle x 15 x x16 x 15 x 4
41 Stripes Bit-Serial Engine x x x x 41
42 Compensating for Bit-Serial s Compute Bandwidth Loss neurons synapses 16 neurons Each Tile: 16 Windows Concurrently 16 neurons each 16 Filters 16 partial output neurons 42
43 Stripes No Accuracy Loss +192% performance -57% energy +32% area * More performance w/ accuracy loss * W/O Older: LeNet + Covnet
44 better Stripes: Performance Boost 44
45 Fully-Connected Layers? neurons synapses 16 neurons Each Tile: No Weight Reuse Cannot Have 16 Windows 45
46 Fully-Connected Layers synapses Input neurons Output neurons No Weight Reuse Cannot Have 16 Windows 46
47 TARTAN: Accelerating Fully-Connected Layers Bit-Parallel Engine V: activation I: weight Both 2 bits 47
48 Bit-Parallel Engine: Processing one Activation x Weight Cycle 1: Activation: a1 and Weight: W 48
49 Bit-Parallel Engine: Processing Another Pair Cycle 2: Activation: a2 and Weight: W a1 x W + a2 x W over two cycles 49
50 weights TARTAN engine 2 x 1b activation inputs 2b or 2 x 1b weight inputs activations 5
51 TARTAN: Convolutional Layer Processing Cycle 1: load 2b weight into BRs 51
52 TARTAN: Weight x 1 st bit of Two Activations Cycle 2: Multiply W with bit 1 of activations a1 and a2 52
53 TARTAN: Weight x 2 nd bit of Two Activations Cycle 3: multiply W with 2 nd bit of a1 and a2 Load new W into BR 3-stage pipeline to do 2: 2b activation x 2b weight 53
54 TARTAN: Fully-Connected Layers: Loading Weights What is different? Weights cannot be reused Cycle 1: Load first bit of two weights into Ars Bit 1 of Two Different Weights 54
55 TARTAN: Fully-Connected Layers: Loading Weights Cycle 2: Load 2 nd bit of w1 and w2 into ARs Bit 2 of Two Different Weights Loaded Different Weights to Each Unit 55
56 TARTAN: Fully-Connected Layers: Processing Activations Cycle 3: Move AR into BR and proceed as before over two cycles 5-stage pipeline to do: TWO of (2b activation x 2b weight) 56
57 TARTAN: Result Summary Bit-Serial TARTAN 2.4x faster than DaDiannao 1.25x more energy efficient at the same frequency 1.5x area overhead 2-bit at-a-time TARTAN 1.6x faster over DaDiannao Roughly same energy efficiency 1.25x area overhead 57
58 Bit-Pragmatic Engine X + X Operand Information Content Varies 58
59 Inner-Products Want to do A x B Let s look at A X B Which bits really matter? 59
60 Zero Bit Content: 16-bit fixed-point Only 8% of bits are non-zero once precision is reduced 15%-1% otherwise 6
61 Zero Bit Content: 8-bit Quantized (Tensorflow-like) Only 27% of bits are non-zero 61
62 Pragmatic Concept: Bit-Parallel Engine 62
63 Pragmatic Concept: Use Shift-and-Add Simply Modify Stripes? Too Large + Cross Lane Synchronization 63
64 Bit-Parallel Engine x x 64
65 STRIPES x x x x 65
66 Pragmatic: Naive STRIPES extension? Problem #1: Too Large offset 248 offset 15 offset 255 offset 16 >> 32 >> 32 BIG >> 32 >> 32 BIG = 3.7x area overhead just for the datapath 66
67 Solution to #1? 2-Stage Shifting Process in groups of Max N Difference Example with N = >> >> 2 OK >> Some opportunity loss, much lower area overhead 67
68 Solution to #1? 2-Stage Shifting Process in groups of Max N Difference Example with N = >> >> 2 OK >> Some opportunity loss, much lower area overhead 68
69 Lane Synchronization Different # of 1 bits Lanes go out of sync May have to fetch up to 256 different activations from NM Keep Lanes Synchronized: No cost: All lanes Extra register for weights: Allow columns to advance by 1 Some cost but much better performance 7
70 Speedup and Energy Efficiency vs. DaDianNao 71
71 Bit-Pragmatic No Accuracy Loss X +31% performance - 48% Energy % Area X Better w/ 8-bit Quantization 4.3x with Encoding 72
72 Reducing Memory Footprint and Bandwidth 73
73 Proteus Operand Precision Required Varies X + X Proteus: Store in reduced precision in memory Less Bandwidth, Less Energy 74
74 Proteus: Pick Per Layer Precision Weights (synapses) and Data (activations/neurons) Layered Extension: Compatible with Existing Systems 75
75 Conventional Format: Base Precision 15 x 15 x Data Physically aligns with Unit Inputs
76 4K x 4K Crossbar Conventional Format: Base Precision 15 x 255 x Need Shuffling Network to Route Synapses 4K input bits Any 4K output bit position
77 Proteus Key Idea: Pack Along Data Lane Columns 15 x 15 x Local Shufflers: 16b input 16b output Much simpler 78
78 Proteus 44% less memory bandwidth
79 What s Next Training Prototype Design Space: lower-end confs Unified Architecture Dispatcher + Compute Other Workloads: Comp. Photo General Purpose Compute Class 8
80 A Value-Based Approach to Acceleration More properties to discover and exploit E.g., Filters do overlap significantly CNNs one class Other networks Use the same layers Relative importance different Training 81
Value-Based Deep- Learning Acceleration
THEME ARTICLE: Automotive Computing Value-Based Deep- Learning Acceleration Andreas Moshovos University of Toronto Jorge Albericio NVIDIA Patrick Judd University of Toronto Alberto Delmás Lascorz University
More informationBit-Pragmatic Deep Neural Network Computing
Bit-Pragmatic Deep Neural Network Computing Jorge Albericio, Patrick Judd, Alberto Delmás, Sayeh Sharify, Andreas Moshovos Department of Electrical and Computer Engineering University of Toronto {jorge,
More informationarxiv: v3 [cs.ne] 17 Dec 2018
DPRed: Making Typical Activation and Weight Values Matter In Deep Learning Computing arxiv:84.673v3 [cs.ne] 7 Dec 8 Alberto Delmás Lascorz, Sayeh Sharify, Patrick Judd, Kevin Siu, Milos Nikolic, Andreas
More informationDPS: Dynamic Precision Scaling for Stochastic Computing-based Deep Neural Networks
DPS: Dynamic Precision Scaling for Stochastic Computing-based Deep Neural Networks Hyeonuk Sim, Saken Kenzhegulov, and Jongeun Lee School of Electrical and Computer Engineering, UNIST, Ulsan, 99 South
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationACCELERATING DEEP CONVOLUTIONAL NETWORKS USING LOW-PRECISION AND SPARSITY. Ganesh Venkatesh, Eriko Nurvitadhi, Debbie Marr
ACCELERATING DEEP CONVOLUTIONAL NETWORKS USING LOW-PRECISION AND SPARSITY Ganesh Venkatesh, Eriko Nurvitadhi, Debbie Marr Accelerator Architecture Lab, Intel Corporation ABSTRACT We explore techniques
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationSIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13
SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD
More informationvs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs
Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing
More informationComputer Architecture: Dataflow/Systolic Arrays
Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationUsing Virtual Texturing to Handle Massive Texture Data
Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?
More informationHello Edge: Keyword Spotting on Microcontrollers
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of
More informationCS184a: Computer Architecture (Structure and Organization)
CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space 1 Previously Computing Requirements Compute Interconnect space time Instructions VLSI Scaling 2 Today
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationHigh Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.
High Performance Computing 2015 Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Reviewed Paper 1 DaDianNao: A Machine- Learning Supercomputer
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationSpeculations about Computer Architecture in Next Three Years. Jan. 20, 2018
Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCS184a: Computer Architecture (Structure and Organization) Previously. Today. Computing Requirements (review) Requirements
CS184a: Computer Architecture (Structure and Organization) Day 8: January 24, 2005 Computing Requirements and Instruction Space Previously Fixed and Programmable Computation Area-Time-Energy Tradeoffs
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationFace Detection using GPU-based Convolutional Neural Networks
Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationarxiv: v1 [cs.cv] 18 Feb 2018
EFFICIENT SPARSE-WINOGRAD CONVOLUTIONAL NEURAL NETWORKS Xingyu Liu, Jeff Pool, Song Han, William J. Dally Stanford University, NVIDIA, Massachusetts Institute of Technology, Google Brain {xyl, dally}@stanford.edu
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationRhythm: Harnessing Data Parallel Hardware for Server Workloads
Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator
ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges
ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationPerformance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]
Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?
More informationCOMP2611: Computer Organization. The Pipelined Processor
COMP2611: Computer Organization The 1 2 Background 2 High-Performance Processors 3 Two techniques for designing high-performance processors by exploiting parallelism: Multiprocessing: parallelism among
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationThe p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.
The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest. Romain Dolbeau March 24, 2014 1 Introduction To quote John Walker, the first person to brute-force the problem [1]:
More informationNeural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators
Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Yulhwa Kim, Hyungjun Kim, and Jae-Joon Kim Dept. of Creative IT Engineering, Pohang University of Science and Technology (POSTECH),
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationShrinath Shanbhag Senior Software Engineer Microsoft Corporation
Accelerating GPU inferencing with DirectML and DirectX 12 Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Machine Learning Machine learning has become immensely popular over the last decade
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationSmall is the New Big: Data Analytics on the Edge
Small is the New Big: Data Analytics on the Edge An overview of processors and algorithms for deep learning techniques on the edge Dr. Abhay Samant VP Engineering, Hiller Measurements Adjunct Faculty,
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationConvolutional Neural Network Acceleration on GPU by Exploiting Data Reuse
San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research Spring 2017 Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse Sindhuja Gopalakrishnan
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationDeep Learning on Modern Architectures. Keren Zhou 4/17/2017
Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others
More informationUniversity of Toronto Faculty of Applied Science and Engineering
Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationTowards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision
Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationWu Zhiwen.
Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms
More information