direct hardware mapping of cnns on fpga-based smart cameras

Size: px
Start display at page:

Download "direct hardware mapping of cnns on fpga-based smart cameras"

Transcription

1 direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June 3, 2017 DREAM Institut Pascal, Clermont-Ferrand, France

2 Convolutional Neural Networks (CNNs) Deep CNNs = The state-of-the-art in image classification Good candidates for implementation on smart camera nodes Computationally intensive 1 A. Canziani and al. An Analysis of Deep CNN Models for Practical Applications Arxiv J. Malik and al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation IEEE CVPR

3 CNN Structure... "cat" (conv+act+pool) x L FC - Convolutional layers Feature extractors Computationally intensive 90 % of execution time - Fully connected layers Classifiers Memory heavy conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 convlayers 2.41 GOP 97% fclayers 0.06 GOP 3% (a) - Computation Load fclayers 468 Mbits 98% convlayers 18 Mbits 2% (b) - Parameter Size 2

4 Convolutional Layers - Deep CNNs are deep because of convlayers : from 5 to A large amount of parallelism... [ f[n, i, j] = act b[n] + C K K c=1 p=1 q=1 ] Φ[c, i + p, j + q].θ[n, c, p, q] (1) C : Number of input channels conv00 conv01 Σ0 act0 0 N : Number of output units conv02 K : Size of the convolution kernels Θ : Learned parameters of a given layer b : Learned bias offset 0 1 conv10 conv11 conv12 conv20 conv21 conv22 Σ1 Σ2 act1 act2 1 2 act : Non linear activation function -N C Convolutions = N C K 2 MAC op -Example N=5,C=3 -Real life CNNs : N=256,C=96,K=5 2 conv30 conv31 conv32 conv40 conv41 conv42 Σ3 Σ4 act3 act

5 Hardware Accelerators for CNNs Trained on GPU Clusters Deployed on dedicated hardware For smart camera nodes: FPGAs Fine grain parallelism Hardware flexibility Low power consumption (vs GPUs) Energy Efficiency Throughput CPU GPU FPGA dev 4

6 Architectures of FPGA-Based CNN accelerators Custom SIMD processors - OpenCL kernels Roofline model : Computation to communication trade-off Fixed-point arithmetic for CNNs 1 Peemen - Memory-centric accelerator design for CNNs. In ICCD 13 2 Qiu - Going Deeper with Embedded FPGA Platform for CNNs. FPGA 16 3 Suda - Throughput-Optimized OpenCL FPGA Accelerator for Large-Scale CNNs, FPGA Meloni - Curbing the Roofline : a Scalable and Flexible Architecture for CNNs on FPGA 5 Zhang - Optimizing FPGA-based Accelerator Design for Deep CNNs. FPGA 15 5 Gysel - Hardware-oriented Approximation of CNNs - ICLR 16 5

7 Dataflow MoC for CNNs - Purely data driven execution model - CNNs : a dataflow process network (DPN) - Apply the dataflow MoC to CNNs - Stream based processing Actors : Granularity? Communication channels : FIFO? Literature: NeuFlow, fpgaconvnet... A limited number of processing tiles 1 Farabet - NeuFlow: A runtime reconfigurable dataflow processor for vision - CVPR 11 2 Venieris - fpgaconvnet: Automated Mapping of CNNs on FPGAs 6

8 Direct Hardware Mapping of CNNs Our contribution: Direct Hardware Mapping (DHM) of CNN entities Each CNN actor is physically mapped on the device Each CNN connection is mapped on a signal Process feature maps on the fly / No memory bandwidth limitation 1 op/clk : Throughput ++ limitation : Area and resources availability 7

9 Memory Optimization 1 Memory print of implementation = Pipelined convolution engine Pipelined Convolution = Neighborhood extractor + MAC unit N C convolution engine per layer, N C K 2 MB of storage. Solution : Factorize neighborhood extractors (NEF) p22 p21 p20 line buffer 0 p00 p12 p11 p10 line buffer 1 p01.. p02 p01 p00 pkk Figure: Architecture of a 3 3 neighbourhood extractor : 2 Buffers with image length size are required to perform a 3 3 convolution on streams of pixels p ij Figure: Architecture of a parallel 3 3 MAC unit. Extracted pixels are weighted using 9 multipliers and accumulated to compute one convolution per clock-cycle. 8

10 Memory Optimization 2 -Factorizing Neighbourhood Extraction reduces the memory buffers by a factor of N conv 00 conv 01 conv 02 Σ 0 act ne act 0 0 conv 10 conv 11 Σ 1 act 1 1 act 1 conv 12 1 conv 20 conv 21 conv 22 Σ 2 act ne act 2 2 conv 30 conv 31 conv 32 Σ 3 act 3 3 act 3 conv 40 conv 41 conv 42 Σ 4 act ne act 4 9

11 Memory Optimization wo/ nef w/ nef Required Memory (Bits) conv1 conv2 conv3 conv4 conv5 Figure: Ratio of memory requirements between architectures w/ and wo/ NEF for Alexnet convolutional layers: 390% less memory is required when implementing NEF 10

12 Computation : Fixed point Arithmetic N C K 2 MAC operation per layer : Limited by # DSPs LeNet5 conv2 (C = 6, N = 16, K = 5) requires 2400 multipliers. Available Multipliers in the biggest Cyclone V Multiplication with Logic Elements - Fixed point arithmetic J O(nbits 2 ) acc (%) Logic Elements (kle) Bit-width (bits) LeNet5 CIFAR10 SVHN

13 Multiplication with logic elements [ f[n, i, j] = act b[n] + C K c=1 p=1 q=1 - Θ[n, c, p, q] : Pre-learned convolution kernel : Considered as constant. - Hard-code as generics that parametrize convolution engines. K ] Φ[c, i + p, j + q].θ[n, c, p, q] Θ[n, c, p, q] = 0 : Remove multiplier and connection Θ[n, c, p, q] = 1 : Replace with a straightforward signal Θ[n, c, p, q] = pow2 : Multiplication with shift registers 1100 Zeros Ones Pow2 57% Lenet 3% 40% 33% Cifar10 25% 16% 37% SVHN 13% 46% Hardware resource of a MAC operation (ALM) % Weights with a null / power-of-two value 12

14 Haddoc2 Framework - From High level CNN Model to Low Level Hardware Description Energy Efficiency.prototxt toplevel.vhd Throughput.caffemodel Haddoc 2 params.vhd Caffe Hardware dev - Direct Hardware Mapping : Structural RTL transcription of the CNN graph - Constant multipliers : Convolution kernels hard-coded as VHDL generics - Platform and device independent VHDL 1 Jia - Caffe: Convolutional Architecture for Fast Feature Embedding. In ACM International Conference on Multimedia

15 Haddoc2 Framework 14

16 Implementation Results with Haddoc2 a b LeNet5 FaceDetect CarType Logic Elements (ALMs) (44%) 6158 (5%) (42%) DSP Blocks 1 0 (0 %) 0 (0%) 0 (0%) Block Memory Bits 2752 (1%) (1%) (1%) Frequency (MHz) Processing capabilities (Gops/s) Slices (88%) 6221 (11%) (89%) DSP Blocks 1 0 (0%) 0 (0%) 0 (0%) LUTs as Memory 420 (1%) 1458 (2%) 1154 (1%) Frequency (MHz) Processing capabilities (Gops/s) Table: Resource Utilization of the Haddoc2-generated convolutional layers with 5-bit representation on: a- an Intel Cyclone V FPGA, b- a Xilinx Kintex 7 FPGA. 15

17 Thanks for listening! 16

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

arxiv: v1 [cs.dc] 20 Nov 2017

arxiv: v1 [cs.dc] 20 Nov 2017 TACTICS TO DIRECTLY MAP CNN GRAPHS ON EMBEDDED FPGAS Kamel Abdelouahab 1, Maxime Pelcat 1,2, Jocelyn Sérot 1, Cédric Bourrasset 3, and François Berry 1 arxiv:1712.04322v1 [cs.dc] 20 Nov 2017 1 Institut

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Accelerating CNN inference on FPGAs: A Survey

Accelerating CNN inference on FPGAs: A Survey Accelerating CNN inference on FPGAs: A Survey Kamel Abdelouahab, Maxime Pelcat, François Berry, Jocelyn Sérot To cite this version: Kamel Abdelouahab, Maxime Pelcat, François Berry, Jocelyn Sérot. Accelerating

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University

More information

FPGA implementations of Histograms of Oriented Gradients in FPGA

FPGA implementations of Histograms of Oriented Gradients in FPGA FPGA implementations of Histograms of Oriented Gradients in FPGA C. Bourrasset 1, L. Maggiani 2,3, C. Salvadori 2,3, J. Sérot 1, P. Pagano 2,3 and F. Berry 1 1 Institut Pascal- D.R.E.A.M - Aubière, France

More information

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Hongxiang Fan, Ho-Cheung Ng, Shuanglong Liu, Zhiqiang Que, Xinyu Niu, Wayne Luk Dept. of Computing,

More information

A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA

A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA Institute of Microelectronic Systems A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA J. Dürre, D. Paradzik and H. Blume FPGA 2018 Outline Pedestrian detection with

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster Energy-Efficient CNN Implementation on a Deeply Pipelined Cluster Chen Zhang 1, Di Wu, Jiayu Sun 1, Guangyu Sun 1,3, Guojie Luo 1,3, and Jason Cong 1,,3 1 Center for Energy-Efficient Computing and Applications,

More information

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague CNNS FROM THE BASICS TO RECENT ADVANCES Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague ducha.aiki@gmail.com OUTLINE Short review of the CNN design Architecture progress

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

Deploying Deep Neural Networks in the Embedded Space

Deploying Deep Neural Networks in the Embedded Space Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

FPGAs for Image Processing

FPGAs for Image Processing FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1.

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Lecture: Deep Convolutional Neural Networks

Lecture: Deep Convolutional Neural Networks Lecture: Deep Convolutional Neural Networks Shubhang Desai Stanford Vision and Learning Lab 1 Today s agenda Deep convolutional networks History of CNNs CNN dev Architecture search 2 Previously argmax

More information

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 fpgaconvnet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs Stylianos I. Venieris, Student Member, IEEE, and Christos-Savvas

More information

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS. Exploring Convolutional Neural Networks on the

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands   MSc THESIS. Exploring Convolutional Neural Networks on the Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2018 MSc THESIS Exploring Convolutional Neural Networks on the ρ-vex architecture Jonathan Tetteroo Abstract As machine

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

Low-Power Neural Processor for Embedded Human and Face detection

Low-Power Neural Processor for Embedded Human and Face detection Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST)

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge

High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge International Workshop on Highly Efficient Neural Processing

More information

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS ... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Distributed Vision Processing in Smart Camera Networks

Distributed Vision Processing in Smart Camera Networks Distributed Vision Processing in Smart Camera Networks CVPR-07 Hamid Aghajan, Stanford University, USA François Berry, Univ. Blaise Pascal, France Horst Bischof, TU Graz, Austria Richard Kleihorst, NXP

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

Semantic Segmentation

Semantic Segmentation Semantic Segmentation UCLA:https://goo.gl/images/I0VTi2 OUTLINE Semantic Segmentation Why? Paper to talk about: Fully Convolutional Networks for Semantic Segmentation. J. Long, E. Shelhamer, and T. Darrell,

More information

Deep Learning Hardware Acceleration

Deep Learning Hardware Acceleration * Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos Disclaimer

More information

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs Preprint: Accepted at 22nd IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE 19) Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

YOLO9000: Better, Faster, Stronger

YOLO9000: Better, Faster, Stronger YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1 Overview 1. Motivation for one-shot object

More information

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab. [ICIP 2017] Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab., POSTECH Pedestrian Detection Goal To draw bounding boxes that

More information

arxiv: v1 [cs.cv] 20 May 2016

arxiv: v1 [cs.cv] 20 May 2016 arxiv:1605.06402v1 [cs.cv] 20 May 2016 Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks By Philipp Matthias Gysel B.S. (Bern University of Applied Sciences, Switzerland) 2012

More information

CNP: An FPGA-based Processor for Convolutional Networks

CNP: An FPGA-based Processor for Convolutional Networks Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

DeepIndex for Accurate and Efficient Image Retrieval

DeepIndex for Accurate and Efficient Image Retrieval DeepIndex for Accurate and Efficient Image Retrieval Yu Liu, Yanming Guo, Song Wu, Michael S. Lew Media Lab, Leiden Institute of Advance Computer Science Outline Motivation Proposed Approach Results Conclusions

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks Chen Zhang 1,2,3, Zhenman Fang 2, Peipei Zhou 2, Peichen Pan 3, and Jason Cong 1,2,3 1 Center for Energy-Efficient

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Parallel Reconfigurable Hardware Architectures for Video Processing Applications

Parallel Reconfigurable Hardware Architectures for Video Processing Applications Parallel Reconfigurable Hardware Architectures for Video Processing Applications PhD Dissertation Presented by Karim Ali Supervised by Prof. Jean-Luc Dekeyser Dr. HdR. Rabie Ben Atitallah Parallel Reconfigurable

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

Graph Streaming Processor

Graph Streaming Processor Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Software Defined Hardware

Software Defined Hardware Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

Transfer Learning. Style Transfer in Deep Learning

Transfer Learning. Style Transfer in Deep Learning Transfer Learning & Style Transfer in Deep Learning 4-DEC-2016 Gal Barzilai, Ram Machlev Deep Learning Seminar School of Electrical Engineering Tel Aviv University Part 1: Transfer Learning in Deep Learning

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

FPGA 101. Field programmable gate arrays in action

FPGA 101. Field programmable gate arrays in action FPGA 101 Field programmable gate arrays in action About me Karsten Becker Head of electronics @Part-Time Scientists PhD candidate @TUHH FPGA Architecture 2 What is an FPGA Programmable Logic Programmable

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information