Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Size: px
Start display at page:

Download "Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research"

Transcription

1 Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

2 Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx) Mission Investigate & exploit novel trends in machine learning that play to the strengths of FPGAs: Reduced Precision Neural Networks Page 2

3 Executive Summary FPGAs can do trillions of reduced precision synaptic operations per second & neural nets can put this to good use Inference accelerators that classify 10Ks to Ms of images per second, at < 25 W, on today s hardware Page 3

4 Background Page 4

5 Convolutional Neural Networks CNN computation is linear algebra on originally floating point data types Demands lots of computation and lots of parameters (memory) AlexNet: 244MB & 15GOPS, VGG16: 552MB & 308GOPS; GoogleNet: 419MB & 30GOPS for ImageNet Not suitable for energy-constrained computing environments Output(w,h,m) += input(w+x,h+y,d)*filter(m, x,y,d); Challenge: billions of floating point multiply-accumulate ops & tens of megabytes of parameter data «cat» Page 5

6 Increasingly Reduced Precision Networks Floating point (FP) CNNs contain a lot of redundancy Reducing precision is shown to work to 2b without loss of accuracy BDally EMDNN, 2016 with ternary networks on par with FP for AlexNet top-1 and top-5, ResNet20,32,44,56 Reducing to the Extreme: Binary and Almost Binary Neural Networks (BNNs) works at a small loss of accuracy for large networks Quantization MNIST SVHN CIFAR-10 Binary weights, Binary activations 096% 253% 1015% Binary weights, FP activations 129% 230% 990% FP weights, FP activations 094% 169% 762% % classification error (lower is better) Source: [4] Page 6 [5]

7 Accuracy of Binary Networks Improving Published Results for FP CNNs, BNNs and Extreme Reduced Precision NNs 60 Top-5 Error (ImageNet) /07/ /11/ /04/ /08/ /12/ /05/ /09/2017 CNN Reduced Precision BNN BNNs are new and accuracy results are improving rapidly through for example new training techniques, topological changes and other methods Page 7

8 Potential of Reduced Precision on FPGAs Cost per operation is greatly reduced For example, for BNN: multiply accumulate becomes XNOR with bit counts Memory cost is greatly reduced Large networks can fit entirely into on-chip memory (OCM) (UltraRAM, BRAM) VU9P(16nm): 43MB More memory bandwidth, lower energy LUT DSP 100ks LUTs Ks DSPs Today s FPGAs have a much higher peak performance for reduced precision operations FPGA performance is anti-proportional to the cost per operation when applications are sufficiently parallel Lower cost per op & massively parallel = more ops every cycle Precision Cost per Op LUT Cost per Op DSP MB needed (AlexNet) TOps/s (KU115)* TOps/s (ZU19EG)* 1b ~46 ~66 4b ~11 ~16 8b ~3 ~4 100x 16b ~1 ~1 32b ~05 ~03 *Assumptions: Application can fill device to 70% (fully parallelizable) 250MHZ Page 8

9 Potential of BNNs on FPGAs (ZU19EG) Fewer LUTs/op yields to higher peak performance 66 TOPS 1 TOPS Staying on-chip to achieve more of the peak 01 TOPS 40 TOPS Reduced Precision allows us to scale NN performance on FPGAs to unprecedented levels Assumption: Operational Intensity for 8b and 1b AlexNet, assuming 145GOps/image & 61MB & 76MB Page 9

10 Exploitation of Reduced Precision Neural Networks through FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Page 10

11 FINN Design Principles Custom-tailored hardware for optimal performance and power efficiency Customized data types Customized dataflow architecture to match network topology Exploit compile time optimizations to simplify generated hardware Keep all parameters on-chip Higher energy efficiency and performance Provide flexibility in architecture to scale solutions Support portability and rapid exploration through high level design entries C++ and most recently in OpenCL Page 11

12 Heterogeneous Dataflow Architecture Systolic Array Data flow architecture Not a systolic array with scheduling network on processing engines and looping over PEs Customized to match each layer s compute requirement Equivalent throughput through all layers To avoid one size fits all penalties Each layer consumes and produces in same order to minimize buffering and latency FIFOs, no ping-pong buffers Chosen Dataflow Architecture 1MOPS 10MOPS Layers are different instantiations of a C++ template classes (MVTU) 1PE 10PE Page 12

13 Architecture of a Matrix-Vector Threshold Unit (MVTU) Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold units (MVTUs) MVTUs support OFM (neuron) and folding over weights (synaptic) Weight and output stationary (weights and popcounts are retained locally) Max pool units are optionally placed behind MVTUs Weight folding OFM folding Page 13

14 Synthesizable C++ Network Description void DoCompute(ap_uint<64> * in, ap_uint<64> * out) { #pragma HLS DATAFLOW stream<ap_uint<64> > meminstrm("meminstrm"); stream<ap_uint<64> > InStrm("InStrm"); stream<ap_uint<64> > memoutstrm("memoutstrm"); Stream definitions } Mem2Stream<64, inbytespadded>(in, meminstrm); StreamingMatrixVector<L0_SIMD, L0_PE, 16, L0_MW, L0_MH, L0_WMEM, L0_TMEM> (InStrm, inter0, weightmem0, thresmem0); StreamingMatrixVector<L1_SIMD, L1_PE, 16, L1_MW, L1_MH, L1_WMEM, L1_TMEM> (inter0, inter1, weightmem1, thresmem1); StreamingMatrixVector<L2_SIMD, L2_PE, 16, L2_MW, L2_MH, L2_WMEM, L2_TMEM> (inter1, inter2, weightmem2, thresmem2); StreamingMatrixVector<L3_SIMD, L3_PE, 16, L3_MW, L3_MH, L3_WMEM, L3_TMEM> (inter2, outstream, weightmem3, thresmem3); StreamingCast<ap_uint<16>, ap_uint<64> >(outstream, memoutstrm); Stream2Mem<64, outbytespadded>(memoutstrm, out); Move image in from PS memory Layer instantiation connected by streams Move results to PS memory Page 14

15 Work Flow for Exploration of NNs of FPGAs First prototype integration with tiny-dnn and Theano (Tensorflow and Caffe in progress) All code in C/C++ Can execute on CPU and FPGA - No RTL needed Fast workflow, integrated with standard framework, with flexibility to support different topologies, sizes, rates, resources with different devices (Z7045, KU115, Z7020) Page 15

16 Experimental Results Embedded platforms (Zynq Z7045 & 7020): ZC706, PYNQ open source platform Server class accelerator: ADM_PCIE_8K5 in OpenPOWER (& x86 with SDAccel) Page 16

17 Input Data Solitaire demo (Xilinx demo center & EmbeddedWorld) MNIST handwritten digits Streetview house numbers Cifar-10: cats, dogs, etc Playing cards Imagenet in progress now Page 17

18 Test Networks MLP: Multilayer Perceptron Input images: 28x28 pixels,black-white (handwritten digits) Number of layers: 3 FC layers, 1024 neurons each Compute: MOPS/Frame CNV: CNN (VGG-16 derivative) Input images: 32x32 pixels, RGB image (SVHN, CIFAR-10, traffic signs) Number of layers: 2 (3x3) Conv + Max Pool + 2 (3x3) Conv + Max Pool + 2 Convolutional + 3 FC Compute: GOPS/Frame Page 18

19 Results - Performance, Latency, Power & Resources Max Throughput Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W] Unprecedented classification rates FC- MNIST S 123M (39%) FC- MNIST L 15M (36%) CNV- CIFAR10 - S 219K (25%) K FPS target Z7045 FPS GOPS/s BRAM LUT Latency [us] Power [W] FC- MNIST - S 122k (2%) FC- MNIST L 122k (3%) CNV- CIFAR10 - S 116k (18%) Comparable to AlexNet Scalability to extremely small footprints Ultra-low latency (P4 ~11ms) For robotics, AR, UAVs 3x classification rate over best measured numbers on GPU today KU115 FPS GOPS/s BRAM LUT Latency [us] Power [W] CNV- CIFAR10 - L 120k (59%) 671 <41 Page 19

20 Status & Next Steps Initial proof of concepts & demos are operational and demonstrate the potential Open source release on PYNQ With Python API - We continue to progress technology investigation Large NNs Higher (but no more than 8b!) & mixed precision NNs Improving accuracy through novel techniques Design space trade-offs accuracy vs performance vs resources Interested to understand system level integration better How does ML plug into data base systems? Heterogeneous at system, node, or device level? Page 20

21 Thank You Page 21

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

direct hardware mapping of cnns on fpga-based smart cameras

direct hardware mapping of cnns on fpga-based smart cameras direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日 FPGA 加速机器学习应用 罗霖 Andy.luo@Xilinx.com 2017 年 6 月 20 日 Xilinx The All Programmable Company XILINX - Founded 1984 Headquarters Research and Development Sales and Support Manufacturing $2.21B FY16 revenue

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

Inference Engine compiler and SDK FWDNXT 2018

Inference Engine compiler and SDK FWDNXT 2018 Inference Engine compiler and SDK 1 Deep Learning processor Best performance per power Best utilization Efficient use of memory bandwidth Low latency Scalability: IoT to cloud 2 Deep Learning processor

More information

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge

High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge International Workshop on Highly Efficient Neural Processing

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Jim Heaton Sr. FAE Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration

More information

Deploying Deep Neural Networks in the Embedded Space

Deploying Deep Neural Networks in the Embedded Space Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Eriko Nurvitadhi 1, Ganesh Venkatesh 1, Jaewoong Sim 1, Debbie Marr 1, Randy Huang 2, Jason Gee Hock Ong 2, Yeong Tat Liew 2, Krishnan

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

Heterogeneous Multi-Processing for SW- Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott

Heterogeneous Multi-Processing for SW- Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott Heterogeneous Multi-Processing for SW- Defined Multi-Tiered Storage Architectures Endric Schubert (MLE) Ulrich Langenbach (MLE) Michaela Blott (Xilinx Research) SDC, 2017 Content Heterogeneous Multi-Processing

More information

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx Accelerating your Embedded Vision / Machine Learning design with the revision Stack Giles Peckham, Xilinx Xilinx Foundation at the Edge Vision Customers Using Xilinx >80 ADAS Models From 23 Makers >80

More information

Embedded Binarized Neural Networks

Embedded Binarized Neural Networks Embedded Binarized Neural Networks Bradley McDanel Harvard University mcdanel@fasharvardedu Surat Teerapittayanon Harvard University steerapi@seasharvardedu HT Kung Harvard University kung@harvardedu Abstract

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Accelerating Implementation of Low Power Artificial Intelligence at the Edge

Accelerating Implementation of Low Power Artificial Intelligence at the Edge Accelerating Implementation of Low Power Artificial Intelligence at the Edge A Lattice Semiconductor White Paper November 2018 The emergence of smart factories, cities, homes and mobile are driving shifts

More information

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Yulhwa Kim, Hyungjun Kim, and Jae-Joon Kim Dept. of Creative IT Engineering, Pohang University of Science and Technology (POSTECH),

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx A So%ware Developer's Journey into a Deeply Heterogeneous World Tomas Evensen, CTO Embedded So%ware, Xilinx Embedded Development: Then Simple single CPU Most code developed internally 10 s of thousands

More information

CS 523: Multimedia Systems

CS 523: Multimedia Systems CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Today - Convolutional Neural Networks - Work on Project 1 http://playground.tensorflow.org/ Convolutional Neural Networks

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators

Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Arash Azizimazreah, and Lizhong Chen School of Electrical Engineering and Computer Science Oregon State University, USA {azizimaa,

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

A Survey of FPGA-based Accelerators for Convolutional Neural Networks

A Survey of FPGA-based Accelerators for Convolutional Neural Networks 1 A Survey of FPGA-based Accelerators for Convolutional Neural Networks Sparsh Mittal Abstract Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive

More information

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands MSc THESIS. Exploring Convolutional Neural Networks on the

Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands   MSc THESIS. Exploring Convolutional Neural Networks on the Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2018 MSc THESIS Exploring Convolutional Neural Networks on the ρ-vex architecture Jonathan Tetteroo Abstract As machine

More information

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018 Copyright Khronos Group 2018 - Page 1 Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc peter.mcguinness@gobrach.com May 2018 Khronos Mission E.g. OpenGL ES provides 3D

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

An 80-core GRVI Phalanx Overlay on PYNQ-Z1:

An 80-core GRVI Phalanx Overlay on PYNQ-Z1: An 80-core GRVI Phalanx Overlay on PYNQ-Z1: Pynq as a High Productivity Platform For FPGA Design and Exploration Jan Gray jan@fpga.org http://fpga.org/grvi-phalanx FCCM 2017 05/03/2017 Pynq Workshop My

More information

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Design Exploration of FPGA-based Accelerators for Deep Neural Networks

Design Exploration of FPGA-based Accelerators for Deep Neural Networks Design Exploration of FPGA-based Accelerators for Deep Neural Networks Guangyu Sun CECA@Peking University gsun@pku.edu.cn 1 Applications of Deep Neural Networks Unmanned Vehicle Speech & Audio Text & Language

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

INTRODUCTION TO DEEP LEARNING

INTRODUCTION TO DEEP LEARNING INTRODUCTION TO DEEP LEARNING CONTENTS Introduction to deep learning Contents 1. Examples 2. Machine learning 3. Neural networks 4. Deep learning 5. Convolutional neural networks 6. Conclusion 7. Additional

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 fpgaconvnet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs Stylianos I. Venieris, Student Member, IEEE, and Christos-Savvas

More information

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC Guanwen Zhong, Akshat Dubey, Tan Cheng, and Tulika Mitra arxiv:1804.00706v1 [cs.dc] 28 Mar 2018 School of Computing, National

More information

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. X, DEC 2016 1 Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

From High-Level Deep Neural Models to FPGAs

From High-Level Deep Neural Models to FPGAs Appears in the Proceedings of the 49 th Annual IEEE/ACM International Symposium on Microarchitecture, 2016 From High-Level Deep Neural Models to FPGAs Hardik Sharma Jongse Park Divya Mahajan Emmanuel Amaro

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

RFNoC Neural-Network Library using Vivado HLS (rfnoc-hls-neuralnet) EJ Kreinar Team E to the J Omega

RFNoC Neural-Network Library using Vivado HLS (rfnoc-hls-neuralnet) EJ Kreinar Team E to the J Omega RFNoC Neural-Network Library using Vivado HLS (rfnoc-hls-neuralnet) EJ Kreinar Team E to the J Omega Overview An RFNoC out-of-tree module that can be used to simulate, synthesize, and run a neural network

More information

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Hongxiang Fan, Ho-Cheung Ng, Shuanglong Liu, Zhiqiang Que, Xinyu Niu, Wayne Luk Dept. of Computing,

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

Arm s First-Generation Machine Learning Processor

Arm s First-Generation Machine Learning Processor Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive

More information

arxiv: v1 [cs.lg] 17 Jan 2019

arxiv: v1 [cs.lg] 17 Jan 2019 CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs Mohammad Samragh, Mojan Javaheripi, Farinaz Koushanfar Department of Electrical and Computer Engineering, University of California

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

DAC 2018 FPGA design contest

DAC 2018 FPGA design contest DAC 2018 FPGA design contest Naveen Purushotham, Xilinx Jingtong Hu, University of Pittsburgh Bei Yu, Chinese University of Hong Kong Xinyi Zhang, University of Pittsburgh Agenda Welcome DAC Contest Committee

More information

RESBINNET: RESIDUAL BINARY NEURAL NETWORK

RESBINNET: RESIDUAL BINARY NEURAL NETWORK RESBINNET: RESIDUAL BINARY NEURAL NETWORK Anonymous authors Paper under double-blind review ABSTRACT Recent efforts on training light-weight binary neural networks offer promising execution/memory efficiency.

More information

arxiv: v1 [cs.dc] 20 Nov 2017

arxiv: v1 [cs.dc] 20 Nov 2017 TACTICS TO DIRECTLY MAP CNN GRAPHS ON EMBEDDED FPGAS Kamel Abdelouahab 1, Maxime Pelcat 1,2, Jocelyn Sérot 1, Cédric Bourrasset 3, and François Berry 1 arxiv:1712.04322v1 [cs.dc] 20 Nov 2017 1 Institut

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 Arm K.K. Senior FAE Ryuji Tanaka Overview Today we will talk about applied machine learning (ML) on Arm. My aim for

More information

Fuzzy Set Theory in Computer Vision: Example 3

Fuzzy Set Theory in Computer Vision: Example 3 Fuzzy Set Theory in Computer Vision: Example 3 Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Purpose of these slides are to make you aware of a few of the different CNN architectures

More information

Utilizing SDSoC to Port Convolutional Neural Network to a Space-grade FPGA

Utilizing SDSoC to Port Convolutional Neural Network to a Space-grade FPGA Utilizing SDSoC to Port Convolutional Neural Network to a Space-grade FPGA Josh Anderson joshua.anderson@swri.org Southwest Research Institute 1 Objective Compress MASPEX instrument data Produces ~80MB

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen Duke University, University of Southern California {linghao.song,

More information

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster Energy-Efficient CNN Implementation on a Deeply Pipelined Cluster Chen Zhang 1, Di Wu, Jiayu Sun 1, Guangyu Sun 1,3, Guojie Luo 1,3, and Jason Cong 1,,3 1 Center for Energy-Efficient Computing and Applications,

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator FPGA Kongress München 2017 Martin Heimlicher Enclustra GmbH Agenda 2 What is Visual System Integrator? Introduction Platform

More information

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / 2017. 10. 31 syoh@add.re.kr Page 1/36 Overview 1. Introduction 2. Data Generation Synthesis 3. Distributed Deep Learning 4. Conclusions

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 ML Group Overview Today we will talk about applied machine learning (ML) on Arm. My aim for today is to show you just

More information

Xilinx Machine Learning Strategies For Edge

Xilinx Machine Learning Strategies For Edge Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla, Deep learning @ ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics Outline Introduction Chip

More information

Inference

Inference Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx

More information

arxiv: v1 [cs.cv] 20 May 2016

arxiv: v1 [cs.cv] 20 May 2016 arxiv:1605.06402v1 [cs.cv] 20 May 2016 Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks By Philipp Matthias Gysel B.S. (Bern University of Applied Sciences, Switzerland) 2012

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna, Convolutional Neural Networks: Applications and a short timeline 7th Deep Learning Meetup Kornel Kis Vienna, 1.12.2016. Introduction Currently a master student Master thesis at BME SmartLab Started deep

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information