Low-Power Neural Processor for Embedded Human and Face detection

Similar documents
GST, from Sensor to Decision

Brainchip OCTOBER

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Deep Neural Networks:

direct hardware mapping of cnns on fpga-based smart cameras

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Traffic Sign Localization and Classification Methods: An Overview

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Deep Face Recognition. Nathan Sun

THE NVIDIA DEEP LEARNING ACCELERATOR

Outline GF-RNN ReNet. Outline

Deep Learning with Tensorflow AlexNet

Dynamic Routing Between Capsules

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Computer Vision Lecture 16

DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Deep Learning & Neural Networks

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

An introduction to Machine Learning silicon

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

How to Build Optimized ML Applications with Arm Software

Machine Learning on VMware vsphere with NVIDIA GPUs

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Deep Learning Processing Technologies for Embedded Systems. October 2018

How to Build Optimized ML Applications with Arm Software

In-memory computing with emerging memory devices

Neurmorphic Architectures. Kenneth Rice and Tarek Taha Clemson University

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Learning with Intel DAAL

Face Recognition A Deep Learning Approach

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Advanced Introduction to Machine Learning, CMU-10715

Object Recognition II

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Minimizing Computation in Convolutional Neural Networks

! References: ! Computer eyesight gets a lot more accurate, NY Times. ! Stanford CS 231n. ! Christopher Olah s blog. ! Take ECS 174!

From Maxout to Channel-Out: Encoding Information on Sparse Pathways

Revolutionizing the Datacenter

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Small is the New Big: Data Analytics on the Edge

Binary Convolutional Neural Network on RRAM

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Parallelization and optimization of the neuromorphic simulation code. Application on the MNIST problem

Study of Residual Networks for Image Recognition

OpenCL on FPGAs - Creating custom accelerated solutions

Convolutional Neural Networks

Convolutional Deep Belief Networks on CIFAR-10

Neural Computer Architectures

Deep Learning. Volker Tresp Summer 2014

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

6. Convolutional Neural Networks

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

CNP: An FPGA-based Processor for Convolutional Networks

Deep Learning for Remote Sensing

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Deep Learning for Computer Vision II

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Future Computer Vision Algorithms for Traffic Sign Recognition Systems

Introduction to Neural Networks

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Intel PSG (Altera) Enabling the SKA Community. Lance Brown Sr. Strategic & Technical Marketing Mgr.

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

XPU A Programmable FPGA Accelerator for Diverse Workloads

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

Computer Vision: Making machines see

Machine Learning 13. week

Image Classification using Fast Learning Convolutional Neural Networks

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

3D Wafer Scale Integration: A Scaling Path to an Intelligent Machine

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

SpiNNaker a Neuromorphic Supercomputer. Steve Temple University of Manchester, UK SOS21-21 Mar 2017

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Accelerating Neuromorphic Vision Algorithms for Recognition

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Creating Affordable and Reliable Autonomous Vehicle Systems

Convolutional Neural Network Layer Reordering for Acceleration

ImageNet Classification with Deep Convolutional Neural Networks

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

The Mont-Blanc approach towards Exascale

Simplifying ConvNets for Fast Learning

Transcription:

Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST) Dijon, France https://gsensing.eu (2) LEAD Université de Bourgogne CNRS, Dijon, France (3) DACLE - CEA LIST Nano-innov, Palaiseau, France June 23th 2016 NeuroSTIC 2016 - O. Brousse 1

Introduction An optimization of performance vs complexity consists in bio-inspired Human Vision performances in words of detection and recognition: Simple Tasks with Human Brain vs Von Neuman Computer (like PC): - Recognizes in less than one second this image: - But Calculates in less than one second (398387.86 x 498.07=?) Artificial vision model proposal for embedded systems: - Arithmetic calculations used in image filtering for example: -> Von Neuman (or Harvard) architectures - Object recognition from natural images: ->Neuro-inspired Human intelligence: Artificial Intel. on Silicon June 23th 2016 NeuroSTIC 2016 - O. Brousse 2

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 3

Deep Neural Network Models ImageNet classification (Hinton s team, hired by Google) 1.2 million high res images, 1,000 different classes Top-5 17% error rate (huge improvement) Learned features on first layer Facebook s DeepFace Program (labs head: Y. LeCun) 4 million images, 4,000 identities 97.25% accuracy, vs. 97.53% human performance June 23th 2016 NeuroSTIC 2016 - O. Brousse 4

State-of-the-art in Recognition Database # Images # Classes Best score MNIST Handwritten digits 60,000 + 10,000 10 99.79% [3] GTSRB Traffic sign CIFAR-10 airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck ~ 50,000 43 99.46% [4] 50,000 + 10,000 State-of-the-art are Deep Neural Networks every time 10 91.2% [5] Caltech-101 ~ 50,000 101 86.5% [6] ImageNet ~ 1,000,000 1,000 Top-5 83% [1] DeepFace ~ 4,000,000 4,000 97.25% [2] June 23th 2016 NeuroSTIC 2016 - O. Brousse 5 INCREASING COMPLEXITY

CNNs Organization Deep = number of layers >> 1 June 23th 2016 NeuroSTIC 2016 - O. Brousse 6

State-of-the-art CNN Example The German Traffic Sign Recognition Benchmark (GTSRB) 43 traffic sign types > 50,000 images Neurons: 287,843 Synapses: 1,388,800 Total memory: 1.5MB (with 8 bits synapses) Connections: 124,121,800 [3] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neural network for traffic sign classification, Neural Networks (32), pp. 333-338, 2012 Near human recognition (> 98%) [3] June 23th 2016 BioInspried Low-Power - O. Brousse 7

An other Neuro-Inspired Model: Hmax (a NeuroScience Approach) Hmax Model: Serre et al, IEEE PAMI 2007 Poggio et al., J Neurophysiol, 2007 June 23th 2016 NeuroSTIC 2016 - O. Brousse 8

Neuro-Inspired Models: The Hmax S1 layer using Gabor filters June 23th 2016 NeuroSTIC 2016 - O. Brousse 9

Neuro-Inspired Models: The Hmax Original Image Gabor Filters June 23th 2016 NeuroSTIC 2016 - O. Brousse 10

Original Image Gabor Filters BioInspried Low-Power - O. Brousse 11 June 23th 2016

Neuro-Inspired Models: The Hmax Hmax Model performances June 23th 2016 NeuroSTIC 2016 - O. Brousse 12

Hmax accelerator: Complexity 64 Gabor Filters 1 Mpixels Image complexity: S1: Optimized Gabor Filters: 2.9 GMAC C1: Max: 0.13 GOP RBF Neural Network : 0.4 GOP One IP camera 1M pixels @ 30 fps: 103 GOP/sec Total: 3.43 GMAC & OP June 23th 2016 NeuroSTIC 2016 - O. Brousse 13

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 14

Pneuro accelerator (Joint Laboratory CEA & GST initiated in 2013) Objective: Designing a processor integrating within the same chip signal processing functions and neuronal functions: Hmax, CNN Data In (Signals, Images) Cluster NeuroCores Cluster NeuroCores Cluster NeuroCores Classification Result From Previous NeuroDSP PNeuro: A Cascadable Parallel Architecture To Next NeuroDSP June 23th 2016 NeuroSTIC 2016 - O. Brousse 15

PNeuro accelerator overview June 23th 2016 NeuroSTIC 2016 - O. Brousse 16

PNeuro accelerator: Main Specifications - Programmable NeuroCores, each can perform image/signal processing and neural functions - Optimized for MAC and Neural operations - Signal processing: convolution filters, etc. - Neural functions: weighted inputs sum - Can perform non-linear operations (maximas, tangh, ) - 1 NeuroCore represents 1 neuron - NeuroCores can be time multiplexed for implementing bigger networks - Optimized memory accesses for data locality and reuse Variable number of clusters to accommodate different application domains and related performances June 23th 2016 NeuroSTIC 2016 - O. Brousse 17

PNeuro accelerator: Performances Profiling result: based on FDSOI 28 nm technology One cluster of 4 Neuro-Cores @ 1GHz: 32 GMAC/sec with 70mW power consumption, including memories and the controller 32 Neuro-Cores @ 1GHz: 1024 GMAC/sec 2.2W Energy Efficiency: 465 GMAC.s -1 /W Full Hmax One IP camera 1M pixels @ 30 fps: 103 GOP/sec Needs 4 clusters of 4 Neuro-Cores (sup[103/32]) 280mW June 23th 2016 NeuroSTIC 2016 - O. Brousse 18

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 19

Face Detection Application Example (1/2) June 23th 2016 NeuroSTIC 2016 - O. Brousse 20

Face Detection Application Example (2/2) Complexity Calculation divided by 8 (merge 8 scales) For one camera 1M pixels @ 30 fps: 12.9 GOP.sec -1 (103 GOP.sec -1 /8) Needs One Cluster with 2 NeuroCores: Power consumption < 35mW For a VGA Image @ 30 fps only 1 NeuroCore: < 20 mw June 23th 2016 NeuroSTIC 2016 - O. Brousse 21

Human detection: Hmin 64 Gabor Filters (7x7 to 37x37) + Original Image Local Maxima (C1) C1 Output Classification with RBF Neural Network (S2, C2) June 23th 2016 NeuroSTIC 2016 - O. Brousse 22

Human Detection Application Example June 23th 2016 NeuroSTIC 2016 - O. Brousse 23

Human Detection Application Example June 23th 2016 NeuroSTIC 2016 - O. Brousse 24

Human Detection Application Example S1 Layer Gabor Filters C1 Layer Max Pooling RBF Classification Human Detected In order to reduce complexity, optimization from Masquelier et al (Plos Computational Biology 2007): 5 images scales (1, 0.7, 0.5, 0.35 and 0.25) 4 orientations One Gabor filter (15x15) per scale and per orientation 20 Gabor Filters 1 Mpixels Image complexity: S1: Optimized Gabor Filters: 2.9 GMAC 0.6 GMAC C1: Max: 0.1 GOP RBF Neural Network : 0.3 GOP Total: 3.43 GMAC & OP 1 GMAC & OP June 23th 2016 NeuroSTIC 2016 - O. Brousse 25

Human Detection Application Example Complexity Calculation divided by 3.43: Original Hmax One IP camera 1M pixels @ 30 fps: 103 GOP/sec Optimized Hmax One IP camera 1M pixels @ 30 fps: 30 GOP/sec Using FDSOI28 technology, one cluster of 4 Neuro-Cores @ 1GHz: 32 GMAC/sec with 70 mw power consumption Optimized Hmax needs 4 Neuro-Cores for one IP camera 1M pixels @ 30fps: Power consumption 70mW For a VGA Image @ 30 fps only 2 Neuro-Cores : 35mW June 23th 2016 NeuroSTIC 2016 - O. Brousse 26

Human Detection Application Examples June 23th 2016 NeuroSTIC 2016 - O. Brousse 27

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 28

Conclusion PNeuro architecture optimized for NeuroInspired algorithms: Hmax, Convolutional Neural Network and more generally Deep Neural Networks PNeuro: A cascadable parallel architecture Performances in FDSOI 28nm allow to consider embedded applications with a very low power consumption: Face Detection needs only 20mW for a VGA image@30fps Human Detection needs only 35mW for a VGA image@30fps PNeuro implemented also on FPGA June 23th 2016 NeuroSTIC 2016 - O. Brousse 29

PNeuro on FPGA First demonstration on a FPGA-based PNeuro Single cluster configuration (4 Neuro-Cores) Embedded CNN application (60 neurons on the hidden layer, 450 KOps) Faces extraction, 18000 images on the database, 96% recognition rate Same application ported on 5 different architectures Embedded CPU: Raspberry PI 2 B, Odroid Xu3 Embedded GPU: NVidia Tegra K1 (batch) Desktop CPU: Intel I7 PNeuro, Quad Neuro-Cores Using a in-house prototyping board Target Frequency Energy efficiency Intel I7 (CPU) 3400 MHz 160 images/w Quad ARM A15 (CPU) 2000 MHz 350 images/w Quad ARM A7 (CPU) 900 MHz 380 images/w Tegra K1 (GPU) 850 MHz 600 images/w PNeuro (FPGA) 100 MHz 2000 images/w FPGA approach is already competitive with existing CPU & GPU solutions First FPGA product developed for early 2017 by GST Embedded FPGA: Artix 100 (~1W), 17.6cm² for the board, including one cluster June 23th 2016 NeuroSTIC 2016 - O. Brousse 30

Article in EETimes Embedded WORLD demonstration (feb 2016) http://www.electronics-eetimes.com/news/licensible-ip-core-accelerates-neural-networks June 23th 2016 NeuroSTIC 2016 - O. Brousse 31

Merci! June 23th 2016 NeuroSTIC 2016 - O. Brousse 32