Low-Power Neural Processor for Embedded Human and Face detection

Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST) Dijon, France https://gsensing.eu (2) LEAD Université de Bourgogne CNRS, Dijon, France (3) DACLE - CEA LIST Nano-innov, Palaiseau, France June 23th 2016 NeuroSTIC 2016 - O. Brousse 1

Introduction An optimization of performance vs complexity consists in bio-inspired Human Vision performances in words of detection and recognition: Simple Tasks with Human Brain vs Von Neuman Computer (like PC): - Recognizes in less than one second this image: - But Calculates in less than one second (398387.86 x 498.07=?) Artificial vision model proposal for embedded systems: - Arithmetic calculations used in image filtering for example: -> Von Neuman (or Harvard) architectures - Object recognition from natural images: ->Neuro-inspired Human intelligence: Artificial Intel. on Silicon June 23th 2016 NeuroSTIC 2016 - O. Brousse 2

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 3

Deep Neural Network Models ImageNet classification (Hinton s team, hired by Google) 1.2 million high res images, 1,000 different classes Top-5 17% error rate (huge improvement) Learned features on first layer Facebook s DeepFace Program (labs head: Y. LeCun) 4 million images, 4,000 identities 97.25% accuracy, vs. 97.53% human performance June 23th 2016 NeuroSTIC 2016 - O. Brousse 4

State-of-the-art in Recognition Database # Images # Classes Best score MNIST Handwritten digits 60,000 + 10,000 10 99.79% [3] GTSRB Traffic sign CIFAR-10 airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck ~ 50,000 43 99.46% [4] 50,000 + 10,000 State-of-the-art are Deep Neural Networks every time 10 91.2% [5] Caltech-101 ~ 50,000 101 86.5% [6] ImageNet ~ 1,000,000 1,000 Top-5 83% [1] DeepFace ~ 4,000,000 4,000 97.25% [2] June 23th 2016 NeuroSTIC 2016 - O. Brousse 5 INCREASING COMPLEXITY

CNNs Organization Deep = number of layers >> 1 June 23th 2016 NeuroSTIC 2016 - O. Brousse 6

State-of-the-art CNN Example The German Traffic Sign Recognition Benchmark (GTSRB) 43 traffic sign types > 50,000 images Neurons: 287,843 Synapses: 1,388,800 Total memory: 1.5MB (with 8 bits synapses) Connections: 124,121,800 [3] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neural network for traffic sign classification, Neural Networks (32), pp. 333-338, 2012 Near human recognition (> 98%) [3] June 23th 2016 BioInspried Low-Power - O. Brousse 7

An other Neuro-Inspired Model: Hmax (a NeuroScience Approach) Hmax Model: Serre et al, IEEE PAMI 2007 Poggio et al., J Neurophysiol, 2007 June 23th 2016 NeuroSTIC 2016 - O. Brousse 8

Neuro-Inspired Models: The Hmax S1 layer using Gabor filters June 23th 2016 NeuroSTIC 2016 - O. Brousse 9

Neuro-Inspired Models: The Hmax Original Image Gabor Filters June 23th 2016 NeuroSTIC 2016 - O. Brousse 10

Original Image Gabor Filters BioInspried Low-Power - O. Brousse 11 June 23th 2016

Neuro-Inspired Models: The Hmax Hmax Model performances June 23th 2016 NeuroSTIC 2016 - O. Brousse 12

Hmax accelerator: Complexity 64 Gabor Filters 1 Mpixels Image complexity: S1: Optimized Gabor Filters: 2.9 GMAC C1: Max: 0.13 GOP RBF Neural Network : 0.4 GOP One IP camera 1M pixels @ 30 fps: 103 GOP/sec Total: 3.43 GMAC & OP June 23th 2016 NeuroSTIC 2016 - O. Brousse 13

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 14

Pneuro accelerator (Joint Laboratory CEA & GST initiated in 2013) Objective: Designing a processor integrating within the same chip signal processing functions and neuronal functions: Hmax, CNN Data In (Signals, Images) Cluster NeuroCores Cluster NeuroCores Cluster NeuroCores Classification Result From Previous NeuroDSP PNeuro: A Cascadable Parallel Architecture To Next NeuroDSP June 23th 2016 NeuroSTIC 2016 - O. Brousse 15

PNeuro accelerator overview June 23th 2016 NeuroSTIC 2016 - O. Brousse 16

PNeuro accelerator: Main Specifications - Programmable NeuroCores, each can perform image/signal processing and neural functions - Optimized for MAC and Neural operations - Signal processing: convolution filters, etc. - Neural functions: weighted inputs sum - Can perform non-linear operations (maximas, tangh, ) - 1 NeuroCore represents 1 neuron - NeuroCores can be time multiplexed for implementing bigger networks - Optimized memory accesses for data locality and reuse Variable number of clusters to accommodate different application domains and related performances June 23th 2016 NeuroSTIC 2016 - O. Brousse 17

PNeuro accelerator: Performances Profiling result: based on FDSOI 28 nm technology One cluster of 4 Neuro-Cores @ 1GHz: 32 GMAC/sec with 70mW power consumption, including memories and the controller 32 Neuro-Cores @ 1GHz: 1024 GMAC/sec 2.2W Energy Efficiency: 465 GMAC.s -1 /W Full Hmax One IP camera 1M pixels @ 30 fps: 103 GOP/sec Needs 4 clusters of 4 Neuro-Cores (sup[103/32]) 280mW June 23th 2016 NeuroSTIC 2016 - O. Brousse 18

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 19

Face Detection Application Example (1/2) June 23th 2016 NeuroSTIC 2016 - O. Brousse 20

Face Detection Application Example (2/2) Complexity Calculation divided by 8 (merge 8 scales) For one camera 1M pixels @ 30 fps: 12.9 GOP.sec -1 (103 GOP.sec -1 /8) Needs One Cluster with 2 NeuroCores: Power consumption < 35mW For a VGA Image @ 30 fps only 1 NeuroCore: < 20 mw June 23th 2016 NeuroSTIC 2016 - O. Brousse 21

Human detection: Hmin 64 Gabor Filters (7x7 to 37x37) + Original Image Local Maxima (C1) C1 Output Classification with RBF Neural Network (S2, C2) June 23th 2016 NeuroSTIC 2016 - O. Brousse 22

Human Detection Application Example June 23th 2016 NeuroSTIC 2016 - O. Brousse 23

Human Detection Application Example June 23th 2016 NeuroSTIC 2016 - O. Brousse 24

Human Detection Application Example S1 Layer Gabor Filters C1 Layer Max Pooling RBF Classification Human Detected In order to reduce complexity, optimization from Masquelier et al (Plos Computational Biology 2007): 5 images scales (1, 0.7, 0.5, 0.35 and 0.25) 4 orientations One Gabor filter (15x15) per scale and per orientation 20 Gabor Filters 1 Mpixels Image complexity: S1: Optimized Gabor Filters: 2.9 GMAC 0.6 GMAC C1: Max: 0.1 GOP RBF Neural Network : 0.3 GOP Total: 3.43 GMAC & OP 1 GMAC & OP June 23th 2016 NeuroSTIC 2016 - O. Brousse 25

Human Detection Application Example Complexity Calculation divided by 3.43: Original Hmax One IP camera 1M pixels @ 30 fps: 103 GOP/sec Optimized Hmax One IP camera 1M pixels @ 30 fps: 30 GOP/sec Using FDSOI28 technology, one cluster of 4 Neuro-Cores @ 1GHz: 32 GMAC/sec with 70 mw power consumption Optimized Hmax needs 4 Neuro-Cores for one IP camera 1M pixels @ 30fps: Power consumption 70mW For a VGA Image @ 30 fps only 2 Neuro-Cores : 35mW June 23th 2016 NeuroSTIC 2016 - O. Brousse 26

Human Detection Application Examples June 23th 2016 NeuroSTIC 2016 - O. Brousse 27

Outline Introduction Neuro-Inspired Vision Models Hardware Accelerator for Neuro-Inspired applications Application examples Conclusion June 23th 2016 NeuroSTIC 2016 - O. Brousse 28

Conclusion PNeuro architecture optimized for NeuroInspired algorithms: Hmax, Convolutional Neural Network and more generally Deep Neural Networks PNeuro: A cascadable parallel architecture Performances in FDSOI 28nm allow to consider embedded applications with a very low power consumption: Face Detection needs only 20mW for a VGA image@30fps Human Detection needs only 35mW for a VGA image@30fps PNeuro implemented also on FPGA June 23th 2016 NeuroSTIC 2016 - O. Brousse 29

PNeuro on FPGA First demonstration on a FPGA-based PNeuro Single cluster configuration (4 Neuro-Cores) Embedded CNN application (60 neurons on the hidden layer, 450 KOps) Faces extraction, 18000 images on the database, 96% recognition rate Same application ported on 5 different architectures Embedded CPU: Raspberry PI 2 B, Odroid Xu3 Embedded GPU: NVidia Tegra K1 (batch) Desktop CPU: Intel I7 PNeuro, Quad Neuro-Cores Using a in-house prototyping board Target Frequency Energy efficiency Intel I7 (CPU) 3400 MHz 160 images/w Quad ARM A15 (CPU) 2000 MHz 350 images/w Quad ARM A7 (CPU) 900 MHz 380 images/w Tegra K1 (GPU) 850 MHz 600 images/w PNeuro (FPGA) 100 MHz 2000 images/w FPGA approach is already competitive with existing CPU & GPU solutions First FPGA product developed for early 2017 by GST Embedded FPGA: Artix 100 (~1W), 17.6cm² for the board, including one cluster June 23th 2016 NeuroSTIC 2016 - O. Brousse 30

Article in EETimes Embedded WORLD demonstration (feb 2016) http://www.electronics-eetimes.com/news/licensible-ip-core-accelerates-neural-networks June 23th 2016 NeuroSTIC 2016 - O. Brousse 31

Merci! June 23th 2016 NeuroSTIC 2016 - O. Brousse 32