Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,

Size: px

Start display at page:

Download "Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,"

Abraham Cunningham
5 years ago
Views:

Deep learning @ ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin

1 Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics

2 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

3 Sensors Invasion & Data Explosion Source: By 2020 billions of sensors could gather 50 Zetta-Bytes (50 Billions TB!) of data annually (CeNSE)

4 Deep Convolutional Neural Networks a key enabler Clarifai.com LSVRC2014 images Courtesy of clarifai.com CoffeeCroissantBeverageMorning Breakfast Food Winter Snow Cold Mammal Dog Arctic Object Detection Scene Classification DCNNs excel in many computer vision applications Stanford's CS 231N course by Andrej Karpathy and Justin Johnson

5 A Zoo of Artificial Neural Networks Credit

6 DCNN s Complexity Evolution Operations (GOPS) Parameters (Millions) ANNs ( ) 3 layers AlexNET (2012) 7 layers GoogleLeNet (2014) 22 layers VGG19 (2014) 19 layers ResNet (2015) 152 layers

7 CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC Tot. Operations: 832 M AlexNet basics 105M 223M 149M 224M 74M 37M 16M 4M 35K 307K 884K 649K 442K Tot. Parameters: ~ 60M Pooling 37M 16M 4M Activation Kernels Feature Maps Krizhevsky et all, NIPS 2012

8 Rationale for a Artificial Neural Network SoC A deep-learning SoC for embedded applications would be very useful. e.g. IoT devices can greatly benefit from ANN capabilities. BUT: DCNN on embedded devices means TOPs/Watt FOM for energy efficiency. Power, cost, scalability, and bandwidth constraints (e.g. No high-end CPUs, No GPUs) No hardwired datapaths due to the vast selection of neural networks and ongoing research.

9 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

10 A complete SoC for ANN(DCNN) applications STRED5 STRED5 LXBAR LXBAR DSPc DSPc I+D MEM 64K I$ 16K I+D MEM 64K I$ 16K Full xbar (64 bits) C-MEM 64K CXBAR DSP Cluster 8 DSP clusters, each with 2 custom 32-bit DSPs, 4-way 16KB I-Cache, 64KB Local RAM and a shared 64KB RAM (6uW/Mhz@0.6V) up to 1GHz in ST FD- SOI 28nm technology ISA extensions for DCNN execution Excess capacity for additional processing (e.g. ROI selection, filtering, etc.). Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link

A complete SoC for ANN(DCNN) applications 4 MB (4x16x64KB) of shared RAM banks organized as 4 groups with a 64 bits bus port each to sustain peak DCNN throughput DSPc Full xbar (64 bits) DSPc Used as

11 A complete SoC for ANN(DCNN) applications 4 MB (4x16x64KB) of shared RAM banks organized as 4 groups with a 64 bits bus port each to sustain peak DCNN throughput DSPc Full xbar (64 bits) DSPc Used as an L2 SW controlled cache for feature maps and parameters sized up to accommodate all conv stages of a DCNN with an AlexNet level of complexity (compressed) Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link Each 64KB bank has individual sleep line control to selectively activate it on demand and decrease power consumption when not active Energy/power x word access Local SRAM On-chip SRAM LPDDR 1x 10x 100x

A complete SoC for ANN(DCNN) applications Display Sensor IF Sensor IF Color convert... H264 MJPEG Stream Switch CA 0... CA 7 DSPc Full xbar (64 bits) DSPc 15.

12 A complete SoC for ANN(DCNN) applications Display Sensor IF Sensor IF Color convert... H264 MJPEG Stream Switch CA 0... CA 7 DSPc Full xbar (64 bits) DSPc Bus Interface Coprocessors subsystem SRAM C2C Link 8 Convolution Accelerators Configurable framework supports data-flow based processing 16 Stream engines linked list, line/column stride, X/Y padding, rounding, packing & scaling HOST + PERIPH + EXT MEM Additional IPs H264, MJPEG, 2 Census, 2 croppers, Corner detector, 4 color conv, 4 sensors input IFs, 1 DVI output IF, digital MIC array IF

13 A complete SoC for ANN(DCNN) applications Upto 32Gbps(4 lanes of 8Gbps) low power Chip2Chip HS Link DSPc Full xbar (64 bits) DSPc Connected to the main Xbar allowing a extension of the internal Bus to off Chip Targets Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link Enables both Homogeneous and Heterogeneous MultiChip Configurations

14 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

15 HW Acc DSP CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC AlexNet HW/SW partitioning Tot. Operations: 832 M 105M 223M 149M 224M 74M 37M 16M 4M 85-90% of total operations CONV layers: 1 Conv Acc ~= 16 DSPs Non conv layers to DSPs to accommodate DCNN future evolution (leaky RELU, etc.)

16 SRAM EXT MEM CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC AlexNet memory footprint 35K 307K 884K 649K 442K 37M 16M 4M Tot. Parameters: ~ 60M On-chip SRAM 2318 KB for parameters (8 bits) 1436 KB for feature maps (16 bits) ~10 MB of external RAM for FC layers

17 FEATURE DEPTH KER 0 KERNEL WIDTH KER Q Logical to physical mapping BATCH 0 IN FEATURE MAP Feature maps and kernels are sliced into batches processed iteratively and results are accumulated FEATURE WIDTH BATCH N BATCH SIZE FEATURE HEIGHT Σ Σ KER 0 KER Q OUT FEATURE MAP Batch size set x layer Matching features and kernels parameters to HW resources and ceilings

18 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

19 Reconfigurable Accelerator framework Image Sensor IF & ISP Image Sensor IF & ISP Display out (DVI) Interface Design time parametric DESIG N Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 E14 Stream Switch RGB IMAGE... E4 BATCH -1 BATCH FEATURE E3 E2 Bus Arbiter & System Bus Interface E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Blocks are configured at startup unidirectional stream links create ad-hoc processing chains. engines run autonomously through linked lists and synch up with the DSP clusters with interrupts STARTUP RUNTIME

20 Reconfigurable Accelerator framework Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 Image Sensor IF & ISP E14 Stream Switch RGB IMAGE... Image Sensor IF & ISP E4 BATCH -1 BATCH FEATURE E3 Bus Arbiter & System Bus Interface Display out (DVI) Interface E2 E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Virtual stream links Ferry data to/from accelerators, interfaces and engines Flow control mechanism is provided Streams can be multicast to multiple destinations More flexible than hardware data paths More power efficient than a bus

21 Exploit parallelism and locality FEATURE KERNEL 0/1 BATCH -1 K0 Parallel Batch Execution CA 0 OUT K0 Parallel and Chained Batch Execution FEATURE KERNEL0-3 BATCH -1 K0 CA 0.0 BATCH -1 K1 CA N OUT K1 FEATURE (next batch) CA 0.M OUT K0 FEATURE KERNEL 0/1 BATCH -1 Chained Batch Execution CA 0 BATCH -1 K1 CA N.0 FEATURE (next batch) CA N OUT CA N.M OUT K1 Chained and parallel batch execution on multiple accelerators reduces bandwidth, power, and the number of required channels

22 Layer 1 Parameter Compression 2000 Layer Kernel weights can be quantized non linearly with 8 of fewer bits (e.g. with KNN), Convolution Accelerator supports decompression in HW AlexNet top-1 classification error rate increase of 0.3%

23 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

24 Near threshold FDSOI technology Body bias: Needed for Temprature and Process Compensation in ULV nad ULP design

25 Wide Voltage Range High Performance Low Voltage Monosupply SRAM 0.120u 2 single p-well bitcell with reduced variability In-situ tracking of bitcell current and programmable read time for best speed and lowest dynamic power In-situ tracking of wordline delay and slope for robust low voltage read/ write Energy conservation Independent array and periphery power switches In-built isolation for FSM stability in power-down modes Extensive internal signal and clock Programmable buffers to optimize performance and power across instances

26 Ultra-Wide DVFS Range LVT design with heterogeneous Poly-Bias levels -> perf vs leakage GALS and low insertion delay clock networks to minimize on chip variation margins; Mono Supply memories with fine grained power switches and sleep mode DVFS energy efficiency improvements via body bias FBB dynamic range split between T & P compensation Wide DVFS Range Frequency GOPS/W

27 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results

28 AlexNet CAs Performance Layer MOPS Time [ms] Load % GOPs/W Pwr [mw] GOPs/W Pwr [mw] 16(F)x16(W)->16 8(F)x8(W)->16 max avg max avg Total V 25C, 4 chains of 2 CAs, batch of 1 image (227x227)

29 AlexNet Complete Application Sensor IF CROP 227x227 RGB->YUV KER. MEM IN FMAP DSPs MJEG MEM CA CA OUT FMAP ARM + SPI + To PC 0.6V, 10 FPS with 37.5 mw 2 chained CAs

30 Deep Learning Demonstration 30 ALEXNET Object Recognition Emotion Detection Autonomous Game Control

31 Summary An ultra-low-power SoC for ANN(DCNN) realword embedded and IoT applications Designed in FD-SOI28, ultra wide DVFS capability Reconfigurable Accelerated Data Flow Framework Parametric HW accelerator for computational bottlenecks of large ANN(DCNN) Exploits different kinds of parallelism to improve performance and reduce power DSP array with an optimized ISA for additional processing needs Average Peak efficiency of 2.9 TOPS/W on 28FDSOI Silicon

32 THANK YOU! 32

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural