Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,
|
|
- Abraham Cunningham
- 5 years ago
- Views:
Transcription
1 Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics
2 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
3 Sensors Invasion & Data Explosion Source: By 2020 billions of sensors could gather 50 Zetta-Bytes (50 Billions TB!) of data annually (CeNSE)
4 Deep Convolutional Neural Networks a key enabler Clarifai.com LSVRC2014 images Courtesy of clarifai.com CoffeeCroissantBeverageMorning Breakfast Food Winter Snow Cold Mammal Dog Arctic Object Detection Scene Classification DCNNs excel in many computer vision applications Stanford's CS 231N course by Andrej Karpathy and Justin Johnson
5 A Zoo of Artificial Neural Networks Credit
6 DCNN s Complexity Evolution Operations (GOPS) Parameters (Millions) ANNs ( ) 3 layers AlexNET (2012) 7 layers GoogleLeNet (2014) 22 layers VGG19 (2014) 19 layers ResNet (2015) 152 layers
7 CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC Tot. Operations: 832 M AlexNet basics 105M 223M 149M 224M 74M 37M 16M 4M 35K 307K 884K 649K 442K Tot. Parameters: ~ 60M Pooling 37M 16M 4M Activation Kernels Feature Maps Krizhevsky et all, NIPS 2012
8 Rationale for a Artificial Neural Network SoC A deep-learning SoC for embedded applications would be very useful. e.g. IoT devices can greatly benefit from ANN capabilities. BUT: DCNN on embedded devices means TOPs/Watt FOM for energy efficiency. Power, cost, scalability, and bandwidth constraints (e.g. No high-end CPUs, No GPUs) No hardwired datapaths due to the vast selection of neural networks and ongoing research.
9 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
10 A complete SoC for ANN(DCNN) applications STRED5 STRED5 LXBAR LXBAR DSPc DSPc I+D MEM 64K I$ 16K I+D MEM 64K I$ 16K Full xbar (64 bits) C-MEM 64K CXBAR DSP Cluster 8 DSP clusters, each with 2 custom 32-bit DSPs, 4-way 16KB I-Cache, 64KB Local RAM and a shared 64KB RAM (6uW/Mhz@0.6V) up to 1GHz in ST FD- SOI 28nm technology ISA extensions for DCNN execution Excess capacity for additional processing (e.g. ROI selection, filtering, etc.). Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link
11 A complete SoC for ANN(DCNN) applications 4 MB (4x16x64KB) of shared RAM banks organized as 4 groups with a 64 bits bus port each to sustain peak DCNN throughput DSPc Full xbar (64 bits) DSPc Used as an L2 SW controlled cache for feature maps and parameters sized up to accommodate all conv stages of a DCNN with an AlexNet level of complexity (compressed) Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link Each 64KB bank has individual sleep line control to selectively activate it on demand and decrease power consumption when not active Energy/power x word access Local SRAM On-chip SRAM LPDDR 1x 10x 100x
12 A complete SoC for ANN(DCNN) applications Display Sensor IF Sensor IF Color convert... H264 MJPEG Stream Switch CA 0... CA 7 DSPc Full xbar (64 bits) DSPc Bus Interface Coprocessors subsystem SRAM C2C Link 8 Convolution Accelerators Configurable framework supports data-flow based processing 16 Stream engines linked list, line/column stride, X/Y padding, rounding, packing & scaling HOST + PERIPH + EXT MEM Additional IPs H264, MJPEG, 2 Census, 2 croppers, Corner detector, 4 color conv, 4 sensors input IFs, 1 DVI output IF, digital MIC array IF
13 A complete SoC for ANN(DCNN) applications Upto 32Gbps(4 lanes of 8Gbps) low power Chip2Chip HS Link DSPc Full xbar (64 bits) DSPc Connected to the main Xbar allowing a extension of the internal Bus to off Chip Targets Coprocessors subsystem HOST + PERIPH + EXT MEM SRAM C2C Link Enables both Homogeneous and Heterogeneous MultiChip Configurations
14 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
15 HW Acc DSP CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC AlexNet HW/SW partitioning Tot. Operations: 832 M 105M 223M 149M 224M 74M 37M 16M 4M 85-90% of total operations CONV layers: 1 Conv Acc ~= 16 DSPs Non conv layers to DSPs to accommodate DCNN future evolution (leaky RELU, etc.)
16 SRAM EXT MEM CONV 11x11 RELU, NORM POOL CONV 5x5 RELU, POOL CONV 3x3 RELU CONV 3x3 RELU CONV 3x3 RELU, POOL FC FC FC AlexNet memory footprint 35K 307K 884K 649K 442K 37M 16M 4M Tot. Parameters: ~ 60M On-chip SRAM 2318 KB for parameters (8 bits) 1436 KB for feature maps (16 bits) ~10 MB of external RAM for FC layers
17 FEATURE DEPTH KER 0 KERNEL WIDTH KER Q Logical to physical mapping BATCH 0 IN FEATURE MAP Feature maps and kernels are sliced into batches processed iteratively and results are accumulated FEATURE WIDTH BATCH N BATCH SIZE FEATURE HEIGHT Σ Σ KER 0 KER Q OUT FEATURE MAP Batch size set x layer Matching features and kernels parameters to HW resources and ceilings
18 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
19 Reconfigurable Accelerator framework Image Sensor IF & ISP Image Sensor IF & ISP Display out (DVI) Interface Design time parametric DESIG N Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 E14 Stream Switch RGB IMAGE... E4 BATCH -1 BATCH FEATURE E3 E2 Bus Arbiter & System Bus Interface E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Blocks are configured at startup unidirectional stream links create ad-hoc processing chains. engines run autonomously through linked lists and synch up with the DSP clusters with interrupts STARTUP RUNTIME
20 Reconfigurable Accelerator framework Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 Image Sensor IF & ISP E14 Stream Switch RGB IMAGE... Image Sensor IF & ISP E4 BATCH -1 BATCH FEATURE E3 Bus Arbiter & System Bus Interface Display out (DVI) Interface E2 E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Virtual stream links Ferry data to/from accelerators, interfaces and engines Flow control mechanism is provided Streams can be multicast to multiple destinations More flexible than hardware data paths More power efficient than a bus
21 Exploit parallelism and locality FEATURE KERNEL 0/1 BATCH -1 K0 Parallel Batch Execution CA 0 OUT K0 Parallel and Chained Batch Execution FEATURE KERNEL0-3 BATCH -1 K0 CA 0.0 BATCH -1 K1 CA N OUT K1 FEATURE (next batch) CA 0.M OUT K0 FEATURE KERNEL 0/1 BATCH -1 Chained Batch Execution CA 0 BATCH -1 K1 CA N.0 FEATURE (next batch) CA N OUT CA N.M OUT K1 Chained and parallel batch execution on multiple accelerators reduces bandwidth, power, and the number of required channels
22 Layer 1 Parameter Compression 2000 Layer Kernel weights can be quantized non linearly with 8 of fewer bits (e.g. with KNN), Convolution Accelerator supports decompression in HW AlexNet top-1 classification error rate increase of 0.3%
23 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
24 Near threshold FDSOI technology Body bias: Needed for Temprature and Process Compensation in ULV nad ULP design
25 Wide Voltage Range High Performance Low Voltage Monosupply SRAM 0.120u 2 single p-well bitcell with reduced variability In-situ tracking of bitcell current and programmable read time for best speed and lowest dynamic power In-situ tracking of wordline delay and slope for robust low voltage read/ write Energy conservation Independent array and periphery power switches In-built isolation for FSM stability in power-down modes Extensive internal signal and clock Programmable buffers to optimize performance and power across instances
26 Ultra-Wide DVFS Range LVT design with heterogeneous Poly-Bias levels -> perf vs leakage GALS and low insertion delay clock networks to minimize on chip variation margins; Mono Supply memories with fine grained power switches and sleep mode DVFS energy efficiency improvements via body bias FBB dynamic range split between T & P compensation Wide DVFS Range Frequency GOPS/W
27 Outline Introduction Chip architecture ANN(DCNN) mapping Co-Processor Subsystem Chip Implementation Results
28 AlexNet CAs Performance Layer MOPS Time [ms] Load % GOPs/W Pwr [mw] GOPs/W Pwr [mw] 16(F)x16(W)->16 8(F)x8(W)->16 max avg max avg Total V 25C, 4 chains of 2 CAs, batch of 1 image (227x227)
29 AlexNet Complete Application Sensor IF CROP 227x227 RGB->YUV KER. MEM IN FMAP DSPs MJEG MEM CA CA OUT FMAP ARM + SPI + To PC 0.6V, 10 FPS with 37.5 mw 2 chained CAs
30 Deep Learning Demonstration 30 ALEXNET Object Recognition Emotion Detection Autonomous Game Control
31 Summary An ultra-low-power SoC for ANN(DCNN) realword embedded and IoT applications Designed in FD-SOI28, ultra wide DVFS capability Reconfigurable Accelerated Data Flow Framework Parametric HW accelerator for computational bottlenecks of large ANN(DCNN) Exploits different kinds of parallelism to improve performance and reduce power DSP array with an optimized ISA for additional processing needs Average Peak efficiency of 2.9 TOPS/W on 28FDSOI Silicon
32 THANK YOU! 32
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationCNN Basics. Chongruo Wu
CNN Basics Chongruo Wu Overview 1. 2. 3. Forward: compute the output of each layer Back propagation: compute gradient Updating: update the parameters with computed gradient Agenda 1. Forward Conv, Fully
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationPULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO
PULP: an open source hardware-software platform for near-sensor analytics Luca Benini IIS-ETHZ & DEI-UNIBO An IoT System View Sense MEMS IMU MEMS Microphone ULP Imager Analyze µcontroller L2 Memory e.g.
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationComprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications
Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications Helena Zheng ML Group, Arm Arm Technical Symposia 2017, Taipei Machine Learning is a Subset of Artificial
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationLecture 37: ConvNets (Cont d) and Training
Lecture 37: ConvNets (Cont d) and Training CS 4670/5670 Sean Bell [http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by] (Unrelated) Dog vs Food [Karen Zack, @teenybiscuit]
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI
ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationPULP: A Parallel Ultra Low Power platform for next generation IoT Applications
PULP: A Parallel Ultra Low Power platform for next generation IoT Applications Davide Rossi 1 Francesco Conti 1, Andrea Marongiu 1,2, Antonio Pullini 2, Igor Loi 1, Michael Gautschi 2, Giuseppe Tagliavini
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationDeep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon
Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationVenezia: a Scalable Multicore Subsystem for Multimedia Applications
Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and
More informationThe Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016
The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationMay Wu, Ravi Iyer, Yatin Hoskote, Steven Zhang, Julio Zamora, German Fabila, Ilya Klotchkov, Mukesh Bhartiya. August, 2015
May Wu, Ravi Iyer, Yatin Hoskote, Steven Zhang, Julio Zamora, German Fabila, Ilya Klotchkov, Mukesh Bhartiya August, 2015 Legal Notices and Disclaimers Intel technologies may require enabled hardware,
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationVector IRAM: A Microprocessor Architecture for Media Processing
IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology
More informationSoC Communication Complexity Problem
When is the use of a Most Effective and Why MPSoC, June 2007 K. Charles Janac, Chairman, President and CEO SoC Communication Complexity Problem Arbitration problem in an SoC with 30 initiators: Hierarchical
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationHotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.
HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using
More informationIntro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn
Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective
More informationFuzzy Set Theory in Computer Vision: Example 3, Part II
Fuzzy Set Theory in Computer Vision: Example 3, Part II Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Resource; CS231n: Convolutional Neural Networks for Visual Recognition https://github.com/tuanavu/stanford-
More informationHybrid Memory Cube (HMC)
23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationArchitetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP
Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP 31.05.2018 Davide Rossi davide.rossi@unibo.it 1 Department of Electrical, Electronic and Information Engineering 2 Integrated
More informationSmart Ultra-Low Power Visual Sensing
Smart Ultra-Low Power Visual Sensing Manuele Rusci*, Francesco Conti * manuele.rusci@unibo.it f.conti@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell Energia Elettrica
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationFPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日
FPGA 加速机器学习应用 罗霖 Andy.luo@Xilinx.com 2017 年 6 月 20 日 Xilinx The All Programmable Company XILINX - Founded 1984 Headquarters Research and Development Sales and Support Manufacturing $2.21B FY16 revenue
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationKiloCore: A 32 nm 1000-Processor Array
KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationProcess and Design Solutions for Exploiting FD SOI Technology Towards Energy Efficient SOCs
Process and Design Solutions for Exploiting FD SOI Technology Towards Energy Efficient SOCs Philippe FLATRESSE Technology R&D Central CAD & Design Solutions STMicroelectronics International Symposium on
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationMulti processor systems with configurable hardware acceleration
Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations
More informationDeep Learning Requirements for Autonomous Vehicles
Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive
More informationMachine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,
Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationSoftware Defined Modem A commercial platform for wireless handsets
Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from
More informationConvolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech
Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:
More informationVersal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm
Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling
More informationFaster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationArtificial Intelligence Enriched User Experience with ARM Technologies
Artificial Intelligence Enriched User Experience with ARM Technologies Daniel Heo Senior Segment Manager Mobile, BSG, ARM ARM Tech Forum Singapore July 12 th 2017 Global AI survey: the world is ready 71
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationDEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017
DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X
More informationA new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology
Dr.-Ing Jens Benndorf (DCT) Gregor Schewior (DCT) A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology Tensilica Day 2017 16th
More informationDeep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper
Deep Convolutional Neural Networks Nov. 20th, 2015 Bruce Draper Background: Fully-connected single layer neural networks Feed-forward classification Trained through back-propagation Example Computer Vision
More informationAn introduction to Machine Learning silicon
An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationOUTLINE Introduction Power Components Dynamic Power Optimization Conclusions
OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism
More informationNANOIOTECH The Future of Nanotechnologies for IoT & Smart Wearables Semiconductor Technology at the Core of IoT Applications
NANOIOTECH The Future of Nanotechnologies for IoT & Smart Wearables Semiconductor Technology at the Core of IoT Applications Giorgio Cesana STMicroelectronics Success Factors for new smart connected Applications
More informationNVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)
NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier
More informationL évolution des architectures et des technologies d intégration des circuits intégrés dans les Data centers
I N S T I T U T D E R E C H E R C H E T E C H N O L O G I Q U E L évolution des architectures et des technologies d intégration des circuits intégrés dans les Data centers 10/04/2017 Les Rendez-vous de
More information2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don
RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,
More informationMulti-Core SoCs for ADAS and Image Recognition Applications
Multi-Core SoCs for ADAS and Image Recognition Applications Takashi Miyamori, Senior Manager Embedded Core Technology Development Department Center for Semiconductor Research & Development Storage Device
More informationVLSI Design Automation. Maurizio Palesi
VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips
More informationMachine learning for the Internet of Things
Machine learning for the Internet of Things Chris Shore Director of Embedded Solutions Arm 2018 Arm Limited April 2018 More Intelligence at the Edge Arm Cortex-M Expanding opportunity for the embedded
More informationSpiNNaker - a million core ARM-powered neural HPC
The Advanced Processor Technologies Group SpiNNaker - a million core ARM-powered neural HPC Cameron Patterson cameron.patterson@cs.man.ac.uk School of Computer Science, The University of Manchester, UK
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationLow-Power Neural Processor for Embedded Human and Face detection
Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST)
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint
More informationInterconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp
Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary
More informationAttack Your SoC Power Challenges with Virtual Prototyping
Attack Your SoC Power Challenges with Virtual Prototyping Stefan Thiel Gunnar Braun Accellera Systems Initiative 1 Agenda Part #1: Power-aware Architecture Definition Part #2: Power-aware Software Development
More informationLow-Power Processor Solutions for Always-on Devices
Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationSmall is the New Big: Data Analytics on the Edge
Small is the New Big: Data Analytics on the Edge An overview of processors and algorithms for deep learning techniques on the edge Dr. Abhay Samant VP Engineering, Hiller Measurements Adjunct Faculty,
More informationHello Edge: Keyword Spotting on Microcontrollers
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of
More informationINTRODUCTION TO DEEP LEARNING
INTRODUCTION TO DEEP LEARNING CONTENTS Introduction to deep learning Contents 1. Examples 2. Machine learning 3. Neural networks 4. Deep learning 5. Convolutional neural networks 6. Conclusion 7. Additional
More information