Close to the Edge How Neural Network inferencing is migrating to specialised DSPs in State of the Art SoCs. Marcus Binning Sept 2018 Lund

Similar documents
An introduction to Machine Learning silicon

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Ten Reasons to Optimize a Processor

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Revolutionizing the Datacenter

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Deep Learning Requirements for Autonomous Vehicles

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Cut DSP Development Time Use C for High Performance, No Assembly Required

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

Grand Challenge Scaling - Pushing a Fully Programmable TeraOp into Handset Imaging

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

Brainchip OCTOBER

Hello Edge: Keyword Spotting on Microcontrollers

Neural Network Exchange Format

The Power of Speech: Supporting Voice- Driven Commands in Small, Low-Power. Microcontrollers

Artificial Intelligence Enriched User Experience with ARM Technologies

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Making Sense of Artificial Intelligence: A Practical Guide

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Machine Learning 13. week

Deep Learning Accelerators

Small is the New Big: Data Analytics on the Edge

Enable AI on Mobile Devices

Versal: AI Engine & Programming Environment

Arm s First-Generation Machine Learning Processor

XPU A Programmable FPGA Accelerator for Diverse Workloads

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Unified Deep Learning with CPU, GPU, and FPGA Technologies

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

AI Requires Many Approaches

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Deep Learning with Tensorflow AlexNet

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Flash Memory Summit 2011

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Binary Convolutional Neural Network on RRAM

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

NVIDIA DEEP LEARNING INSTITUTE

World s most advanced data center accelerator for PCIe-based servers

THE NVIDIA DEEP LEARNING ACCELERATOR

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

The Changing Face of Edge Compute

Deep learning in MATLAB From Concept to CUDA Code

Xilinx ML Suite Overview

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

direct hardware mapping of cnns on fpga-based smart cameras

Know your data - many types of networks

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Multicore and MIPS: Creating the next generation of SoCs. Jim Whittaker EVP MIPS Business Unit

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

NVIDIA PLATFORM FOR AI

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

Embedded Systems. 7. System Components

Distributed Vision Processing in Smart Camera Networks

Embedded GPGPU and Deep Learning for Industrial Market

! References: ! Computer eyesight gets a lot more accurate, NY Times. ! Stanford CS 231n. ! Christopher Olah s blog. ! Take ECS 174!

A backward glance and a forward view

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

The OpenVX Computer Vision and Neural Network Inference

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

Multiple Instruction Issue. Superscalars

Growth outside Cell Phone Applications

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

May Wu, Ravi Iyer, Yatin Hoskote, Steven Zhang, Julio Zamora, German Fabila, Ilya Klotchkov, Mukesh Bhartiya. August, 2015

CS 101, Mock Computer Architecture

Basic Computer Architecture

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Fast Hardware For AI

Autonomous Driving Solutions

Advanced IP solutions enabling the autonomous driving revolution

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Graph Streaming Processor

ARM Processors for Embedded Applications

ARM Processor. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Machine learning for the Internet of Things

MIPI : Advanced Driver Assistance System

CNP: An FPGA-based Processor for Convolutional Networks

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Transcription:

Close to the Edge How Neural Network inferencing is migrating to specialised DSPs in State of the Art SoCs Marcus Binning Sept 2018 Lund

Science Fiction Science Fact (or Consumer Device) The Babel Fish "The Babel fish is small, yellow, leech-like - and probably the oddest thing in the universe. It feeds on brain wave energy, absorbing all unconscious frequencies and then excreting telepathically a matrix formed from the conscious frequencies and nerve signals picked up from the speech centres of the brain, the practical upshot of which is that if you stick one in your ear, you can instantly understand anything said to you in any form of language: the speech you hear decodes the brain wave matrix." From: The Hitchhiker s Guide to the Galaxy, Douglas Adams One of the latest focus areas for AI is automatic language translation It s a really hard problem 2 2018 Cadence Design Systems, Inc.

What is AI? Merriam-Webster defines artificial intelligence this way: A branch of computer science dealing with the simulation of intelligent behavior in computers. The capability of a machine to imitate intelligent human behavior. English Oxford Living Dictionary gives this definition: The theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. 3 2018 Cadence Design Systems, Inc.

What are Neural Networks? neural network noun a computer system modelled on the human brain and nervous system. The building blocks of AI Convolutional (CNN) Recurrent (RNN) Long Short Term Memory (LSTM) Alphabet Soup. Diverse, targeted at solving different type of problem Spatial, temporal.. 4 2018 Cadence Design Systems, Inc.

The Basics of Real-Time Neural Networks: Training vs Inferencing in embedded systems Training: Runs once per database, server-based, very compute intensive Selection of layered network Iterative derivation of coefficients by stochastic descent error minimization Server Farm Embedded Labeled dataset 10 4-10 10 TMAC/s 0.5 to 10TMAC/s Single-pass evaluation of input image Most probable label Deployment ( Inference ): Runs on every image, device based, compute intensive Most important for embedded systems: Power 5 2018 Cadence Design Systems, Inc.

Integrated with other Processing AI does not exist in isolation Pre-processing, post processing Solutions need to reflect this, especially for embedded The energy cost of moving data is relatively high Examples 6 2018 Cadence Design Systems, Inc.

AI-Based Application Trends: Mix of Vision and AI Face Detection Multi-Scale Pyramid Vision Operation Face Detector AI Operation HDR Scene Detection AI Operation HDR Processing Vision Operation Bokeh with Single Camera Segmentation AI Operation Blurring and De-Blurring Vision Operation Moving from traditional feature-based embedded vision to AI-based algorithm All use cases still have mix of vision and AI operations Need for both vision and AI processing in e.g. the camera pipeline 7 2018 Cadence Design Systems, Inc.

Smart Speaker Processing Chain Audio/Voice Mix of traditional DSP processing and AI mapped to CDNS SW components Multi-mic Input / Spkr Output Front-end Preprocessing Block Mono Speech / Audio Frames Feature Extraction Block DSP Processing Low level kernel library containing many commonly used components Includes support for math functions like tanh() etc Features in a Frame Network / Layer Decoding AI Processing Acoustic Units in a Frame Language /word mapping Additional SW library, primarily Focused on Network/Layer decoding functions Also contains few modules to provide basic support for Viterbi/Hashing/Beam search kernels Imported from NDSP lib, this also contains specific modules instances of feature extractions modules + Sigmoid/tanh/SoftMax Natural Language / Key Words Example DSPs which are commonly used in this domain: HiFi 3/3z HiFi 4 8 2018 Cadence Design Systems, Inc.

Majority of AI Inferences Are in the Cloud today AI in Cloud Alexa, when is my new camera arriving? Smart Assistant Voice search Travel Assistant Translation Navigation Assistant Store finder 9 2018 Cadence Design Systems, Inc.

On-Device ( At the Edge ) AI Why? Low latency requirements Natural dialogue in speech assistants requires less than 200msec latency Real-time decision making in automotive, robots, AR/VR, etc. needs low latency Lack of good connectivity Smart city cameras difficult to connect to existing network Inspection drones for wind turbines and power lines operate in rural areas Privacy Smart home video cameras and smart assistants consumers desire privacy 10 2018 Cadence Design Systems, Inc.

Target Markets for On-Device AI Inferencing IoT < 0.5TMAC/s Mobile 0.5-2TMAC/s AR/VR 1-4TMAC/s 11 2018 Cadence Design Systems, Inc. Smart Surveillance 2-10TMAC/s Autonomous Vehicles 10s - 100s TMAC/s

On-Device AI Processing Needs Are Increasing Mobile On-device AI experiences like face detection and people recognition at video capture rates AR/VR headsets On-device AI for object detection, people recognition, gesture recognition, and eye tracking Surveillance cameras On-device AI for family or stranger recognition and anomaly detection Drones and robots On-device AI to recognize subjects, objects, obstacles, emotions, etc. Automotive On-device AI to recognize pedestrians, cars, signs, lanes, driver alertness, etc. for ADAS and AV 12 2018 Cadence Design Systems, Inc.

CNN Algorithm Development Trends Increasing Computational Requirements (~16X in <4 years) Network Architectures Changing Regularly New Applications and Markets AlexNet (2012) Inception (2015) ResNet (2015) NETWORK MACS/IMAGE ALEXNET 724,406,816 INCEPTION V3 5,713,232,480 RESNET-101 7,570,194,432 RESNET-152 11,282,415,616 AlexNet (bigger convolution); Inception V3 and ResNet (smaller convolution) Linear network vs. branch Automotive, server, home (voice-activated digital assistants), mobile, surveillance L o w P o w e r How do you pick an inference hardware platform today (2018) for a product shipping in 2020-2022+? How do you achieve low-power efficiency yet be flexible? 13 2018 Cadence Design Systems, Inc.

AI = Big Problem size, Requires Big (SoC) Solutions What is realistic to deploy? Kirin 980 from HiSilicon The Kirin 980 integrates 6.9 billion transistors in an area of less than 1 square centimeter Kirin 980 can quickly adapt to AI scenes such as face recognition, object recognition, object detection, image segmentation and intelligent translation with the power of a dual-core NPU achieving 4500 images per minute https://consumer.huawei.com/en/campaign/kirin980/ A12 Bionic from Apple The company says it's the industry's first 7-nanometer chip and contains 6.9 billion transistors https://www.engadget.com/2018/09/12/apple-a12-bionic-7-nanometer-chip/ The Neural Engine is incredibly fast, able to perform five trillion operations per second. It s incredibly efficient, which enables it to do all kinds of new things in real time https://www.apple.com/uk/iphone-xr/a12-bionic/ 14 2018 Cadence Design Systems, Inc.

When all you have is a hammer, everything looks like a nail Abraham Maslow 15 2018 Cadence Design Systems, Inc.

Edge Processing of AI Requires NEW Architectures AI is fundamentally a Software Problem TRUE Software can easily run on Standard Processors TRUE Software performance can be scaled up on GPUs TRUE AI Software can run at the required performance/power levels on Standard processor/gpu platforms FALSE It s Obvious what the next generation of AI platforms should look like FALSE 16 2018 Cadence Design Systems, Inc.

What Characteristics Does a NN DSP need? High Computation Throughput at low power Different number representations from (typically) 8b fixed point to Floating point Lots of MACs! Arranged in a flexible way to be able to handle the different kernels Supporting ISA Just Enough usually means arithmetics, logical, some shifts and vector shuffling, Don t need all the bells and whistles associated with traditional Computer Vision Keep it lean and focussed with enough flexibility to handle all visible and predicted (!) requirements High Data Throughput from L1 memory Need to be able to feed data to the computation units in a sustained manner There may be some tricks (e.g. Compression Taking advantage of zeroes) Connection to Bulk Memory DMA The Network Model will not fit in L1 memory (pretty much a given) Access and timing of memory fetch from off-chip memory to L1 usually handled by DMA 17 2018 Cadence Design Systems, Inc.

Quantisation / Tiling Fundamentals for embedded processing of NN In reference trained model (e.g. Caffe, Tensorflow etc) Number representation usually floating point, single precision Entire image available/visible to the model In an embedded solution Number representation usually much lower Quantised 16b, 8b, even lower Inferencing system can only process parts of the image at a time Tiles Embedded systems must handle long latencies to bulk memory (off chip) Embedded systems must intelligently quantise to avoid degradation of NN performance (accuracy) 18 2018 Cadence Design Systems, Inc.

Vision P6 DSP Running Alexnet Convolutional Neural Network Alexnet: 227x227 RGB image 1 2 4 5 6 Winner of the ImageNet (ILSVRC) 2012 Contest Trained for 1000 different classes (images) Most often quoted benchmark for CNN Classifier CNN Example 5 Conv & 3 FC layers Input image: 227x227 image patch (ROI) 3 Alexnet visualization from http://ethereon.github.io/netscope/quickstart.html Vision P6 Detection Result admiral 0.941991 ladybug 0.000517 monarch 0.000287 tench 0.000057 goldfish 0.000057 7 8 9 10 Cadence Alexnet Implementation Based on Caffe 32b floating point Alexnet model Use 8 bit coefficients, 8 bit data computations Pure C P6 implementation, No library dependencies such as BLAS, NumPy, etc 19 2018 Cadence Design Systems, Inc.

Vision P6 DSP Running Alexnet Convolutional Neural Network Alexnet: 227x227 RGB image 1 2 4 5 6 Winner of the ImageNet (ILSVRC) 2012 Contest Trained for 1000 different classes (images) Most often quoted benchmark for CNN Classifier CNN Example 5 Conv & 3 FC layers Input image: 227x227 image patch (ROI) Alexnet visualization from http://ethereon.github.io/netscope/quickstart.html 7 Patches ( Tiles ) fetched from memory as 3 227x227 8 The Network does not decide which Vision P6 Detection Result 9 patches admiral to 0.941991 fetch from the main image ladybug 0.000517 10 decision monarch 0.000287 made by other systems tench 0.000057 goldfish 0.000057 Cadence Alexnet Implementation Based on Caffe 32b floating point Alexnet model Use 8 bit coefficients, 8 bit data computations Pure C P6 implementation, No library dependencies such as BLAS, NumPy, etc 20 2018 Cadence Design Systems, Inc.

Vision P6 DSP Running Alexnet Convolutional Neural Network Alexnet: 227x227 RGB image 1 2 4 5 6 Winner of the ImageNet (ILSVRC) 2012 Contest Trained for 1000 different classes (images) Most often quoted benchmark for CNN Classifier CNN Example 5 Conv & 3 FC layers Input image: 227x227 image patch (ROI) 3 Alexnet visualization from http://ethereon.github.io/netscope/quickstart.html Vision P6 Detection Result admiral 0.941991 ladybug 0.000517 monarch 0.000287 tench 0.000057 7 Cadence Alexnet Implementation As each layer is processed, the related 8 coefficents ( weights ) must be fetched from Based on Caffe 32b floating point Alexnet model Use 8 bit coefficients, 8 bit data computations Pure C P6 implementation, No library dependencies such as BLAS, NumPy, etc memory (usually 9 off chip) by the DMA systems, 10 in time for processing by the MAC units goldfish 0.000057 21 2018 Cadence Design Systems, Inc.

Inception V3 Accuracy on Vision P6 (8bit Data and 8 bit Weights) Input ROI Inception V3 Details Number of Layers 110 Compute Requirement Bandwidth Requirement (8 bit Weights) 299x299x3 5.78 GMAC 19.4 MB Vision P6 Accuracy (Loss <1%) Using 8bit Quantized Data & Weights Accuracy* Float 8bit Fixed Point Top-1 Accuracy 74.00% 73.29% Top-5 Accuracy 91.62% 91.18% *Accuracy tested over 50K images in ImageNet Val set Marginal loss of Network accuracy due to quantisation 22 2018 Cadence Design Systems, Inc.

Examples of (Vision + NN) Specialised DSPs from Cadence Sometimes you want Computer Vision + NN Capable in 1 core Sometimes you want only NN Capable, but more efficient. Usually (our philosophy) you need lots of flexibility, but the opportunity to differentiate This means Program in C using convenient vector types Good debug, modelling, integration, libraries, supporting SW, compilers Ability to add ISA features (custom instructions) if desired 23 2018 Cadence Design Systems, Inc.

Vision P6 Architecture VLIW & SIMD 5 slots 64way 8-bit 32way 16-bit 16way 32-bit ALU Ops (MAX 4 out of 5 slots) MAC (1 of 5 slots) Memory Width # of Vector Registers SuperGather Bus Interface 256 8-bit 128 16-bit 64 32-bit Vision P6: 256 (8x8), 128 (8x16), 64 (16x16) 1024-bits 2 vector load/store units 32 32 non-contiguous locations read/ written per instruction AXi4 24 2018 Cadence Design Systems, Inc. idma Target Frequency (reference core) Optional no alignment restrictions, local memory to local memory transfers, 800Mhz @28nm 1.1 GHz @16nm (with overdrive) Vector Floating Point, ECC

Vision P6 Architecture VLIW & SIMD 5 slots 64way 8-bit 32way 16-bit 16way 32-bit ALU Ops (MAX 4 out of 5 slots) MAC (1 of 5 slots) Memory Width # of Vector Registers SuperGather Bus Interface 256 8-bit 128 16-bit 64 32-bit Vision P6: 256 (8x8), 128 (8x16), 64 (16x16) 1024-bits 2 vector load/store units 32 32 non-contiguous locations read/ written per instruction AXi4 25 2018 Cadence Design Systems, Inc. idma Target Frequency (reference core) Optional no alignment restrictions, local memory to local memory transfers, 800Mhz @28nm 1.1 GHz @16nm (with overdrive) Vector Floating Point, ECC

Vision P6 Architecture VLIW & SIMD 5 slots 64way 8-bit 32way 16-bit 16way 32-bit ALU Ops (MAX 4 out of 5 slots) MAC (1 of 5 slots) Memory Width # of Vector Registers SuperGather Bus Interface 256 8-bit 128 16-bit 64 32-bit Vision P6: 256 (8x8), 128 (8x16), 64 (16x16) 1024-bits 2 vector load/store units 32 32 non-contiguous locations read/ written per instruction AXi4 26 2018 Cadence Design Systems, Inc. idma Target Frequency (reference core) Optional no alignment restrictions, local memory to local memory transfers, 800Mhz @28nm 1.1 GHz @16nm (with overdrive) Vector Floating Point, ECC

Vision P6 Architecture VLIW & SIMD 5 slots 64way 8-bit 32way 16-bit 16way 32-bit ALU Ops (MAX 4 out of 5 slots) MAC (1 of 5 slots) Memory Width # of Vector Registers SuperGather Bus Interface 256 8-bit 128 16-bit 64 32-bit Vision P6: 256 (8x8), 128 (8x16), 64 (16x16) 1024-bits 2 vector load/store units 32 32 non-contiguous locations read/ written per instruction AXi4 27 2018 Cadence Design Systems, Inc. idma Target Frequency (reference core) Optional no alignment restrictions, local memory to local memory transfers, 800Mhz @28nm 1.1 GHz @16nm (with overdrive) Vector Floating Point, ECC

Vision C5 Architecture Complete Stand Alone DSP to run all NN Layers General Purpose, Programmable and Flexible Scalable to multi-tmac design Fixed point DSP with 8-bit and 16-bit support 1024 8x8 MACs per cycle 512 16x16 MACs per cycle 512-bit vector register file 1024-bit register (pairing 2 512-bit registers) 128 way 8bit SIMD 64 way 16bit SIMD 3 or 4 slot VLIW Load/store/pack pairs with ALU/MAC ops 1024 bit Memory Width Dual LD/ST Support including aligned data Tensilica Vision C5 DSP for Neural Networks On the Fly Decompression Support Special addressing mode for efficient access of 3-D data Richer set of convolution multipliers (signed and unsigned) Extensive data rearrangement and selection 28 2018 Cadence Design Systems, Inc.

Vision C5 Architecture Complete Stand Alone DSP to run all NN Layers General Purpose, Programmable and Flexible Scalable to multi-tmac design Fixed point DSP with 8-bit and 16-bit support 1024 8x8 MACs per cycle 512 16x16 MACs per cycle 512-bit vector register file 1024-bit register (pairing 2 512-bit registers) 128 way 8bit SIMD 64 way 16bit SIMD 3 or 4 slot VLIW Load/store/pack pairs with ALU/MAC ops 1024 bit Memory Width Dual LD/ST Support including aligned data Tensilica Vision C5 DSP for Neural Networks On the Fly Decompression Support Special addressing mode for efficient access of 3-D data Richer set of convolution multipliers (signed and unsigned) Extensive data rearrangement and selection 29 2018 Cadence Design Systems, Inc.

Vision C5 Architecture Complete Stand Alone DSP to run all NN Layers General Purpose, Programmable and Flexible Scalable to multi-tmac design Fixed point DSP with 8-bit and 16-bit support 1024 8x8 MACs per cycle 512 16x16 MACs per cycle 512-bit vector register file 1024-bit register (pairing 2 512-bit registers) 128 way 8bit SIMD 64 way 16bit SIMD 3 or 4 slot VLIW Load/store/pack pairs with ALU/MAC ops 1024 bit Memory Width Dual LD/ST Support including aligned data Tensilica Vision C5 DSP for Neural Networks On the Fly Decompression Support Special addressing mode for efficient access of 3-D data Richer set of convolution multipliers (signed and unsigned) Extensive data rearrangement and selection 30 2018 Cadence Design Systems, Inc.

Vision C5 Architecture Complete Stand Alone DSP to run all NN Layers General Purpose, Programmable and Flexible Scalable to multi-tmac design Fixed point DSP with 8-bit and 16-bit support 1024 8x8 MACs per cycle 512 16x16 MACs per cycle 512-bit vector register file 1024-bit register (pairing 2 512-bit registers) 128 way 8bit SIMD 64 way 16bit SIMD 3 or 4 slot VLIW Load/store/pack pairs with ALU/MAC ops 1024 bit Memory Width Dual LD/ST Support including aligned data Tensilica Vision C5 DSP for Neural Networks On the Fly Decompression Support Special addressing mode for efficient access of 3-D data Richer set of convolution multipliers (signed and unsigned) Extensive data rearrangement and selection 31 2018 Cadence Design Systems, Inc.

Automated Tool, ISS, Model, RTL, and EDA Script Generation... Base Processor Dozens of Templates for many common applications Pre-Verified Options Off-the-shelf DSPs, interfaces, peripherals, debug, etc. Tensilica IP Iterate in minutes! Complete Hardware Design Pre-verified Synthesizable RTL EDA Scripts Test suite Optional Customization Create your own instructions, data types, registers, interfaces Customer IP Advanced Software Tools IDE C/C++ Compiler Debugger ISS Simulator SystemC Models DSP code libraries 32 2018 Cadence Design Systems, Inc.

Hardware is Not Enough Hardware must be Programmed Debugged Integrated Modelled Embedded Software must Integrate at a high level with other software Be self-sufficient usually libraries are required Integrate computations with data fetch Be easily maintained no assembly programming thank you very much! Vision based example 33 2018 Cadence Design Systems, Inc.

Xtensa Neural Network Compiler (XNNC) CNN Framework (Caffe, Tensorflow) Trained Model Cadence CNN Optimizer CNN Parser Float to Fixed Point Conversion Optimizer and Code Generation Xtensa Neural Network Compiler (XNNC) CNN Library with Meta Information Connects to existing industry CNN frameworks by using their Trained Model descriptions and auto-generates Trained Model optimized code for Cadence CNN DSPs Three Major components to XNNC CNN Parser: Float to Fixed Point conversion CNN Code Generation and Optimization CNN Library for Vision DSP Optimized Target Specific code for Vision / AI DSPs First CNN Framework support: Caffe, followed by Tensorflow For Vision DSPs 34 2018 Cadence Design Systems, Inc.

Summary What does it take to run AI Inferencing at the Edge? The right architecture.. with the right High Level Software integration and the right scalability and flexibility right now Sep 19 th 2018: Cadence Launches New Tensilica DNA 100 Processor IP Delivering Industry-Leading Performance and Power Efficiency for On-Device AI Applications http://www.cadence.com/go/dna100 35 2018 Cadence Design Systems, Inc.

2018 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks found at www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other trademarks are the property of their respective owners.