The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Similar documents
An introduction to Machine Learning silicon

THE LEADER IN VISUAL COMPUTING

24th MONDAY. Overview 2018

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

Enabling Safe, Secure, Smarter Cars from Silicon to Software. Jeff Hutton Synopsys Automotive Business Development

The OpenVX Computer Vision and Neural Network Inference

Deep Learning Requirements for Autonomous Vehicles

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Graph Streaming Processor

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Neural Network Exchange Format

Revolutionizing the Datacenter

Brainchip OCTOBER

CEVA-X1 Lightweight Multi-Purpose Processor for IoT

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

THE NVIDIA DEEP LEARNING ACCELERATOR

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Revolutionizing RISC-V based application design possibilities with GLOBALFOUNDRIES. Gregg Bartlett Senior Vice President, CMOS Business Unit

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Welcome. Altera Technology Roadshow 2013

XPU A Programmable FPGA Accelerator for Diverse Workloads

TIOVX TI s OpenVX Implementation

Arm Technology in Automotive Geely Automotive Shanghai Innovation Center

Perform. Travis Lanier Sr. Director, Product Management Qualcomm Technologies,

OCP Engineering Workshop - Telco

Signal Processing IP for a Smarter, Connected World. May 2017

Fast Hardware For AI

Xilinx ML Suite Overview

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

Standards for Vision Processing and Neural Networks

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Embedded Vision Solutions

Altera SDK for OpenCL

MIPI : Advanced Driver Assistance System

Enable AI on Mobile Devices

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Autonomous Driving Solutions

Multimedia SoC System Solutions

3D Graphics in Future Mobile Devices. Steve Steele, ARM

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Speed Sign Detection Using Convolutional Neural Network Accelerator IP User Guide

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

How to Build Optimized ML Applications with Arm Software

借助 SDSoC 快速開發複雜的嵌入式應用

efpga for Neural Network based Image Recognition

Bringing Intelligence to Enterprise Storage Drives

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

GTC Interaction Simplified. Gesture Recognition Everywhere: Gesture Solutions on Tegra

Deep Learning Accelerators

Speed Sign Detection Using Convolutional Neural Network Accelerator IP Reference Design

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision

Deep learning in MATLAB From Concept to CUDA Code

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Close to the Edge How Neural Network inferencing is migrating to specialised DSPs in State of the Art SoCs. Marcus Binning Sept 2018 Lund

Bringing Intelligence to Enterprise Storage Drives

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Ultra Low Power GPUs for Wearables

How to Build Optimized ML Applications with Arm Software

Building the Most Efficient Machine Learning System

Embedded HW/SW Co-Development

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

Flash Memory Summit 2011

Versal: AI Engine & Programming Environment

GTC 2013 March San Jose, CA The Smartest People. The Best Ideas. The Biggest Opportunities. Opportunities for Participation:

LINARO CONNECT 23 HKG18 George Grey, Linaro CEO

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

Growth outside Cell Phone Applications

Machine learning for the Internet of Things

Embedded Hardware and Software

May Wu, Ravi Iyer, Yatin Hoskote, Steven Zhang, Julio Zamora, German Fabila, Ilya Klotchkov, Mukesh Bhartiya. August, 2015

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

The Changing Face of Edge Compute

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Wu Zhiwen.

Silicon Acceleration APIs

Achieving on Mobile Devices

Ncore Cache Coherent Interconnect

Intel PSG (Altera) Enabling the SKA Community. Lance Brown Sr. Strategic & Technical Marketing Mgr.

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Building the Most Efficient Machine Learning System

Artificial Intelligence Enriched User Experience with ARM Technologies

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

Simplify System Complexity

< CEVA-XC4500 PRODUCT BRIEF THE WORLD S FIRST VECTOR FLOATING-POINT DSP FOR WIRELESS INFRASTRUCTURE APPLICATIONS. Overview

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Advanced IP solutions enabling the autonomous driving revolution

Bring Intelligence to the Edge with Intel Movidius Neural Compute Stick

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

Open Standards for AR and VR Neil Trevett Khronos President NVIDIA VP Developer January 2018

Transcription:

The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016

Presentation Outline Introduction The Need for Embedded Vision & AI Vision Processor Deep Learning Path to Low Power AI Devices Summary 2

CEVA Licensor of ultra-low-power signal processing IP s for embedded devices Imaging & Vision Audio, Voice, Sensing Connectivity Communication >7 Billion CEVA-powered devices shipped world-wide 3

Corporate Introduction Corporate Facts Headquartered in Mountain View, Calif. Publicly traded - NASDAQ:CEVA 273 employees, >200 eng. Profitable, cash positive, >$138M cash World Leading IP Supplier since 1991 Worldwide Operations US (Mountain View, CA, Austin, TX & Detroit, MI), Israel, Ireland, France, U.K, Sweden, China, Taiwan, Korea & Japan 4

CEVA Value Chain IDM CEVA Silicon IP Chips Development Tools Support, SW OEM Cooperation with Foundries and Design Service Companies Partner Cooperation Solutions (HW + SW + Services) Providing hardware, software, tools and support from license to deployment! 5

The Need for Embedded Vision & AI Security & Surveillance Visual Perception & Analytics ADAS & Autonomous Cars Augmented Reality Drones 6

CEVA-XM4 Vision DSP Highlights 1.5GHz max frequency @28nm HPM 14 stage deep pipeline 128 MACs, 8-way VLIW, 4096-bit vector engine 32-way parallel random memory access Enables serial code vectorization Sliding window & sliding pattern mechanisms Efficient 2-dimension data processing Silicon proven, 15+ design wins for variety of products Completed ISO 26262 compliance & process certification 7

CEVA-XM4 Performance Highlights Fixed- and floating-point math Available both in vector and scalar structures 8/16/32/64-bit fixed point precision Native non linear operations in single cycle 1, x, 1 x x Supported both in fixed and float operations Multipliers per cycle Scalar fixed-point 4 Scalar Floating-point 4 Vector fixed-point 128 Vector Floating-point 16 Multiple 128 or 256-bit AXI interfaces Extendible ISA High Performance, yet flexible in precision and operation 8

CEVA-XM4 Block Diagram Parallel Random Mem Access 4-Way I-Cache Scalar Floating-Point 128 MACs Multi-Core Interfaces Automatic management of HW accelerators Specialized DMA CEVA-Xtend User Instructions (Vector & Scalar) CEVA-Connect Customer HW accelerators Vector Floating-Point 9

CEVA Vision Platform Layers CEVA Software Products Partner Software Products Software Layer Digital Video Stabilizer (DVS) Super-Resolution (SR) Face Detection & Recognition Universal Object Recognition Pedestrian Detection ADAS Algorithms (FCW, LDW) 3D Depth Map Creation SW Toolset App Dev. Kit (ADK) Hardware Layer Host CEVA-CV API / OpenVX RTOS CPU offload CPU-DSP Link Communication Layer CEVA-XM4 DSP Core CEVA-CV Libraries CEVA CNN Framework (CDNN) Provides OEM differentiation Hardware Development Kit *available for XM4 only 10

CEVA-ToolBox - Eclipse based SW Dev. Tools Advanced Eclipse-based IDE Optimizing C/C++ tailored compiler Auto vectorization Extensive Vec-C support C language extensions in OpenCL-like syntax Vector types for C/C++ - short8, ushort32, Vectorization from operators Linker and Utilities Automatic Build Optimizer CEVA-Xtend GUI Instruction builder Built-in Debugger, Simulator & Profiler Target Emulation dev. kit Build Optimizer IDE / Debugger View Profiler Function Graph Complete Software Development tools. Focused on ease of use and quick SW porting for performance optimization 11

Parallel Random Memory Access Mechanism CEVA-XM4 scatter-gather capability enables load/store vector elements from/into multiple memory locations in a single cycle Enables serial code vectorization Able to load values from 32 addresses per cycle Example: Image histogram requires random accesses to memory per pixel Each value (within a vector) can generate an address that allows it to vectorize/parallel the operations into a multiple operation within a single cycle b7 b6 b5 b4 b3 b2 b1 b0 Vector Register L1 Memory 12

CEVA-XM4 Power Consumption per Block Running Histogram Histogram Power Analysis Others DMEM DMSS PMEM PMSS Core Using Special Histogram HW Support Without Histogram Special HW Support (convential DSP / CPU Approach) Special histogram HW consumes 1.89x higher power per cycle, but kernel is 100x faster ~50x better power efficiency for entire calculation! 13

Sliding-Window Data Processing Mechanism CEVA-XM4 processes 2D data efficiently Takes advantage of pixel overlap in image processing by reusing same data to produce multiple outputs Significantly increases processing capability Saves external memory bandwidth and frees system buses for other tasks Reduces power consumption Reuse Input Image For 16 MAC with 512-bit bandwidth, only 176 bits actually loaded 14

Input Data Reuse Example The vector instruction performs the arithmetic operation multiple times in parallel (producing multiple results). The next sliding-window operation is offset by a step from the previous operation. Input Data Vector A Input Data Vector B 0 0 11 22 23 3 34 3 4 44 5 45 5 56 56 6 67 7 778 889 9 910 10 11 12 13 14 15 X X X X X X X X X X X X X X X X A B C D A B C D A B C D A B C D + + + + 0 1 2 3 15

Sliding-Window Pattern Mechanism Sliding-window pattern (i.e. sliding pattern) efficiently handles sparse filters Sparse filters are defined as filters that contain some coefficients equal to zero Instead of multiplying the input data by zero, the sliding pattern skips zero coefficients and improves efficiency (and keeps power low) 16

Sliding-Window Pattern Example Example below describes a sparse filter ~45% non-zero coefficients Achieves 92% multiplier utilization, completing filter in 3 cycles Standard implementation: ~55% multiplier utilization in 5 cycles Sparse Filter Sliding Pattern Sliding Window 0 0 0 0 0 0 0 0 0 17

Sliding-Window Pattern Power Savings Sliding Window Utilization Sliding Pattern Utilization Convolution 5x5 62.5% 89% Sobel 3x3 75% 75% Average 7x7 87.5% 94% Relative Power Consumption CEVA-XM4 "Sliding Window" CEVA-XM4 "Sliding Pattern" Convolution 5x5 Sobel 3x3 Average 7x7 18

Sliding Window, Sliding Pattern Power Savings 90x 80x 70x 60x Relative Power Consumption 50x 40x 30x CPU GPU 20x 10x CEVA-XM4 Scalar Only CEVA-XM4 "Sliding Window" CEVA-XM4 "Sliding Pattern" Convolution 5x5 Sobel 3x3 Average 7x7 19

CEVA Deep Neural Network (CDNN2) 2 nd gen SW framework support Caffe and TensorFlow Frameworks Various networks* All network topologies All the leading layers Variable ROI Push-button conversion from pre-trained networks to optimized real-time Accelerates machine learning deployment for embedded systems Optimized for CEVA-XM4 vision DSP (*) Including AlexNet, GoogLeNet, ResNet, SegNet, VGG, NIN and others 20

CEVA Network Generator Image Database (For Training) Caffe TensorFlow Pre-Trained Network (Deploy.caffemodel) Network Structure (Deploy.prototxt) Normalization Image (Deploy.binaryproto) Network Labeling (Deploy.labels) CEVA Network Generator CEVA-XM4 Optimized Network 21

Real-Time CDNN2 Application Flow Input Image ROI Horse 22

Real-Time CNN Object Recognition Demo Live Alexnet object recognition Enables milli-watt products vs. watts on GPU Daisy Live CDNN2 demo Input Images Convert network in under 10min! HDMI Webcam FHD XM4 FPGA PCIe i.mx6 XM4 FPGA 23

Vision & Deep Learning SoC Example Typical SoC for recording devices such as action / dash / surveillance cameras Vision SoC Encoder HW Enables OEM differentiation by adding highest-end camera features Ultra low power: Full vision system below 1w@28nm Sensor I/F (MIPI) DDR Cont. A X I ISP HW CPU @ 1GHz CEVA-XM4 @ 800 MHz I/F (USB) 24

Brite Semi Vision & Deep Learning SoC High-end vision system, Available Early 2017 25

AR/VR Ref System Example using Inuitive NU4000 High-end AR/VR system supporting depth sensing, deep learning, SLAM and vision, but cutting power by a factor CEVA-XM4 Vision Processor 26

Why DSP Is Better as Vision Processor? Domain-specific architecture focused on computer vision Dedicated ISA & mechanisms enable higher perf & utilization 8-way VLIW - up to 8 separate instructions combined in parallel 32-way parallel load-store maintainable per cycle Combination of strong vector and scalar types of code CPU good mostly in scalar code GPU good mostly for parallel portion, weak on memory accesses Enables flexible fixed-point math for better power efficiency Combines floating-point for time-to-market and higher dynamic range Maximizing local data reuse, limiting DDR memory bandwidth Significant memory power saving, GPU not able to utilize well Enables more efficient deep learning algo development CEVA-XM4 in combination with SW platform saves time-to-market and extends device battery life 33% Faster processing* 17x Smaller* 9x Lower power* * vs. Mobile GPU 27

Welcome to visit us at the demo table to hear more Thank You Contact: Yairs@ceva-dsp.com