How to Build Optimized ML Applications with Arm Software

Similar documents
How to Build Optimized ML Applications with Arm Software

Bringing Intelligence to Enterprise Storage Drives

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

The Changing Face of Edge Compute

An introduction to Machine Learning silicon

Unleash the DSP performance of Arm Cortex processors

Bringing Intelligence to Enterprise Storage Drives

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Accelerating intelligence at the edge for embedded and IoT applications

Machine learning for the Internet of Things

Advanced IP solutions enabling the autonomous driving revolution

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

WAVE ONE MAINFRAME WAVE THREE INTERNET WAVE FOUR MOBILE & CLOUD WAVE TWO PERSONAL COMPUTING & SOFTWARE Arm Limited

Arm s First-Generation Machine Learning Processor

2017 Arm Limited. How to design an IoT SoC and get Arm CPU IP for no upfront license fee

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Enable AI on Mobile Devices

Hardware- Software Co-design at Arm GPUs

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Connect your IoT device: Bluetooth 5, , NB-IoT

Optimize HPC - Application Efficiency on Many Core Systems

Connect Your IoT Device: Bluetooth 5, , NB-IoT

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Hello Edge: Keyword Spotting on Microcontrollers

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

A Developer's Guide to Security on Cortex-M based MCUs

Implementing debug. and trace access. through functional I/O. Alvin Yang Staff FAE. Arm Tech Symposia Arm Limited

Neural Network Exchange Format

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Compute solutions for mass deployment of autonomy

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

DPDK on Arm64 Status Review & Plan

Beyond TrustZone PSA Reed Hinkel Senior Manager Embedded Security Market Development

Making progress vs strategy

A New Security Platform for High Performance Client SoCs

Arm s Latest CPU for Laptop-Class Performance

Accelerate Ceph By SPDK on AArch64

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

World s most advanced data center accelerator for PCIe-based servers

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Rendering Structures Analyzing modern rendering on mobile

The Power of Speech: Supporting Voice- Driven Commands in Small, Low-Power. Microcontrollers

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

TZMP-1 Software Reference Implementation. Ken Liu 2018-Mar-12

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Arm Mbed Edge. Shiv Ramamurthi Arm. Arm Tech Symposia Arm Limited

Lecture 12: Model Serving. CSE599W: Spring 2018

DynamIQ Processor Designs Using Cortex-A75 & Cortex- A55 for 5G Networks

Diversity of. connectivity required for scalable IoT devices. Sam Grove Principal Software Engineer Arm. Arm TechCon 2017.

Building firmware update: The devil is in the details

Beyond Hardware IP An overview of Arm development solutions

Artificial Intelligence Enriched User Experience with ARM Technologies

Revolutionizing the Datacenter

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Building Ultra-Low Power Wearable SoCs

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Making an on-device personal assistant a reality

New Approaches to Connected Device Security

Deep Model Compression

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Xilinx ML Suite Overview

The Benefits of GPU Compute on ARM Mali GPUs

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks

Software Ecosystem for Arm-based HPC

Making XR a reality for everyone

Accelerate Machine Learning on macos with Intel Integrated Graphics. Hisham Chowdhury May 23, 2018

THE NVIDIA DEEP LEARNING ACCELERATOR

Snapdragon NPE Overview

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Making Sense of Artificial Intelligence: A Practical Guide

Challenges to Embedding Computer Vision J. Scott Gardner General Manager and Editor-in-Chief Embedded Vision Alliance (

NVIDIA PLATFORM FOR AI

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

The Future of Mobility. Keith Kressin Senior Vice President, Product Management Qualcomm Technologies,

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Standards for Vision Processing and Neural Networks

CNN optimization. Rassadin A

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Achieving on Mobile Devices

Why data science is the new frontier in software development

A backward glance and a forward view

The OpenVX Computer Vision and Neural Network Inference

Arm Mbed Edge. Nick Zhou Senior Technical Account Manager. Arm Tech Symposia Arm Limited

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Smart Ultra-Low Power Visual Sensing

CCIX: a new coherent multichip interconnect for accelerated use cases

Fuzzy Set Theory in Computer Vision: Example 3

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Transcription:

How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 ML Group

Overview Today we will talk about applied machine learning (ML) on Arm. My aim for today is to show you just how much can be achieved on mobile and edge devices. We will cover: An overview of the ML platform A look at some use cases The software and tools we use Our approach to development and deployment 2

Solving Problems 3

Why Do We Look at Use Cases? ML is being used for problems across a wide number of fields and there are unique features and subtleties that need to be considered in each. Our approach: Focus Deep dives into key/emerging technologies Broad sweep of the problems people try and solve Pick a focus across the huge variety of markets and use cases for machine learning Understand impact on hardware and software optimization Realistic test corpus to tune performance and features Improving Usability of the platform Making interfaces, IP, features and tools better 4

Image Processing Speech and Audio Vision User Interface Improvement Use Case Fragments Combining these pieces helps to solve larger problems Noise reduction and super resolution for better video quality Keyword and noise spotting (KWS) in low power for system wakeup Object detection for autonomous driving, mobile or security Natural language processing (NLP) to mine text and improve UIs Foreground/ background segmentation for live video and camera stills Speech recognition, translation, text to speech (TTS) Identifying points of interest for augmented reality mapping Improving OS power management and scheduling 5

Face ID and Unlock Arm Cortex-M7 processes input from a low-resolution sensor using an NN in an always-on power budget This object detection gates the processing of RGBD camera data that runs on the Cortex-A73 Higher quality face detection removes false positives, and spoof detection uses depth information to make sure the face isn t just a photograph The identity verification stage is a neural network designed to highlight differentiable features in faces Moving stages to the right IP is natural when using Arm NN and Compute Library Low-Resolution Sensor Object Wakeup RGBD Camera Face Detection Spoof Detection Alignment Extraction and Normalization Identify Verification Cortex-M7 Using CMSIS-NN Cortex-A73 CPU Mali-G71 GPU or ML Processor 6

Face ID 700 600 500 400 300 200 100 0 Face ID Performance & Memory Use ACL 18.1 ACL 18.5 ML Processor Face ID is a key technology being deployed in phones, homes and cities It enables unlocking of phones, making payments, logging in and opening doors This use case is tuned for CPU, GPU and NPU 7 *8-core Mali-G71 silicon @850MHz and Arm NPU @1GHz Time (ms) Memory (MB)

Babelfish 8

Going Fishing Why look at the Babelfish problem? NN ASR for edge is moving out of the research space Now more accurate than humans on word-error rates ASR, translation and generation are very complex Speech generation is more complex than vision in some cases This needs more than just CPU and GPU Complex workloads start to need Terra-OP class hardware ASR on the edge is good for users Latency is better, privacy is maintained for sensitive conversations Combining ASR and vision is compelling 9

Real-Time Subtitles (Augmented Reality) Assist deaf people by adding a virtual layer of real-time subtitles to their view Who Where What Identify the speaker s ID using a speaker-recognition system Using visual face-identification system we find the bounding box of that speaker Automatic speech recognition produces a transcript from the audio feed Let s go Speaker Identification Face Identification Voice and Image Associated Speech Recognition 10

11

12

13

14

Real-time, on-device speech recognition 60 FPS face detection and audio-visual speaker recognition 15

How Much of This Is Possible? Most of it! Automatic speech recognition approaching human level Translation systems are competent on simple sentences Speech synthesis is acceptable/functional or better Examples exist in the wild on server class today, many becoming possible in constrained budgets High-quality, real-time ASR can operate in under 1 TOP/s Speech synthesis can be performed in very limited budgets Speech Speech Recognition Machine Translation Speech Synthesis End-to-end speech recognition is continuing to improve Translated Speech 16

How Did We Deploy This Use Case? 17

Project Trillium: Arm s ML Computing Platform 18

What Do We Learn? Application Use Case CPU and GPU alone Access control Face identification Object detection Camera stills Noise reduction Super resolution Semantic segmentation Video Super resolution Semantic segmentation Keyword spotting Voice assistants Natural language processing Text to speech Driving Autonomous driving Driver assistance CPU and GPU and dedicated AI hardware Not every problem needs to be solved with a dedicated NPU. A combination of good software and hardware is needed to deploy machine learning successfully and easily. 19

What is Arm NN? An inference engine for edge machine learning 1. 1 Program through standard ML Frameworks 2. 2 Use accelerated support on Android 3. 3 Designed to support Arm ML Processor and third-party IP 1 Network import (e.g. TensorFlow, Caffe, ONNX) Parsers Application Arm NN 2 Existing inference engine (e.g. Android NN) Arm NN API Key Arm NN aims Device interface 3 Compute Library Well optimized for Arm CPUs, GPUs and NPUs Interoperation with other inference engines 3 rd party IP Arm NPU Cortex-A CPUs Mali GPUs Low overhead for embedded systems 20

Collaborate to Improve Standard ML Software Interfaces developer.arm.com/arm-nn Arm has donated Arm NN to Linaro to improve common software interface for ML Starting contributing code directly and help us add exciting new features Join Linaro Machine Learning Initiative Domain experts Machine learning experts Platform developers 21

The First Step: Speech Recognition There s been decades of work on speech recognition, leading to a well-structured breakdown which is often still used when applying ML techniques: Extract features (frequency space) Capture phonemes (probable noises) Deduce probable sentence (grammar/context aware) Language Model (Recommended) Audio Input Feature Extraction Frequency Space Features Acoustic Model Phonemes Decoder Final Transcript 22

Tools for Optimization and Conditioning Conditioning Quantizing and retraining networks from FP32 to INT8 without accuracy loss F32 I8 Optimizing Streamline adds support for Arm NN in addition to the GPU driver to show end-to-end timing Compression of weights and reduction of operations by pruning and clustering weights System reports provide feedback on what should be optimized 23

Quantization to 8-bit: It Works Theory for CNN and practical deployment possible today Network (Recognition) FP32 Top-1 Accuracy FP32 Top-5 Accuracy UINT8 Top-1 Accuracy UINT8 Top-5 Accuracy VGG 16 (ImageNet) 69.88% 89.16% 68.39% 88.26% VGG 19 (ImageNet) 70.05% 89.17% 68.65% 88.31% ResNet V2 101 (ImageNet) 70.50% 89.75% 66.56% 86.01% ResNet V2 152 (ImageNet) 71.04% 90.04% 68.93% 86.70% Inception V3 (ImageNet) 76.07% 92.59% 74.60% 91.78% Inception V4 (ImageNet) 78.33% 93.93% 77.07% 93.29% MobileNet 1.0 224 69.60% 89.12% 62.75% 84.25% MobileNet 1.0 224 - retrained 70.90% 89.90% 69.70% 89.50% Network (Segmentation) FP32 Pixel Accuracy FP32 F1-Score UINT8 Pixel Accuracy UINT8 F1-Score Fully-connected Networks (KITTI) 94.15 % 72.30 % 95.61% 84.79% Fully-connected Networks (VOC 2011) 85.96% 50.86% 85.92% 50.48% 24 Source: Arm ML Technology group research using custom techniques updated MobileNet retraining from Google

Distillation of Speech Networks Teaching a small convolutional network from a large recurrent one What is distillation? Selected student and teacher networks wav2letter DeepSpeech Training Data (Libri- Speech) Teacher Large RNN Student Small CNN Soft Logits Optimize Joint Loss 25

Distillation of Speech Networks Teaching a small convolutional network from a large recurrent one What is distillation? Result: better accuracy and generalization with headroom for further improvement Letter Error Rates (No Language Model) Training Data (Libri- Speech) Teacher Large RNN Student Small CNN Soft Logits Optimize Joint Loss Original Paper Wav2Letter Our Retrained Wav2Letter Baidu's DeepSpeech 2 0 0.02 0.04 0.06 0.08 0.1 0.12 Standard wav2letter Distilled wav2letter 26

Performance Results Pipeline Stage Operations Per Inference on 1 Second/1 Frame/Letter Inferences per Second of Realtime Input Operations per Second Required Mid-end 2017 Mobile (ms per second of input)* Complex Version Processing Time Feature Extraction 0.0032 GOP/s 138 samples per second Acoustic Model 7.3 GOP/ss Wav2Letter 4 due to overlapping 1 second window with 0.25 second stride Language Model Ken LM (Mozilla/Baidu) 10 800 GOP/s 0.33 as the language model works on a 3 second window 0.44 GOP/s 24 ms 24 ms 29.2 GOP/s 200 ms 760 ms (small accuracy improvements) 3.3 264 GOP/s 91 ms 2380 ms (more languages) Translation / NLP 124 562 MOP/s 15 1.9 8.4 GOP/s 20 ms 88 ms (accuracy improvements) Face Detection 5.4 65 GOP/s 30 60 FPS 162 GOPs 3.9 TOP/s 30 ms 200 ms (accuracy improvements) Face Verification 8.46 GOP/s 3 FPS 12.3 GOP/s 442 ms 442 ms Speaker ID 20.28 MOP/s 1 one sample a second 20.28 MOP/s 10 ms 10 ms Speech Generation 0.1 1.3 TOP/s 22 Khz 0.1 1.3 TOP/s 280 ms 3250 ms (more realistic speech) Combined 0.4 5.9 TOP/s 809 ms 6146 ms 27 * Measured on the Arm NN 18.05 release notable optimizations have been introduced since

Conclusion This is possible today on production devices On today s platforms using CPUs and GPUs And Arm NPU platforms appearing over next year A lot can be achieved on the shoulders of giants The ML research community is not just for huge companies Retraining with distillation is a key tool We provide tool flows to extract sizable performance gains We re continuing to work to make Babelfish a reality We re not far from this aim, the next year or two will be exciting 28

Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धन यव द 29

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks 30