Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Similar documents
HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

An introduction to Machine Learning silicon

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Fuzzy Set Theory in Computer Vision: Example 3

Brainchip OCTOBER

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Deep Learning with Intel DAAL

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Deep Learning Processing Technologies for Embedded Systems. October 2018

INTRODUCTION TO DEEP LEARNING

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Architectures for Scalable Media Object Search

Building the Most Efficient Machine Learning System

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Fast Hardware For AI

How to Build Optimized ML Applications with Arm Software

NVIDIA PLATFORM FOR AI

Demystifying Deep Learning

Application of Deep Learning Techniques in Satellite Telemetry Analysis.

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Deep Learning mit PowerAI - Ein Überblick

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Demystifying Deep Learning

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Machine Learning on VMware vsphere with NVIDIA GPUs

How to Build Optimized ML Applications with Arm Software

A performance comparison of Deep Learning frameworks on KNL

Using CNN Across Intel Architecture

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Lecture 12: Model Serving. CSE599W: Spring 2018

Why data science is the new frontier in software development

How GPUs Power Comcast's X1 Voice Remote and Smart Video Analytics. Jan Neumann Comcast Labs DC May 10th, 2017

World s most advanced data center accelerator for PCIe-based servers

Building the Most Efficient Machine Learning System

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

DL Tutorial. Xudong Cao

POINT CLOUD DEEP LEARNING

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

YOLO9000: Better, Faster, Stronger

CNN optimization. Rassadin A

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

High Performance Computing

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Contents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1

n N c CIni.o ewsrg.au

Deep Learning Accelerators

Convolutional Neural Network Layer Reordering for Acceleration

Deep Learning with Tensorflow AlexNet

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Deep Neural Networks for Market Prediction on the Xeon Phi *

GPU-Accelerated Deep Learning

Convolutional Neural Networks

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Scaling Deep Learning. Bryan

Profiling DNN Workloads on a Volta-based DGX-1 System

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Semantic Image Search. Alex Egg

Xilinx ML Suite Overview

Deep Learning. Volker Tresp Summer 2014

Deep Learning Performance and Cost Evaluation

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Autonomous Driving Solutions

Sony's deep learning software "Neural Network Libraries/Console and its use cases in Sony

Low-Power Neural Processor for Embedded Human and Face detection

NVIDIA DEEP LEARNING INSTITUTE

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Deep Learning Performance and Cost Evaluation

Democratizing Machine Learning on Kubernetes

Transcription:

Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager

Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance Self-driving cars Medical imaging Robotics Interactive voice response (IVR) systems Voice interfaces (Mobile, Cars, Gaming, Home) Security (speaker identification) Health care Simultaneous interpretation Search and ranking Sentiment analysis Machine translation Question answering Recommendation engines Advertising Fraud detection AI challenges Drug discovery Sensor data analysis Diagnostic support 2

Is Deep Learning a universal algorithm? Deep Learning Cat detector 3

Is Deep Learning a universal algorithm? Deep Learning Machine translator Cat detector 4

Types of artificial neural networks Topology to fit data characteristics Convolutional: Images Fully connected: Speech, text, sensor Recurrent: Speech, text, sensor Input Hidden Layer 1 Hidden Layer 2 Output Input Hidden Layer 1 Hidden Layer 2 Output Input Hidden Layer 1 Output 5

Applications break down Images Video Speech Text Sensor Tissue classification in medical images Video surveillance Speech recognition Sentiment analysis Predictive maintenance Detection Look for a known object/pattern Generation Generate content Classification Assign a label from a predefined set of labels Anomaly detection Look for abnormal, unknown patterns Other Fraud detection 6

One size does NOT fit all Application Data type Model (topology of artificial neural network): - How many layers - How many neurons per layer - Connections between neurons (types of layers) Data size 7

Deep learning ecosystem Keras Software Hardware 8

How to pick the right hardware/software stack? 9

Popular models Name Type Model size (# params) Model size (MB) GFLOPs (forward pass) AlexNet CNN 60,965,224 233 MB 0.7 GoogleNet CNN 6,998,552 27 MB 1.6 VGG-16 CNN 138,357,544 528 MB 15.5 VGG-19 CNN 143,667,240 548 MB 19.6 ResNet50 CNN 25,610,269 98 MB 3.9 ResNet101 CNN 44,654,608 170 MB 7.6 ResNet152 CNN 60,344,387 230 MB 11.3 Eng Acoustic Model RNN 34,678,784 132 MB 0.035 TextCNN CNN 151,690 0.6 MB 0.009 10

Popular models Name Type Model size (# params) Model size (MB) GFLOPs (forward pass) AlexNet CNN 60,965,224 233 MB 0.7 GoogleNet CNN 6,998,552 27 MB 1.6 VGG-16 CNN 138,357,544 528 MB 15.5 VGG-19 CNN 143,667,240 548 MB 19.6 ResNet50 CNN 25,610,269 98 MB 3.9 ResNet101 CNN 44,654,608 170 MB 7.6 ResNet152 CNN 60,344,387 230 MB 11.3 Eng Acoustic Model RNN 34,678,784 132 MB 0.035 TextCNN CNN 151,690 0.6 MB 0.009 11

Compute requirements Name Type Model size (# params) Model size (MB) GFLOPs (forward pass) ResNet152 CNN 60,344,387 230 MB 11.3 Training data: 14M images (ImageNet) FLOPs per epoch: 3 11.3 10 9 14 10 6 5 10 17 1 epoch per hour: ~140 TFLOPS Today s hardware: Google TPU2: 180 TFLOPS Tensor ops NVIDIA Tesla V100: 15 TFLOPS SP (30 TFLOPS FP16, 120 TFLOPS Tensor ops), 12 GB memory NVIDIA Tesla P100: 10.6 TFLOPS SP, 16 GB memory NVIDIA Tesla K40: 4.29 TFLOPS SP, 12 GB memory NVIDIA Tesla K80: 5.6 TFLOPS SP (8.74 TFLOPS SP with GPU boost), 24 GB memory INTEL Xeon Phi: 2.4 TFLOPS SP 12

Distributed training Model parallelism Data parallelism Node A Flower Predictions Batch Node A Flower Predictions Batch Errors Errors Node B Batch House Predictions Node B Errors 13

Which configuration is better? 14

Deep Learning Cookbook helps to pick the right HW/SW stack Benchmarking Suite Benchmarking scripts Reference models Performance metrics Knowledgebase Performance results - 11 reference models - 8 frameworks - 8 hardware systems Performance and scalability models - Machine learning (SVR) to predict performance of core operations - Analytical communication models - Analytical models for overall performance - Performance prediction for arbitrary ANNs - Scalability prediction - Optimal HW/SW configuration for a given workload - Reference solutions 15

16

Selected observations and tips Larger models are easier to scale (such as ResNet and VGG) A single GPU can hold only small batches (the rest of memory is occupied by a model) Fast interconnect is more important for less compute-intensive models (FC) A rule of thumb: 2 CPU threads per GPU A rule of thumb: RAM = 2 x GPU memory x number of GPUs 17

Thank you Natalia Vassilieva nvassilieva@hpe.com 18