Pluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

Similar documents
Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Research Faculty Summit Systems Fueling future disruptions

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Deep Learning Applications

NVIDIA DEEP LEARNING INSTITUTE

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

CS231N Section. Video Understanding 6/1/2018

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Unified Deep Learning with CPU, GPU, and FPGA Technologies

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

Using Machine Learning to Optimize Storage Systems

Deep Learning on AWS with TensorFlow and Apache MXNet

CNN optimization. Rassadin A

The OpenVX Computer Vision and Neural Network Inference

Xilinx ML Suite Overview

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

1 Overview Definitions (read this section carefully) 2

Diagnostics in Testing and Performance Engineering

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Machine Learning on VMware vsphere with NVIDIA GPUs

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Deep learning in MATLAB From Concept to CUDA Code

Interpretable Machine Learning with Applications to Banking

GPU Accelerated Machine Learning for Bond Price Prediction

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Hello Edge: Keyword Spotting on Microcontrollers

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

CSC 578 Neural Networks and Deep Learning

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Sentiment Classification of Food Reviews

Generative Adversarial Text to Image Synthesis

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Leveraging AI on the Cloud to transform your business. Florida Business Analytics Forum 2018 at University of South Florida

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Deep Learning for Embedded Security Evaluation

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Tutorial on Machine Learning Tools

Demystifying Deep Learning

The Hitchhiker s Guide to TensorFlow:

Practical Deep Learning

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Wu Zhiwen.

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

A Deep Relevance Matching Model for Ad-hoc Retrieval

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Advanced Search Algorithms

Agile CI/CD with Jenkins and/at ZeroStack. Kiran Bondalapati CTO, Co-Founder & Jenkins Admin ZeroStack, Inc. (

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Neural Network Exchange Format

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Keras: Handwritten Digit Recognition using MNIST Dataset

Unifying Big Data Workloads in Apache Spark

Demystifying Deep Learning

Rolling Bearing Diagnosis Based on CNN-LSTM and Various Condition Dataset

Deep Learning Accelerators

Recurrent Neural Network (RNN) Industrial AI Lab.

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

Parallel FFT Program Optimizations on Heterogeneous Computers

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Face Image Quality Assessment for Face Selection in Surveillance Video using Convolutional Neural Networks

Visualization and text mining of patent and non-patent data

AI & Machine Learning at Amazon

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Binary Convolutional Neural Network on RRAM

Jan Pedersen 22 July 2010

Vinnie Saini Cloud Solution Architect Big Data & AI

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Contents. Preface to the Second Edition

Intelligent System for AI. 清大資工周志遠 AII Workshop

Convolutional Layer Pooling Layer Fully Connected Layer Regularization

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Model-Driven Geo-Elasticity In Database Clouds

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

WITH INTEL TECHNOLOGIES

Automation.

Hybrid Implementation of 3D Kirchhoff Migration

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Enable AI on Mobile Devices

Oracle Big Data Connectors

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Transcription:

Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

Outline PAI(Platform of Artificial Intelligence) PAI Overview Deep Learning with PAI Pluto PAI DL Application Chatbot Engine Summary 2

Machine Learning Platforms 3

PAI Overview Frontend PAI WEB Console PAI IDE Algorithms Feature Engineering Statistic Methods Machine Learning Deep Learning Distributed Computing PAI SDK MR/MPI/PS/Graph/Pluto Fuxi Scheduler CPU/GPU/FPGA/ASIC/ Serving Data Storage OSS Database: ODPS/RDS Streaming data: DataHub/TT/Kafka Tutorial: data.aliyun.com 4

PAI Project Search Experiment Data Source Component Model Serving 5

Machine Learning with PAI Data Preprocessing Feature Engineering Statistics Modeling Deep Learning Application Sampling & Filtering Data Merge Fill Missing Values Feature Transformatio n Feature Selection Feature Importance Correlation Coefficients Histogram Hypothesis Test Binary Classificatio n Multiple Classificatio n Clustering DNN CNN RNN Pluto NLP Search & Rec. Image Process Normalizatio n Feature Generation Visualization Regression A La Carte Network Analysis Prediction Financial Section Evaluation 6

Deep Learning with PAI 7

PAI TensorFlow Rich Data IO Distributed Job Optimization (Multi. GPU/CPUs) Easy model Serving Hyper Parameter Tuning 8

Pluto 9

Single-card Optimization Compiler-oriented strategy Fuse small ops into bigger one Reduce CUDA kernel launch overhead Prepare data layout friendly with low-level computation library Memory optimization Here again compiler-oriented tactics Dependency analysis Lifetime analysis 10

Multi-cards Optimization Heuristic-based Model Parallelism Both model weights and feature map taken into consideration Memory allocator strategy taken into consideration A greedy allocation algorithm With pre-run support 11

Multi-cards Optimization Hybrid-parallelism Mixture of data-parallelism and model-parallelism For communication-intensive parts, consider model-parallelism For computation-intensive parts, consider data-parallelism Tricks Integrate seamlessly with computation graph style Happier with pyramid network 12

Multi-cards Optimization Hybrid-parallelism(cont.) M40 Result K40 Result 13

Multi-cards Optimization Late-multiply Customized for fully-connected layers Trade-off between computation and communication W avg : [N l,n l+1 ], X:[M, N l ], E:[M, N l+1 ], here N l,n l+1 layer sizes, M is mini-batch size 14

Multi-cards Optimization Late-multiply(cont.) 15

Multi-cards Optimization Heuristic-based MA Automatic batch-size selection Learning rate auto-tuning Happier with sequential model 16

Multi-cards Optimization Heuristic-based MA(cont.) Model Metrics Training Time in Wallclock 17

Inference Optimization Quantization Significantly reduce model size(4x) Around 2X speed-up on average Binarized Neural Network Binarize model weights Convert floating point computation into bit manipulation Both model size and computation speed significantly improved Training process needs to be manipulated to compensate for accuracy Happier with CNN, but for RNN 18

PAI DL Application 19

AliMe Personal Assistant Bot in E-commerce AliMe for Customers AliMe for Sellers AliMe for Enterprises From 海青 @ 云栖大会 20

Open-Domain Conversations Retrieval Model Learning to rank Query QA pairs Knowledge Base Q 1 -A 1 : s 1 Q 2 -A 2 : s 2 Q 3 -A 3 : s 3... Q n -A n : s n A1 Generation Model Sequence to Sequence (Seq2Seq) Model Cho et al., 2014 Recurrent Neural Networks: LSTM, GRU (our choice) 21

A Hybrid Conversation Model based on Seq2Seq Overview Query IR Candidates Answer Rerank Score > T Yes Output No Chat logs SNS data QA pairs KnowledgeBase Seq2Seq Model Answer Generation Retrieval Module Seq2Seq Based Rerank and Generation Modules [AliMe Chat: Minghui Qiu et al., ACL 2017] 22

PAI DL Support for AliMe Both the offline training and online serving backed by PAI Through heuristic-based MA, the offline training task has 2.8X convergence speed-up with 4 cards setting Through quantization, the online serving task has 1.5X speed-up on commodity CPU servers. 23

Conclusion PAI DL End2end machine learning platform Support big data analytics Optimized Deep learning algorithms Scheduling on CPU/GPU cloud More data intelligence Pluto Distributed optimization engine of PAI DL PAI DL Application PAI DL makes it easy to build DL methods for industrial applications SCAN BARCODE! START YOUR TRIAL! 24

We are hiring! J muzhuo.yj@alibaba-inc.com chenyan.cy@alibaba-inc.com 25

Reference AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine, Minghui Qiu et al., ACL 2017. Deep Learning with PAI: a Case Study of AliMe, Minghui Qiu et al., Deep Learning Summit 2017. TensorFlow in AliMe, Jun Yang et al., Shanghai GDG Mar., 2017. 26

Thanks!