Research Faculty Summit Systems Fueling future disruptions
|
|
- Lizbeth McBride
- 5 years ago
- Views:
Transcription
1 Research Faculty Summit 2018 Systems Fueling future disruptions
2 Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia
3 System Challenge in Deep Learning Innovations are emerging very fast in deep learning area New DNN models and workload patterns RNN, CNN, GAN, reinforcement learning, graph neural network, etc. Diverse and emerging hardware accelerators, GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc. Compiler stack is key to bridge framework and hardware Combine information of computation graph and hardware Optimize for both local execution and distributed scalability Critical for both training and inference Intermediate Representation (IR) x w * b + y Compiler & Optimizer Infrastructure Execution backends
4 Wolong: Optimizer Stack for Deep Learning System innovation to bridge application and hardware General computation graph optimization Software and hardware co-design Just-in-time compiler Transparent optimization Communication efficiency Accelerator execution efficiency Memory efficiency Tensor Placement Optimizer: Memory layout and placement optimization Intermediate Representation (graph of operators) Global Optimizer: Optimize distributed training over RDMA Graph analyzer/ RDMA memcpy library Local Optimizer: Optimize execution with JIT compiler Operator batching/ kernel fusion Execution Runtime CPU, GPU, RDMA devices
5 Global Optimizer Fast Distributed Deep Learning Computation over RDMA
6 Distributed Dataflow Graph Execution Deep learning computation is modeled as dataflow graph Achieve parallel manner through graph partitioning Model parallelism vs. data parallelism Tensor transmission across server becomes bottlenecks Dispatch partitions Partition YY graph HH * σσ * WW 11 XX WW 22 Latency (ms) Send-Recv Benchmark grpc RDMA 10x Send HH σσ WW 11 * XX Recv YY * WW 2 message size (MB) Server 0 Server 1
7 General Message Passing Library (e.g., RPC) Unavoidable memory copy overhead in RPC Generally designed for dynamic data structure Lacks knowledge of actual data placement and size Extra memory copy from data serialization Software/hardware co-design to completely remove memory copy overhead Leverage runtime application information RDMA network HH Send σσ * WW 11 XX Application memory t RPC managed buffer t t YY * Recv WW 22 Application memory t RPC managed buffer t Server 0 Server 1
8 Combine Dataflow Graph Computation with RDMA Tensor abstraction in deep learning computation Consists of a plain byte array with sufficiently large size (tens of KB to MB) Do NOT require variant data serialization/deserialization Do NOT require extra batching since access pattern is already sequential RDMA enables to manage local and distributed memory in a unified view One-side RDMA R/W : efficient memory copy between host memory GPU-Direct RDMA : efficient memory copy between host and device memory Global graph optimizer for distributed computation Has the entire view and control of memory placement among devices and servers Capable of making globally optimized strategy for tensor data placement in runtime
9 Optimized Communication Mechanism Transfer statically placed tensor through one-side RDMA write Phase I: graph analyzing RDMA-based zero-copy communication Phase II: graph execution Send HH σσ * WW 11 XX... Source Tensor 1 RDMA lib: Conduct remote memory copy One-sided RDMA write (Polling flag byte) Dest Tensor Recv YY * WW 22 Server 0 Tensor Manager: Detect the source tensor place Re-allocate as RDMA memory Tensor Manager: Pre-allocate RDMA compatible receive tensor Server 1
10 Global Optimizer: Performance Evaluation Improve training throughput, convergence speed and scalability Deep Learning Benchmarks Convergence of Seq2Seq Translation Throughput (mini-batchs/sec) x 2.3x 2.6x 4.2x 4.0x 8.1x AlexNet Inception FC LSTM GRU VGG16 Perplexity x speed up Run time (s) TensorFlow(gRPC) Wolong TensorFlow(gRPC) Wolong(RDMA) More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA * Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM 15 datasets;
11 Local Optimizer Kernel Fusion for Deep Learning on GPU
12 Motivation Deep learning frameworks model computation as graph of primitive operators Expressivity to represent arbitrary neural network structure Flexibility to run on multi-device and multi-server through graph partitioning Significant framework overhead to schedule thousands of operators Kernel-launch overhead Cross operator communication overhead Too fine-grained to leverage vendor s library Example: 80-step LSTM model Contains 1686 operators in TensorFlow runtime (ms) LSTM 512x512 (80steps) TensorFlow CUDNN FuseKernel
13 DL Frameworks vs. Vendor Provided Library Flexibility Deep learning frameworks E.g., TensorFlow, PyTorch, CNTK Embrace flexibility and expressivity Performance inefficiency DL framework + Compiler Generate library-like code in runtime Win both of the worlds Hardware specific library E.g., cudnn, cublas, MKL Designed for extreme efficiency Impossible to handle customized or new network structure Efficiency
14 Wolong Compiler Design Computation graph level optimization Graph rewriting based on computational equivalence Common subexpression elimination, constant folding etc. Operator batching: automatically batch same type operators to better leverage batch efficiency Target and application specific runtime compilation Static shape and type inference Static memory planning Aggressive kernel fusion Computational Graph (IR) x * + y w b Graph level optimizer Target specific JIT compiler Execution runtime
15 Wolong Compiler Execution Workflow Detect optimize subgraph Code generation LSTM Relu Sigmoid Mul BiasAdd Shape inference Memory planning JIT compile FusedKernel Input graph Operator batching Cache hit in later iterations Rewrite graph Output FusedKernel Before Graph Execution Runtime Optimization Input Input
16 Graph Level Optimization: Operator Batching Automatically conduct GEMM fusion and static memory placement optimization M 0 [2,4] M 2 [4,1] M 1 [2,4] Concat(M 0, M 1 ) M 2 [4,1] Tensor M 0 (2,4) Tensor M 1 (2,4) O 0 [2,1] O 1 [2,1] Concat(O 0, O 1 ) Concat Tensor M 2 (4,4)
17 JIT Compilation: Kernel Fusion Leverage aggressive kernel fusion to completely remove scheduling overhead Element-wise (i.e., point-wise) operators No cross-element dependency between operators Better leverage cache, register locality h = ssssssssssssss(xx 1 + xx 2 ) h Sig Add xx 1 xx h xx 1 xx 2 global void kernel_0(float *x1, float *x2, float *h) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = x1[idx] + x2[idx]; float temp1 = sigmoidf(temp0); h[idx] = temp1; } }
18 JIT Compilation: Kernel Fusion Fuse arbitrary (non element-wise) operators in to single kernel Operator data dependency may introduce cross threads data dependency in kernel Need global synchronization to guarantee correctness Cross operator communication uses device memory E.g., fuse two matrix multiplications: Z = AA BB CC AA MM MM BB ZZ CC void kernel_0(float *A, float *B, float *C, float *Z) { if (idx < 1024) { buffer[idx] = _f(a, B); Global_Sync(); Z[idx] = _f(buffer, C); h[idx] = temp1; } }
19 Graph Computation in DL Frameworks Operators (kernels) are scheduled (launched) one by one CPU SM SM SM SM SM SM PCIe GPU block0 block1 block2 block3 block4 block5 Overhead of kernel launching, DAG scheduling, memory copy, etc. Conv block6 block7 block8 block0 block1 block2 block3 block4 block5 block6 block7 block8 block9 block1 0
20 Arbitrary Kernel Fusion Is Limited by GPU Architechture Hard to conduct global synchronization across all threads PCIe GPU CPU SM SM SM SM SM SM Conv barrier Fused as single kernel block0 Conv block6 Conv block1 Conv block7 Conv block2 Conv block8 Conv block3 Conv block9 Conv block4 Conv block10 Conv block5 Conv Never complete due to waiting for barrier Never be scheduled
21 Our Solution: Persistent Threads and Virtual Blocks Assign virtual block task to persistent threads PCIe GPU CPU SM SM SM SM SM SM block0 block1 block2 block3 block4 block5 Conv barrier vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 Fused as single kernel vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 vblock9 vblock1 0
22 Kernel Packing Explore graph level parallelism in static code generation GPU PCIe CPU SM SM SM SM SM SM [64, 512] [512, 128] block0 block1 block2 block3 block4 block5 vblock0 vblock1 Under utilization of GPU resource σσ AAAAAA vblock0 vblock1 vblock2 vblock3 vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 [64, 128] vblock6
23 Code Generation Code Generator Operator kernels: device functions A σσ Relu C B //kernel code generated by Wolong compiler global void kernel_0(float *A, float *B, float *C) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = sigmoidf(a[idx]); buffer0[idx] = temp0; } GlobalSync(); for (int tix = bx; tix < ey; tix += offx) { for (int tiy = by; tiy < ey; tiy += offy) { (buffer0, B, buffer1, 1024, 1024, 128); } } if (idx < 1024) { float temp2 = reluf(temp1); C[idx] = temp2; } } //device function for sigmoid operator device float sigmoidf(float in) { return 1.f / (1.f + expf(-in)); } //device function for relu operator device float reluf(float in) { return fmaxf(0.f, in); } Operator device functions //device function for operator device float (float *a, float *b, float *c, int m, int n, int k) { if (thread_ix < m && thread_iy < k) { float temp = 0.f; for (int i = 0; i < n; ++i) { temp += a[tix*n+i] * b[k*i+tiy]; } c[k * tix + tiy] = temp; } }, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.
24 Performance of End-to-end Kernel Fusion RNN inference benchmark (LSTM-128uints-80steps) Fused 1686 operators into 1 kernel Wolong fused_kernel Avg Runtime(ms) LSTM Benchmark 10.9x TensorFlow XLA OpBatch Fusion Experiments are conducted on Nvidia GTX 1080 Ti GPUs
25 Conclusion A compiler infrastructure is critical for both cloud and edge AI Optimize for fast distributed training in cloud Optimize for efficient inference on accelerator devices System innovations to bridge applications and diverse hardware Common intermediate representation (IR) Co-design software and hardware for extreme efficiency Wolong prototype has demonstrated the initial improvements Up to 8x speedup on training workloads Up to 10x speedup on inference benchmark
26 Thank You! Systems Fueling future disruptions
27
28 Distributed Graph Optimizer of Wolong Transfer dynamically allocated tensor through RDMA write/read Phase I: graph analyzing Supports GPUDirect RDMA as well Phase II: graph execution Send HH σσ * WW 11 XX Tensor meta data 1... One-sided RDMA write One-sided RDMA read Tensor meta data 01 Allocate... Recv YY * WW 22 Source Tensor Dest Tensor Server 0 Server 1
GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13
GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationAutomatic Code Generation TVM Stack
Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks
More informationEnd to End Optimization Stack for Deep Learning
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team
More informationNVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS
TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationRelay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018
Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable
More informationDeep Learning on Modern Architectures. Keren Zhou 4/17/2017
Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationShrinath Shanbhag Senior Software Engineer Microsoft Corporation
Accelerating GPU inferencing with DirectML and DirectX 12 Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Machine Learning Machine learning has become immensely popular over the last decade
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationTensorFlow: A System for Learning-Scale Machine Learning. Google Brain
TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationTVM Stack Overview. Tianqi Chen
TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationGPU Coder: Automatic CUDA and TensorRT code generation from MATLAB
GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming
More informationEmbarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA
Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get
More informationThe OpenVX Computer Vision and Neural Network Inference
The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationOPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS
April 4-7, 2016 Silicon Valley OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS Jeremy Appleyard, 7 April 2016 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly
More informationRealtime Signal Processing on Nvidia TX2 using CUDA
Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of
More informationOperator Vectorization Library A TensorFlow Plugin
Operator Vectorization Library A TensorFlow Plugin Matthew Pickett, Karen Brems, Florian Raudies Hewlett Packard Labs HPE-2016-94 Keyword(s): Machine learning; GPU acceleration; TensorFlow Abstract: TensorFlow
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationA NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017
A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 TWO FORCES DRIVING THE FUTURE OF COMPUTING 10 7 Transistors (thousands) 10 6 10 5 1.1X per year 10 4 10 3 10 2 1.5X per year Single-threaded
More informationTVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos
More informationSpeculations about Computer Architecture in Next Three Years. Jan. 20, 2018
Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationTowards Scalable Machine Learning
Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction
More informationGUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning
GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning Koichi Shirahata, Youri Coppens, Takuya Fukagai, Yasumoto Tomita, and Atsushi Ike FUJITSU LABORATORIES LTD. March 27, 2018 0 Deep
More informationEnabling the future of Artificial intelligence
Enabling the future of Artificial intelligence Contents AI Overview Intel Nervana AI products Hardware Software Intel Nervana Deep Learning Platform Learn more - Intel Nervana AI Academy Artificial Intelligence,
More informationProcessing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach
2018 51st Annual IEEE/ACM International Symposium on Microarchitecture Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach Jiawen Liu*, Hengyu Zhao*, Matheus Almeida
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationLayer-wise Performance Bottleneck Analysis of Deep Neural Networks
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William
More informationGraph Streaming Processor
Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer
More informationCharacterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures
Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
More informationProfiling GPU Code. Jeremy Appleyard, February 2016
Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation
More informationTESLA V100 PERFORMANCE GUIDE. Life Sciences Applications
TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationNVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI
NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain
More informationSlalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Intel, Santa Clara August 30 th 2018 Trusted execution of ML: 3 motivating
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationIntelligent Hybrid Flash Management
Intelligent Hybrid Flash Management Jérôme Gaysse Senior Technology&Market Analyst jerome.gaysse@silinnov-consulting.com Flash Memory Summit 2018 Santa Clara, CA 1 Research context Analysis of system &
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationDEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017
DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationOPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi
OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2 About nervana
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationLecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017
Lecture 6: Optimize for Hardware Backends CSE599G1: Spring 2017 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components Computational Graph
More informationWu Zhiwen.
Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationPluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud
Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud Outline PAI(Platform of Artificial Intelligence) PAI Overview Deep Learning with PAI Pluto
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationSlalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Stanford security lunch June 13 th Trusted execution of ML: 3 motivating
More informationImplementing Deep Learning for Video Analytics on Tegra X1.
Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationS8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く 本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationInference Engine compiler and SDK FWDNXT 2018
Inference Engine compiler and SDK 1 Deep Learning processor Best performance per power Best utilization Efficient use of memory bandwidth Low latency Scalability: IoT to cloud 2 Deep Learning processor
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationTuring Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA
Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationMatchbox. automatic batching for imperative deep learning NVIDIA GTC, 2018/3/28. James Bradbury
Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28 Roadmap Imperative deep learning How manual batching works Other tools for automatic batching The Matchbox
More informationAutoTVM & Device Fleet
AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More information