Scalable Distributed Training with Parameter Hub: a whirlwind tour
|
|
- Miles Turner
- 5 years ago
- Views:
Transcription
1 Scalable Distributed Training with Parameter Hub: a whirlwind tour
2 TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet
3 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.
4 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4
5 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication * In the cloud-based training context Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4
6 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5
7 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5
8 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6
9 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6
10 EC2 reclaims your GPU instances as they run out of capacity 7
11 EC2 reclaims your GPU instances as they run out of capacity 7
12 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Server A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 8
13 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Sever A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 9
14 Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core Machine with GPUs Machine Machine with GPUs Machine 10
15 Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core Worker 1 PS 1 PS 2 Worker 2 11
16 Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core Worker 1 PS 1 PS 2 Worker 2 12
17 Distributed training is communication bound - Problem gets worse over time: shifting bottleneck. - With modern GPUs most of the time is spent on communication. - Making GPUs faster will do little to increase throughput - Wasting compute Seconds ResNet 269 GPU idle, waiting on network GPU and Network active GRID 520 K80 M60 V resources. 13
18 Distributed training is communication bound AlexNet ResNet 269 Inception V3 GoogleNet
19 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 15
20 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Network Core Worker 1 PS 1 PS 2 Worker 2 16
21 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS GPU Training Framework Network Core Network Worker 1 PS 1 PS 2 Worker 2 16
22 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS 17
23 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Inception GoogleNet AlexNet Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads Seconds 17
24 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 18
25 Bottlenecks in DDNN training BANDWIDTH BOTTLENECK Network Core Worker 1 PS 2 PS 1 Worker 2 19
26 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps 20
27 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps Cloud Bandwidth 20
28 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20
29 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20
30 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 1300 Gbps 1000 Gbps AlexNet: 1200 Gbps ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20
31 Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 21
32 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core Worker 1 PS 1 PS 2 Worker 2 22
33 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Transient congestion, or oversubscription by design Cross-rack communication cost is higher than Intra-rack communication. Hosts Cluster 1: Cluster 2: Hosts Gbps 4 Gbps 23
34 Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING PS 2 PS 1 Worker 2 24
35 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 25
36 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 26
37 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27
38 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27
39 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28
40 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28
41 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29
42 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29
43 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. Gradient Array for Key 0 from 8 workers 29
44 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings Aggregated - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. - A chunk is only processed by a single core : maintaining maximum locality. Gradient Array for Key 0 from 8 workers 29
45 Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. Aggregated ay for Key 0 from 8 workers 30
46 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31
47 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated Optimized When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31
48 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 32
49 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 33
50 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Worker/PS 1 Worker/PS N Worker/PS 1 Worker/PS 2 34
51 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Aggregator Worker/PS 1 Worker/PS N PBox Worker/PS 1 Worker/PS 2 35
52 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N Aggregator Worker/PS 1 Worker/PS 2 36
53 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36
54 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack N times traffic reduction! Cluster Network 2. Inter-Rack aggregation Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36
55 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING VMs Azure/EC2
56 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2
57 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2 Distance Matrix
58 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix
59 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix Inferred Network Topology*
60 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology*
61 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan
62 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan
63 Performance in commercial cloud with PHub Azure (Standard NC6) EC2 (P3.2xlarge) 0 Azure (Standard NC6) EC2 (P3.2xlarge) VS Facebook Gloo VS Ring Reduction Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/ Pytorch. ResNet
64 Framework Integration Support for Mxnet/Pytorch/Caffe2. var phub = std::make_shared<phub>(cfg.redisip, nmap, keysize, appaddrs, cntr, sizeof(float), cfg.rank, plp); phub->toggleuseschedule(pschedule); phub->reduce();
65 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.
66
67
68 Hardware Parameter Hub
69 Hardware Parameter Hub
70 Hardware Parameter Hub Balanced computation and communication resource ConnectX-3 Card 560+Gbps Network BW 800Gbps PCIe Fully supported by Software Parameter Hub
71 7 Hardware Parameter Hub 35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine Gloo HD Gloo Ring PS-Lite PHub SW
72 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. Better training throughput/$.
73 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. 25 % Better training throughput/$.
arxiv: v1 [cs.dc] 21 May 2018 Abstract
Samples/s Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy University of Washington,
More informationParameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training University of Washington, Microsoft Research Abstract Distributed deep neural network (DDNN) training constitutes
More informationEfficient Communication Library for Large-Scale Deep Learning
IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Security/public safety
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationHPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov
HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,
More informationEnd to End Optimization Stack for Deep Learning
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team
More informationTVM Stack Overview. Tianqi Chen
TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the
More informationDemocratizing Machine Learning on Kubernetes
Democratizing Machine Learning on Kubernetes Joy Qiao, Senior Solution Architect - AI and Research Group, Microsoft Lachlan Evenson - Principal Program Manager AKS/ACS, Microsoft Who are we? The Data Scientist
More informationEvaluating On-Node GPU Interconnects for Deep Learning Workloads
Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13,
More informationProfiling DNN Workloads on a Volta-based DGX-1 System
Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,
More informationRelay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018
Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable
More informationOptimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa
Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationAutoTVM & Device Fleet
AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph
More informationTVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationOptimizing CNN Inference on CPUs
Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation Make DL inference easier and faster
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationNVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS
TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationHigh-Performance Training for Deep Learning and Computer Vision HPC
High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationENDURING DIFFERENTIATION Timothy Lanfear
ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf
More informationENDURING DIFFERENTIATION. Timothy Lanfear
ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded
More informationCisco UCS C480 ML M5 Rack Server Performance Characterization
White Paper Cisco UCS C480 ML M5 Rack Server Performance Characterization The Cisco UCS C480 ML M5 Rack Server platform is designed for artificial intelligence and machine-learning workloads. 2018 Cisco
More informationTESLA V100 PERFORMANCE GUIDE. Life Sciences Applications
TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationParallel Stochastic Gradient Descent: The case for native GPU-side GPI
Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer
More informationarxiv: v1 [cs.dc] 8 Jun 2018
PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Ganger Phil Gibbons Microsoft Research Carnegie Mellon University
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationScaling Up Performance Benchmarking
Scaling Up Performance Benchmarking -with SPECjbb2015 Anil Kumar Runtime Performance Architect @Intel, OSG Java Chair Monica Beckwith Runtime Performance Architect @Arm, Java Champion FaaS Serverless Frameworks
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationOCP Engineering Workshop - Telco
OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,
More informationExploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters
Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth
More informationHardware Evolution in Data Centers
Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers
More informationHigh Performance Packet Processing with FlexNIC
High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet
More informationAdvancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances
Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances Adrien Gaidon - Machine Learning Lead, Toyota Research Institute Mike Garrison - Senior Systems Engineer,
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationA Network-aware Scheduler in Data-parallel Clusters for High Performance
A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters
More informationPerformance and Scalability with Griddable.io
Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationDeep Learning Inferencing on IBM Cloud with NVIDIA TensorRT
Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff Member (STSM), IBM Larry Brown Senior Software Engineer, IBM Agenda Introduction Inferencing with PyCaffe TensorRT
More informationCSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017
CSE 124: THE DATACENTER AS A COMPUTER George Porter November 20 and 22, 2017 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationThe Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer
The Path to GPU as a Service in Kubernetes Renaud Gaubert , Lead Kubernetes Engineer May 03, 2018 RUNNING A GPU APPLICATION Customers using DL DL Application RHEL 7.3 CUDA 8.0 Driver 375
More informationHIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS
HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS Proven Companies and Products Fusion-io Leader in PCIe enterprise flash platforms Accelerates mission-critical applications
More informationS8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018
S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created Agenda DGX-2 internal architecture Software programming model Simple application Results 2 DEEP LEARNING TRENDS Application
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationAMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance
Solution Brief February, 2019 AMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance The AMD EPYC SoC brings a new balance to the datacenter. Utilizing x86 architecture, the AMD EPYC
More informationS8765 Performance Optimization for Deep- Learning on the Latest POWER Systems
S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to
More informationTowards Scalable Machine Learning
Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction
More informationWhat s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1
What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................
More informationCisco Tetration Analytics
Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center
More informationDeep Learning Inference on Openshift with GPUs
Deep Learning Inference on Openshift with GPUs OpenShift Commons, Seattle, Dec 10 2018 Tripti Singhal Product Manager, NVIDIA Deep Learning Software Tushar Katarki Product Manager, AI on OpenShift AGENDA
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationModule Day Topic. 1 Definition of Cloud Computing and its Basics
Module Day Topic 1 Definition of Cloud Computing and its Basics 1 2 3 1. How does cloud computing provides on-demand functionality? 2. What is the difference between scalability and elasticity? 3. What
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions
More informationNetworking at the Speed of Light
Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationOnto Petaflops with Kubernetes
Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationLecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018
Lecture 11: Distributed Training and Communication Protocols CSE599W: Spring 2018 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationHow Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC
How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationDeployment Planning and Optimization for Big Data & Cloud Storage Systems
Deployment Planning and Optimization for Big Data & Cloud Storage Systems Bianny Bian Intel Corporation Outline System Planning Challenges Storage System Modeling w/ Intel CoFluent Studio Simulation Methodology
More informationIn partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
In partnership with VelocityAI REFERENCE JULY // 2018 Contents Introduction 01 Challenges with Existing AI/ML/DL Solutions 01 Accelerate AI/ML/DL Workloads with Vexata VelocityAI 02 VelocityAI Reference
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Coherency Tracked by
More informationEmbarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA
Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationFlashmatrix Technology
matrix Technology All-flash Super-Converged Platform By Ram Johri Memory Summit 2017 Santa Clara, CA 1 Traditional Von Neumann vs. Data Centric Architecture Memory Shared Memory Pool Memory matrix: Data
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationNVIDIA PLATFORM FOR AI
NVIDIA PLATFORM FOR AI João Paulo Navarro, Solutions Architect - Linkedin i am ai HTTPS://WWW.YOUTUBE.COM/WATCH?V=GIZ7KYRWZGQ 2 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 3 GPU COMPUTING
More informationTECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016
TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationThe HammerBlade: An ML-Optimized Supercomputer for ML and Graphs
The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs Prof. Michael B. Taylor (PI) University of Washington Prof. Adrian Sampson Cornell University Prof. Luis Ceze University of Washington Prof.
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More information