Scalable Distributed Training with Parameter Hub: a whirlwind tour

Size: px
Start display at page:

Download "Scalable Distributed Training with Parameter Hub: a whirlwind tour"

Transcription

1 Scalable Distributed Training with Parameter Hub: a whirlwind tour

2 TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

3 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

4 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

5 Parameter Hub Optimized, topology-aware and dynamic mechanism for inter-machine communication * In the cloud-based training context Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

6 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5

7 Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning. 5

8 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6

9 Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook 6

10 EC2 reclaims your GPU instances as they run out of capacity 7

11 EC2 reclaims your GPU instances as they run out of capacity 7

12 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Server A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 8

13 Distributed Training INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE Time Parameter Sever A1 O1 A2 O2 Worker 1 F1 B1 F2 B2 F3 B2 Worker 2 F1 B1 F2 B2 F3 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization Worker Parameter Server 9

14 Distributed Training Today IN THE CONTEXT OF THE CLOUD Network Core Machine with GPUs Machine Machine with GPUs Machine 10

15 Distributed Training Today FORWARD AND BACKWARD PASSES IN WORKER Network Core Worker 1 PS 1 PS 2 Worker 2 11

16 Distributed Training Today AGGREGATION AND OPTIMIZATION IN PS Network Core Worker 1 PS 1 PS 2 Worker 2 12

17 Distributed training is communication bound - Problem gets worse over time: shifting bottleneck. - With modern GPUs most of the time is spent on communication. - Making GPUs faster will do little to increase throughput - Wasting compute Seconds ResNet 269 GPU idle, waiting on network GPU and Network active GRID 520 K80 M60 V resources. 13

18 Distributed training is communication bound AlexNet ResNet 269 Inception V3 GoogleNet

19 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 15

20 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS Network Core Worker 1 PS 1 PS 2 Worker 2 16

21 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS GPU Training Framework Network Core Network Worker 1 PS 1 PS 2 Worker 2 16

22 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS 17

23 Bottlenecks in DDNN training FRAMEWORK BOTTLENECKS ResNet 269 Inception GoogleNet AlexNet Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads Seconds 17

24 Bottlenecks in DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 18

25 Bottlenecks in DDNN training BANDWIDTH BOTTLENECK Network Core Worker 1 PS 2 PS 1 Worker 2 19

26 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps 20

27 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet 25 Gbps 10 Gbps Cloud Bandwidth 20

28 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

29 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 1300 Gbps 1000 Gbps 8 workers, GTX 1080 Ti, central parameter servers. MxNet ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

30 Bottlenecks in Cloud-based DDNN training INSUFFICIENT BANDWIDTH Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet 1300 Gbps 1000 Gbps AlexNet: 1200 Gbps ResNet: 100 Gbps GoogleNet / Inception: 40 Gbps 25 Gbps 10 Gbps Cloud Bandwidth 20

31 Bottlenecks in Cloud-based DDNN training MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT. Network Core Worker 1 PS 1 PS 2 Worker 2 21

32 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Network Core Worker 1 PS 1 PS 2 Worker 2 22

33 Bottlenecks in Cloud-based DDNN training DEPLOYMENT-RELATED OVERHEAD Transient congestion, or oversubscription by design Cross-rack communication cost is higher than Intra-rack communication. Hosts Cluster 1: Cluster 2: Hosts Gbps 4 Gbps 23

34 Parameter Hub Optimizations CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING PS 2 PS 1 Worker 2 24

35 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 25

36 Eliminating framework bottlenecks: PHub Optimizations: streamlining DDNN training pipeline Network Core GPU Data Copy Aggregation Optimization Network Worker 1 PS 1 PS 2 Worker 2 26

37 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27

38 Software Optimizations GRADIENTS MEMORY Network Core Worker 1 CPU PS 2 PS 1 Worker 2 27

39 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28

40 Software Optimizations GRADIENT AGGREGATION AND OPTIMIZATION Requires synchronization. Great locality. No synchronization Great locality. No synchronization Too much coherence and synchronization NUMA 0 NUMA 1 Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for aggregation. This is used in MxNet. (Wide Aggregation) Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into hierarchy. Perform NUMA aware tree reduction. 28

41 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29

42 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. Gradient Array for Key 0 from 8 workers 29

43 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. Gradient Array for Key 0 from 8 workers 29

44 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Core Mappings Aggregated - Chunk a gradient into a series of virtual gradients deterministically. - A virtual gradient is mapped to a particular core on the server. - Virtual gradients are transferred independently. - A chunk is only processed by a single core : maintaining maximum locality. Gradient Array for Key 0 from 8 workers 29

45 Software Optimizations TALL AGGREGATION AND OPTIMIZATION When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. Aggregated ay for Key 0 from 8 workers 30

46 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31

47 Software Optimizations TALL AGGREGATION AND OPTIMIZATION Aggregated Optimized When Aggregation is done, PHub: - PHub optimizes a chunk with the same core that aggregates that chunk. - FP32-level streaming aggregation and optimization to hide communication latency. ay for Key 0 from 8 workers 31

48 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 32

49 Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic Network Core Worker 1 PS 1 PS 2 Worker 2 33

50 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Worker/PS 1 Worker/PS N Worker/PS 1 Worker/PS 2 34

51 Two-Phase Hierarchical Aggregation RACK SCALE PARAMETER SERVICE Cluster Network CM Rack Aggregator Worker/PS 1 Worker/PS N PBox Worker/PS 1 Worker/PS 2 35

52 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N Aggregator Worker/PS 1 Worker/PS 2 36

53 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack Cluster Network Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36

54 Two-Phase Hierarchical Aggregation ADAPTING TO THE DATACENTER NETWORK TOPOLOGY Rack N times traffic reduction! Cluster Network 2. Inter-Rack aggregation Aggregator Worker/PS 1 Worker/PS N 1. Intra-Rack central aggregation Aggregator Worker/PS 1 Worker/PS 2 36

55 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING VMs Azure/EC2

56 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2

57 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe VMs Azure/EC2 Distance Matrix

58 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix

59 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms VMs Azure/EC2 Distance Matrix Inferred Network Topology*

60 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology*

61 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan

62 Efficient DDNN Training in Commercial Cloud ACTIVE TOPOLOGY PROBING DPDK-based latency Probe Clustering Algorithms Automagic Schedule Generation VMs Azure/EC2 Distance Matrix Inferred Network Topology* Hierarchical Reduction Plan

63 Performance in commercial cloud with PHub Azure (Standard NC6) EC2 (P3.2xlarge) 0 Azure (Standard NC6) EC2 (P3.2xlarge) VS Facebook Gloo VS Ring Reduction Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/ Pytorch. ResNet

64 Framework Integration Support for Mxnet/Pytorch/Caffe2. var phub = std::make_shared<phub>(cfg.redisip, nmap, keysize, appaddrs, cntr, sizeof(float), cfg.rank, plp); phub->toggleuseschedule(pschedule); phub->reduce();

65 Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC Hardware Fleet Your Cloud Active Topology Probing Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

66

67

68 Hardware Parameter Hub

69 Hardware Parameter Hub

70 Hardware Parameter Hub Balanced computation and communication resource ConnectX-3 Card 560+Gbps Network BW 800Gbps PCIe Fully supported by Software Parameter Hub

71 7 Hardware Parameter Hub 35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine Gloo HD Gloo Ring PS-Lite PHub SW

72 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. Better training throughput/$.

73 Hardware Parameter Hub ResNet-50. See paper for detailed estimates. 25 % Better training throughput/$.

arxiv: v1 [cs.dc] 21 May 2018 Abstract

arxiv: v1 [cs.dc] 21 May 2018 Abstract Samples/s Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy University of Washington,

More information

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training University of Washington, Microsoft Research Abstract Distributed deep neural network (DDNN) training constitutes

More information

Efficient Communication Library for Large-Scale Deep Learning

Efficient Communication Library for Large-Scale Deep Learning IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Security/public safety

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,

More information

End to End Optimization Stack for Deep Learning

End to End Optimization Stack for Deep Learning End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team

More information

TVM Stack Overview. Tianqi Chen

TVM Stack Overview. Tianqi Chen TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the

More information

Democratizing Machine Learning on Kubernetes

Democratizing Machine Learning on Kubernetes Democratizing Machine Learning on Kubernetes Joy Qiao, Senior Solution Architect - AI and Research Group, Microsoft Lachlan Evenson - Principal Program Manager AKS/ACS, Microsoft Who are we? The Data Scientist

More information

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Evaluating On-Node GPU Interconnects for Deep Learning Workloads Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13,

More information

Profiling DNN Workloads on a Volta-based DGX-1 System

Profiling DNN Workloads on a Volta-based DGX-1 System Profiling DNN Workloads on a Volta-based DGX-1 System Saiful A. Mojumder 1, Marcia S Louis 1, Yifan Sun 2, Amir Kavyan Ziabari 3, José L. Abellán 4, John Kim 5, David Kaeli 2, Ajay Joshi 1 1 ECE Department,

More information

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018 Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable

More information

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition

More information

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018 VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor

More information

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

AutoTVM & Device Fleet

AutoTVM & Device Fleet AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph

More information

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Optimizing CNN Inference on CPUs

Optimizing CNN Inference on CPUs Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation Make DL inference easier and faster

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Cisco UCS C480 ML M5 Rack Server Performance Characterization White Paper Cisco UCS C480 ML M5 Rack Server Performance Characterization The Cisco UCS C480 ML M5 Rack Server platform is designed for artificial intelligence and machine-learning workloads. 2018 Cisco

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation

More information

Lecture 12: Model Serving. CSE599W: Spring 2018

Lecture 12: Model Serving. CSE599W: Spring 2018 Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer

More information

arxiv: v1 [cs.dc] 8 Jun 2018

arxiv: v1 [cs.dc] 8 Jun 2018 PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Nikhil Devanur Greg Ganger Phil Gibbons Microsoft Research Carnegie Mellon University

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Scaling Up Performance Benchmarking

Scaling Up Performance Benchmarking Scaling Up Performance Benchmarking -with SPECjbb2015 Anil Kumar Runtime Performance Architect @Intel, OSG Java Chair Monica Beckwith Runtime Performance Architect @Arm, Java Champion FaaS Serverless Frameworks

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Deep Learning Compiler

Deep Learning Compiler Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Hardware Evolution in Data Centers

Hardware Evolution in Data Centers Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances

Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances Adrien Gaidon - Machine Learning Lead, Toyota Research Institute Mike Garrison - Senior Systems Engineer,

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

A Network-aware Scheduler in Data-parallel Clusters for High Performance

A Network-aware Scheduler in Data-parallel Clusters for High Performance A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff Member (STSM), IBM Larry Brown Senior Software Engineer, IBM Agenda Introduction Inferencing with PyCaffe TensorRT

More information

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017 CSE 124: THE DATACENTER AS A COMPUTER George Porter November 20 and 22, 2017 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer The Path to GPU as a Service in Kubernetes Renaud Gaubert , Lead Kubernetes Engineer May 03, 2018 RUNNING A GPU APPLICATION Customers using DL DL Application RHEL 7.3 CUDA 8.0 Driver 375

More information

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS Proven Companies and Products Fusion-io Leader in PCIe enterprise flash platforms Accelerates mission-critical applications

More information

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018 S8688 : INSIDE DGX-2 Glenn Dearth, Vyas Venkataraman Mar 28, 2018 Why was DGX-2 created Agenda DGX-2 internal architecture Software programming model Simple application Results 2 DEEP LEARNING TRENDS Application

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

AMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance

AMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance Solution Brief February, 2019 AMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance The AMD EPYC SoC brings a new balance to the datacenter. Utilizing x86 architecture, the AMD EPYC

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

Towards Scalable Machine Learning

Towards Scalable Machine Learning Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction

More information

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1 What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................

More information

Cisco Tetration Analytics

Cisco Tetration Analytics Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center

More information

Deep Learning Inference on Openshift with GPUs

Deep Learning Inference on Openshift with GPUs Deep Learning Inference on Openshift with GPUs OpenShift Commons, Seattle, Dec 10 2018 Tripti Singhal Product Manager, NVIDIA Deep Learning Software Tushar Katarki Product Manager, AI on OpenShift AGENDA

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

Module Day Topic. 1 Definition of Cloud Computing and its Basics

Module Day Topic. 1 Definition of Cloud Computing and its Basics Module Day Topic 1 Definition of Cloud Computing and its Basics 1 2 3 1. How does cloud computing provides on-demand functionality? 2. What is the difference between scalability and elasticity? 3. What

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Networking at the Speed of Light

Networking at the Speed of Light Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Onto Petaflops with Kubernetes

Onto Petaflops with Kubernetes Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Improving Packet Processing Performance of a Memory- Bounded Application

Improving Packet Processing Performance of a Memory- Bounded Application Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb

More information

Lecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018

Lecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018 Lecture 11: Distributed Training and Communication Protocols CSE599W: Spring 2018 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Deployment Planning and Optimization for Big Data & Cloud Storage Systems

Deployment Planning and Optimization for Big Data & Cloud Storage Systems Deployment Planning and Optimization for Big Data & Cloud Storage Systems Bianny Bian Intel Corporation Outline System Planning Challenges Storage System Modeling w/ Intel CoFluent Studio Simulation Methodology

More information

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER In partnership with VelocityAI REFERENCE JULY // 2018 Contents Introduction 01 Challenges with Existing AI/ML/DL Solutions 01 Accelerate AI/ML/DL Workloads with Vexata VelocityAI 02 VelocityAI Reference

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Coherency Tracked by

More information

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Flashmatrix Technology

Flashmatrix Technology matrix Technology All-flash Super-Converged Platform By Ram Johri Memory Summit 2017 Santa Clara, CA 1 Traditional Von Neumann vs. Data Centric Architecture Memory Shared Memory Pool Memory matrix: Data

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

NVIDIA PLATFORM FOR AI

NVIDIA PLATFORM FOR AI NVIDIA PLATFORM FOR AI João Paulo Navarro, Solutions Architect - Linkedin i am ai HTTPS://WWW.YOUTUBE.COM/WATCH?V=GIZ7KYRWZGQ 2 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 3 GPU COMPUTING

More information

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs Prof. Michael B. Taylor (PI) University of Washington Prof. Adrian Sampson Cornell University Prof. Luis Ceze University of Washington Prof.

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information