Why AI Frameworks Need (not only) RDMA?

Similar documents
Building the Most Efficient Machine Learning System

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Building the Most Efficient Machine Learning System

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

The Future of High Performance Interconnects

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds. Layong (Larry) Luo, Tencent TEG August 8, 2018

Deep Learning Frameworks with Spark and GPUs

RDMA and Hardware Support

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Bare Metal Library. Abstractions for modern hardware Cyprien Noel

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

NVMe over Universal RDMA Fabrics

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Memory Management Strategies for Data Serving with RDMA

Storage Protocol Offload for Virtualized Environments Session 301-F

The NE010 iwarp Adapter

ReFlex: Remote Flash Local Flash

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Sharing High-Performance Devices Across Multiple Virtual Machines

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

An Approach for Implementing NVMeOF based Solutions

Advanced Computer Networks. End Host Optimization

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Application Acceleration Beyond Flash Storage

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Networking at the Speed of Light

N V M e o v e r F a b r i c s -

InfiniBand Networked Flash Storage

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

The rcuda middleware and applications

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

LegUp: Accelerating Memcached on Cloud FPGAs

Kaminario and K2 All Flash Array At A Glance 275+ Employees Boston HQ Locations in Israel, London, Paris, Beijing & Seoul 200+ Channel Partners K2 All

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Beyond the Hype of NVMe & NVMeoF:

High-Performance Training for Deep Learning and Computer Vision HPC

Accessing NVM Locally and over RDMA Challenges and Opportunities

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Oracle Exadata: Strategy and Roadmap

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Baidu s Best Practice with Low Latency Networks

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

PARAVIRTUAL RDMA DEVICE

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

RDMA in Embedded Fabrics

Mark Falco Oracle Coherence Development

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Democratizing Machine Learning on Kubernetes

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

IBM CORAL HPC System Solution

High Performance Packet Processing with FlexNIC

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Solutions for Scalable HPC

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

Industry Collaboration and Innovation

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Revisiting Network Support for RDMA

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Deep Learning mit PowerAI - Ein Überblick

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

The Common Communication Interface (CCI)

Farewell to Servers: Resource Disaggregation

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

RoCE vs. iwarp Competitive Analysis

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Accelerating Web Protocols Using RDMA

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Interconnect Your Future

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Transcription:

Why AI Frameworks Need (not only) RDMA? With Design and Implementation Experience of Networking Support on TensorFlow GDR, Apache MXNet, WeChat Amber, and Tencent Angel Bairen Yi (byi@connect.ust.hk) Jingrong Chen, Junxue Zhang, Jiacheng Xia, Li Chen, and Kai Chen SING Group, WHAT Lab, Big Data Institute Hong Kong University of Science & Technology 1

A Perspective from (non-hpc) System and Networking Community Prof. Kai Chen and System & Networking Lab @ HKUST Research interests include networked systems design and implementation, data center networks, data centric networking, and cloud & big data systems 5 papers in NSDI 15-18, 4 papers in SIGCOMM 15-17 Collaborations with industrial partners including Tencent & Huawei on real-world systems in AI, Big Data, and Cloud 2

Industrial Experience from an Academic Lab? Worked on real-world AI & Big Data systems, like CoDA with Huawei (2015), Tencent Angel (2016), WeChat Amber (2017) Contributed network optimisation patches to several open source projects, including TensorFlow & Apache MXNet A recently funded startup in Beijing, providing commodity & high performance data center networking solutions to AI teams of data scientists, developers, and operations 3

Agenda A glance to commodity data center networking Convergence of networking in cloud data centers Anti-patterns with high performance networks End-to-end design principles for AI frameworks 4

The Rise of Cloud Computing 40 Azure Regions across the World 5

Data Centers What does data center network look like? 6

7

What Modern Data Center Networks Offer High throughput: single connection at line rate Low latency: 99.9% tail latency within 200 μs* Scalability: >100,000 nodes in routable IP network Commodity: <$100($500) 25(100) GbE per port *RDMA over Commodity Ethernet at Scale, SIGCOMM 16 8

Convergence of Data Center Networking Technologies InfiniBand, OmniPath, Fibre Channel, and PLX (PCIe Switch) Replacing 4 networks (and switches) with a single Ethernet Convergence of networking applications to IP as well: computation, storage, messaging, and remote management (Routable) RDMA over Converged Ethernet (RoCEv2) 9

Evolution of Network I/O Latency dropped from ~10 ms to ~10 μs From kernel to user space (VMA, DPDK, Onload) From software to hardware (RoCE, iwarp) Reduced CPU load for better performance 10

Software Anti-Patterns High performance networking costs and we can only afford commodity Ethernet (a.k.a. AWS VPC) Network communication hurts performance and we need to avoid communication as much as possible Either high-level APIs with poor performance or lowlevel APIs with high performance; not both 11

Networking APIs Revisited Messaging libraries: socket, 0MQ, Netty, Akka RPC libraries: grpc, Thrift, Dubbo, brpc Encoding libraries: protobuf, thrift, kryo, flatbuffers Compression libraries: zlib, snappy, lz4 12

~100 MB/s encoding throughput Lesson Learnt: Do not encode your data when your network >100 Gbps Taken from: https://google.github.io/flatbuffers/flatbuffers_benchmarks.html 13

~850 MB/s compression throughput Lesson Learnt: Do not compress your data when your network >100 Gbps Taken from: https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/ 14

How about Memory Copy? 15

Throughput (MB/s) 40000 Cached Cache-bypass 30000 20000 10000 0 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Buffer Size (KB) Cached Writes: 37.7 GB/s at 16 KB, 3.0 GB/s >L3 cache Cache Bypassed Writes: 3.0 GB/s at all buffer sizes 16 Taken from: http://zsmith.co/bandwidth.html

AI Applications are Bottlenecked by its Anti-Patterns The performance of Spark 1.4 increases only 2% by medium even if the network is infinitely fast*! Encoding, compression, serialisation, and memory copying take the most CPU cycles, not networking (nor disk I/O; about 20% better if it s infinitely fast) Software architecture makes CPU its bottleneck *Making Sense of Performance in Data Analytics Frameworks, NSDI 15 17

Network Device RDMA??? Networking Protocol Stack Memory Copy Compression Performance Gain eaten by Software Forwarding of Data Computation Job Scheduler Serialization Storage File System CUDA??? NVMe??? Computation Device 18 Storage Device

Yahoo s TensorFlow RDMA * Removing Ser/Des gives 1.5 GB/s throughput 19

0-Copy Dataflow The end-to-end offloaded dataflow pipeline: NVMe storage, RDMA networks, and GPU accelerators Lesson learnt from network switches: we need to separate control plane and data plane CPU/software for flexible control plane, hardware offloading for high performance data plane 20

Disaggregated Architecture CPU/Software Control Plane High Performance Data Plane Offloaded Computation/Networking/Storage Computation Device Network Device Storage Device Lean Data Structures P2P Switching 21

22

TensorFlow GDR 23

TensorFlow GDR We kept 100% grpc calls without introducing significant overhead compared to pure RDMA Easy to fallback to grpc with mixed deployment Less code changes compared to verbs with even more features (<1,000 lines of C++, mostly on GPU Direct RDMA and RDMA memory management) 24

WeChat s Amber w/ RDMA Taken from: http://www.infoq.com/cn/news/2017/08/roce-wechat-amber 25

WeChat s Amber w/ RDMA Scalability Ratio (0MQ for TCP) Ego Network Deep Conversation Object Recognition Taken from: http://www.infoq.com/cn/news/2017/08/roce-wechat-amber 26

To Copy or Not To Copy RDMA messages need to be registered (pinned) through ibv_reg_mr before send/recv Pinning memory pages through get_user_pages in kernel is costly, e.s.p. frequently for small buffers Typically we introduce transmitting/receiving side ring buffers w/ huge pages for RDMA buffer reuse Buffer bloat and extra latency introduced in copy 27

Looking Forward Unified Virtual Memory (UVM) in CUDA 6 On Demand Paging (ODP) in OFED 4 MSG_ZEROCOPY by Google in Linux 4.14 Heterogeneous Memory Management (HMM) for universal coherent host+device memory space 28

Conclusion Embrace end-to-end principle designing AI frameworks for data intensive applications Revisit old operating system concepts and learn how to write low level programs for your hardware Combining high-level APIs with efficient offloading actually works (as long as hardware does all the heavy lifting and software only in the control plane) 29

Questions? 30