Construct a sharable GPU farm for Data Scientist. Layne Peng DellEMC OCTO TRIGr

Similar documents
Using GPUaaS in Cloud Foundry

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

VMware Hybrid Cloud Solution

Deep Learning Inference on Openshift with GPUs

Pontoon An Enterprise grade serverless framework using Kubernetes Kumar Gaurav, Director R&D, VMware Mageshwaran R, Staff Engineer R&D, VMware

Democratizing Machine Learning on Kubernetes

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

IBM Leading High Performance Computing and Deep Learning Technologies

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

AGENDA Introduction Pivotal Cloud Foundry NSX-V integration with Cloud Foundry New Features in Cloud Foundry Networking NSX-T with Cloud Fou

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

BUILDING A GPU-FOCUSED CI SOLUTION

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Intuit Application Centric ACI Deployment Case Study

Application Centric Microservices Ken Owens, CTO Cisco Intercloud Services. Redhat Summit 2015

Blockhead Open Service Broker Jonathan Berkhahn Swetha Repakula IBM

Service Insertion with ACI using F5 iworkflow

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Introduction to GPU computing

SQL Azure. Abhay Parekh Microsoft Corporation

Understanding As-a-service: Teradata IntelliCloud

Mesosphere and the Enterprise: Run Your Applications on Apache Mesos. Steve Wong Open Source Engineer {code} by Dell

One Platform Kit: The Power to Innovate

Orchestration: Accelerate Deployments and Reduce Operational Risk. Nathan Pearce, Product Development SA Programmability & Orchestration Team

FuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Enhancing NVMe-oF Capabilities Using Storage Abstraction

Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A

Microservices Beyond the Hype. SATURN San Diego May 3, 2016 Paulo Merson

2018 Cisco and/or its affiliates. All rights reserved. Cisco Public

CWIN CAPGEMINI WEEK OF INNOVATION NETWORKS. Re-platforming and Cloud Journey. Fausto Pasqualetti, Milano, 25 Settembre 2018

The Latest EMC s announcements

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Composable Infrastructure for Public Cloud Service Providers

Developing Enterprise Cloud Solutions with Azure

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Building a Data-Friendly Platform for a Data- Driven Future

Making the Most of the Splunk Scheduler

CC13c LifeCycle Management. Infrastructure at your Service.

ENTERPRISE SECURITY MANAGEMENT. Frederick Verduyckt 20 September 2012

OpenStack Mitaka Release Overview

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Building an Operating System for AI

Cisco 5G Now! Product Announcements. February, 2018

XtremIO Business Continuity & Disaster Recovery. Aharon Blitzer & Marco Abela XtremIO Product Management

SD-WAN. Enabling the Enterprise to Overcome Barriers to Digital Transformation. An IDC InfoBrief Sponsored by Comcast

Going cloud-native with Kubernetes and Pivotal

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Running RHV integrated with Cisco ACI. JuanLage Principal Engineer - Cisco May 2018

Achieving Digital Transformation: FOUR MUST-HAVES FOR A MODERN VIRTUALIZATION PLATFORM WHITE PAPER

DEPLOY MODERN APPS WITH KUBERNETES AS A SERVICE

Deep Learning Frameworks with Spark and GPUs

DEPLOY MODERN APPS WITH KUBERNETES AS A SERVICE

Cisco HyperFlex and the F5 BIG-IP Platform Accelerate Infrastructure and Application Deployments

The SD-WAN implementation handbook

Recent Enhancements to Cloud Foundry Routing. Route Services and TCP Routing

Oracle Big Data Connectors

EMC Storage Resource Management Suite

Continuous Delivery for Cloud Native Applications

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Hedvig as backup target for Veeam

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Mesosphere and Percona Server for MongoDB. Peter Schwaller, Senior Director Server Eng. (Percona) Taco Scargo, Senior Solution Engineer (Mesosphere)

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Mesosphere and Percona Server for MongoDB. Jeff Sandstrom, Product Manager (Percona) Ravi Yadav, Tech. Partnerships Lead (Mesosphere)

Cloud Native Architecture 300. Copyright 2014 Pivotal. All rights reserved.

Modernize Your Storage

Please give me your feedback

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Windows Server 2012 Top Ten

Practical Guide to Platform as a Service.

Enterprise Cloud One OS. One Click.

Top 4 considerations for choosing a converged infrastructure for private clouds

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

HP Storage Software Solutions

EMC Celerra CNS with CLARiiON Storage

By Julián Fernández-Campón Solutions Maximizing storage Storage Anywhere

São Paulo. August,

Building scalable service-based applications Wicked Fast

Boost your Analytics with ML for SQL Nerds

#techsummitch

Scaling Throughput Processors for Machine Intelligence

Docker CaaS. Sandor Klein VP EMEA

IT Redefined. Hans Timmerman CTO EMC Nederland. Copyright 2015 EMC Corporation. All rights reserved.

ALCATEL Edge Services Router

API, DEVOPS & MICROSERVICES

IBM CORAL HPC System Solution

A Comparision of Service Mesh Options

The Desired State. Solving the Data Center s N-Dimensional Challenge

Cisco CloudCenter Solution with Cisco ACI: Common Use Cases

EXPLORE MICROSOFT SHAREPOINT SERVER 2016 AND BEYOND #ILTAG70

Supporting GPUs in Docker Containers on Apache Mesos

Zend PHP Cloud Application Platform

UCS Management Architecture Deep Dive

WHITE PAPER: A NEW STORAGE ARCHITECTURE FOR THE COMMODITIZATION ERA

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

8.3 cloud roadmap. Dr. Andrei Borshchev, CEO Nikolay Churkov, Head of Software Development. The AnyLogic Company Conference 2018 Baltimore

Transcription:

Construct a sharable GPU farm for Data Scientist Layne Peng OCTO TRIGr

Agenda Introduction Problems of consuming GPU GPU-as-a Service (project ) How to manage GPU in farm? How to schedule jobs to GPU? How to integrate with cloud platform? Next step and contacts 2 of 15

Technology Research Innovation Group (TRIGr) Innovation Advance Research Proof of Concept User Feedback Agile Roadmap 3 of 15

About me Email: layne.peng@dell.com Twitter: @layne_peng Wechat: Layne Peng, Principal Technologist at OCTO, experienced in cloud computing and SDDC areas, responsible for cloud computing related initiatives since joint in 2011, 10+ patents and a big data and cloud computing book author. 4 of 15

What is GPU & GPGPU? Graphical Processing Unit Designed for Parallel Processing Latency: Increase Clock Speed Throughput: Increase Concurrent Task Execution General-Purpose GPU Why: Scientific Computing and Graphical processing => Matrix & Vector Prereuisite: key feature supported in GPU, such as float point computing Easy to use: CUDA => Theano, Tensorflow and etc. Widely used in machine learning and deep learning area 5 of 15

Problems of consuming GPU today Not widely deployed High cost Low availability when sharing Hard to monitoring in fine-grained ü Remote GPU as a Service ü Fine-grained resource control ü Sharable ü Fine-grained resource monitor and tracing Offering GPU in cloud makes it more complex: How to abstract GPU resource? Pass-through solution today Performance issues caused by network Resource discovery and provisioning 6 of 15

Project GPU-as-a Service Remote GPU Access Consume remote GPU resource transparently Based on ueue model and various network fabric GPU Sharing N:1 model: multiple user consume same GPU accelerator Fine-grained control and management GPU Chaining 1:N model: multiple GPU hosts chaining as one to serve a large size to 1 application; Dynamical GPU number fit to application Smart Scheduling GPU pooling, discovery and provision Fine-grained resource monitor and tracing Pluggable intelligent scheduling algorithms: o Heterogeneous resource o Network tradeoffs 7 of 15

Overview of project Providing transparent remote GPU resource access: interception Intelligent selection of network fabric: TCP or RDMA Abstract the GPU resource to : Device from client s view, same with GPU Small management grained C GPU Library C Network (TCP, RDMA) GPU farm: GPU hosts management Sharable resource pool GPU Library Chaining features Intelligent scheduling Client Network Fabric C : CUDA Application 8 of 15

How to manage GPU in farm? Manage GPU farm: Consul based Basic cluster features: Add new nodes Remove nodes Detect failure Expel nodes and resurrect nodes Key takeaways: Abstract the hardware resource in a enough level Learn the experience from MSA Manager Manager a. Reuest to add to GPU farm Note: Master: a executable binary based on Consul with adding features; It can also deploy to the node with GPU; : is an extension of Consul with checking scripts and special configuration. Manager b. Reuest/Force to leave GPU farm c. Failure detected and expel failed node Cluster link Action 9 of 15

How to select GPU hosts to chain? GPU can be chained and exposed as one large machine to application GPU hosts selected for chaining by: Capacity 10 of 15 Queue length (crowdedness) Network radius Hops Bandwidth Key takeaways: Network is the first citizen of considering which GPU serve to application in remote case and chaining servers 3. Offer s to client Allocation API Global Scheduler XaaS_ 2. Chaining GPU from hosts selected to serve as one big machine has nine Network Radius 1. Select the nodes according to capacity and network radius C GPU Library Selected hosts

How to schedule jobs to GPU hosts? Client Application Two-level scheduling: 1 st level: find GPU hosts from farm GPU Library 2 nd level: find device for CUDA reuests Global smart scheduling: Global status monitor and trace Network topology and analysis Queue-based priority Pluggable scheduling strategy Network Strategy Host Dispatcher Q-Length Strategy H-Res Strategy Intelligent Analysis Distributed Global State Store Key takeaways: Think resources scheduling in a global view Frontend Smart Dispatcher Tasks Queue 11 of 15

Leverage ML to solve heterogeneous scheduling? Heterogeneous scheduling: Heterogeneous hardware resource Heterogeneous network fabric NP-hard problem Greedy Algorithm Or? Pluggable scheduler design: Abstract of scheduler API Machine learning jobs has work load pattern Metadata matters for scheduling Use machine learning to solve the scheduling problem of machine learning Status: on-going Training Phase Learn 3. Execute the Job Cluster Manager Model training 1. Job reuests Sync nodes attributes & status 2. Allocation Strategies Inference Phase Model trained Externel_Scheduler Colocation_Scheduler Allocation API Scheduler API Global Scheduler XaaS_ 12 of 15 : Job : Allocation Strategy

How to integrate with cloud platform? Offering allocation strategies as a service: High-level abstraction of allocation/scheduling functions Non-invasion design, easy to integrate with existed scheduler using in Cloud Native Platform 3. Execute the Job Cluster Manager 1. Job reuests Allocation API Global Scheduler XaaS_ Access GPU transparently from applications deployed in Cloud Native Platform Ease of integrating to Cloud Native Platform: Mesos/Marathon Kubernetes Cloud Foundry (demo in World 2017) 4. Transparently consume GPU as a service 2. Allocation Strategies Sync nodes attributes & status 13 of 15 : Job : Allocation Strategy

Example: Cloud Foundry Integration 1. Deploy machine learning task/application with CLI As a friendly machine learning interface at to project Receive jobs from users Manage cluster Start the containers for jobs Cloud Foundry offering Machine learning platform doesn t need to consider how environment configure and hardware accelerations Run my app with GPU! I don t care how! Reach out: Victor Fong from Dojo victor.fong@dell.com @victorfong Data Scientist CLI GUI 2. Provision GPU service for application according to XaaS_ allocation strategies, and bind to applications Allocation API Global Scheduler XaaS_ Cloud Foundry Cloud Diego Brain Loggreator Marketplace Diego Cell Diego Cell Router Users 3. Users consume the task or application. GPU Farm 14 of 15

Next step and contacts Intelligent scheduling Schedule according to application metadata Model and calculate network distance Introduce machine learning to solve heterogeneous resource scheduling Keep improve performance Intelligent data movement and loading Enterprise features such as snapshot and etc. XaaS More hardware resource as a service, such as FPGAaaS Global-wide maximized performance Reach out the team Jack Harwood: jack.harwood@dell.com (distinguished engineer in ) Frank Zhao: junping.zhao@dell.com (consultant engineer, leads data path) Layne Peng: layne.peng@dell.com 15 of 15