Using GPUaaS in Cloud Foundry
Agenda Introduction GPUaaS Cloud Foundry Integration 2
Technology Research Innovation Group Innovation Advanced Research Proof of Concept User Feedback Agile Roadmap 3
Technology Research Innovation Group Frank Zhao Consultant Technologist Leading GPUaaS Data Path Distributed System, Micro- Services, Real-time Processing and Analytics 4 Granted Patents 50+ Pending Patents Layne Peng Principal Technologist Cloud Computing and SDDC 10+ Patents A big data and cloud computing book author Victor Fong @victorkfong Sensei at Dell EMC Dojo Cloud Foundry Contributor 4
GPU-as-a Service Project Asaka
What is GPU & GPGPU? Graphical Processing Unit Designed for Parallel Processing Latency: Increase Clock Speed Throughput: Increase Concurrent Task Execution General-Purpose GPU Why: Scientific Computing and Graphical processing => Matrix & Vector Prerequisite: key feature supported in GPU, such as float point computing Easy to use: CUDA => Theano, Tensorflow and etc. Popular use-case: Scientific Calculation, ML/DL, Crypto, Bitcoin and etc. 6
Problems today Not widely deployed High cost Low availability when sharing Hard to monitoring in fine-grained Remote GPU as a Service Fine-grained resource control Sharable Fine-grained resource monitor and tracing Offering GPU in cloud makes it more complex: How to abstract GPU resource? Pass-through solution today Performance issues caused by network Resource discovery and provisioning 7
Project Asaka - GPU as a Service Remote GPU Access Consume remote GPU resource transparently Based on queue model and various network fabric GPU Sharing N:1 model: multiple user consume same GPU accelerator Fine-grained control and management GPU Chaining 1:N model: multiple GPU hosts chaining as one to serve a large size to 1 application; Dynamical GPU number fit to application Smart Scheduling GPU pooling, discovery and provision Fine-grained resource monitor and tracing Pluggable intelligent scheduling algorithms: o Heterogeneous resource o Network tradeoffs 8
How to access remote GPU transparently? Transparent remote GPU resource access: C Asaka CUDA APIs CUDA APIs interception CUDA APIs Fine grained GPU resource management and control: vgpu -side GPU resource chaining Intelligent selection of network fabric: TCP or RDMA Asaka GPU Library vgpu Network (TCP, RDMA) Asaka CUDA APIs C CUDA APIs Compatible with existed CUDA applications Asaka GPU Library vgpu Asaka CUDA APIs Client Network Fabric 9 C : CUDA Application
How to manage GPU pool and scheduling? Pooling and managing GPU resources: Consul based Add/remove node, failure detection, resurrect GPU scheduling challenges: Two-level dispatching: 1 st level: Find GPU hosts 2 nd level: CUDA request to devices GPU server chaining Network Strategy Host Dispatcher Q-Length Strategy H-Res Strategy Intelligent Analysis Client Application Asaka GPU Library Distributed Global State Store Global intelligence and smart scheduling: Global status monitor and tracing Frontend Network topology analysis Queue-based priority profile Pluggable scheduling algorithms Tasks Queue Smart Dispatcher 10
How to integrate with Cloud Native Platform? Offering allocation strategies as a service: High-level abstraction of scheduling functions 3. Execute the Job Cluster Manager 1. Job requests Allocation API Global Scheduler XaaS_Controller Non-invasive design, easy to integrate with existed schedulers in Cloud Native Platform Access GPU transparently from applications deployed in Cloud Native Platform 4. Transparently consume GPU as a service Ease of integrating to Cloud Native Platform Sync nodes attributes & status 2. Allocation Strategies : Job : Allocation Strategy 11
Asaka Demo
Cloud Foundry What is it and why should I care?
Let Daily me Struggle solve that for you Need to MOVE at a faster pace Constant Requests New Machines New Databases New Services Manual Installation Processes Things Crashing Developers writing crappy code Software Platform for rapid pace Self Help Marketplace Automated Orchestration High Availability with Scalability Enforces good software pattern Identical Environments for Dev, Test and Production 14
What is Cloud Foundry? Cloud Foundry DevOps CLI Web GUI Cloud Controller Diego Brain Loggregator Marketplace Diego Cell Diego Cell Router Users 15
Cloud Foundry Marketplace Fully Self Serviced Industrial standard services available Different Service Levels Charge back based on Usage Easy API 16
Cloud Foundry with Asaka Run my app with GPU! I don t care how!
But why? New Requests for GPU Manual Management and Orchestration No Network Traffic Control Hard to predict budget Not releasing GPU GPU dependent App cannot run on CF Self-Service Marketplace Automates orchestration of entire stack Centralize Compute Location Pay and use as you go Easy GPU Service Provision GPU Access from Cloud Foundry! 18
Cloud Foundry with Asaka Cloud Foundry DevOps CLI Web GUI Cloud Controller Diego Brain Loggregator Marketplace Diego Cell Diego Cell Router Users Asaka Cluster XaaS Controller Asaka Asaka Asaka 19
Runtime Stack Cloud Foundry automates Only cares about DevOps 20
Runtime Architecture ML Application Python Buildpack AsakaFS Asaka Cluster 21
Cloud Foundry with Asaka Demo!
What s Next?
Upcoming Talks Cloud Foundry 101 Monday May 8 th 12pm Zeno 4602 7 Lessons Learned from Implementing DevOps Tuesday, May 9 th 12pm - Murano 3201A Adventure of Pair Programming Thursday, May 11 th 1pm Zeno 4602 DevOps from a Developer Perspective Thursday, May 11 th 10am Zeno 4702 24
Next Steps Reach out and Pair with us! jack.harwood@dell.com victor.fong@dell.com Intelligent Scheduling Data Path XaaS (FPGAaaS and etc.) 25
Performance 15000 Memcpy H2D MemcpyH2D: ~98% native MemcpyD2H: ~25% native Further improvement on D2H: (Maybe) the platform issue; Optimize the communication flow between host and client 10000 5000 0 10000 8000 6000 4000 2000 0 1MB 2MB 4MB 8MB 16MB 32MB Native TCP RDMA Memcpy D2H 1MB 2MB 4MB 8MB 27 Native TCP RDMA