Distributing Computation to Large GPU Clusters

Size: px

Start display at page:

Download "Distributing Computation to Large GPU Clusters"

Stanley Oliver
5 years ago
Views:

1 Distributing Computation to Large GPU Clusters

2 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster

3 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster Used since 2003 in our rendering products... NVIDIA Iray NVIDIA index

4 Why are we presenting this here? DiCE is a base technology in index Clustering / networking /distribution based on DiCE DiCE API exposed by index Distribute pre-computation of data for index Do your own calculations Nothing in DiCE specific to rendering

5 Design Goals Provide a software library which can be used by domain experts to write scalable software for GPU clusters. Not required: low level paralellization / networking knowledge Not specific to special domain (e.g. rendering) Easy to use... High performance, meant for interactive applications Other solutions: OpenMP, MPI, UPC,...

6 Unique Combination of Features Simple programming model Ease of deployment / commodity hardware Unified multi-core and cluster parallelization CUDA support Dynamic clustering Focus on interactive applications Multi-user support Genuine distributed system: All hosts are equal

7 Overview Application C++ API Job System Datastore Networking / Clustering

8 Overview Application C++ API Job System Datastore Networking / Clustering

9 Overview Application C++ API Job System Datastore Networking / Clustering

10 Overview Application C++ API Job System Datastore Networking / Clustering

11 Overview Application C++ API Job System Datastore Networking / Clustering

12 DiCE and index Application index C++ API Job System Datastore Networking / Clustering

13 Networking / Clustering Application C++ API Job System Datastore Networking / Clustering

14 Networking / Clustering Handles cluster building and data transfers Self-organizing, dynamic addition and removal of hosts Tested with up to 1000 hosts Several networking protocols for different environments Provides to application List of hosts in cluster; same on all hosts! Notification for new / leaving hosts List of resources in cluster (GPUs, CPUs)

15 Network Layer: UDP with Multicast Self Organization: Multicast address identifies cluster Multicast beacon packets to detect other hosts Election process to elect one synchronizer Synchronizer organizes hosts Multicast / unicast used for bulk data transfers Especially effective for many hosts One layer of sub-clustering

16 Network Layer: TCP with multicast discovery For networks with low bandwidth multicast UDP multicast layer used for discovering hosts TCP used for all data transport

17 Network Layer: TCP with host list For networks which do not support multicast (e.g. AWS) Host list used for building network Does not have to be complete At least one host Still self-organizing and dynamic TCP used for all data transport

18 Network Layer: Infiniband Native Infiniband with remote DMA (RDMA) Not a standalone network layer IP based layers used for clustering Most communcation over IP layers RDMA used for speeding up bulk data transfer Fastest transmissions > 30 Gbit/s end-to-end

19 Network Layer Not exposed to application! Rely on Datastore and Job System!

20 Datastore Application C++ API Job System Datastore Networking / Clustering

21 Datastore In memory NoSQL datastore for arbitrary C++ objects Store object on some host / retrieve on any host Numeric id / string identify objects Multi-version capability for multi-user Data transport transparent to application

22 Datastore Objects class My_adder { float m_a; int m_b; Your class }; float sum() { return m_a + m_b; }

23 Datastore Objects class My_adder { float m_a; int m_b; Arbitrary member variables }; float sum() { return m_a + m_b; }

24 Datastore Objects class My_adder { float m_a; int m_b; }; float sum() { return m_a + m_b; } Arbitrary member functions

25 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; Derive from base class }; float sum() { return m_a + m_b; }

26 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer) { serializer->write(m_a); serializer->write(m_b); } }; Implement serialization

27 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer) Implement deserialization { deserializer->read(m_a); deserializer->read(m_b); } };

28 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer); }; register_serializable_class< My_adder >(); Register class

29 Datastore Accessing object will make sure it is available! Per host cache for objects Store more data in cluster than a single host could Configurable max cache size Redundant storage for handling host failure Configurable redundancy level Automatic rebalancing in case of failure

30 Datastore Transactions Important for multi-user operation

31 Datastore Transactions Important for multi-user operation ACID

32 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure

33 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available

34 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore

35 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore D: Redundancy

36 Transaction Isolation T11 A X T7

37 Transaction Isolation Isolation based on multi-version capability T11 A 5 X 9 T7

38 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7

39 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7

40 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 10

41 Job System Application C++ API Job System Datastore Networking / Clustering

42 Parallelization Model Programmer: split work in n fragments! As independent as possible Small enough but still be efficient Potentially thousands per frame! No apriori knowledge about resources in the cluster! Data transport through datastore Goal: Distribute work over all GPUs / CPUs in cluster

$void execute_fragment(int i, int n) { } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ask DiCE to execute$ job in n fragments DiCE calls execute_fragment() once for every fragment (i = 0 n-1) DiCE assigns CPU core

job in n fragments DiCE calls execute_fragment() once for every fragment (i = 0 n-1) DiCE assigns CPU core

43 Parallelization Model Fragmented Job ~ similar to CUDA kernel Implement C++ class with at least one function: void execute_fragment(int i, int n) { } Ask DiCE to execute job in n fragments DiCE calls execute_fragment() once for every fragment (i = 0 n-1) DiCE assigns CPU core and/or GPU exclusively to fragment Job decides if it needs a GPU Job execution has access to all members and member functions

44 Parallelization Model - Cluster Not a shared memory model! Idea: Split execution and integration of results void execute_remote(int i, int n, OUT){ } Remote host void receive_result(int i, int n, IN) { } Origin host execute_remote()+receive_result() = execute_fragment()

45 Parallelization Model Single Host My_job Scene Camera Framebuf[ ] 0 GPU 1 3 GPU 2 1 GPU 1 4 GPU1 2 GPU 2 5 GPU 2

46 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

47 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

48 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

49 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

50 Parallelization Model

51 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 Host 2 Host 3

52 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 My_job My_job Scene Camera Scene Camera Host 2 Host 3

53 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

54 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

55 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Recevie result 5 Receive result 4 Execute fragment 3 Receive result 2 Receive result 1 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

56 Parallelization Model 3 Hosts

57 Parallelization Model - Hierarchical Viewer Host Compositor Job Compositor Host Compositor Fragment Rendering Job Render Host Rendering Fragment GPU Job GPUs GPU Fragment

58 Other Features More multi-user capabilities (scopes) Futures Global logging system HTTP Server RTMP Video streaming Cloud Bridge...

59 Summary DiCE is a library for writing parallel applications DiCE used in our rendering products Available to those using index

60 Questions?

Mark Falco Oracle Coherence Development

Achieving the performance benefits of Infiniband in Java Mark Falco Oracle Coherence Development 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy