The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Size: px

Start display at page:

Download "The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla"

Vincent Lynch
5 years ago
Views:

1 The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain

2 The scope of this talk Delft, April /47

3 More flexible use of s rcuda: a software technology that enables a more flexible use of s in computing facilities No Delft, April /47

4 rcuda Overhead (%) Overhead introduced by rcuda Execution Time (s) CUDASW++ Bioinformatics software for Smith-Waterman protein database searches NVIDIA Tesla K20 Mellanox ConnectX-3 single-port adapters FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Small overhead when using InfiniBand Sequence Length Lower is better Delft, April /47

5 1: more s for a single application As many s as there are in the cluster may be provided to a single application No Delft, April /47

6 1: more s for a single application Delft, April /47

7 1: more s for a single application Delft, April /47

8 1: more s for a single application MonteCarlo Multi- (from NVIDIA SDK) Higher is better Lower is better Delft, April /47

9 2: increased cluster performance s can be shared among jobs running in remote clients App 1 App 2 App 3 App 4 App 5 App 6 App 7 App 8 App 9 Delft, April /47

10 2: increased cluster performance Test bench for studying rcuda performance at cluster level: SLURM used as job scheduler InfiniBand ConnectX-3 based cluster Dual socket E5-2620v2 Intel Xeon based nodes: 1 node without 8 nodes. Each with one NVIDIA K20 Four applications used LAMMPS -Blast MCUDA-MEME Gromacs (no ) Three workload sizes: Small Medium Large 1 node hosting the main SLURM controller 8 nodes with one K20 each Delft, April /47

11 2: increased cluster performance Delft, April /47

12 3: less cost with more performance Let s reduce the amount of s in the cluster 43% Less 41% Less 42% Less Delft, April /47

13 4: reduced energy consumption Delft, April /47

14 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler Final considerations Delft, April /47

15 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler Final considerations Delft, April /47

16 Characteristics of -based clusters A computing facility is usually a set of independent selfcontained nodes that leverage the shared-nothing approach: Nothing is directly shared among nodes (MPI required for aggregating computing resources within the cluster) s can only be used within the node they are attached to Interconnection Delft, April /47

17 First concern with accelerated clusters Applications can only use the s located within their node: Non-accelerated applications keep s idle in the nodes where they use all the cores A -only application spreading over these four nodes would make their s unavailable for accelerated applications Interconnection Delft, April /47

acquisition costs not amortized Space: s reduce density Energy: idle s keep consuming power 4 s node

18 Money leakage in current clusters? Idle Power (Watts) For some workloads, s may be idle for significant periods of time: Initial acquisition costs not amortized Space: s reduce density Energy: idle s keep consuming power 4 s node 1 node 25% 1 node: Two E5-2620V2 sockets and 32GB DDR3 RAM. One Tesla K20 4 s node: Two E5-2620V2 sockets and 128GB DDR3 RAM. Four Tesla K20 s Time (s) Delft, April /47

19 Second concern with accelerated clusters Applications can only use the s located within their node: Multi- applications running on a subset of nodes cannot make use of the tremendous resources available at other cluster nodes (even if they are idle) multi- application All these s cannot be used by the multi- application in execution Interconnection Delft, April /47

20 One more concern with accelerated clusters Do applications completely squeeze the s present in the cluster? Even if all s are assigned to running applications, computational resources inside s may not be fully used Application presenting low level of parallelism code being executed ( assigned working) -core stall due to lack of data etc Interconnection Delft, April /47

21 Sharing a given among jobs Several -Blast instances concurrently executed on the same. Each instance uses about 1.5 of ory Delft, April /47

22 Why -cluster performance is lost? In summary There are scenarios where s are available but cannot be used Accelerated applications do not make use of s 100% of the time In conclusion We are losing cycles, thus reducing cluster performance Delft, April /47

23 We need something more in the cluster The current model for using s is too rigid What is missing is some flexibility for using the s in the cluster Delft, April /47

24 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler Final considerations Delft, April /47

25 What is needed for increased flexibility? Two ingredients are required to cook a higher-throughput -based cluster A way of seamlessly sharing s across nodes in the cluster (remote virtualization) Enhanced job schedulers that take into account the new shared s Delft, April /47

26 Remote virtualization envision Remote virtualization allows a new vision of a deployment, moving from the usual cluster configuration: Interconnection to the following one. Delft, April /47

27 Remote virtualization envision Physical Interconnection configuration Logical connections Logical Interconnection configuration Delft, April /47

28 Busy cores are no longer a problem Physical Interconnection configuration Logical connections Logical Interconnection configuration Delft, April /47

Multi- applications get benefit virtualization is also useful for multi- applications Only the s in the node can be provided to the application Without

29 Multi- applications get benefit virtualization is also useful for multi- applications Only the s in the node can be provided to the application Without virtualization Interconnection With virtualization Many s in the cluster can be provided to the application Logical connections Interconnection Delft, April /47

30 About the second ingredient Current job schedulers, like SLURM, know about real s, but cannot manage virtual s Enhancing schedulers is required to effectively take advantage of virtualization Delft, April /47

31 More about enhanced scheduling One step further: enhancing the scheduler so that servers are put into low-power sleeping modes as soon as their acceleration features are not required Delft, April /47

32 Enhancing even more scheduling Going even beyond: support task migration consolidate tasks into as few servers as possible Delft, April /47

33 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler (I) Final considerations Delft, April /47

34 Basics of the rcuda framework Basic CUDA behavior Delft, April /47

35 Basics of the rcuda framework rcuda is binary compatible with CUDA 6.5 Delft, April /47

36 Bandwidth is a concern for rcuda Performance of pinned ory Performance of pageable ory Delft, April /47

37 Performance of applications with rcuda CUDA-MEME application: NVIDIA Tesla K40 Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters 0.19% Lower is better Delft, April /47

38 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler (II) Final considerations Delft, April /47

39 Integrating rcuda with SLURM SLURM (Simple Linux Utility for Resource Management) job scheduler SLURM does not understand about virtualized s Add a new GRES (general resource) in order to manage virtualized s Where the s are in the system is completely transparent to the user In the job script, or in the submission command, the user specifies the number of rs (remote s) required by the job. The amount of ory required by the job may also be specified Delft, April /47

40 The basic idea about SLURM Delft, April /47

41 The basic idea about SLURM + rcuda s are decoupled from nodes All jobs are executed in less time Delft, April /47

42 Sharing remote s among jobs 0 is scheduled to be shared among jobs s are decoupled from nodes All jobs are executed even in less time Delft, April /47

43 Cluster performance with rcuda+slurm Delft, April /47

44 Cluster performance with rcuda+slurm Let s reduce the amount of s in the cluster 43% Less 42% Less 41% Less Delft, April /47

45 Increasing throughput in current clusters Why rcuda: the problem with -enabled clusters The enabler for higher cluster throughput at lower cost Engineering the enabler Final considerations Delft, April /47

46 rcuda is the enabling technology for High Throughput Computing Sharing remote s makes applications to execute slower BUT more throughput (jobs/time) is achieved Datacenter administrators can choose between HPC and HTC Green Computing migration and application migration allow to devote just the required computing resources to the current workload More flexible system upgrades and updates become independent from each other. Attaching boxes to non -enabled clusters is possible Delft, April /47

net More than 500 requests @rcuda_ The rcuda team Carlos

47 Get a free copy of rcuda at More than 500 The rcuda team Carlos Reaño Javier Prades Fernando Campos Rocío Alegre Federico Silla José Duato Antonio Peña (1) (1) Former student, now at Argonne National Lab. (USA)

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What