Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Size: px

Start display at page:

Download "Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain"

Antonia Miles
5 years ago
Views:

1 Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain

2 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 2/87

3 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 3/87

4 Current computing needs Many applications require a lot of computing resources Execution time is usually increased Applications are accelerated to get their execution time reduced computing has experienced a remarkable growth in the last years HPC Advisory Council European Conference 2014, Leipzig 4/87

s reduce energy and time blastp db sorted_env_nr query SequenceLength_00001300.

5 s reduce energy and time blastp db sorted_env_nr query SequenceLength_ txt -num_threads X -gpu [t f] Dual socket E v2 Intel Xeon node with NVIDIA K20 On On On On On On On On 2.36x 3.56x -Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool HPC Advisory Council European Conference 2014, Leipzig 5/87

6 Current computing facilities The basic building block is a node with one or more s HPC Advisory Council European Conference 2014, Leipzig 6/87

between 1 and 4) An interconnection network

7 Current computing facilities From the programming point of view: A set of nodes, each one with: one or more s (with several cores per ) one or more s (typically between 1 and 4) An interconnection network HPC Advisory Council European Conference 2014, Leipzig 7/87

8 Current computing facilities A computing facility is usually a set of independent selfcontained nodes that leverage the shared-nothing approach Nothing is directly shared among nodes (MPI required for aggregating computing resources within the cluster) s can only be used within the node they are attached to Interconnection HPC Advisory Council European Conference 2014, Leipzig 8/87

not amortized Space: s reduce density Energy: idle s keep consuming power 4 s node 1 node 25% 1 node: 2

9 Money leakage in current clusters? Idle Power (Watts) For many workloads, s may be idle for long periods of time: Initial acquisition costs not amortized Space: s reduce density Energy: idle s keep consuming power 4 s node 1 node 25% 1 node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 4 s node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 s Time (s) HPC Advisory Council European Conference 2014, Leipzig 9/87

10 Further concerns in accelerated clusters Applications can only use the s located within their node Non-accelerated applications keep s idle in the nodes when they use all the cores Multi- applications running on a subset of nodes cannot make use of the tremendous resources available at other cluster nodes (even if they are idle) HPC Advisory Council European Conference 2014, Leipzig 10/87

11 We need something else in the cluster What is missing is some flexibility for using the s in the cluster HPC Advisory Council European Conference 2014, Leipzig 11/87

12 Increasing flexibility How to address current concerns: A way of addressing the idle concern is by sharing the s present in the cluster among all the nodes and, once s are shared, their amount can be reduced This would increase utilization, also lowering power consumption, at the same time that initial acquisition costs are reduced HPC Advisory Council European Conference 2014, Leipzig 12/87

13 What is needed for increased flexibility? This new cluster configuration requires: A way of seamlessly sharing s across nodes in the cluster (remote virtualization) Enhanced job schedulers that take into account the new virtual s HPC Advisory Council European Conference 2014, Leipzig 13/87

14 Remote virtualization envision Remote virtualization allows a new vision of a deployment, moving from the usual cluster configuration: Interconnection to the following one. HPC Advisory Council European Conference 2014, Leipzig 14/87

15 Remote virtualization envision Physical configuration Interconnection Logical configuration Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 15/87

16 Busy cores are no longer a problem Physical configuration Interconnection Logical configuration Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 16/87

17 Multi- applications get benefit virtualization is also useful for multi- applications Only the s in the node can be provided to the application Without virtualization Interconnection With virtualization Many s in the cluster can be provided to the application Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 17/87

18 Remote virtualization envision Real local s Without virtualization Interconnection With virtualization Virtualized remote s virtualization allows all nodes to access all s Interconnection HPC Advisory Council European Conference 2014, Leipzig 18/87

19 More about reducing energy consumption One step further: enhancing the scheduling process so that servers are put into low-power sleeping modes as soon as their acceleration features are not required HPC Advisory Council European Conference 2014, Leipzig 19/87

20 ed boxes Going even beyond: consolidating s into dedicated boxes (no power required) allowing task migration TRUE GREEN COMPUTING HPC Advisory Council European Conference 2014, Leipzig 20/87

21 task migration Box A has 4 s and only one is busy Box B has 8 s but only two are busy Move jobs from Box B to Box A and switch off Box B Migration should be transparent to applications (decided by the global scheduler) Box A TRUE GREEN COMPUTING Box B HPC Advisory Council European Conference 2014, Leipzig 21/87

22 task migration Box A has 4 s and only one is busy Box B has 8 s but only two are busy Move jobs from Box B to Box A and switch off Box B Migration should be transparent to applications (decided by the global scheduler) Box A TRUE GREEN COMPUTING Box B HPC Advisory Council European Conference 2014, Leipzig 22/87

transfers for SGEMM Pinned Memory Non-Pinned Memory Data from a matrix-matrix multiplication using a local!

23 Problem with remote virtualization Time devoted to data transfers (%) Main virtualization drawback is the increased latency and reduced bandwidth to the remote Influence of data transfers for SGEMM Pinned Memory Non-Pinned Memory Data from a matrix-matrix multiplication using a local!!! Matrix Size HPC Advisory Council European Conference 2014, Leipzig 23/87

24 Remote virtualization frameworks Several efforts have been made regarding virtualization during the last years: rcuda (CUDA 6.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vcuda (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V- (CUDA 4.0) Publicly available NOT publicly available HPC Advisory Council European Conference 2014, Leipzig 24/87

25 Remote virtualization frameworks Performance comparison of virtualization solutions: Intel Xeon E (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch CentOS Mellanox OFED Latency (measured by transferring 64 bytes) Pageable H2D Pinned H2D Pageable D2H Pinned D2H CUDA 34,3 4,3 16,2 5,2 rcuda 94,5 23,1 292,2 6,0 GVirtuS 184,2 200,3 168,4 182,8 DS-CUDA 45,9-26,5 - HPC Advisory Council European Conference 2014, Leipzig 25/87

size (MB) Transfer size (MB) Host-to-Device pinned ory Device-to-Host pinned ory

26 MB/sec MB/sec Remote virtualization frameworks MB/sec MB/sec Bandwidth of a copy between and ories Host-to-Device pageable ory Device-to-Host pageable ory Transfer size (MB) Transfer size (MB) Host-to-Device pinned ory Device-to-Host pinned ory Transfer size (MB) Transfer size (MB) HPC Advisory Council European Conference 2014, Leipzig 26/87

27 Applications tested with rcuda rcuda has been successfully tested with the following applications: NVIDIA CUDA SDK Samples LAMMPS WideLM CUDASW++ OpenFOAM HOOMDBlue mcuda-meme -Blast Gromacs GAMESS DL-POLY HPL The list keeps growing HPC Advisory Council European Conference 2014, Leipzig 27/87

28 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 28/87

29 Basics of the rcuda framework A framework enabling a CUDA-based application running in one (or some) node(s) to access s in other nodes It is useful for: Applications that do not make use of s all the time (moderate level of data parallelism) Applications for multi- computing HPC Advisory Council European Conference 2014, Leipzig 29/87

30 Basics of the rcuda framework Basic CUDA behavior HPC Advisory Council European Conference 2014, Leipzig 30/87

31 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 31/87

32 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 32/87

33 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 33/87

34 How to declare remote s Environment variables are properly initialized in the client side and used by the rcuda client (transparently to the application) Server name/ip address : Amount of s exposed to applications HPC Advisory Council European Conference 2014, Leipzig 34/87

Basics of the rcuda framework rcuda uses a proprietary communication protocol Example: 1) initialization 2) ory allocation on the remote 3) to ory transfer of the input data 4) kernel

35 Basics of the rcuda framework rcuda uses a proprietary communication protocol Example: 1) initialization 2) ory allocation on the remote 3) to ory transfer of the input data 4) kernel execution 5) to ory transfer of the results 6) ory release 7) communication channel closing and server process finalization HPC Advisory Council European Conference 2014, Leipzig 35/87

36 rcuda presents a modular architecture HPC Advisory Council European Conference 2014, Leipzig 36/87

37 rcuda uses optimized transfers rcuda features optimized communications: Use of Direct RDMA to move data between s Pipelined transfers to improve performance Preallocated pinned ory buffers Optimal pipeline block size HPC Advisory Council European Conference 2014, Leipzig 37/87

38 Basic performance analysis Pipeline block size for InfiniBand FDR NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch It was 2MB with IB QDR HPC Advisory Council European Conference 2014, Leipzig 38/87

39 Basic performance analysis Bandwidth of a copy between and ories Host-to-Device pinned ory Host-to-Device pageable ory HPC Advisory Council European Conference 2014, Leipzig 39/87

40 Basic performance analysis Latency study: copy a small dataset (64 bytes) HPC Advisory Council European Conference 2014, Leipzig 40/87

41 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 41/87

42 Performance of the rcuda framework Test system: Intel Xeon E (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch Cisco switch SLM2014 (1Gbps Ethernet) CentOS Mellanox OFED HPC Advisory Council European Conference 2014, Leipzig 42/87

43 Normalized time Single- applications Test CUDA SDK examples CUDA rcuda FDR rcuda Gb BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI HPC Advisory Council European Conference 2014, Leipzig 43/87

44 rcuda Overhead (%) Single- applications Execution Time (s) CUDASW++ Bioinformatics software for Smith-Waterman protein database searches FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Sequence Length HPC Advisory Council European Conference 2014, Leipzig 44/87

45 rcuda Overhead (%) Single- applications Execution Time (s) -Blast Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Sequence Length HPC Advisory Council European Conference 2014, Leipzig 45/87

46 Multi- applications Test system for Multi- applications: For CUDA tests one node with: 2 Quad-Core Intel Xeon E5440 processors Tesla S2050 (4 Tesla s) Each thread (1-4) uses one For rcuda tests 8 nodes with: 2 Quad-Core Intel Xeon E5520 processors 1 NVIDIA Tesla C2050 InfiniBand QDR Test running in one node and using up to all the s from the others Each thread (1-8) uses one HPC Advisory Council European Conference 2014, Leipzig 46/87

47 Multi- applications MonteCarlo Multi (from NVIDIA SDK) HPC Advisory Council European Conference 2014, Leipzig 47/87

48 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 48/87

49 Integrating rcuda with SLURM SLURM (Simple Linux Utility for Resource Management) job scheduler SLURM does not understand about virtualized s Add a new GRES (general resource) in order to manage virtualized s Where the s are in the system is completely transparent to the user In the job script, or in the submission command, the user specifies the number of rs (remote s) required by the job. The amount of ory required by the job may also be specified HPC Advisory Council European Conference 2014, Leipzig 49/87

50 The basic idea about SLURM HPC Advisory Council European Conference 2014, Leipzig 50/87

51 The basic idea about SLURM + rcuda s are decoupled from nodes All jobs are executed in less time HPC Advisory Council European Conference 2014, Leipzig 51/87

52 Sharing remote s among jobs 0 is scheduled to be shared among jobs s are decoupled from nodes All jobs are executed even in less time HPC Advisory Council European Conference 2014, Leipzig 52/87

53 SLURM configuration slurm.conf ClusterName=rcu GresTypes=gpu,rgpu NodeName=rcu16 NodeAddr=rcu16 s=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7990 Gres=rgpu:4,gpu:4 HPC Advisory Council European Conference 2014, Leipzig 53/87

SLURM configuration scontrol show node NodeName=rcu16 Arch=x86_64 CoresPerSocket=4 Alloc=0 Err=0 Tot=8 Features=(null) Gres=rgpu:4,gpu:4 NodeAddr=rcu16 NodeHostName=rcu16 OS=Linux

54 SLURM configuration scontrol show node NodeName=rcu16 Arch=x86_64 CoresPerSocket=4 Alloc=0 Err=0 Tot=8 Features=(null) Gres=rgpu:4,gpu:4 NodeAddr=rcu16 NodeHostName=rcu16 OS=Linux RealMemory=7682 Sockets=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime= T18:45:35 SlurmdStartTime= T10:02:04 HPC Advisory Council European Conference 2014, Leipzig 54/87

SLURM configuration gres.conf Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m Name=rgpu File=/dev/nvidia2 Cuda=1.

55 SLURM configuration gres.conf Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 HPC Advisory Council European Conference 2014, Leipzig 55/87

56 Submitting jobs with SLURM Submit a job $ $ srun -N1 --gres=rgpu:4:512m script.sh... $ Environment variables are initialized by SLURM and used by the rcuda client (transparently to the user) RCUDA_DEVICE_COUNT=4 RCUDA_DEVICE_0=rcu16:0 RCUDA_DEVICE_1=rcu16:1 RCUDA_DEVICE_2=rcu16:2 RCUDA_DEVICE_3=rcu16:3 Server name/ip address : HPC Advisory Council European Conference 2014, Leipzig 56/87

57 Integrating rcuda with SLURM Test bench for checking SLURM correctness: Old heterogeneous cluster with 28 nodes: 24 nodes without with 2 AMD Opteron 1.4GHz, 2GB RAM 1 node without, QuadCore i7 3.5GHz, 8GB de RAM 3 nodes with GeForce GTX 670 or 590 s 2GB, QuadCore i7 3GHz 1Gbps Ethernet interconnect Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot) HPC Advisory Council European Conference 2014, Leipzig 57/87

58 Integrating rcuda with SLURM Test bench for analyzing SLURM+rCUDA performance: InfiniBand ConnectX-3 based cluster CentOS 6.4 Linux Dual socket E v2 Intel Xeon based nodes: 3 nodes without 1 node with NVIDIA K40 10 nodes with NVIDIA K20 1 node with 4 NVIDIA K20 First tests carried out. More complex tests in the way 1 node hosting the main SLURM controller 1 node with the K20 1 node without 1 node with the K40 HPC Advisory Council European Conference 2014, Leipzig 58/87

59 Blast using K40 remotely Theoretical CUDA Exclusive CUDA Shared CUDA Theoretical rcuda (Eth) Exclusive rcuda (Eth) Shared rcuda (Eth) Theoretical rcuda (IB) Exclusive rcuda (IB) Shared rcuda (IB) HPC Advisory Council European Conference 2014, Leipzig 59/87

60 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 60/87

61 Why rcuda with virtual machines? Current clusters frequently leverage virtual machines in order to attain energy and cost reductions Xen and KVM virtual machines are commonly leveraged Applications being executed inside virtual machines need access to computing resources Different Xen and KVM virtual machines can share an InfiniBand network card thanks to the virtualization features of the InfiniBand driver NVIDIA drivers do not provide virtualization features, thus avoiding the concurrent usage of the from several virtual machines rcuda can be used to provide concurrent access to s HPC Advisory Council European Conference 2014, Leipzig 61/87

62 How to connect an IB card to a VM Real IB card Virtual IB cards (VFs) PCI pass-through is used to assign an IB card (either real or virtual) to a given VM The IB card manages the several virtual copies HPC Advisory Council European Conference 2014, Leipzig 62/87

used from host OS VM: virtual card copy used from VM Transfer size (Bytes) RDMA write RDMA write

63 Bandwidth (MB/s) Performance analysis on KVM VMs Normalized latency Normalized Bandwidth Computer with KVM VMs Remote computer RDMA write RealX3: actual Connect-X3 (1 port) card VF: virtual copy used from host OS VM: virtual card copy used from VM Transfer size (Bytes) RDMA write RDMA write Transfer size (Bytes) Transfer size (Bytes) HPC Advisory Council European Conference 2014, Leipzig 63/87

64 CUDA BWtest from a KVM VM Computer with KVM VMs Remote computer with Tesla K20 HPC Advisory Council European Conference 2014, Leipzig 64/87

65 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 65/87

66 rcuda and low-power processors There is a clear trend to improve the performance-power ratio in HPC facilities: By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, ) More recently, by using low-power processors (Intel Atom, ARM, ) rcuda allows to attach a pool of accelerators to a computing facility: Less accelerators than nodes accelerators shared among nodes Performance-power ratio probably improved further How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote s? Which is the best configuration? Is energy effectively saved? HPC Advisory Council European Conference 2014, Leipzig 66/87

67 rcuda and low-power processors The computing platforms analyzed in this study are: KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller ATOM: Intel Atom quad-core s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a XEON: Intel Xeon X3440 quad-core (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller The accelerators analyzed are: CARMA: NVIDIA Quadro 1000M (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA FERMI: NVIDIA GeForce GTX480 Fermi (448 cores) with 1,280 MB DDR3/GDDR5 RAM CARMA platform HPC Advisory Council European Conference 2014, Leipzig 67/87

68 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) HPC Advisory Council European Conference 2014, Leipzig 68/87

69 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance HPC Advisory Council European Conference 2014, Leipzig 69/87

70 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ HPC Advisory Council European Conference 2014, Leipzig 70/87

71 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance HPC Advisory Council European Conference 2014, Leipzig 71/87

72 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance The 96 cores of CARMA provide a noticeably lower performance than a more powerful HPC Advisory Council European Conference 2014, Leipzig 72/87

73 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) CARMA requires less power HPC Advisory Council European Conference 2014, Leipzig 73/87

74 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) When execution time and required power are combined into energy, XEON+FERMI is more efficient HPC Advisory Council European Conference 2014, Leipzig 74/87

LAMMPS (bottom) Summing up: - If execution

power is to be minimized, CARMA is a good

PCIe decreases the interest of this option

must be minimized, XEON+FERMI is the right

75 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) Summing up: - If execution time is a concern, then use XEON+FERMI - If power is to be minimized, CARMA is a good choice - The lower performance of KAYLA s PCIe decreases the interest of this option for single system local- usage - If energy must be minimized, XEON+FERMI is the right choice HPC Advisory Council European Conference 2014, Leipzig 75/87

rcuda and low-power processors 2 nd analysis: use of remote s The use of the CUADRO within the CARMA system is discarded as rcuda server due to its poor performance.

76 rcuda and low-power processors 2 nd analysis: use of remote s The use of the CUADRO within the CARMA system is discarded as rcuda server due to its poor performance. Only the FERMI is considered Six different combinations of rcuda clients and servers are feasible: Combination Client Server 1 KAYLA KAYLA+FERMI 2 ATOM KAYLA+FERMI 3 XEON KAYLA+FERMI 4 KAYLA XEON+FERMI 5 ATOM XEON+FERMI 6 XEON XEON+FERMI Only one client system and one server system are used at each experiment HPC Advisory Council European Conference 2014, Leipzig 76/87

77 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 77/87

78 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The KAYLA-based server presents much higher execution time KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 78/87

2 nd analysis: use of remote s rcuda and low-power processors The KAYLA-based server presents much higher execution time The network interface for KAYLA delivers much less than the expected 1Gbps

79 2 nd analysis: use of remote s rcuda and low-power processors The KAYLA-based server presents much higher execution time The network interface for KAYLA delivers much less than the expected 1Gbps bandwidth CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 79/87

80 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The lower bandwidth of KAYLA (and ATOM) can also be noticed when using the XEON server KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 80/87

81 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ Most of the power and energy is required by the server side, as it hosts the KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 81/87

82 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The much shorter execution time of XEON-based servers render more energyefficient systems KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 82/87

83 2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Combination Client Server 1 KAYLA KAYLA+FERMI 2 KAYLA XEON+FERMI 3 ATOM XEON+FERMI 4 XEON XEON+FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 83/87

84 2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Similar conclusions to CUDASW++ apply to LAMMPS, due to the network bottleneck KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 84/87

85 rcuda and low-power processors Atom-based systems with PCIe 2.0 x8 may hold an InfiniBand Card InfiniBand QDR adapters would noticeably increase network bandwidth HPC Advisory Council European Conference 2014, Leipzig 85/87

86 Summary about rcuda rcuda is the enabling technology for: High Throughput Computing Using remote s does not make applications to execute faster Sharing remote s makes applications to execute slower BUT more throughput (jobs/time) is achieved Datacenter administrators can choose between HPC and HTC Green Computing migration and application migration allow to devote just the required computing resources to the current load More flexible system upgrades and updates become independent from each other. Adding boxes to non -enabled clusters is possible HPC Advisory Council European Conference 2014, Leipzig 86/87

87 You can get a free copy of rcuda at HPC Advisory Council European Conference 2014, Leipzig 87/87

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current