Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Size: px
Start display at page:

Download "Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain"

Transcription

1 Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain

2 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 2/87

3 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 3/87

4 Current computing needs Many applications require a lot of computing resources Execution time is usually increased Applications are accelerated to get their execution time reduced computing has experienced a remarkable growth in the last years HPC Advisory Council European Conference 2014, Leipzig 4/87

5 s reduce energy and time blastp db sorted_env_nr query SequenceLength_ txt -num_threads X -gpu [t f] Dual socket E v2 Intel Xeon node with NVIDIA K20 On On On On On On On On 2.36x 3.56x -Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool HPC Advisory Council European Conference 2014, Leipzig 5/87

6 Current computing facilities The basic building block is a node with one or more s HPC Advisory Council European Conference 2014, Leipzig 6/87

7 Current computing facilities From the programming point of view: A set of nodes, each one with: one or more s (with several cores per ) one or more s (typically between 1 and 4) An interconnection network HPC Advisory Council European Conference 2014, Leipzig 7/87

8 Current computing facilities A computing facility is usually a set of independent selfcontained nodes that leverage the shared-nothing approach Nothing is directly shared among nodes (MPI required for aggregating computing resources within the cluster) s can only be used within the node they are attached to Interconnection HPC Advisory Council European Conference 2014, Leipzig 8/87

9 Money leakage in current clusters? Idle Power (Watts) For many workloads, s may be idle for long periods of time: Initial acquisition costs not amortized Space: s reduce density Energy: idle s keep consuming power 4 s node 1 node 25% 1 node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 4 s node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 s Time (s) HPC Advisory Council European Conference 2014, Leipzig 9/87

10 Further concerns in accelerated clusters Applications can only use the s located within their node Non-accelerated applications keep s idle in the nodes when they use all the cores Multi- applications running on a subset of nodes cannot make use of the tremendous resources available at other cluster nodes (even if they are idle) HPC Advisory Council European Conference 2014, Leipzig 10/87

11 We need something else in the cluster What is missing is some flexibility for using the s in the cluster HPC Advisory Council European Conference 2014, Leipzig 11/87

12 Increasing flexibility How to address current concerns: A way of addressing the idle concern is by sharing the s present in the cluster among all the nodes and, once s are shared, their amount can be reduced This would increase utilization, also lowering power consumption, at the same time that initial acquisition costs are reduced HPC Advisory Council European Conference 2014, Leipzig 12/87

13 What is needed for increased flexibility? This new cluster configuration requires: A way of seamlessly sharing s across nodes in the cluster (remote virtualization) Enhanced job schedulers that take into account the new virtual s HPC Advisory Council European Conference 2014, Leipzig 13/87

14 Remote virtualization envision Remote virtualization allows a new vision of a deployment, moving from the usual cluster configuration: Interconnection to the following one. HPC Advisory Council European Conference 2014, Leipzig 14/87

15 Remote virtualization envision Physical configuration Interconnection Logical configuration Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 15/87

16 Busy cores are no longer a problem Physical configuration Interconnection Logical configuration Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 16/87

17 Multi- applications get benefit virtualization is also useful for multi- applications Only the s in the node can be provided to the application Without virtualization Interconnection With virtualization Many s in the cluster can be provided to the application Logical connections Interconnection HPC Advisory Council European Conference 2014, Leipzig 17/87

18 Remote virtualization envision Real local s Without virtualization Interconnection With virtualization Virtualized remote s virtualization allows all nodes to access all s Interconnection HPC Advisory Council European Conference 2014, Leipzig 18/87

19 More about reducing energy consumption One step further: enhancing the scheduling process so that servers are put into low-power sleeping modes as soon as their acceleration features are not required HPC Advisory Council European Conference 2014, Leipzig 19/87

20 ed boxes Going even beyond: consolidating s into dedicated boxes (no power required) allowing task migration TRUE GREEN COMPUTING HPC Advisory Council European Conference 2014, Leipzig 20/87

21 task migration Box A has 4 s and only one is busy Box B has 8 s but only two are busy Move jobs from Box B to Box A and switch off Box B Migration should be transparent to applications (decided by the global scheduler) Box A TRUE GREEN COMPUTING Box B HPC Advisory Council European Conference 2014, Leipzig 21/87

22 task migration Box A has 4 s and only one is busy Box B has 8 s but only two are busy Move jobs from Box B to Box A and switch off Box B Migration should be transparent to applications (decided by the global scheduler) Box A TRUE GREEN COMPUTING Box B HPC Advisory Council European Conference 2014, Leipzig 22/87

23 Problem with remote virtualization Time devoted to data transfers (%) Main virtualization drawback is the increased latency and reduced bandwidth to the remote Influence of data transfers for SGEMM Pinned Memory Non-Pinned Memory Data from a matrix-matrix multiplication using a local!!! Matrix Size HPC Advisory Council European Conference 2014, Leipzig 23/87

24 Remote virtualization frameworks Several efforts have been made regarding virtualization during the last years: rcuda (CUDA 6.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vcuda (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V- (CUDA 4.0) Publicly available NOT publicly available HPC Advisory Council European Conference 2014, Leipzig 24/87

25 Remote virtualization frameworks Performance comparison of virtualization solutions: Intel Xeon E (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch CentOS Mellanox OFED Latency (measured by transferring 64 bytes) Pageable H2D Pinned H2D Pageable D2H Pinned D2H CUDA 34,3 4,3 16,2 5,2 rcuda 94,5 23,1 292,2 6,0 GVirtuS 184,2 200,3 168,4 182,8 DS-CUDA 45,9-26,5 - HPC Advisory Council European Conference 2014, Leipzig 25/87

26 MB/sec MB/sec Remote virtualization frameworks MB/sec MB/sec Bandwidth of a copy between and ories Host-to-Device pageable ory Device-to-Host pageable ory Transfer size (MB) Transfer size (MB) Host-to-Device pinned ory Device-to-Host pinned ory Transfer size (MB) Transfer size (MB) HPC Advisory Council European Conference 2014, Leipzig 26/87

27 Applications tested with rcuda rcuda has been successfully tested with the following applications: NVIDIA CUDA SDK Samples LAMMPS WideLM CUDASW++ OpenFOAM HOOMDBlue mcuda-meme -Blast Gromacs GAMESS DL-POLY HPL The list keeps growing HPC Advisory Council European Conference 2014, Leipzig 27/87

28 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 28/87

29 Basics of the rcuda framework A framework enabling a CUDA-based application running in one (or some) node(s) to access s in other nodes It is useful for: Applications that do not make use of s all the time (moderate level of data parallelism) Applications for multi- computing HPC Advisory Council European Conference 2014, Leipzig 29/87

30 Basics of the rcuda framework Basic CUDA behavior HPC Advisory Council European Conference 2014, Leipzig 30/87

31 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 31/87

32 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 32/87

33 Basics of the rcuda framework HPC Advisory Council European Conference 2014, Leipzig 33/87

34 How to declare remote s Environment variables are properly initialized in the client side and used by the rcuda client (transparently to the application) Server name/ip address : Amount of s exposed to applications HPC Advisory Council European Conference 2014, Leipzig 34/87

35 Basics of the rcuda framework rcuda uses a proprietary communication protocol Example: 1) initialization 2) ory allocation on the remote 3) to ory transfer of the input data 4) kernel execution 5) to ory transfer of the results 6) ory release 7) communication channel closing and server process finalization HPC Advisory Council European Conference 2014, Leipzig 35/87

36 rcuda presents a modular architecture HPC Advisory Council European Conference 2014, Leipzig 36/87

37 rcuda uses optimized transfers rcuda features optimized communications: Use of Direct RDMA to move data between s Pipelined transfers to improve performance Preallocated pinned ory buffers Optimal pipeline block size HPC Advisory Council European Conference 2014, Leipzig 37/87

38 Basic performance analysis Pipeline block size for InfiniBand FDR NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch It was 2MB with IB QDR HPC Advisory Council European Conference 2014, Leipzig 38/87

39 Basic performance analysis Bandwidth of a copy between and ories Host-to-Device pinned ory Host-to-Device pageable ory HPC Advisory Council European Conference 2014, Leipzig 39/87

40 Basic performance analysis Latency study: copy a small dataset (64 bytes) HPC Advisory Council European Conference 2014, Leipzig 40/87

41 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 41/87

42 Performance of the rcuda framework Test system: Intel Xeon E (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch Cisco switch SLM2014 (1Gbps Ethernet) CentOS Mellanox OFED HPC Advisory Council European Conference 2014, Leipzig 42/87

43 Normalized time Single- applications Test CUDA SDK examples CUDA rcuda FDR rcuda Gb BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI HPC Advisory Council European Conference 2014, Leipzig 43/87

44 rcuda Overhead (%) Single- applications Execution Time (s) CUDASW++ Bioinformatics software for Smith-Waterman protein database searches FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Sequence Length HPC Advisory Council European Conference 2014, Leipzig 44/87

45 rcuda Overhead (%) Single- applications Execution Time (s) -Blast Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE Sequence Length HPC Advisory Council European Conference 2014, Leipzig 45/87

46 Multi- applications Test system for Multi- applications: For CUDA tests one node with: 2 Quad-Core Intel Xeon E5440 processors Tesla S2050 (4 Tesla s) Each thread (1-4) uses one For rcuda tests 8 nodes with: 2 Quad-Core Intel Xeon E5520 processors 1 NVIDIA Tesla C2050 InfiniBand QDR Test running in one node and using up to all the s from the others Each thread (1-8) uses one HPC Advisory Council European Conference 2014, Leipzig 46/87

47 Multi- applications MonteCarlo Multi (from NVIDIA SDK) HPC Advisory Council European Conference 2014, Leipzig 47/87

48 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 48/87

49 Integrating rcuda with SLURM SLURM (Simple Linux Utility for Resource Management) job scheduler SLURM does not understand about virtualized s Add a new GRES (general resource) in order to manage virtualized s Where the s are in the system is completely transparent to the user In the job script, or in the submission command, the user specifies the number of rs (remote s) required by the job. The amount of ory required by the job may also be specified HPC Advisory Council European Conference 2014, Leipzig 49/87

50 The basic idea about SLURM HPC Advisory Council European Conference 2014, Leipzig 50/87

51 The basic idea about SLURM + rcuda s are decoupled from nodes All jobs are executed in less time HPC Advisory Council European Conference 2014, Leipzig 51/87

52 Sharing remote s among jobs 0 is scheduled to be shared among jobs s are decoupled from nodes All jobs are executed even in less time HPC Advisory Council European Conference 2014, Leipzig 52/87

53 SLURM configuration slurm.conf ClusterName=rcu GresTypes=gpu,rgpu NodeName=rcu16 NodeAddr=rcu16 s=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7990 Gres=rgpu:4,gpu:4 HPC Advisory Council European Conference 2014, Leipzig 53/87

54 SLURM configuration scontrol show node NodeName=rcu16 Arch=x86_64 CoresPerSocket=4 Alloc=0 Err=0 Tot=8 Features=(null) Gres=rgpu:4,gpu:4 NodeAddr=rcu16 NodeHostName=rcu16 OS=Linux RealMemory=7682 Sockets=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime= T18:45:35 SlurmdStartTime= T10:02:04 HPC Advisory Council European Conference 2014, Leipzig 54/87

55 SLURM configuration gres.conf Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 HPC Advisory Council European Conference 2014, Leipzig 55/87

56 Submitting jobs with SLURM Submit a job $ $ srun -N1 --gres=rgpu:4:512m script.sh... $ Environment variables are initialized by SLURM and used by the rcuda client (transparently to the user) RCUDA_DEVICE_COUNT=4 RCUDA_DEVICE_0=rcu16:0 RCUDA_DEVICE_1=rcu16:1 RCUDA_DEVICE_2=rcu16:2 RCUDA_DEVICE_3=rcu16:3 Server name/ip address : HPC Advisory Council European Conference 2014, Leipzig 56/87

57 Integrating rcuda with SLURM Test bench for checking SLURM correctness: Old heterogeneous cluster with 28 nodes: 24 nodes without with 2 AMD Opteron 1.4GHz, 2GB RAM 1 node without, QuadCore i7 3.5GHz, 8GB de RAM 3 nodes with GeForce GTX 670 or 590 s 2GB, QuadCore i7 3GHz 1Gbps Ethernet interconnect Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot) HPC Advisory Council European Conference 2014, Leipzig 57/87

58 Integrating rcuda with SLURM Test bench for analyzing SLURM+rCUDA performance: InfiniBand ConnectX-3 based cluster CentOS 6.4 Linux Dual socket E v2 Intel Xeon based nodes: 3 nodes without 1 node with NVIDIA K40 10 nodes with NVIDIA K20 1 node with 4 NVIDIA K20 First tests carried out. More complex tests in the way 1 node hosting the main SLURM controller 1 node with the K20 1 node without 1 node with the K40 HPC Advisory Council European Conference 2014, Leipzig 58/87

59 Blast using K40 remotely Theoretical CUDA Exclusive CUDA Shared CUDA Theoretical rcuda (Eth) Exclusive rcuda (Eth) Shared rcuda (Eth) Theoretical rcuda (IB) Exclusive rcuda (IB) Shared rcuda (IB) HPC Advisory Council European Conference 2014, Leipzig 59/87

60 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 60/87

61 Why rcuda with virtual machines? Current clusters frequently leverage virtual machines in order to attain energy and cost reductions Xen and KVM virtual machines are commonly leveraged Applications being executed inside virtual machines need access to computing resources Different Xen and KVM virtual machines can share an InfiniBand network card thanks to the virtualization features of the InfiniBand driver NVIDIA drivers do not provide virtualization features, thus avoiding the concurrent usage of the from several virtual machines rcuda can be used to provide concurrent access to s HPC Advisory Council European Conference 2014, Leipzig 61/87

62 How to connect an IB card to a VM Real IB card Virtual IB cards (VFs) PCI pass-through is used to assign an IB card (either real or virtual) to a given VM The IB card manages the several virtual copies HPC Advisory Council European Conference 2014, Leipzig 62/87

63 Bandwidth (MB/s) Performance analysis on KVM VMs Normalized latency Normalized Bandwidth Computer with KVM VMs Remote computer RDMA write RealX3: actual Connect-X3 (1 port) card VF: virtual copy used from host OS VM: virtual card copy used from VM Transfer size (Bytes) RDMA write RDMA write Transfer size (Bytes) Transfer size (Bytes) HPC Advisory Council European Conference 2014, Leipzig 63/87

64 CUDA BWtest from a KVM VM Computer with KVM VMs Remote computer with Tesla K20 HPC Advisory Council European Conference 2014, Leipzig 64/87

65 Outline Why remote virtualization? How does rcuda work? The performance of the rcuda framework Scheduling virtual s with SLURM rcuda and KVM virtual machines Low-power processors and rcuda HPC Advisory Council European Conference 2014, Leipzig 65/87

66 rcuda and low-power processors There is a clear trend to improve the performance-power ratio in HPC facilities: By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, ) More recently, by using low-power processors (Intel Atom, ARM, ) rcuda allows to attach a pool of accelerators to a computing facility: Less accelerators than nodes accelerators shared among nodes Performance-power ratio probably improved further How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote s? Which is the best configuration? Is energy effectively saved? HPC Advisory Council European Conference 2014, Leipzig 66/87

67 rcuda and low-power processors The computing platforms analyzed in this study are: KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller ATOM: Intel Atom quad-core s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a XEON: Intel Xeon X3440 quad-core (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller The accelerators analyzed are: CARMA: NVIDIA Quadro 1000M (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA FERMI: NVIDIA GeForce GTX480 Fermi (448 cores) with 1,280 MB DDR3/GDDR5 RAM CARMA platform HPC Advisory Council European Conference 2014, Leipzig 67/87

68 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) HPC Advisory Council European Conference 2014, Leipzig 68/87

69 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance HPC Advisory Council European Conference 2014, Leipzig 69/87

70 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ HPC Advisory Council European Conference 2014, Leipzig 70/87

71 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance HPC Advisory Council European Conference 2014, Leipzig 71/87

72 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance The 96 cores of CARMA provide a noticeably lower performance than a more powerful HPC Advisory Council European Conference 2014, Leipzig 72/87

73 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) CARMA requires less power HPC Advisory Council European Conference 2014, Leipzig 73/87

74 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) When execution time and required power are combined into energy, XEON+FERMI is more efficient HPC Advisory Council European Conference 2014, Leipzig 74/87

75 rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) Summing up: - If execution time is a concern, then use XEON+FERMI - If power is to be minimized, CARMA is a good choice - The lower performance of KAYLA s PCIe decreases the interest of this option for single system local- usage - If energy must be minimized, XEON+FERMI is the right choice HPC Advisory Council European Conference 2014, Leipzig 75/87

76 rcuda and low-power processors 2 nd analysis: use of remote s The use of the CUADRO within the CARMA system is discarded as rcuda server due to its poor performance. Only the FERMI is considered Six different combinations of rcuda clients and servers are feasible: Combination Client Server 1 KAYLA KAYLA+FERMI 2 ATOM KAYLA+FERMI 3 XEON KAYLA+FERMI 4 KAYLA XEON+FERMI 5 ATOM XEON+FERMI 6 XEON XEON+FERMI Only one client system and one server system are used at each experiment HPC Advisory Council European Conference 2014, Leipzig 76/87

77 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 77/87

78 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The KAYLA-based server presents much higher execution time KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 78/87

79 2 nd analysis: use of remote s rcuda and low-power processors The KAYLA-based server presents much higher execution time The network interface for KAYLA delivers much less than the expected 1Gbps bandwidth CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 79/87

80 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The lower bandwidth of KAYLA (and ATOM) can also be noticed when using the XEON server KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 80/87

81 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ Most of the power and energy is required by the server side, as it hosts the KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 81/87

82 rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The much shorter execution time of XEON-based servers render more energyefficient systems KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 82/87

83 2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Combination Client Server 1 KAYLA KAYLA+FERMI 2 KAYLA XEON+FERMI 3 ATOM XEON+FERMI 4 XEON XEON+FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 83/87

84 2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Similar conclusions to CUDASW++ apply to LAMMPS, due to the network bottleneck KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council European Conference 2014, Leipzig 84/87

85 rcuda and low-power processors Atom-based systems with PCIe 2.0 x8 may hold an InfiniBand Card InfiniBand QDR adapters would noticeably increase network bandwidth HPC Advisory Council European Conference 2014, Leipzig 85/87

86 Summary about rcuda rcuda is the enabling technology for: High Throughput Computing Using remote s does not make applications to execute faster Sharing remote s makes applications to execute slower BUT more throughput (jobs/time) is achieved Datacenter administrators can choose between HPC and HTC Green Computing migration and application migration allow to devote just the required computing resources to the current load More flexible system upgrades and updates become independent from each other. Adding boxes to non -enabled clusters is possible HPC Advisory Council European Conference 2014, Leipzig 86/87

87 You can get a free copy of rcuda at HPC Advisory Council European Conference 2014, Leipzig 87/87

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current

More information

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible

More information

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!

More information

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain rcuda: hybrid - clusters Federico Silla Technical University of Valencia Spain Outline 1. Hybrid - clusters 2. Concerns with hybrid clusters 3. One possible solution: virtualize s! 4. rcuda what s that?

More information

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION

More information

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the

More information

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat

More information

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium

More information

LAMMPSCUDA GPU Performance. April 2011

LAMMPSCUDA GPU Performance. April 2011 LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council

More information

An approach to provide remote access to GPU computational power

An approach to provide remote access to GPU computational power An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

GROMACS (GPU) Performance Benchmark and Profiling. February 2016 GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute

More information

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015 LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA

More information

GROMACS Performance Benchmark and Profiling. September 2012

GROMACS Performance Benchmark and Profiling. September 2012 GROMACS Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute

More information

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. January 2015 NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

HYCOM Performance Benchmark and Profiling

HYCOM Performance Benchmark and Profiling HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities

More information

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Torben Kling-Petersen, PhD Presenter s Name Principle Field Title andengineer Division HPC &Cloud LoB SunComputing Microsystems

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

STAR-CCM+ Performance Benchmark and Profiling. July 2014

STAR-CCM+ Performance Benchmark and Profiling. July 2014 STAR-CCM+ Performance Benchmark and Profiling July 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: CD-adapco, Intel, Dell, Mellanox Compute

More information

SNAP Performance Benchmark and Profiling. April 2014

SNAP Performance Benchmark and Profiling. April 2014 SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting

More information

FUJITSU PHI Turnkey Solution

FUJITSU PHI Turnkey Solution FUJITSU PHI Turnkey Solution Integrated ready to use XEON-PHI based platform Dr. Pierre Lagier ISC2014 - Leipzig PHI Turnkey Solution challenges System performance challenges Parallel IO best architecture

More information

Framework of rcuda: An Overview

Framework of rcuda: An Overview Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,

More information

University at Buffalo Center for Computational Research

University at Buffalo Center for Computational Research University at Buffalo Center for Computational Research The following is a short and long description of CCR Facilities for use in proposals, reports, and presentations. If desired, a letter of support

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

MM5 Modeling System Performance Research and Profiling. March 2009

MM5 Modeling System Performance Research and Profiling. March 2009 MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

ARISTA: Improving Application Performance While Reducing Complexity

ARISTA: Improving Application Performance While Reducing Complexity ARISTA: Improving Application Performance While Reducing Complexity October 2008 1.0 Problem Statement #1... 1 1.1 Problem Statement #2... 1 1.2 Previous Options: More Servers and I/O Adapters... 1 1.3

More information

LAMMPS Performance Benchmark and Profiling. July 2012

LAMMPS Performance Benchmark and Profiling. July 2012 LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. February 2012 NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System INSTITUTE FOR PLASMA RESEARCH (An Autonomous Institute of Department of Atomic Energy, Government of India) Near Indira Bridge; Bhat; Gandhinagar-382428; India PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX -2 EN with RoCE Adapter Delivers Reliable Multicast Messaging With Ultra Low Latency

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

More information

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA Evaluation report prepared under contract with HP Executive Summary The computing industry is experiencing an increasing demand for storage

More information

ENABLING NEW SCIENCE GPU SOLUTIONS

ENABLING NEW SCIENCE GPU SOLUTIONS ENABLING NEW SCIENCE TESLA BIO Workbench The NVIDIA Tesla Bio Workbench enables biophysicists and computational chemists to push the boundaries of life sciences research. It turns a standard PC into a

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Maximizing Memory Performance for ANSYS Simulations

Maximizing Memory Performance for ANSYS Simulations Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance

More information

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland) Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Altair RADIOSS Performance Benchmark and Profiling. May 2013 Altair RADIOSS Performance Benchmark and Profiling May 2013 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Altair, AMD, Dell, Mellanox Compute

More information

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its

More information

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability VPI / InfiniBand Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox enables the highest data center performance with its

More information

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability Mellanox InfiniBand Host Channel Adapters (HCA) enable the highest data center

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015 Altair OptiStruct 13.0 Performance Benchmark and Profiling May 2015 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute

More information

MAHA. - Supercomputing System for Bioinformatics

MAHA. - Supercomputing System for Bioinformatics MAHA - Supercomputing System for Bioinformatics - 2013.01.29 Outline 1. MAHA HW 2. MAHA SW 3. MAHA Storage System 2 ETRI HPC R&D Area - Overview Research area Computing HW MAHA System HW - Rpeak : 0.3

More information

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

MILC Performance Benchmark and Profiling. April 2013

MILC Performance Benchmark and Profiling. April 2013 MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011 The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011 HPC Scale Working Group, Dec 2011 Gilad Shainer, Pak Lui, Tong Liu,

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE Carl Trieloff cctrieloff@redhat.com Red Hat Lee Fisher lee.fisher@hp.com Hewlett-Packard High Performance Computing on Wall Street conference 14

More information

GViM: GPU-accelerated Virtual Machines

GViM: GPU-accelerated Virtual Machines GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Himeno Performance Benchmark and Profiling. December 2010

Himeno Performance Benchmark and Profiling. December 2010 Himeno Performance Benchmark and Profiling December 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information