rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

Similar documents
Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain

The rcuda middleware and applications

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

An approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power

Framework of rcuda: An Overview

computational power computational

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

2008 International ANSYS Conference

Pedraforca: a First ARM + GPU Cluster for HPC

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

LAMMPSCUDA GPU Performance. April 2011

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Birds of a Feather Presentation

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Interconnect Your Future

GROMACS Performance Benchmark and Profiling. September 2012

HPC Architectures. Types of resource currently in use

NAMD GPU Performance Benchmark. March 2011

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

NAMD Performance Benchmark and Profiling. January 2015

HYCOM Performance Benchmark and Profiling

Slurm Configuration Impact on Benchmarking

University at Buffalo Center for Computational Research

High Performance Computing with Accelerators

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Future Routing Schemes in Petascale clusters

AcuSolve Performance Benchmark and Profiling. October 2011

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

AMBER 11 Performance Benchmark and Profiling. July 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

NAMD Performance Benchmark and Profiling. February 2012

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

MM5 Modeling System Performance Research and Profiling. March 2009

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

OCTOPUS Performance Benchmark and Profiling. June 2015

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Unified Runtime for PGAS and MPI over OFED

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

SNAP Performance Benchmark and Profiling. April 2014

ARISTA: Improving Application Performance While Reducing Complexity

Solutions for Scalable HPC

FUJITSU PHI Turnkey Solution

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

LAMMPS Performance Benchmark and Profiling. July 2012

GPU Architecture. Alan Gray EPCC The University of Edinburgh

ABySS Performance Benchmark and Profiling. May 2010

CP2K Performance Benchmark and Profiling. April 2011

Single-Points of Performance

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Altair RADIOSS Performance Benchmark and Profiling. May 2013

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Document downloaded from:

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

CME 213 S PRING Eric Darve

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

High Performance Computing

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

The Future of High Performance Interconnects

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

The BioHPC Nucleus Cluster & Future Developments

Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

GViM: GPU-accelerated Virtual Machines

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

CPMD Performance Benchmark and Profiling. February 2014

Optimizing LS-DYNA Productivity in Cluster Environments

MAHA. - Supercomputing System for Bioinformatics

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

The Future of Interconnect Technology

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Transcription:

rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 2/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 3/83

Current computing facilities computing defines all the technological issues for using the computational power for executing general purpose applications computing has experienced a remarkable growth in the last years 70 60 Clearspeed IBM PowerCell NVIDIA ATI/AMD Intel MIC / Phi Number of Systems 50 40 30 20 10 0 Top500 June 2013 HPC Advisory Council Spain Conference, Barcelona 2013 4/83

Current computing facilities The basic building block is a node with one or more s Main Memory CPU PCI-e Main Memory CPU PCI-e HPC Advisory Council Spain Conference, Barcelona 2013 5/83

Current computing facilities From the programming point of view: A set of nodes, each one with: one or more CPUs (probably with several cores per CPU) one or more s (typically between 1 and 4) disjoint ory spaces for each and the CPUs An interconnection network HPC Advisory Council Spain Conference, Barcelona 2013 6/83

Current computing facilities For some kinds of code the use of s brings huge benefits in terms of performance and energy There must be data parallelism in the code: this is the only way to take benefit from the hundreds of processors within a Different scenarios from the point of view of the application: Low level of data parallelism High level of data parallelism Moderate level of data parallelism Applications for multi- computing Leveraging up to the 4 s inside the node HPC Advisory Council Spain Conference, Barcelona 2013 7/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 8/83

How virtualization adds value A computing facility is usually a set of independent selfcontained nodes that leverage the shared-nothing approach Nothing is directly shared among nodes (MPI required for aggregating computing resources within the cluster) s can only be used within the node they are attached to Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Interconnection HPC Advisory Council Spain Conference, Barcelona 2013 9/83

How virtualization adds value + + + + + + + + + + + + + + + + + + + + CONCERNS: Current trend in deployment Multi- applications running on a subset of nodes cannot make use of the tremendous resources available at other cluster nodes (even if they are idle) For applications with moderate levels of data parallelism, many s in the cluster may be idle for long periods of time HPC Advisory Council Spain Conference, Barcelona 2013 10/83

How virtualization adds value A way of addressing the first concern is by sharing the s present in the cluster among all the nodes. For the second concern, once s are shared, their amount can be reduced This would increase utilization, also lowering power consumption, at the same time that initial acquisition costs are reduced This new configuration requires: 1) A way of seamlessly sharing s across nodes in the cluster ( virtualization) 2) Enhanced job schedulers that take into account the new configuration + + + + + HPC Advisory Council Spain Conference, Barcelona 2013 11/83

How virtualization adds value In summary, virtualization allows a new vision of a deployment, moving from the usual cluster configuration: Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Interconnection to the following one. HPC Advisory Council Spain Conference, Barcelona 2013 12/83

How virtualization adds value Physical configuration Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Interconnection Logical configuration Logical connections Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Interconnection HPC Advisory Council Spain Conference, Barcelona 2013 13/83

How virtualization adds value virtualization is also useful for multi- applications Only the s in the node can be provided to the application PCI-e CPU Main Memory PCI-e CPU Main Memory PCI-e CPU Main Memory PCI-e CPU Main Memory PCI-e CPU Main Memory PCI-e CPU Main Memory Without virtualization Interconnection With virtualization Many s in the cluster can be provided to the application PCI-e CPU Main Memory Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Main Memory CPU PCI-e Logical connections PCI-e CPU Main Memory Interconnection HPC Advisory Council Spain Conference, Barcelona 2013 14/83

How virtualization adds value virtualization allows all nodes to access all s Interconnection Node Node Node Node Interconnection Node Node Node Node Real local s Virtualized remote s Physical configuration Logical configuration HPC Advisory Council Spain Conference, Barcelona 2013 15/83

How virtualization adds value One step further: enhancing the scheduling process so that servers are put into low-power sleeping modes as soon as their acceleration features are not required + + + + + HPC Advisory Council Spain Conference, Barcelona 2013 16/83

How virtualization adds value Going even beyond: consolidating s into dedicated boxes (no CPU power required) allowing task migration??? TRUE GREEN COMPUTING HPC Advisory Council Spain Conference, Barcelona 2013 17/83

How virtualization adds value Main virtualization drawback is the increased latency and reduced bandwidth to the remote Time devoted to data transfers (%) 100 90 80 70 60 50 40 30 20 10 Influence of data transfers for SGEMM Pinned Memory Non-Pinned Memory Data from a matrix-matrix multiplication using a local!!! 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Matrix Size HPC Advisory Council Spain Conference, Barcelona 2013 18/83

How virtualization adds value Several efforts have been made regarding virtualization during the last years: rcuda (CUDA 5.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vcuda (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V- (CUDA 4.0) Publicly available NOT publicly available HPC Advisory Council Spain Conference, Barcelona 2013 19/83

How virtualization adds value Performance comparison of virtualization solutions: Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch CentOS 6.3 + Mellanox OFED 1.5.3 Latency (measured by transferring 64 bytes) Pageable H2D Pinned H2D Pageable D2H Pinned D2H CUDA 34,3 4,3 16,2 5,2 rcuda 94,5 23,1 292,2 6,0 GVirtuS 184,2 200,3 168,4 182,8 DS-CUDA 45,9-26,5 - HPC Advisory Council Spain Conference, Barcelona 2013 20/83

How virtualization adds value Bandwidth of a copy between and CPU ories Host-to-Device pageable ory Device-to-Host pageable ory MB/sec MB/sec Transfer size (MB) Transfer size (MB) MB/sec Host-to-Device pinned ory MB/sec Device-to-Host pinned ory Transfer size (MB) Transfer size (MB) HPC Advisory Council Spain Conference, Barcelona 2013 21/83

How virtualization adds value rcuda has been successfully tested with the following applications: NVIDIA CUDA SDK Samples LAMMPS WideLM CUDASW++ OpenFOAM HOOMDBlue mcuda-meme -Blast Gromacs HPC Advisory Council Spain Conference, Barcelona 2013 22/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 23/83

Basics of the rcuda framework A framework enabling a CUDA-based application running in one (or some) node(s) to access s in other nodes It is useful for: Moderate level of data parallelism Applications for multi- computing HPC Advisory Council Spain Conference, Barcelona 2013 24/83

Basics of the rcuda framework CUDA application Application CUDA libraries HPC Advisory Council Spain Conference, Barcelona 2013 25/83

Basics of the rcuda framework Client side CUDA application Server side Application Application CUDA libraries CUDA libraries HPC Advisory Council Spain Conference, Barcelona 2013 26/83

Basics of the rcuda framework Client side CUDA application Server side Application rcuda daemon rcuda library CUDA libraries device device HPC Advisory Council Spain Conference, Barcelona 2013 27/83

Basics of the rcuda framework Client side CUDA application Server side Application rcuda daemon rcuda library CUDA libraries device device HPC Advisory Council Spain Conference, Barcelona 2013 28/83

Basics of the rcuda framework Client side CUDA application Server side Application rcuda daemon rcuda library CUDA libraries device device HPC Advisory Council Spain Conference, Barcelona 2013 29/83

Basics of the rcuda framework rcuda uses a proprietary communication protocol Example: 1) initialization 2) ory allocation on the remote 3) CPU to ory transfer of the input data 4) kernel execution 5) to CPU ory transfer of the results 6) ory release 7) communication channel closing and server process finalization HPC Advisory Council Spain Conference, Barcelona 2013 30/83

Basics of the rcuda framework rcuda features a modular architecture Client side Server side Application Interception library Server daemon Communications Communications CUDA library TCP interface IB TCP interface IB HPC Advisory Council Spain Conference, Barcelona 2013 31/83

Basics of the rcuda framework rcuda features optimized communications: Use of Direct to avoid unnecessary ory copies Pipeline to improve performance Preallocated pinned ory buffers Optimal pipeline block size HPC Advisory Council Spain Conference, Barcelona 2013 32/83

Basics of the rcuda framework Pipeline block size for InfiniBand FDR NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch It was 2MB with IB QDR HPC Advisory Council Spain Conference, Barcelona 2013 33/83

Basics of the rcuda framework Bandwidth of a copy between and CPU ories Host-to-Device pinned ory Host-to-Device pageable ory HPC Advisory Council Spain Conference, Barcelona 2013 34/83

Basics of the rcuda framework Latency study: copy a small dataset (64 bytes) HPC Advisory Council Spain Conference, Barcelona 2013 35/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 36/83

Performance of the rcuda framework Test system: Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture) NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch Cisco switch SLM2014 (1Gbps Ethernet) CentOS 6.3 + Mellanox OFED 1.5.3 HPC Advisory Council Spain Conference, Barcelona 2013 37/83

Performance of the rcuda framework Test CUDA SDK examples 9 Normalized time 8 7 6 5 4 3 2 CUDA rcuda FDR rcuda Gb 1 0 BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI HPC Advisory Council Spain Conference, Barcelona 2013 38/83

Performance of the rcuda framework CUDASW++ Bioinformatics software for Smith-Waterman protein database searches rcuda Overhead (%) 45 40 35 30 25 20 15 10 5 0 FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 Sequence Length 18 16 14 12 10 8 6 4 2 0 Execution Time (s) HPC Advisory Council Spain Conference, Barcelona 2013 39/83

Performance of the rcuda framework -Blast Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool rcuda Overhead (%) 350 300 250 200 150 100 50 0 FDR Overhead QDR Overhead GbE Overhead CUDA rcuda FDR rcuda QDR rcuda GbE 2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 Sequence Length 60 50 40 30 20 10 0 Execution Time (s) HPC Advisory Council Spain Conference, Barcelona 2013 40/83

Performance of the rcuda framework Test system for Multi For CUDA tests one node with: 2 Quad-Core Intel Xeon E5440 processors Tesla S2050 (4 Tesla s) Each thread (1-4) uses one For rcuda tests 8 nodes with: 2 Quad-Core Intel Xeon E5520 processors 1 NVIDIA Tesla C2050 InfiniBand QDR Test running in one node and using up to all the s of the others Each thread (1-8) uses one HPC Advisory Council Spain Conference, Barcelona 2013 41/83

Performance of the rcuda framework MonteCarlo Multi (from NVIDIA SDK) HPC Advisory Council Spain Conference, Barcelona 2013 42/83

Performance of the rcuda framework GEMM Multi using libflame: high performance dense linear algebra library http://www.cs.utexas.edu/~flame/web/ HPC Advisory Council Spain Conference, Barcelona 2013 43/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 44/83

Integrating rcuda with SLURM SLURM (Simple Linux Utility for Resource Management) job scheduler Does not understand about virtualized s Add a new GRES (general resource) in order to manage virtualized s Where the s are in the system is completely transparent to the user In the job script or in the submission command, the user specifies the number of rs (remote s) required by the job HPC Advisory Council Spain Conference, Barcelona 2013 45/83

Integrating rcuda with SLURM HPC Advisory Council Spain Conference, Barcelona 2013 46/83

Integrating rcuda with SLURM slurm.conf ClusterName=rcu GresTypes=gpu,rgpu NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7990 Gres=rgpu:4,gpu:4 HPC Advisory Council Spain Conference, Barcelona 2013 47/83

Integrating rcuda with SLURM scontrol show node NodeName=rcu16 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) Gres=rgpu:4,gpu:4 NodeAddr=rcu16 NodeHostName=rcu16 OS=Linux RealMemory=7682 Sockets=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2013-04-24T18:45:35 SlurmdStartTime=2013-04-30T10:02:04 HPC Advisory Council Spain Conference, Barcelona 2013 48/83

Integrating rcuda with SLURM gres.conf Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 HPC Advisory Council Spain Conference, Barcelona 2013 49/83

Integrating rcuda with SLURM Submit a job $ $ srun -N1 --gres=rgpu:4:512m script.sh... $ Environment variables are initialized by SLURM and used by the rcuda client (transparently to the user) RCUDA_DEVICE_COUNT=4 RCUDA_DEVICE_0=rcu16:0 RCUDA_DEVICE_1=rcu16:1 RCUDA_DEVICE_2=rcu16:2 RCUDA_DEVICE_3=rcu16:3 Server name/ip address : HPC Advisory Council Spain Conference, Barcelona 2013 50/83

Integrating rcuda with SLURM HPC Advisory Council Spain Conference, Barcelona 2013 51/83

Integrating rcuda with SLURM s are decoupled from nodes All jobs are executed in less time HPC Advisory Council Spain Conference, Barcelona 2013 52/83

Integrating rcuda with SLURM Test bench for the SLURM experiments: Old heterogeneous cluster with 28 nodes: 1 node without, with 2 AMD Opteron 1.4GHz, 6GB RAM 23 nodes without with 2 AMD Opteron 1.4GHz, 2GB RAM 1 node without, QuadCore i7 3.5GHz, 8GB de RAM 1 node with GeForce GTX 670 2GB, QuadCore i7 3GHz, 8GB RAM 1 node with GeForce GTX 670 2GB, Core2Duo 2.4GHz, 2GB RAM 1 node with 2 GeForce GTX 590 3GB, Quad Core i7 3.5GHz, 8GB RAM 1Gbps Ethernet Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot) HPC Advisory Council Spain Conference, Barcelona 2013 53/83

Integrating rcuda with SLURM matrixmul from NVIDIA s SDK matrixa 320x320 matrixb 640x320 Total execution time including SLURM HPC Advisory Council Spain Conference, Barcelona 2013 54/83

Integrating rcuda with SLURM matrixmul from NVIDIA s SDK The higher performance of rcuda is due to a pre-initialized CUDA environment in the rcuda server Total execution time including SLURM HPC Advisory Council Spain Conference, Barcelona 2013 55/83

Integrating rcuda with SLURM matrixmul from NVIDIA s SDK The time for initializing the CUDA environment is about 0.9 seconds Total execution time including SLURM HPC Advisory Council Spain Conference, Barcelona 2013 56/83

Integrating rcuda with SLURM matrixmul from NVIDIA s SDK Scheduling the use of remote s takes less time than finding a local available Total execution time including SLURM HPC Advisory Council Spain Conference, Barcelona 2013 57/83

Integrating rcuda with SLURM Sharing remote s among processes (matrixmul) Total execution time including SLURM HPC Advisory Council Spain Conference, Barcelona 2013 58/83

Integrating rcuda with SLURM Sharing remote s among processes sortings from NVIDIA s SDK HPC Advisory Council Spain Conference, Barcelona 2013 59/83

Integrating rcuda with SLURM Sharing remote s among processes (CUDASW++) The ory gets full with 5 concurrent executions HPC Advisory Council Spain Conference, Barcelona 2013 60/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors HPC Advisory Council Spain Conference, Barcelona 2013 61/83

rcuda and low-power processors There is a clear trend to improve the performance-power ratio in HPC facilities: By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, ) More recently, by using low-power processors (Intel Atom, ARM, ) rcuda allows to attach a pool of accelerators to a computing facility: Less accelerators than nodes accelerators shared among nodes Performance-power ratio probably improved further How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote s? Which is the best configuration? Is energy effectively saved? HPC Advisory Council Spain Conference, Barcelona 2013 62/83

rcuda and low-power processors The computing platforms analyzed in this study are: KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core CPU (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller ATOM: Intel Atom quad-core CPU s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a XEON: Intel Xeon X3440 quad-core CPU (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller The accelerators analyzed are: CARMA: NVIDIA Quadro 1000M (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA FERMI: NVIDIA GeForce GTX480 Fermi (448 cores) with 1,280 MB DDR3/GDDR5 RAM CARMA platform HPC Advisory Council Spain Conference, Barcelona 2013 63/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) HPC Advisory Council Spain Conference, Barcelona 2013 64/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance HPC Advisory Council Spain Conference, Barcelona 2013 65/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ HPC Advisory Council Spain Conference, Barcelona 2013 66/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance HPC Advisory Council Spain Conference, Barcelona 2013 67/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) The XEON+FERMI combination achieves the best performance LAMMPS makes a more intensive use of the than CUDASW++ The lower performance of the PCIe bus in Kayla reduces performance The 96 cores of CARMA provide a noticeably lower performance than a more powerful HPC Advisory Council Spain Conference, Barcelona 2013 68/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) CARMA requires less power HPC Advisory Council Spain Conference, Barcelona 2013 69/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) When execution time and required power are combined into energy, XEON+FERMI is more efficient HPC Advisory Council Spain Conference, Barcelona 2013 70/83

rcuda and low-power processors 1 st analysis: using local s; CUDASW++ (top) LAMMPS (bottom) Summing up: - If execution time is a concern, then use XEON+FERMI - If power is to be minimized, CARMA is a good choice - The lower performance of KAYLA s PCIe decreases the interest of this option for single system local- usage - If energy must be minimized, XEON+FERMI is the right choice HPC Advisory Council Spain Conference, Barcelona 2013 71/83

rcuda and low-power processors 2 nd analysis: use of remote s The use of the CUADRO within the CARMA system is discarded as rcuda server due to its poor performance. Only the FERMI is considered Six different combinations of rcuda clients and servers are feasible: Combination Client Server 1 KAYLA KAYLA+FERMI 2 ATOM KAYLA+FERMI 3 XEON KAYLA+FERMI 4 KAYLA XEON+FERMI 5 ATOM XEON+FERMI 6 XEON XEON+FERMI Only one client system and one server system are used at each experiment HPC Advisory Council Spain Conference, Barcelona 2013 72/83

rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 73/83

rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The KAYLA-based server presents much higher execution time KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 74/83

2 nd analysis: use of remote s rcuda and low-power processors The KAYLA-based server presents much higher execution time The network interface for KAYLA delivers much less than the expected 1Gbps bandwidth CUDASW++ KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 75/83

rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The lower bandwidth of KAYLA (and ATOM) can also be noticed when using the XEON server KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 76/83

rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ Most of the power and energy is required by the server side, as it hosts the KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 77/83

rcuda and low-power processors 2 nd analysis: use of remote s CUDASW++ The much shorter execution time of XEON-based servers render more energyefficient systems KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 78/83

2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Combination Client Server 1 KAYLA KAYLA+FERMI 2 KAYLA XEON+FERMI 3 ATOM XEON+FERMI 4 XEON XEON+FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 79/83

2 nd analysis: use of remote s rcuda and low-power processors LAMMPS Similar conclusions to CUDASW++ apply to LAMMPS, due to the network bottleneck KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI HPC Advisory Council Spain Conference, Barcelona 2013 80/83

Outline Current computing facilities How virtualization adds value to your cluster The rcuda framework: basics and performance Integrating rcuda with SLURM rcuda and low-power processors rcuda work in progress HPC Advisory Council Spain Conference, Barcelona 2013 81/83

rcuda work in progress Scheduling in SLURM taking into account the actual utilization Analysis of the benefits of using the SLURM+rCUDA suite in real production clusters with InfiniBand fabric Test rcuda with more applications from list of Popular -accelerated Applications from NVIDIA Support for CUDA 5.5 Integration of Direct RDMA into rcuda communications Further analyze hybrid ARM-XEON scenarios Use of better network protocols for 1Gbps Ethernet Use of InfiniBand fabrics HPC Advisory Council Spain Conference, Barcelona 2013 82/83

http://www.rcuda.net More than 350 requests Thanks to Mellanox for its support to this work Antonio Peña Carlos Reaño Federico Silla José Duato (1) rcuda team Adrian Castelló Sergio Iserte Vicente Roca Rafael Mayo Enrique S. Quintana-Ortí (1) Now at Argonne National Lab. (USA)