Pedraforca: a First ARM + GPU Cluster for HPC

www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez

We ve hit the power wall ALL computers are limited by power consumption

Energy-efficient approaches Multi-core Fujitsu Ultra SPARC VIIIfx Intel SandyBridge AMD Bulldozer Low-power processors IBM BlueGene/Q Compute accelerators IBM Cell NVIDIA Tesla AMD Radeon Intel Xeon Phi

The next step in the commodity chain HPC Servers Desktop Mobile Build the next HPC system on commodity and super-commodity components 100M tablets in 2012 750M smartphones in 2012

NVIDIA Tegra: Commodity CPU + GPU platform Tegra 2 Dual-core ARM Cortex-A9 ULP Embedded GPU Tegra 3 Quad-core ARM Cortex-A9 12-core Embedded GPU Tegra 4 Quad-core ARM Cortex-A15 72-core Embedded GPU

Tibidabo: The first ARM multicore cluster Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W Proof of concept It is possible to deploy a cluster of smartphone processors Enable software stack development 2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W

HPC System software stack on ARM ATLAS Source files (C, C++, FORTRAN, ) Compiler(s) gcc gfortran OmpSs Paraver GASNet MPI Executable(s) Scientific libraries FFTW HDF5 Developer tools Scalasca Cluster management (Slurm) OmpSs runtime library (NANOS++) CUDA Linux Linux Linux OpenCL CPU CPU GPU GPU CPU GPU Open source system software stack Ubuntu/Debian Linux OS GNU compilers gcc, g++, gfortran Scientific libraries ATLAS, FFTW, HDF5,... Slurm cluster management Runtime libraries MPICH2, CUDA, OmpSs toolchain* Developer tools Paraver, Scalasca Allinea DDT debugger * S3232 - OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous Clusters of Hardware Accelerators. Thursday, 10:00, Marriott Ballroom 3

Porting applications to ARM Application Domain Institution Prog. Model MPI OpenMP Other Scalability YALES2 Combustion CNRS/CORIA Y >32K EUTERPE Fusion BSC Y Y >60K SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU MP2C Multi-particle collision JSC Y >65K BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good PEPC Coulomg + gravitational forces ARM port JSC Y Pthreads, SMPSs >300K SMMP Protein folding JSC Y OpenCL 16K ProFASI Protein folding JSC Y Good COSMO Weather forecast CINECA Y Y BQCD Particle physics LRZ Y Y ~300K Porting full-scale HPC applications to ARM cluster requires minimal effort

CARMA: CUDA on ARM developer kit Tegra3 SoC Quad-core ARM Cortex-A9 6 PCIe lanes (gen1) Quadro 1000M CUDA supported 1 GbE First hybrid ARM + CUDA platform

CARMA Kit: Energy Efficiency CARMA platform is much more energy-efficient than Tegra3 alone

Pedraforca v1: The first ARM + GPU cluster Development cluster of 16 CARMA kits @ BSC

Pedraforca v1: Initial application performance results Only 3.72 GLFOPS in Linpack but DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W) SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W) Low PCIe bandwidth (400 MB/s peak) No overlap of data transfers and computation

Pedraforca v2: Next generation ARM + GPU platform Tegra3 Q7 module 4x ARM Cortex-A9 @ 1.3 GHz 2GB DDR2 Mini-ITX carrier 4x PCIe Gen1 SATA 2.0 1 GbE NVIDIA Tesla K20 16x PCIe Gen3 1170 GFLOPS (peak) Mellanox ConnectX-3 8x PCIe Gen3 40 Gb/s 2.5 SSD 250 GB SATA 3 MLC Ethernet 1 Gb/s (service + storage) InfiniBand 40 Gb/s (MPI)

Pedraforca: Rack enclosure 2x GbE switch 4x IB switch Login nodes Intel SandyBridge E5 64x Compute nodes 4x ARM Cortex-A9 1x NVIDIA Tesla K20 NFS Storage

Pedraforca: Interconnect GbE IB GbE GbE network for service and storage IB network for MPI With extra ports to connect to other clusters IB IB IB

GPU-accelerated cluster vs. GPU-accelerator cluster Current GPU clusters Fixed ratio of CPU to GPU Unused GPU in not-accelerated apps Unused CPU in heavily accelerated apps Decouple CPU from GPU Off-load kernels to remote GPU Direct GPU to GPU data transfers Orchestrated by light-weight ARM CPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU Interconnection network CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU Interconnection network

Conclusions CARMA is not an HPC solution but it enables software development already Pedraforca is the second generation ARM + GPU prototype GPU-accelerator cluster, instead of GPU-accelerated cluster ARM CPU used to orchestrate direct GPU to GPU communication CPU + GPU integration is happening already Embedded mobile platforms with OpenCL capable GPU Get ready for your next generation CPU + GPU platforms!

We re hiring! Do you want to work on the next generation of energy-efficient HPC systems? Lead the way to the Exascale? Change the HPC world forever? http://www.bsc.es/about_bsc/employment/vacancies Senior Researchers in Energy-Efficient Supercomputers HPC Application Developers