Pedraforca: a First ARM + GPU Cluster for HPC

Similar documents
The Mont-Blanc Project

Building supercomputers from commodity embedded chips

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Barcelona Supercomputing Center

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

HPC state of play. HPC Ecosystem. HPC system supply. HPC use 24% EU 4,3% EU. Application software & tools. Academia 23% Bio-sciences 22% CAE 21%

The Mont-Blanc approach towards Exascale

Building supercomputers from embedded technologies

European energy efficient supercomputer project

Butterfly effect of porting scientific applications to ARM-based platforms

The rcuda middleware and applications

Please cite this article using the following BibTex entry:

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

SNAP Performance Benchmark and Profiling. April 2014

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Designed for Maximum Accelerator Performance

GPU Architecture. Alan Gray EPCC The University of Edinburgh

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Addressing Heterogeneity in Manycore Applications

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

OCTOPUS Performance Benchmark and Profiling. June 2015

High Performance Computing with Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

MILC Performance Benchmark and Profiling. April 2013

HPC Architectures. Types of resource currently in use

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Smarter Clusters from the Supercomputer Experts

Trends in HPC (hardware complexity and software challenges)

CPMD Performance Benchmark and Profiling. February 2014

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

n N c CIni.o ewsrg.au

HPC Technology Update Challenges or Chances?

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Exascale: challenges and opportunities in a power constrained world

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

The Arm Technology Ecosystem: Current Products and Future Outlook

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Interconnect Your Future

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited

To hear the audio, please be sure to dial in: ID#

Arm Processor Technology Update and Roadmap

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Vectorisation and Portable Programming using OpenCL

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

GPUs and Emerging Architectures

Illinois Proposal Considerations Greg Bauer

Mapping MPI+X Applications to Multi-GPU Architectures

Martin Dubois, ing. Contents

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Overview of Architectures and Programming Languages for Parallel Computing

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

HPC with Multicore and GPUs

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Carlo Cavazzoni, HPC department, CINECA

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work BOAST. A Generative Language for Intensive Computing Kernels

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

CME 213 S PRING Eric Darve

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

FUJITSU PHI Turnkey Solution

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

THE LEADER IN VISUAL COMPUTING

HPC projects. Grischa Bolls

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

GPU Computing with NVIDIA s new Kepler Architecture

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

2008 International ANSYS Conference

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Transcription:

www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez

We ve hit the power wall ALL computers are limited by power consumption

Energy-efficient approaches Multi-core Fujitsu Ultra SPARC VIIIfx Intel SandyBridge AMD Bulldozer Low-power processors IBM BlueGene/Q Compute accelerators IBM Cell NVIDIA Tesla AMD Radeon Intel Xeon Phi

The next step in the commodity chain HPC Servers Desktop Mobile Build the next HPC system on commodity and super-commodity components 100M tablets in 2012 750M smartphones in 2012

NVIDIA Tegra: Commodity CPU + GPU platform Tegra 2 Dual-core ARM Cortex-A9 ULP Embedded GPU Tegra 3 Quad-core ARM Cortex-A9 12-core Embedded GPU Tegra 4 Quad-core ARM Cortex-A15 72-core Embedded GPU

Tibidabo: The first ARM multicore cluster Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W Proof of concept It is possible to deploy a cluster of smartphone processors Enable software stack development 2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W

HPC System software stack on ARM ATLAS Source files (C, C++, FORTRAN, ) Compiler(s) gcc gfortran OmpSs Paraver GASNet MPI Executable(s) Scientific libraries FFTW HDF5 Developer tools Scalasca Cluster management (Slurm) OmpSs runtime library (NANOS++) CUDA Linux Linux Linux OpenCL CPU CPU GPU GPU CPU GPU Open source system software stack Ubuntu/Debian Linux OS GNU compilers gcc, g++, gfortran Scientific libraries ATLAS, FFTW, HDF5,... Slurm cluster management Runtime libraries MPICH2, CUDA, OmpSs toolchain* Developer tools Paraver, Scalasca Allinea DDT debugger * S3232 - OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous Clusters of Hardware Accelerators. Thursday, 10:00, Marriott Ballroom 3

Porting applications to ARM Application Domain Institution Prog. Model MPI OpenMP Other Scalability YALES2 Combustion CNRS/CORIA Y >32K EUTERPE Fusion BSC Y Y >60K SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU MP2C Multi-particle collision JSC Y >65K BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good PEPC Coulomg + gravitational forces ARM port JSC Y Pthreads, SMPSs >300K SMMP Protein folding JSC Y OpenCL 16K ProFASI Protein folding JSC Y Good COSMO Weather forecast CINECA Y Y BQCD Particle physics LRZ Y Y ~300K Porting full-scale HPC applications to ARM cluster requires minimal effort

CARMA: CUDA on ARM developer kit Tegra3 SoC Quad-core ARM Cortex-A9 6 PCIe lanes (gen1) Quadro 1000M CUDA supported 1 GbE First hybrid ARM + CUDA platform

CARMA Kit: Energy Efficiency CARMA platform is much more energy-efficient than Tegra3 alone

Pedraforca v1: The first ARM + GPU cluster Development cluster of 16 CARMA kits @ BSC

Pedraforca v1: Initial application performance results Only 3.72 GLFOPS in Linpack but DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W) SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W) Low PCIe bandwidth (400 MB/s peak) No overlap of data transfers and computation

Pedraforca v2: Next generation ARM + GPU platform Tegra3 Q7 module 4x ARM Cortex-A9 @ 1.3 GHz 2GB DDR2 Mini-ITX carrier 4x PCIe Gen1 SATA 2.0 1 GbE NVIDIA Tesla K20 16x PCIe Gen3 1170 GFLOPS (peak) Mellanox ConnectX-3 8x PCIe Gen3 40 Gb/s 2.5 SSD 250 GB SATA 3 MLC Ethernet 1 Gb/s (service + storage) InfiniBand 40 Gb/s (MPI)

Pedraforca: Rack enclosure 2x GbE switch 4x IB switch Login nodes Intel SandyBridge E5 64x Compute nodes 4x ARM Cortex-A9 1x NVIDIA Tesla K20 NFS Storage

Pedraforca: Interconnect GbE IB GbE GbE network for service and storage IB network for MPI With extra ports to connect to other clusters IB IB IB

GPU-accelerated cluster vs. GPU-accelerator cluster Current GPU clusters Fixed ratio of CPU to GPU Unused GPU in not-accelerated apps Unused CPU in heavily accelerated apps Decouple CPU from GPU Off-load kernels to remote GPU Direct GPU to GPU data transfers Orchestrated by light-weight ARM CPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU Interconnection network CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU CPU CPU CPU GPU GPU Interconnection network

Conclusions CARMA is not an HPC solution but it enables software development already Pedraforca is the second generation ARM + GPU prototype GPU-accelerator cluster, instead of GPU-accelerated cluster ARM CPU used to orchestrate direct GPU to GPU communication CPU + GPU integration is happening already Embedded mobile platforms with OpenCL capable GPU Get ready for your next generation CPU + GPU platforms!

We re hiring! Do you want to work on the next generation of energy-efficient HPC systems? Lead the way to the Exascale? Change the HPC world forever? http://www.bsc.es/about_bsc/employment/vacancies Senior Researchers in Energy-Efficient Supercomputers HPC Application Developers