Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Similar documents
ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Pedraforca: a First ARM + GPU Cluster for HPC

Building supercomputers from embedded technologies

Barcelona Supercomputing Center

HPC projects. Grischa Bolls

The Mont-Blanc Project

Building supercomputers from commodity embedded chips

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Implementing Deep Learning for Video Analytics on Tegra X1.

The Mont-Blanc approach towards Exascale

Butterfly effect of porting scientific applications to ARM-based platforms

European energy efficient supercomputer project

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

CPU-GPU Heterogeneous Computing

rcuda: an approach to provide remote access to GPU computational power

Tuning Alya with READEX for Energy-Efficiency

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

The rcuda middleware and applications

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

HPC with Multicore and GPUs

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

OpenStaPLE, an OpenACC Lattice QCD Application

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

General Purpose GPU Computing in Partial Wave Analysis

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Vectorisation and Portable Programming using OpenCL

The DEEP (and DEEP-ER) projects

Solving Dense Linear Systems on Graphics Processors

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

Accelerated Machine Learning Algorithms in Python

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

Approaches to Parallel Computing

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Parallel Programming on Ranger and Stampede

ASTRI/CTA data analysis on parallel and low-power platforms

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Feature Detection Plugins Speed-up by

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

GPU COMPUTING WITH MSC NASTRAN 2013

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

IBM Power AC922 Server

High performance Computing and O&G Challenges

Prototyping in PRACE PRACE Energy to Solution prototype at LRZ

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics

Architecture, Programming and Performance of MIC Phi Coprocessor

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

There s STILL plenty of room at the bottom! Andreas Olofsson

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GPUs and Emerging Architectures

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

MAGMA. Matrix Algebra on GPU and Multicore Architectures

arxiv: v1 [hep-lat] 12 Nov 2013

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Trends in HPC (hardware complexity and software challenges)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

High Performance Computing with Accelerators

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Exascale: challenges and opportunities in a power constrained world

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

The ICARUS white paper: A scalable, energy-efficient, solar-powered HPC center based on low power GPUs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Scientific Computations Using Graphics Processors

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Interconnect Your Future

GPU-centric communication for improved efficiency

GPU for HPC. October 2010

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

World s most advanced data center accelerator for PCIe-based servers

Auto-tunable GPU BLAS

Benchmark of a Cubieboard cluster

Transcription:

www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es

BSC/UPC CUDA Centre of Excellence (CCOE) Training Build an education program on parallel programming using CUDA, OpenCL and OmpSs PUMPS summer school 2010-2015, courses at BSC and UPC Research Generation, Simulation and Rendering of Large Varied Animated Crowds that attendees can get a presentation using OmpSs at current GTC HERTA Security GPU-based machine learning for real-time face recognition, and bio-marketing, also presented at this GTC. Exploring the potential of low-power GPU clusters as high-performance platforms involved in Mont-Blanc and PRACE prototypes 2

Power [MW] 8 7 6 5 4 3 2 1 0 Top500 Power Consumption Evolution TOP10 TOP50 TOP500 2008 2009 2010 2011 2012 2013 x5.04 in 5y x3.13 in 5y x3.25 in 5y Higher performance, at the expense of higher power 3

Mont-Blanc Project http://www.montblanc-project.eu European approach for energy efficient HPC systems. Objectives: To develop a full energy-efficient HPC prototype using low-power commercially available embedded technology. To develop a portfolio of exascale applications to be run on this new generation of HPC systems. Exploring different alternatives for the compute node (from low-power mobile sockets to special-purpose high-end ARM chips), and its implications on the rest of the system Partners: 4

Euroserver Project http://www.euroserver-project.eu European approach for energy efficient data servers. Objectives: Reduced Energy consumption by: (i) using ARM (64-bit) cores (ii) drastically reducing the core-to-memory distance (iii) improving on the "energy proportionality". Reduced Cost to build and operate each microserver, (i) improved manufacturing yield (ii) reduced physical volume of the packaged interposer module (iii) and energy efficient semiconductor process (FDSOI). Partners: 5

Mont-Blanc Prototype Ecosystem 6

Outline 1. Pedraforca Prototype Architecture 2. Evaluation application 3. Exploiting Dynamic Parallelism 4. Some benchmarks and results 5.Limitations & Conclusions 7

Pedraforca : Prototype Node Architecture E4 ARKA single node desktop unit 8

Pedraforca: Cluster 3 bullx 1200 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) 9

Comparing Power Budgets X86_64 based system Low power ARM Component Max power usage Component Max power usage Tesla K20 235 Board 80 CPU 90 Total 405 Tesla K20 235 Board 25 CPU 5 Total 265 Quad core Intel i5-3570k @3.4GHz, ASUS P8Z77 V-pro Tegra 3 (quad core ARM A9 @ 1.3 GHz), Mini ITX Carrier 10

Outline 1. Pedraforca Prototype Architecture 2. Evaluation application 3. Exploiting Dynamic Parallelism 4. Some benchmarks and results 5.Limitations & Conclusions 11

Thick restarted Lanczos Algorithm in Lattice QCD SU(3 x 3) matrix(complex double) At time t SU(3) vector (complex double) Each point on lattice is SU(3) vector and links connecting points are SU(3) matrix. Using thick restarted Lanczos algorithm for generating eigenpairs of the lattice 80 % cublas routines Average number of cublas calls: 60000 90000 depending on lattice configuration Process lattice from multiple time steps in parallel 12

Evaluation Example Lanczos Iteration Initial vector (v 0 ) Large number of BLAS operations Dominated by global orthogonalization module which includes BLAS Implemented using cublas, highly modularized and easy to use Iterations are not independent of each other N iterations Apply matrix V i = A (V i-1 ) Compute alpha α i = dot(v i,v i-1 ) AXPY kernel V i = V i - α i V i-1 β i-1 V i-2 Global orthogonalization Compute beta β i = Euclidean norm(v i ) New subspace vector V i = V i / β i 13

Algorithm Implementation for the Prototype CPU works as coordinator CPU pipeline Start End GPU slave executes kernels GPU pipeline Apply matrix cublas dot kernel cublas AXPY kernel Serial Dependency Bottlenecks Large number of calls to cublas. Overall algorithm is serial Dominated by CPU s capability of launching cublas kernels ARM processor is not fast enough to quickly launch kernels on GPU. GPU in underutilized 14

Outline 1. Pedraforca Prototype Architecture 2. Evaluation application 3. Exploiting Dynamic Parallelism 4. Some benchmarks and results 5.Limitations & Conclusions 15

Exploiting Dynamic Parallelism The reason for dynamic parallelism, is to make GPU adapt to data Can we exploit further to solve bottlenecks and save power? 16

Approach for Exploiting Dynamic Parallelism for Low Power Prototype CPU works as coordinator CPU pipeline Start GPU slave executes kernels GPU pipeline Apply matrix CPU starts and ends wrapper CPU pipeline Start GPU wrapper coordinates the tasks GPU pipeline Wrapper kernel, 1 control thread Apply matrix cublas dot kernel cublas dot kernel cublas AXPY kernel cublas AXPY kernel End End Serial Dependency 17

global Applymatrix(..,..) int main() { copytogpu(); } Example code:1 - Simple Wrapper Original code Code with wrapper Applymatrix<<<, >>>(); cublaszdot(); cublaszaxpy(); copyfromgpu(); global Applymatrix(..,..) global wrapper(..,..) { Applymatrix<<<, >>>(); cublaszdot(); cublaszaxpy(); } int main() { copytogpu(); wrapper<<<1,1>>>(); copyfromgpu(); } 18

When wrapper executed with more than one thread to process multiple instances. Wrapper<<<1,2>>>() PROBLEM Threads in same block launch kernels one after another. Multiple instances are not executed simultaneously. Multiple Threads in Wrapper CPU pipeline Start End GPU pipeline GPU wrapper, 2 CUDA thread Apply matrix cublas dot kernel cublas AXPY kernel Apply matrix cublas dot kernel cublas AXPY kernel 19

Bottleneck caused by multiple threads in wrapper OUR GOAL SOLUTION CUDA streams created on GPU side CPU pipeline Start End GPU pipeline GPU wrapper, 2 CUDA thread Wrapper Apply matrix cublas dot kernel cublas AXPY kernel Apply matrix cublas dot kernel cublas AXPY kernel 20

Solution for processing multiple instances by CUDA streams Modification to code global wrapper(..,..) { cudastream_t stream; cudastreamcreatewithflags(&str eam,cudastreamnonblocking); cublassetstream(.,stream); Applymatrix<<<, stream>>>(); cublaszdot(); cublaszaxpy(); cudastreamdestroy(stream); } CPU pipeline Start End cublas dot kernel cublas AXPY kernel GPU pipeline GPU wrapper, 2 CUDA thread Wrapper CUDA create stream Apply matrix CUDA create stream Apply matrix cublas dot kernel cublas AXPY kernel 21

Outline 1. Pedraforca Prototype Architecture 2. Evaluation application 3. Exploiting Dynamic Parallelism 4. Some benchmarks and results 5.Limitations & Conclusions 22

Speed up No of kernel calls cublas kernel launch scaling cublas calls by CPU (seconds) cublas calls GPU thread (seconds) Speed up 1 x 10 3 1.72 1.43 1.20 x 3 x 10 3 2.23 1.62 1.37 x 5 x 10 3 4.7 2.9 1.62 x 10 x 10 3 7.52 3.5 2.14 x 50 x 10 3 11.78 4.2 2.80 x cublas level 1 routines 40% reduction kernel 30% AXPY kernel 30% dot product 3 2 1 0 Speed Up no. of cublas calls Speed Up 23

Execution Time (sec) Application Performance (High Frequency CPU) Kernel calls by CPU Kernel calls by CPU (with streams) 50 Kernel calls by GPU Kernel calls by GPU (with streams) 40 30 20 10 0 4.4 5.2 6.4 2.3 2.8 4.1 7.6 5.2 11.2 7.5 24 Lattice size 32 48 Code with wrapper may be slower on a system with fast CPU Quad core intel i5-3570k @3.4GHz 12.8 24 8.7

Execution Time (sec) Application Performance (Pedraforca Prototype) Kernel calls by CPU Kernel calls by CPU (with streams) 50 Kernel calls by GPU Kernel calls by GPU (with streams) 40 40.6 36.4 30 20 10 0 23.5 20.4 13.6 15.2 13.1 5.3 7.5 2.7 5.2 24 Lattice size 32 48 Code with wrapper kernel performs better on ARM based system Tegra 3 - quad core ARM A9 @ 1.3 GHz 9 25

Comparing systems A B Quad core i5-3570k@3.4g Hz Quad core ARM A9@1.3 GHz Tesla K20 Tesla K20 26

Percentage Comparing power footprint Without CUDA streams QCD lattice size A : All kernels launched by CPU(Quad core intel i5-3570k@3.4ghz) B : All kernels launched by GPU (Tegra 3-quad core ARM A9@1.3 GHz) Execution time (seconds) Average Power (W) Energy Consumption (J) A B A B A B 24 4.4 5.3 367 245 1614.8 1298.5 32 6.4 7.5 359 246 2297.6 1845 48 11.2 13.1 365 243 4088 3183.3 24 22 20 18 16 Energy savings (%) 24 32 48 Lattice size Energy savings (%) 27

Percentage QCD lattice size Comparing power footprint With CUDA streams A : All kernels launched by CPU(Quad core intel i5-3570k@3.4ghz) B : All kernels launched by GPU (Tegra 3-quad core ARM A9@1.3 GHz) Execution time (seconds) Average Power (W) Energy Consumption (J) A B A B A B 24 2.3 2.7 420 286 966 772.2 32 4.1 5.2 426 287 1746.6 1392.4 48 7.5 9.0 425 282 3187.5 2538 24 22 20 18 16 Energy savings (%) 24 32 48 Lattice size Energy savings (%) 28

Scaling across Cluster State of art technologies like GPU Direct, CUDA aware MPI can significantly improve data transfers among multiple nodes Wrapper kernel ensures, low frequency CPU has sufficient time for communication. Without GPU Direct With GPU Direct I/O netwo rk card CPU pipeline Start End G P U m e m or y GPU pipeline Some dynamic CUDA processing load I/O with networ k card CPU pipeline Start End G P U m e m or y GPU pipeline Some dynamic CUDA processing load 29

Pedraforca limitations 32 bit SoC GOOD NEWS!! Driver support for 32 bit 64 bit SoC, upto 4GB support 30

Conclusions With CUDA dynamic parallelism and CUDA streams in action we are able to save roughly 20 % of power on Pedraforca prototype. CUDA Dynamic Parallelism helps reducing GPU-CPU communication, hence faster CPU is not always necessary. More libraries supporting dynamic parallelism have to be developed. Embedding ARM cores inside big accelerator like Tesla could be promising 31