n N c CIni.o ewsrg.au

Similar documents
System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPUs and Emerging Architectures

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

General Purpose GPU Computing in Partial Wave Analysis

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing. November 20, W.Homberg

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Interconnection Network for Tightly Coupled Accelerators Architecture

NVIDIA GPU TECHNOLOGY UPDATE

OP2 FOR MANY-CORE ARCHITECTURES

Introduc)on to Hyades

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Programming on Ranger and Stampede

NAMD GPU Performance Benchmark. March 2011

Accelerator programming with OpenACC

The Era of Heterogeneous Computing

Parallel Computer Architecture - Basics -

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

OpenPOWER Performance

Accelerating Financial Applications on the GPU

High Performance Computing with Accelerators

Benchmark results on Knight Landing architecture

NAMD Performance Benchmark and Profiling. February 2012

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU on OpenStack for Science

Mapping MPI+X Applications to Multi-GPU Architectures

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

Mathematical computations with GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

IBM CORAL HPC System Solution

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Preparing for Highly Parallel, Heterogeneous Coprocessing

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

NAMD Performance Benchmark and Profiling. January 2015

Benchmark results on Knight Landing (KNL) architecture

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Intel Xeon Phi Coprocessors

Introduction to Xeon Phi. Bill Barth January 11, 2013

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CME 213 S PRING Eric Darve

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA. Matthew Joyner, Jeremy Williams

Introduction to GPU hardware and to CUDA

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

SCALABLE HYBRID PROTOTYPE

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

CUDA Experiences: Over-Optimization and Future HPC

Accelerating High Performance Computing.

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The Visual Computing Company

GRID Testing and Profiling. November 2017

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HPC Hardware Overview

Advanced Research Computing. ARC3 and GPUs. Mark Dixon

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Trends in HPC (hardware complexity and software challenges)

Pedraforca: a First ARM + GPU Cluster for HPC

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

HPC Architectures past,present and emerging trends

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Trends in the Infrastructure of Computing

Leibniz Supercomputer Centre. Movie on YouTube

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

rabbit.engr.oregonstate.edu What is rabbit?

What s P. Thierry

Introduction to GPGPU and GPU-architectures

Transcription:

@NCInews

NCI and Raijin National Computational Infrastructure 2

Our Partners

General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU Example: Nvidia Tesla GPU (K80) 2 x 2496 cores (562MHz/875MHz) 2 x 12 GB RAM 500 GB/s mem bandwidth 2.91 Tflops double prec. 8.74 Tflops single prec. Coprocessor Accelerators Example: Intel Xeon Phi 7120X (MIC architecture) 61 cores (244 threads) 16 GB RAM 352 GB/s mem bandwidth 1.2 Tflops double prec.

Dell C4130 Node CPU0 12 cores QPI CPU1 12 cores PCIE-3 x16 x16 x16 x16 x8 IB FDR K80 K80 K80 K80 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7 Raijin Node Dell C4130 Node Processor 2 SandyBridge Xeon E5-2670 CPUs 2 Haswell Xeon E5-2670 v3 CPUs #Cores 16 24 Memory 32 GB 128 GB Network Infiniband FDR Infiniband FDR Accelerator None 4 NVIDIA Tesla K80s

NVIDIA Tesla K80 GPU Tesla K80 GK210 GK210 cores memory memory BW clock base clock max power SP DP Architecture PCIe 4992 24 GB (48x256M) 480 GB/s (384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.91 TFLOPs Kepler Gen 3 (15.7 GB) 12GB GDDR5 PCIe switch 12GB GDDR5 PCIe Gen3 Connector

GK210 and SMX Number of SMX Manufacturing Register File Size Shared Mem / L1 Cache Transistor Count 13/15 TSMC 28nm 512 KB 128 KB 7.1 B single-prec cores double-prec units special-func units load/store units 192 64 32 32

Software Stack Item Software OS CentOS Kernel 3.14.46.el6.x86_64 OFED Mellanox OFED 2.1 Host Compiler Intel-CC/12.1.9.293 MPI Library OpenMPI/1.6.5 MKL Library Intel-MKL/12.1.9.293 CUDA Toolkit CUDA 6.5 CUDA Driver 340.87

HPL 16000 14000 HPL GFLOPS and Speedups 46.20 50 45 12000 40 35 GFLOPS 10000 8000 6000 22.04 16.95 13960 30 25 20 Acceleration (X) 4000 2000 0 6.90 6659 5122 2.46 1.00 302 742 2084 Raijin Haswell Haswell+2K80s Haswell+4K80s 2node 4K80 2node 8K80 15 10 5 0 Actual GFLOPS Speedup Binary version: hpl-2.1_cuda-6.5 from NVIDIA Auto boost is used in all the tests (manual CLOCK may give better results ) Some experiments are not fully-tuned (e.g., half GPUs) Speedups are based on one Raijin node

Power Consumption (Watts) 3500 3000 2500 Power (W) 2000 1500 1000 500 0 Time Time 3500 2n 4K80 2n 8K80 Haswell+2K80 Haswell+4K80 3000 2500 Power (W) 2000 1500 1000 500 0 System power reading from ipmi-sensors As a reference, 2 Raijin nodes consume ~600 W

GPU Autoboost CLOCK 1000 900 800 700 600 500 400 300 200 100 0 1400 1200 1000 800 600 400 200 0 Power in Watt CLOCK POWER HPL benchmarking on single node using 8 gpus Power consumption is for GPUs only CLOCK range from 374 MHz to 875 MHz

NAMD 16 14 12 12.67 13.69 Acceleration (X) 10 8 6 4 2 5.68 6.70 8.54 2.33 2.07 2.03 1.00 1.00 1.00 9.03 Raijin Haswell (24cores) Haswell+2K80s Haswell+4K80s 0 apoa f1atpase stmv GPU version - NAMD 2.10_Linux_x86_64-multicore-CUDA CPU version - NAMD 2.10 Speedups are based on one Raijin node

NAMD STMV Comparison with Raijin days/ns 3 2.5 2 1.5 1 0.5 0 0.696212 4 8 12 16 20 24 28 32 36 Number of Nodes 7000 6000 5000 4000 3000 2000 1000 0 Power The performance of 24 nodes using MPI is similar to one GPU node 24 nodes 0.696, GPU node 0.681 The power consumption is 5463 W compared to GPU node 3111 W

HPL Tuning on a GPU node 8000 6000 5936 6659 GFLOPS 4000 3804 2000 0 fermi naïve highly-tuned HPL running on single node with 8 GPUs, with same input Code version does matter - from fermi to NVIDIA-hpl-2.1 binary Tuning does matter - optimised binary is not sufficient

Hybrid Programming Model NUMA-aware, accelerator-aware, 1 billion vs 1000 x 1000 x 1000 MPI + OpenMP + CUDA Accelerator Programming CUDA OpenMP 4.0 OpenCL OpenACC MIC

Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality

Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality CPU0 12 cores QPI CPU1 12 cores r3596 x16 x16 x16 x16 PCIE-3 K80 K80 K80 K80 x8 IB FDR Infiniband 56Gb/s connect to r3597 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7

Execution Model of HPL module load openmpi/1.6.5 cuda/6.5... export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 mpirun -np 16 --bind-to-none..../run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0]) export CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0,2,4./xhpl ;; [1]) export CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6,8,10./xhpl ;;... [7]) esac export CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=19,21,23./xhpl ;;

A program view from a computer scientist Resource Utilisation Computation CPU, ILP, Parallelism Memory Caching, Conflict, Locality Communication Bandwidth, Latency I/O IO Caching, Granularity Program Machine

PAPI CUDA Component papi/5.4.1-cuda module on Raijin supporting CUDA counters Sample output: Process GPU results on 8 GPUs... PAPI countervalue 4432 --> cuda:::device:0:inst_executed PAPI countervalue 9977 --> cuda:::device:0:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:1:inst_executed PAPI countervalue 10228 --> cuda:::device:1:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:2:inst_executed PAPI countervalue 9961 --> cuda:::device:2:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:3:inst_executed PAPI countervalue 9885 --> cuda:::device:3:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:4:inst_executed PAPI countervalue 9942 --> cuda:::device:4:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:5:inst_executed PAPI countervalue 9852 --> cuda:::device:5:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:6:inst_executed PAPI countervalue 9836 --> cuda:::device:6:elapsed_cycles_sm PAPI countervalue 4432 --> cuda:::device:7:inst_executed PAPI countervalue 9757 --> cuda:::device:7:elapsed_cycles_sm

Performance Modelling Performance Modelling (or Performance Expectation) estimate baseline performance estimate potential benefit identify critical resources Benchmarking is not performance modelling Combine performance tools with analytical methods Compute Compute MPI 61.3% walltime 17.5% in scalar numeric ops 2.5% in vector numeric ops 80.0% in memory accesses 31.8% walltime 57.6% in collective calls, process rate 12.6 MB/s 42.4% in point-to-pint calls, process rate 108 MB/s MPI I/O I/O 6.9% walltime 0% in reads, process read rate 0 MB/s 100% in writes, process write rate 28.9 MB/s

Computational Intensity Computational Intensity = number of calculation operations each memory load/store Example loop CI Key factor A(:) = B(:) + C(:) 0.33 Memory A(:) = c * B(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) * E(:) 0.6 Memory A(:) = c * B(:) + d * C(:) 1.0 Still memory A(:) = c + B(:) * (d + B(:) * e) 2.0 Calculation

Working and To Do Profiling Tools - nvprof, nsight, etc. - PAPI CUDA components - CUDA-aware MPI - OpenMPI 1.10.0 built with CUDA awareness - GPU Direct RDMA - PBS Scheduling and GPUs - resource utilisation - nvidia-smi permissions

References Tesla K80 GPU Accelerator Board Specification, Jan 2015 NVIDIA s CUDA Compute Architecture: Kepler GK110/210 (white paper) GPU Performance Analysis and Optimisation (NVIDIA) 2015 OpenMPI with RDMA Support and CUDA GPU Hardware Execution Model and Overview, University of Utah, 2011 NCI http://www. Nvidia CUDA http://www.nvidia.com/object/cuda_home_new.html