HPC with Multicore and GPUs

Similar documents
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Tesla Architecture, CUDA and Optimization Strategies

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

MAGMA: a New Generation

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Fast and reliable linear system solutions on new parallel architectures

GPU programming. Dr. Bernhard Kainz

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A Standard for Batching BLAS Operations

Dense Linear Algebra. HPC - Algorithms and Applications

Tesla GPU Computing A Revolution in High Performance Computing

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Introduction to GPGPU and GPU-architectures

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

CME 213 S PRING Eric Darve

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Programmable Graphics Hardware (GPU) A Primer

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Overview of research activities Toward portability of performance

GPU Programming. Ringberg Theorie Seminar 2010

Technology for a better society. hetcomp.com

Massively Parallel Architectures

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Practical Introduction to CUDA and GPU

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

THE LEADER IN VISUAL COMPUTING

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

High Performance Computing on GPUs using NVIDIA CUDA

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Speeding up MATLAB Applications Sean de Wolski Application Engineer

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Advanced CUDA Optimization 1. Introduction

Introduction to CUDA Programming

CS427 Multicore Architecture and Parallel Computing

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

GPU ARCHITECTURE Chris Schultz, June 2017

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Lecture 3: Introduction to CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

General Purpose GPU Computing in Partial Wave Analysis

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Solving Dense Linear Systems on Graphics Processors

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Programming in CUDA. Malik M Khan

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

World s most advanced data center accelerator for PCIe-based servers

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Device Memories and Matrix Multiplication

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

GPU Programming Using NVIDIA CUDA

Tesla GPU Computing A Revolution in High Performance Computing

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Parallel and Distributed Programming Introduction. Kenjiro Taura

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

The Fermi GPU and HPC Application Breakthroughs

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Performance Diagnosis for Hybrid CPU/GPU Environments

Lecture 1: Introduction and Computational Thinking

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

An Overview of High Performance Computing and Challenges for the Future

Kepler Overview Mark Ebersole

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Applications of Berkeley s Dwarfs on Nvidia GPUs

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Accelerating image registration on GPUs

GP-GPU. General Purpose Programming on the Graphics Processing Unit

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Transcription:

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20

Outline Introduction - Hardware trends Challenges of using multicore+gpus How to code for GPUs and multicore - An approach that we will study Introduction to CUDA and the cs954 project/ library Conclusions 2/20

Speeding up Computer Simulations Better numerical methods e.g. a posteriori error analysis: solving for much less DOF but achieving the same accuracy http://www.cs.utk.edu/~tomov/cflow/ Exploit advances in hardware l Manage to use hardware efficiently for real-world HPC applications l Match LU benchmark in performance! 3/20

Why multicore and GPUs? Hardware trends Power is the root cause of all this Multicore ( Yelick (Source: slide from Kathy GPU Accelerators old GPUs; compare 77 Gflop/s with today s P100, reaching 4,700 Gflop/s in DP 4/18 (Source: NVIDIA's GT200: Inside a Parallel Processor )

Main Issues Increase in parallelism * 1 How to code (programming model, language, productivity, etc.)? Despite issues, high speedups on HPC applications are reported using GPUs ( homepage (from NVIDIA CUDA Zone Increase in commun. * 2 cost (vs computation) How to redesign algorithms? Hybrid Computing * 3 How to split and schedule the computation between hybrid hardware components? CUDA architecture & programming: * 1 - A data-parallel approach that scales - Similar amount of efforts on using CPUs vs GPUs by domain scientists demonstrate the GPUs' potential Processor speed * 2 improves 59% / year but memory bandwidth by 23% latency by 5.5% e.g., schedule small * 3 non-parallelizable tasks on the CPU, and large and paralelizable on the GPU 5/20

Evolution of GPUs GPUs: excelling in graphics rendering Scene model streams of data Graphics pipelined computation Final image Repeated fast over and over: e.g. TV refresh rate is 30 fps; limit is 60 fps This type of computation: Requires enormous computational power Allows for high parallelism Needs high bandwidth vs low latency ( as low latencies can be compensated with deep graphics pipeline ) Currently, can be viewed as multithreaded multicore vector units Obviously, this pattern of computation is common with many other applications 6/20

Challenges of using multicore+gpus Massive parallelism Many GPU cores, serial kernel execution [ e.g. 240 in the GTX280; up to 512 in Fermi to have concurrent kernel execution ] Hybrid/heterogeneous architectures Match algorithmic requirements to architectural strengths [ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on GPU ] Compute vs communication gap Exponentially growing gap; persistent challenge [ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ] [ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power of O(1,000) Gflop/s but GPUs communicate through the CPU using O(1) GB/s connection ] 7/20

How to Code for GPUs? Complex question - Language, programming model, user productivity, etc Recommendations - Use CUDA / OpenCL [already demonstrated benefits in many areas; data-based parallelism; move to support task-based] - Use GPU BLAS [high level; available after introduction of shared memory can do data reuse; leverage existing developments ] - Use Hybrid Algorithms [currently GPUs massive parallelism but serial kernel execution; hybrid approach small non-parallelizable tasks on the CPU, large parallelizable tasks on the GPU ] 8/20

This is the programming model of MAGMA Performance GFLOP/s TASK-BASED ALGORITHMS MAGMA uses task-based algorithms where the computation is split into tasks of varying granularity and their execution scheduled over the hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the GPUs. PERFORMANCE & ENERGY EFFICIENCY MAGMA LU factorization in double precision arithmetic CPU 4000 3500 3000 2500 2000 1500 1000 500 Intel Xeon E5-2650 v3 (Haswell) 2x10 cores @ 2.30 GHz 0 K40 NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz P100 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 32k 34k 36k NVIDIA Pascal GPU 56 MP x 64 @ 1.19 GHz P100 2 K40 1 K40 CPU GFLOPs / Watt BLAS tasking + hybrid scheduling 14 12 10 8 6 4 2 0 CPU K40 P100 Matrix size N x N CPU K40 P100 9/20

MAGMA main thrusts and use MAGMA research vehicle on LA algorithms and libraries for new architectures for architectures in { CPUs + Nvidia GPUs (CUDA), CPUs + AMD GPUs (OpenCL), CPUs + Intel Xeon Phis, manycore (native: GPU or KNL/CPU), embedded systems, combinations, and software stack, e.g., since CUDA x} for precisions in { s, d, c, z, adding half real and complex, mixed, } for interfaces { heterogeneous CPU or GPU, native, } LAPACK BLAS Batched LAPACK Batched BLAS Sparse MAGMA for CUDA GPU Center of Excellence (GCOE) for 8 th year MAGMA for Xeon Phi Intel Parallel Computing Center (IPCC) 5 th year collaboration with Intel on Xeon Phi MAGMA in OpenCL Collaboration with AMD Number of downloads for 2016 so far is 6,205 (around 3K per release; to be ~10K for 2016) MAGMA Forum: 2,727 + 312 (3,039) posts in 742 + 75 (817) topics, 1,038 + 317 (1,355) users MAGMA is incorporated in MATLAB (as of the R2010b), contributions in CUBLAS and MKL, AMD, Siemens (in NX Nastran 9.1), ArrayFire, ABINIT, Quantum-Espresso, R (in HiPLAR & CRAN), SIMULIA (Abaqus), MSC Software (Nastran and Marc), Cray (in LibSci for accelerators libsci_acc), Nano-TCAD (Gordon Bell finalist), Numerical Template Toolbox (Numscale), and others. 10/20

An approach for multicore+gpus Split algorithms into tasks and dependencies between them, e.g., represented as DAGs Schedule the execution in parallel without violating data dependencies Algorithms as DAGs (small tasks/tiles for homogeneous multicore) e.g., in the PLASMA library for Dense Linear Algebra http://icl.cs.utk.edu/plasma/ 11/20

An approach for multicore+gpus Split algorithms into tasks and dependencies between them, e.g., represented as DAGs Schedule the execution in parallel without violating data dependencies Algorithms as DAGs (small tasks/tiles for homogeneous multicore) Hybrid CPU+GPU algorithms (small tasks for multicores and ( GPUs large tasks for e.g., in the PLASMA library for Dense Linear Algebra http://icl.cs.utk.edu/plasma/ e.g., in the MAGMA library for Dense Linear Algebra http://icl.cs.utk.edu/magma/ 12/20

Design algorithms to use mostly BLAS 3 Level 1, 2 and 3 BLAS Nvidia P100, 1.19 GHz, Peak DP = 4700 Gflop/s 4800 4400 C = C + A*B 4503 Gflop/s 4000 3600 Gflop/s 3200 2800 2400 2000 1600 1200 800 400 31x dgemm BLAS Level 3 dgemv BLAS Level 2 daxpy BLAS Level 1 y = y + A*x 145 Gflop/s y = *x + y 52 Gflop/s 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k Matrix size (N), vector size (NxN) 13/20

How to program in parallel? There are many parallel programming paradigms (to be covered w/ Prof. George Bosilca), e.g., master K K K worker worker worker P1 P2 P3 ( SPMD ) master/worker divide and conquer pipeline work pool data parallel In reality applications usually combine different paradigms CUDA and OpenCL have roots in the data-parallel approach (now adding support for task parallelism) http://developer.download.nvidia.com/compute/cuda/2_0/docs/nvidia_cuda_programming_guide_2.0.pdf http://developer.download.nvidia.com/opencl/nvidia_opencl_jumpstart_guide.pdf 14/20

Compute Unified Device Architecture (CUDA) Software Stack CUBLAS, CUFFT, MAGMA,... C like API (Source: NVIDIA CUDA ( Guide Programming 15/20

CUDA Memory Model (Source: NVIDIA CUDA ( Guide Programming 16/20

CUDA Hardware Model (Source: NVIDIA CUDA ( Guide Programming 17/20

CUDA Programming Model Grid of thread blocks (blocks of the same dimension, grouped ( kernel together to execute the same Thread block (a batch of threads with fast shared ( kernel memory executes a Sequential code launches asynchronously GPU kernels // set the grid and thread configuration Dim3 dimblock(3,5); Dim3 dimgrid(2,3); // Launch the device computation MatVec<<<dimGrid, dimblock>>>(... ); C Program Sequential Execution Serial Code ( Guide (Source: NVIDIA CUDA Programming global void MatVec(... ) { // Block index int bx = blockidx.x; int by = blockidx.y; } // Thread index int tx = threadidx.x; int ty = threadidx.y;... 18/20

Jetson TK1 A full-featured platform for embedded applications It allows you to unleash the power of 192 CUDA cores to develop solutions in computer vision, robotics, medicine, security, and automotive; You have accounts on astro.icl.utk.edu (for Ozgur Cekmer and Yasser Gandomi) rudi.icl.utk.edu (for Yuping Lu and Eduardo Ponce ) Board features - Tegra K1 SOC l l Kepler GPU with 192 CUDA cores 4-Plus-1 quad-core ARM Cortex A15 CPU - 2 GB x16 memory with 64 bit width - 16 GB 4.51 emmc memory - https://developer.nvidia.com/jetson-tk1 19/20

Conclusions Hybrid Multicore+GPU computing: Architecture trends: towards heterogeneous/hybrid designs Can significantly accelerate linear algebra [vs just multicores] ; Can significantly accelerate algorithms that are slow on homogeneous architectures 20/20