CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Similar documents
CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU programming. Introduction to GPU programming p. 1/17

QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 10!! Introduction to CUDA!

COMP Parallel Computing. Programming Accelerators using Directives

GPGPU A Supercomputer on Your Desktop

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU CUDA Programming

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Directive-based Programming for Highly-scalable Nodes

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

CSE 160 Lecture 24. Graphical Processing Units

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Basic unified GPU architecture

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Fundamentals. Steve Abbott November 15, 2017

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

OpenACC (Open Accelerators - Introduced in 2012)

GPU Programming Using CUDA

Lecture 11: GPU programming

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Blue Waters Programming Environment

Lecture 3. Programming with GPUs

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

OpenACC Fundamentals. Steve Abbott November 13, 2016

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Module 2: Introduction to CUDA C. Objective

HPCSE II. GPU programming and CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerating image registration on GPUs

CUDA Parallel Programming Model Michael Garland

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Massively Parallel Architectures

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

CUDA. Matthew Joyner, Jeremy Williams

Introduction to Compiler Directives with OpenACC

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Parallel Numerical Algorithms

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to CUDA Programming

Massively Parallel Algorithms

Programming paradigms for GPU devices

CSC573: TSHA Introduction to Accelerators

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Implementing Method of Moments on a GPGPU using Nvidia CUDA

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to OpenACC - Part 1

Parallel Programming Libraries and implementations

Module 2: Introduction to CUDA C

ARCHER Champions 2 workshop

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Computing Master Clss. Development Tools

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

S Comparing OpenACC 2.5 and OpenMP 4.5

Parallel Accelerators

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

High Performance Computing and GPU Programming

Parallel Programming Overview

HPC with Multicore and GPUs

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming. Libraries and Implementations

GPU Programming Using NVIDIA CUDA

Advanced OpenACC. Steve Abbott November 17, 2017

Architecture, Programming and Performance of MIC Phi Coprocessor

Parallel Accelerators

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lecture 3: Introduction to CUDA

Multi-Processors and GPU

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

OpenMP and GPU Programming

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

High Performance Linear Algebra on Data Parallel Co-Processors I

Lecture 2: Introduction to CUDA C

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Transcription:

CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra)

Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming paradigm: threads Technologies: Pthreads, OpenMP Distributed (Non-Uniform Memory Access NUMA) Primary story: add more computers Programming paradigm: message passing Technologies: MPI (OpenMPI/MPICH), SLURM Where do we go from here?

A brief digression into gaming 1970s: arcades began using specialized graphics chips 1980s: increasingly sophisticated capabilities (e.g., sprites, blitters, scrolling) Early-mid 1990s: first 3D consoles (e.g., N64) and 3D accelerator cards for PCs Late 1990s: classic wars begin: Nvidia vs. ATI and DirectX vs. OpenGL Early 2000s: creation of "shaders" (easier non-graphical use of accelerators) Late 2000s: rise of General-Purpose GPU (GPGPU) frameworks 2007: Compute Unified Device Architecture (CUDA) released (newer library: Thrust) 2009: OpenCL standard released 2011: OpenACC standard released

GPU Programming "Kernels" run on a batch of threads Distributed onto many low-powered GPU cores Grouped into blocks of cores and grids of blocks Limited instruction set that operates on vector data Must copy data to/from main memory

GPU Programming (CUDA) void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Low-level // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); control of parallelism on GPU global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

GPU Programming (CUDA) // Kernel that executes on the CUDA device global void square_array(float *a, int N) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx]; // main routine that executes on the host int main(void) { float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results and cleanup for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); free(a_h); cudafree(a_d); Must micromanage memory usage and data movement

GPU Programming (OpenACC) #pragma acc data copy(a) create(anew) while (error > tol && iter < iter_max) error = 0.0; { #pragma acc kernels { #pragma acc loop for (int j = 1; j < n-1; j++) { for (int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]; error = fmax(error, fabs(anew[j][i] - A[j][i])); Fewer modifications required; may not parallelize effectively #pragma acc loop for (int j = 1; j < n-1; j++) { for (int = i; i < m-1; i++ ) { A[j][i] = Anew[j][i]; if (iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++;

Hybrid HPC architectures Highly parallel on the node Hardware: CPU w/ accelerators GPUs or manycore processors (e.g., Intel Phi and SunWay) Technologies: OpenMP, CUDA, OpenACC, OpenCL Distributed between nodes Hardware: interconnect and distributed FS Technologies: Infiniband, Lustre, HDFS

Top10 systems (Spring 2016)

Top10 systems (Spring 2017)

Top10 systems Spring 2016 Spring 2017

Cloud Computing Homogenous centralized nodes Infrastructure as a Service (IaaS) and Software as as Service (SaaS) Hardware: large datacenters with thousands of servers and a highspeed internet connection Software: virtualized OS and custom software (Docker, etc.)

Grid Computing Heterogenous nodes in disparate physical locations Solving problems or performing tasks of interest to a large number of diverse groups Hardware: different CPUs, GPUs, memory layouts, etc. Software: different OSes, Folding@Home, Condor, GIMPs, etc.

Aside: linear algebra Many scientific phenomena can be modeled as matrix operations Differential equations, mesh simulations, view transforms, etc. Very efficient on vector processors (including GPUs) Data decomposition and SIMD parallelism Dense matrices vs. sparse matrices Popular packages: BLAS, LINPACK, LAPACK

Dense vs. sparse matrices A sparse matrix is one in which most elements are zero Could lead to more load imbalances Can be stored more efficiently, allowing for larger matrices Dense matrix operations no longer work It is a challenge to make sparse operations as efficient as dense operations

HPL benchmark HPL: LINPACK-based dense linear algebra benchmark Generates a linear system of equations (answers are all 1.0 s) Distributes data in block-cyclic pattern LU factorization (similar to Gaussian elimination) Backward substitution to solve system Error calculation to verify correctness Compiled on cluster Located in /shared/apps/hpl-2.1/bin/linux_pii_cblas

P3 (OpenMP) Similar to HPL benchmark 1) Random generation of linear system (x is all 1 s) 2) Gaussian elimination 3) Backwards substitution (row- or column-oriented) Non-random example Gaussian elimination Original system (Ax = b) 3.0 2.0-1.0 0.0-3.3 4.7 0.0 0.0 0.3 1.0-2.7-0.6 Upper triangular system Backward substitution 3.0 2.0-1.0 2.0-2.0 4.0-1.0 0.5-1.0 1.0-2.0 0.0 Augmented matrix [A b] 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 Solved system 1.0-2.0-2.0