GPU Cluster Computing. Advanced Computing Center for Research and Education

Similar documents
Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Mathematical computations with GPUs

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduc)on to GPU Programming

Introduction to GPU hardware and to CUDA

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Programming Model

Tesla GPU Computing A Revolution in High Performance Computing

Lecture 1: Introduction and Computational Thinking

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Kepler Overview Mark Ebersole

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Introduction to Parallel Computing with CUDA. Oswald Haan

General Purpose GPU Computing in Partial Wave Analysis

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Paralization on GPU using CUDA An Introduction

CME 213 S PRING Eric Darve

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Tesla Architecture, CUDA and Optimization Strategies

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Graham vs legacy systems

The rcuda middleware and applications

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduc)on to Hyades

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Threading Hardware in G80

Accelerating image registration on GPUs

Introduction to the Cluster

Introduction to CELL B.E. and GPU Programming. Agenda

High Performance Computing with Accelerators

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Advanced CUDA Optimization 1. Introduction


Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Parallel Computing: Parallel Architectures Jin, Hai

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Multi-Processors and GPU

Tesla GPU Computing A Revolution in High Performance Computing

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

Turbostream: A CFD solver for manycore

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Technology for a better society. hetcomp.com

High Performance Computing and GPU Programming

CUDA (Compute Unified Device Architecture)

Introduction to the Cluster

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

How to run a job on a Cluster?

Introduction to CUDA Programming

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Network Coding: Theory and Applica7ons

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Optimization solutions for the segmented sum algorithmic function

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

n N c CIni.o ewsrg.au

Experts in Application Acceleration Synective Labs AB

GPU for HPC. October 2010

GPU Programming with CUDA. Pedro Velho

Applications of Berkeley s Dwarfs on Nvidia GPUs

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Introduction to GPU Programming

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

University of Bielefeld

GPU Computing with Fornax. Dr. Christopher Harris

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CSE 160 Lecture 24. Graphical Processing Units

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Using GPUs to compute the multilevel summation of electrostatic forces

Portland State University ECE 588/688. Graphics Processors

GPU CUDA Programming

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Sherlock for IBIIS. William Law Stanford Research Computing

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Real-Time Support for GPU. GPU Management Heechul Yun

GPU programming. Dr. Bernhard Kainz

High Performance Computing (HPC) Introduction

CS516 Programming Languages and Compilers II

Transcription:

GPU Cluster Computing Advanced Computing Center for Research and Education 1

What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of GPU(s) to solve scien3fic compu3ng problem. Due to their massive- parallel architecture, using GPUs enables the comple3on of computa3onally intensive assignments much faster compared with conven3onal CPUs! Hybrid CPU- GPU system: GPU: Computa3onally- intensive (massively parallel) part CPU: sequen3al part 2

GPU Performance Image taken from CUDA Programming Guide 3

GPU Performance Image taken from CUDA Programming Guide 4

Why is the GPU So Fast? The GPU is specialized for compute- intensive, highly data parallel computa3on (owing to its graphics rendering origin) More transistors can be devoted to data processing rather than data caching and control flow Where are GPUs good: high arithme3c intensity (the ra3o between arithme3c opera3ons and memory opera3ons) 5

Generic Multicore Chip Processor Memory Processor Memory Global&Memory& Handful of processors each suppor3ng ~1 hardware thread On- chip memory near processors (registers, cache) Shared global memory space (external DRAM) 6

Generic Manycore Chip Processor Memory Processor Memory Global&Memory& Many processors each suppor3ng many hardware threads On- chip memory near processors (register, cache, RAM) Shared global memory space (external DRAM) 7

Heterogeneous Typically GPU and CPU coexist in a heterogeneous sesng Less computa3onally intensive part runs on CPU (coarse- grained parallelism), and more intensive parts run on GPU (fine- grained parallelism) Multicore CPU! Manycore GPU!!!! 8

GPUs GPU is typically a computer card, installed into a PCI Express slot Market leaders: NVIDIA, AMD (ATI) Example NVIDIA GPUs GeForce'GTX'480' Tesla&2070& 9

NVIDIA GPU SP:*scalar*processor** CUDA*core * * Executes*one*thread* SM# streaming*mul-processor* 32xSP**(or*16,*48*or*more)* * Fast*local* shared#memory * (shared*between*sps)* 16*KiB*(or*64*KiB)* SM* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SP* SHARED* MEMORY** GLOBAL*MEMORY* (ON*DEVICE)* 10

NVIDIA GPU GPU: SMs 30 SM on GT200, 15 SM on Fermi For example, GTX 480: 15 SMs x 32 cores = 480 cores on a GPU SM$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SP$ SHARED$ MEMORY$$ GDDR$memory$ $ 512$MiB$0$6$GiB$ GLOBAL$MEMORY$ (ON$DEVICE)$ 11

CUDA Programming Model The GPU is viewed as a compute device that: Is a co- processor to the CPU or host Has its own DRAM (device memory, or global memory in CUDA parlance) Runs many threads in parallel Data- parallel por3ons of an applica3on run on the device as kernels which are executed in parallel by many threads Differences between GPU and CPU threads GPU threads are extremely lightweight: very lidle crea3on overhead GPU needs 1000s of threads for full efficiency Mul3- core CPU needs only a few heavy ones 12

Software View A master process running on the CPU performs the following steps: Ini3alize card Allocates memory in host and on device Copies data from host to device memory Launches mul3ple blocks of execu3on kernel on device Copies data from device memory to host Repeats 3-5 as needed Deallocates all memory and terminates 13

Execution Model Serial'Code' Host' Kernel'A' Device' Serial'Code' Host' Kernel'B' Device' 14

Software view Threads launched for a parallel sec3on are par33oned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execu3on Communicate via shared memory 15

Whole Grid Runs on GPU Many blocks of threads... SMEM SMEM SMEM SMEM Global Memory 16

Software View - within GPU Each block of the execu3on kernel executes on an SM If the number of blocks exceeds the number of SMs, then more than one will run at a 3me on each SM if there are enough registers and shared memory, and the others will wait in a queue and execute later All threads within one block can access local shared memory but can t see what the other blocks are doing (even if they are on the same SM) There are no guarantees on the order in which the blocks execute 17

Software View void function( ) { Allocate memory on the GPU Transfer input data to the GPU Launch kernel on the GPU Transfer output data to CPU }! global void kernel( ) { Code executed on the GPU goes here } CPU CPU Memory GPU Memory GPU 18

ACCRE GPU Nodes 48 compute nodes with: Two quad-core 2.4 GHz Intel Nehalem, 48 GB of memory 4 Nvidia GTX 480 10 Gigabit Ethernet Nvidia GTX 480 15 Streaming Multiprocessors per GPU 480 Compute cores per GPU Peak performance 1.35 TFLOPS SP / 168 GFLOPS DP Memory Amount 1.5 GB Memory Bandwidth 177.4 GB/sec 19

Software Stack Operating system: CentOS 6 Batch system: SLURM All compilers and tools are available on the GPU gateway (vmps81) type: ssh vunetid@vmps81.accre.vanderbilt.edu OR rsh vmps81 if you re already logged into the cluster GCC, Intel compiler Compute nodes share the same base OS and libraries 20

CUDA Architecture Based on C with some extensions C++ support increasing steadily FORTRAN support provided by PGI compiler Lots of example code and good documenta3on Large user community on NVIDIA forums 21

CUDA Components Driver Low- level sokware that controls the graphics card Compute Capability Refers to the GPU architecture, and features supported by hardware Toolkit (CUDA version) nvcc CUDA compiler, libraries, profiling/debugging tools SDK (Sokware Development Kit) Lots of demonstra3on examples Some error- checking u3li3es Not officially supported by NVIDIA Almost no documenta3on 22

CUDA Architecture 23

CUDA Toolkit Installed at /usr/local/cuda To compile CUDA program:! export PATH="/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib:LD_LIBRARY_PATH! Source files with CUDA language extensions (.cu) must be compiled with nvcc Actually, nvcc is a compile driver Works by invoking all the necessary tools and compilers like gcc, cl,... 24

CUDA Toolkit $ nvcc hello_world.c $ a.out Hello World! $ NVIDIA compiler (nvcc) can be used to compile programs with no device code 25

CUDA Toolkit CUDA C/C++ keyword global indicates a function that: - Runs on the device - Called from host (CPU) code nvcc separates source code into host and device components - Device functions, e.g. mykernel(), processed by NVIDIA compiler - Host functions, e.g. main(), processed by standard host compile, e.g. gcc, cl.exe 26

CUDA SDK Provides hundreds of code samples to help you get started on the path of writing software with CUDA C/ C++ Code samples covers a wide range of applications and techniques Installed at /opt/nvidia_gpu_computing_sdk on the gpu gateway 27

Job Submission Should not ssh to a compute node Must use SLURM to submit jobs Either as batch or interactively Provides exclusive access to a node and all 4 GPUs Must include partition and account information If you need interactive access to a GPU for development or testing: salloc --time=6:00:00 --partition=gpu --account=<mygroup>_gpu! For batch jobs, include these lines: #SBATCH --partition=gpu! #SBATCH --account=<account>_gpu 28

GPU Jobs #!/bin/bash #SBATCH --mail-user=vunetid@vanderbilt.edu #SBATCH --mail-type=all #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=2:00:00 # 2 hours #SBATCH --mem=100m #SBATCH --output=gpu-job.log #SBATCH --partition=gpu #SBATCH --account=accre_gpu # substitute appropriate group here setpkgs -a hoomd pwd date hoomd simple-script.py date Must be in a GPU group and include the --partition and --account options Single job has exclusive rights to allocated node (4 GPUs per node) See example-5 at https://github.com/accre/slurm.git

Getting help from ACCRE website FAQ and Getting Started pages: www.accre.vanderbilt.edu/?page_id=35 ACCRE Help Desk: www.accre.vanderbilt.edu/?page_id=369 Accre-forum mailing list ACCRE office hours: open a help desk ticket and mention that you would like to work personally with one of us in the office. The person most able to help you with your specific issue will contact you to schedule a date and time for your visit.