GPU Architecture. Alan Gray EPCC The University of Edinburgh

Similar documents
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CME 213 S PRING Eric Darve

Trends in HPC (hardware complexity and software challenges)

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Introduction to GPU hardware and to CUDA

HPC future trends from a science perspective

HPC Architectures. Types of resource currently in use

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Introduction to Xeon Phi. Bill Barth January 11, 2013

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

It's the end of the world as we know it

GPUs and Emerging Architectures

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Preparing GPU-Accelerated Applications for the Summit Supercomputer

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

World s most advanced data center accelerator for PCIe-based servers

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Parallel Programming on Ranger and Stampede

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Scientific Computing on GPUs: GPU Architecture Overview

High Performance Computing with Accelerators

Preparing for Highly Parallel, Heterogeneous Coprocessing

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Technology for a better society. hetcomp.com

General Purpose GPU Computing in Partial Wave Analysis

Building NVLink for Developers

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

An Introduction to Graphical Processing Unit

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Era of Heterogeneous Computing

Lecture 1: an introduction to CUDA

Intel Xeon Phi архитектура, модели программирования, оптимизация.

ECE 8823: GPU Architectures. Objectives

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

General introduction: GPUs and the realm of parallel architectures

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Accelerator Programming Lecture 1

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Parallel Accelerators

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Master Informatics Eng.

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Pedraforca: a First ARM + GPU Cluster for HPC

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPU ARCHITECTURE Chris Schultz, June 2017

n N c CIni.o ewsrg.au

Inspur AI Computing Platform

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU for HPC. October 2010

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Experts in Application Acceleration Synective Labs AB

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CUDA GPGPU Workshop 2012

WHY PARALLEL PROCESSING? (CE-401)

Using Graphics Chips for General Purpose Computation

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

HPC Issues for DFT Calculations. Adrian Jackson EPCC

High Performance Computing (HPC) Introduction

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

BREAKING THE MEMORY WALL

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

ECE 574 Cluster Computing Lecture 18

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Antonio R. Miele Marco D. Santambrogio

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

GPU Fundamentals Jeff Larkin November 14, 2016

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Introduc)on to Xeon Phi

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Parallel Accelerators

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Portland State University ECE 588/688. Graphics Processors

Outline Marquette University

Numerical Simulation on the GPU

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Transcription:

GPU Architecture Alan Gray EPCC The University of Edinburgh

Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From NVIDIA and AMD Accelerated Systems 2

4 key performance factors DATA IN 1. Amount of data processed at one time (Parallel processing) Processor DATA PROCESSED DATA OUT Memory 2. Processing speed on each data element (Clock frequency) 3. Amount of data transferred at one time (Memory bandwidth) 4. Time for each data element to be transferred (Memory latency) 3

DATA IN 4 key performance factors 1. Parallel processing 2. Clock frequency Processor DATA PROCESSED DATA OUT Memory 3. Memory bandwidth 4. Memory latency Different computational problems are sensitive to these in different ways from one another Different architectures address these factors in different ways 4

CPUs: 4 key factors Parallel processing Until relatively recently, each CPU only had a single core. Now CPUs have multiple cores, where each can process multiple instructions per cycle Clock frequency CPUs aim to maximise clock frequency, but this has now hit a limit due to power restrictions (more later) Memory bandwidth CPUs use regular DDR memory, which has limited bandwidth Memory latency Latency from DDR is high, but CPUs strive to hide the latency through: Large on-chip low-latency caches to stage data Multithreading Out-of-order execution 5

The Problem with CPUs The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers got faster by increasing the frequency Voltage was decreased to keep power reasonable. Now, voltage cannot be decreased any further 1s and 0s in a system are represented by different voltages Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished 6

Reproduced from http://queue.acm.org/detail.cfm?id=2181798 7

The Problem with CPUs Instead, performance increases can be achieved through exploiting parallelism Need a chip which can perform many parallel operations every clock cycle Many cores and/or many operations per core Want to keep power/core as low as possible Much of the power expended by CPU cores is on functionality not generally that useful for HPC e.g. branch prediction 8

Accelerators So, for HPC, we want chips with simple, low power, number-crunching cores But we need our machine to do other things as well as the number crunching Run an operating system, perform I/O, set up calculation etc Solution: Hybrid system containing both CPU and accelerator chips 9

Accelerators It costs a huge amount of money to design and fabricate new chips Not feasible for relatively small HPC market Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market And largely possess the right characteristics for HPC Many number-crunching cores GPU vendors NVIDIA and AMD have tailored existing GPU architectures to the HPC market GPUs now firmly established in HPC industry 10

Intel Xeon Phi More recently, Intel have released a different type of accelerator to compete with GPUs for scientific computing Many Integrated Core (MIC) architecture AKA Xeon Phi (codenames Larrabee, Knights Ferry, Knights Corner) Used in conjunction with regular Xeon CPU Intel prefer the term coprocessor to accelerator Essentially a many-core CPU Typically 50-100 cores per chip with wide vector units So again uses concept of many simple low-power cores Each performing multiple operations per cycle But latest Knights Landing (KNL) is not normally used as an accelerator Instead a self-hosted CPU 11

AMD 12-core CPU Not much space on CPU is dedicated to compute = compute unit (= core) 12

NVIDIA Pascal GPU GPU dedicates much more space to compute At expense of caches, controllers, sophistication etc = compute unit (= SM = 64 CUDA cores) 13

Memory For many applications, performance is very sensitive to memory bandwidth GPUs use high bandwidth memory GPUs use Graphics DRAM (GDDR) CPUs use DRAM or HBM2 stacked memory (new Pascal P100 chips only) 14

GPUs: 4 key factors Parallel processing GPUs have a much higher extent of parallelism than CPUs: many more cores (high-end GPUs have thousands of cores). Clock frequency GPUs typically have lower clock-frequency than CPUs, and instead get performance through parallelism. Memory bandwidth GPUs use high bandwidth GDDR or HBM2 memory. Memory latency Memory latency from is similar to DDR. GPUs hide latency through very high levels of multithreading. 15

Latest Technology NVIDIA Tesla HPC specific GPUs have evolved from GeForce series AMD FirePro HPC specific GPUs have evolved from (ATI) Radeon series 16

NVIDIA Tesla Series GPU Chip partitioned into Streaming Multiprocessors (SMs) that act independently of each other Multiple cores per SM. Groups of cores act in lock-step : they perform the same instruction on different data elements Number of SMs, and cores per SM, varies across products. High-end GPUs have thousands of cores 17

NVIDIA SM 18

Performance trends GPU performance has been increasing much more rapidly than CPU 19

NVIDIA Roadmap 20

AMD FirePro AMD acquired ATI in 2006 AMD FirePro series: derivative of Radeon chips with HPC enhancements Like NVIDIA, High computational performance and high-bandwidth graphics memory Currently much less widely used for GPGPU than NVIDIA, because of programming support issues 21

Programming GPUs GPUs cannot be used instead of CPUs They must be used together GPUs act as accelerators Responsible for the computationally expensive parts of the code CUDA: Extensions to the C language which allow interfacing to the hardware (NVIDIA specific) OpenCL: Similar to CUDA but cross-platform (including AMD and NVIDIA) Directives based approach: directives help compiler to automatically create code for GPU. OpenACC and now also relatively new OpenMP 4.0 22

GPU Accelerated Systems CPUs and GPUs are used together Communicate over PCIe bus Or, in case of newest Pascal P100 GPUs, NVLINK (more later) DRAM GDRAM/HBM2 CPU GPU I/O PCIe I/O 23

Scaling to larger systems Can have multiple CPUs and GPUs within each workstation or shared memory node E.g. 2 CPUs +2 GPUs (below) CPUs share memory, but GPUs do not I/O PCIe I/O Interconnect CPU DRAM GPU + GDRAM/HB M2 Interconnect allows multiple nodes to be connected CPU I/O PCIe GPU + GDRAM/HB M2 I/O 24

GPU Accelerated Supercomputer GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node GPU+CPU Node 25

DIY GPU Workstation Just need to slot GPU card into PCI-e Need to make sure there is enough space and power in workstation 26

GPU Servers Multiple servers can be connected via interconnect Several vendors offer GPU Servers Example Configuration: 4 GPUs plus 2 (multicore) CPUs 27

Cray XK7 Each compute node contains 1 CPU + 1 GPU Can scale up to thousands of nodes 28

NVIDIA Pascal In 2016 the Pascal P100 GPU was released, with major improvements over previous versions Adoption of stacked 3D HBM2 memory as an alternative to GDDR. Several times higher bandwidth Introduction of NVLINK: an alternative to PCIe with several-fold performance benefits To closely integrate fast dedicated CPU with fast dedicated GPU CPU must also support NVLINK IBM Power series only at the moment. 29

Summary GPUs have higher compute and memory bandwidth capabilities than CPUs Silicon dedicated to many simplistic cores Use of high bandwidth graphics or HBM2 memory Accelerators are typically not used alone, but work in tandem with CPUs Most common are NVIDIA GPUs AMD also have high performance GPUs, but not so widely used due to programming support GPU accelerated systems scale from simple workstations to large-scale supercomputers 30