CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Similar documents
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Portland State University ECE 588/688. Graphics Processors

CS427 Multicore Architecture and Parallel Computing

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Computer Architecture

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Chapter 7. Multicores, Multiprocessors, and

Tesla Architecture, CUDA and Optimization Strategies

Parallel computing and GPU introduction

Online Course Evaluation. What we will do in the last week?

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture


GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Advanced Parallel Programming I

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Real-Time Support for GPU. GPU Management Heechul Yun

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

COSC 6385 Computer Architecture - Multi Processor Systems

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

45-year CPU Evolution: 1 Law -2 Equations

Current Trends in Computer Graphics Hardware

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Multi-Processors and GPU

When MPPDB Meets GPU:

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Modern CPU Architectures

Threading Hardware in G80

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Parallel Processing SIMD, Vector and GPU s cont.

GPUs and GPGPUs. Greg Blanton John T. Lubia

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 1: Gentle Introduction to GPUs

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Antonio R. Miele Marco D. Santambrogio

Massively Parallel Architectures

High Performance Computing on GPUs using NVIDIA CUDA

Master Informatics Eng.

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel programming: Introduction to GPU architecture

Chapter 7. Multicores, Multiprocessors, and Clusters

Introduction to GPGPU and GPU-architectures

Multicore Hardware and Parallelism

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Fundamental CUDA Optimization. NVIDIA Corporation

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CUDA Programming Model

Performance, Power, Die Yield. CS301 Prof Szajda

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Real-time Graphics 9. GPGPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Architecture. Michael Doggett Department of Computer Science Lund university

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Real-time Graphics 9. GPGPU

CME 213 S PRING Eric Darve

n N c CIni.o ewsrg.au

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Computing. Hwansoo Han (SKKU)

Fundamentals of Quantitative Design and Analysis

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction II. Overview

The NVIDIA GeForce 8800 GPU

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Master Informatics Eng.

CS 426 Parallel Computing. Parallel Computing Platforms

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Efficient and Scalable Shading for Many Lights

Transcription:

CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance model Fallacies, pitfalls 3 1

History of GPUs In addition to Multicore CPU, GPU is another type of multiprocessor Simple video cards in early time Use frame buffer memory for video output Computer-generated, 1998 3D graphics processing later But only available on high-end computers (e.g., SGI workstations) Moore s Law Þ lower cost, higher density Now, 3D graphics cards on PCs Called Graphics Processing Units (GPU) Special processors oriented to 3D Graphics tasks Compute vertex and pixel processing, shading, texture mapping, and rasterization, etc. 4 GPUs in the System 5 2

GPU Architectures GPU processor is highly data-parallel It is highly multithreaded Use thread switching to hide memory latency So, has much less reliance on multi-level caches Graphics memory is wide and high-bandwidth Current Trend: General Purpose GPUs (GPGPU) We now use heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code Programming languages DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA), OpenCL 6 GPGPU Example: NVIDIA Tesla Streaming multiprocesso SIMD processor 8 Streaming processors 9 3

In 2018, Nvidia Volta V100 delivers very high floating-point and integer performance. Its peak computation rates (based on GPU Boost clock rate) are: 7.8 TFLOP/s of double precision floating-point (FP64) performance 15.7 TFLOP/s of single precision (FP32) performance 125 Tensor-TFLOP/s of mixed-precision matrixmultiply-and-accumulate (for deep learning) 13 V100 Architecture The V100 GPU is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GV100 GPU consists of 6 GPCs, 42 TPCs (each including two SMs), 84 Volta SMs, and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores: Each SM also includes four texture units 14 4

With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache. 15 17 5

Multicore CPU vs GPU Feature Multicore with SIMD GPU Number of SIMD processors 4-16 16-60 #SIMD lanes/processor 4 16 Hardware multithreading support for SIMD threads (per cycle) Typical ratio of single precision to doubleprecision performance SMT = 2-4 32 threads 2:1 2:1 Largest cache size 8 MB 0.75 MB Size of memory address 64-bit 64-bit Size of main memory as big as 256 GB 4 GB to 6 GB Memory protection at level of page Yes Yes Virtual Memory Yes No Integrated scalar processor and SIMD processor Cache coherent Yes No Yes No 19 Distributed-Memory Systems and Message Passing Each processor (aka compute node) has private physical address space This is different from SMP Hardware sends/receives messages between processors 20 6

Interconnection Networks Have different network topologies: An arrangement of processors, switches, and links Examples: Bus Ring 2D Mesh N-cube (N = 3) Fully connected 26 How to Model Performance: common concepts Our performance metric of interest is: GFLOPs/sec (or Gflops) Can be measured using benchmarks or computational kernels Given a computational kernel, it has an Arithmetic Intensity That is, FLOPs Per Byte of accessed memory Given a computer, we can determine: Peak GFLOPS (can be calculated based on CPU clock rate, or check its specification) Peak memory bandwidth (i.e. bytes/sec) Measured by using the Stream benchmark: https://www.cs.virginia.edu/stream 31 7

The Roofline Performance Model y=min(kx, constant) Program specific Attainable GFLOPs/sec = Min {Peak Memory BW Arithmetic Intensity, Peak FP Performance} machine specific 32 Comparing Two Systems Opteron X2 CPU VS Opteron X4 CPU 2-core vs 4-core 2 FP performance/core 2.2GHz vs. 2.3GHz Identical memory system same mem bw n Insight: To get higher performance on X4 than performance on X2, n Need a high arithmetic intensity n Or working set must fit in X4 s 2MB L-3 cache (to have a higher $ hit rate) 33 8

Benchmark Results (CPU vs GPU) Not always faster 38 Multi-threaded DGEMM n Use OpenMP pragmas; compiler will generate parallel code void dgemm (int n, double* A, double* B, double* C) { #pragma omp parallel for for ( int sj = 0; sj < n; sj += BLOCKSIZE ) for ( int si = 0; si < n; si += BLOCKSIZE ) for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); } 40 9

Multi-threaded DGEMM Using 1 to 16 Threads 41 Fallacies Amdahl s Law does not apply to parallel computers, because we are able to achieve linear speedup. Certainly it applies (see next slide) E.g., strong scaling is bounded by the law Peak CPU performance will reflect observed/actual performance Marketing people like this approach! No, you need to be aware of many bottlenecks E.g., Pipeline stall, branch mispredictions, cache miss, memory bus, contention, synchronization, etc 42 10

Multi-threaded DGEMM 43 Pitfalls When developing parallel software, you do not take account of multiprocessors Example: using a single lock for a shared resource This will serialize all accesses, even if they could be done in parallel Hence, you need to use finer-granularity locking or buffers to reduce the contention 44 11

Concluding Remarks The goal of multiprocessors: To achieve higher performance by using multiple processors Difficulties: Developing parallel software Devising appropriate architectures SaaS is growing and clusters are a good match Parallel Multiprocessor will continue to be popular Performance per dollar and performance per Joule are driving both mobile and WSC (Warehouse Scale Computers) 45 In Summary, Looking Back at DGEMM on Intel Core i7 Data level parallelism via AVX (or subword parallelism, SIMD) 3.2x speedup Increase ILP by loop unrolling four times 2.0x more speedup Give CPU hardware more instructions to schedule Cache optimization via blocking 2.4x more speedup TLP (thread level parallelism) on a 16-core machine 14x more speedup Note: you just added 24 lines of code, but the speedup is 200-300X! (from chapter 1 to chapter 6) 47 12

DGEMM Combining all the techniques we have learned in CS402 so far: 220 X faster! 15x 48 13