Parallel computing and GPU introduction

Similar documents
Chapter 7. Multicores, Multiprocessors, and Clusters

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Parallel Systems I The GPU architecture. Jan Lemeire

Chapter 7. Multicores, Multiprocessors, and

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Chapter 7. Multicores, Multiprocessors, and Clusters

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13

Parallel Architectures

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Advanced Computer Architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Online Course Evaluation. What we will do in the last week?

Static Compiler Optimization Techniques

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

THREAD LEVEL PARALLELISM

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Portland State University ECE 588/688. Graphics Processors

Multi-Processors and GPU

WHY PARALLEL PROCESSING? (CE-401)

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 7: Support for Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EE 4683/5683: COMPUTER ARCHITECTURE

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CSC2/458 Parallel and Distributed Systems Machines and Models

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

GRAPHICS PROCESSING UNITS

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

Multiprocessors & Thread Level Parallelism

Parallel Computing Platforms

Computer Architecture Spring 2016

Introduction II. Overview

! Readings! ! Room-level, on-chip! vs.!

Threading Hardware in G80

Parallel Architecture. Hwansoo Han

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer parallelism Flynn s categories

High Performance Computing in C and C++

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

CS427 Multicore Architecture and Parallel Computing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

High Performance Computing for Science and Engineering I

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Vector Processors and Graphics Processing Units (GPUs)

Computer Architecture

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

What is Parallel Computing?

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Parallel Processing SIMD, Vector and GPU s cont.

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

CMSC 611: Advanced. Parallel Systems

GPU for HPC. October 2010

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Trends in HPC (hardware complexity and software challenges)

Lecture 7: Parallel Processing

Lecture 1: Introduction

Parallel programming: Introduction to GPU architecture

Overview. Processor organizations Types of parallel machines. Real machines

An Introduction to Parallel Architectures

SMD149 - Operating Systems - Multiprocessing

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Processor Architecture and Interconnect

High Performance Computing on GPUs using NVIDIA CUDA

CME 213 S PRING Eric Darve

Lecture 25: Board Notes: Threads and GPUs

Computer Systems Architecture

Parallel Architecture, Software And Performance

Review: Parallelism - The Challenge. Agenda 7/14/2011

POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS/COE1541: Intro. to Computer Architecture

Lecture 7: Parallel Processing

Transcription:

國立台灣大學 National Taiwan University Parallel computing and GPU introduction 黃子桓 tzhuan@gmail.com

Agenda Parallel computing GPU introduction Interconnection networks Parallel benchmark Parallel programming GPU programming

Parallel computing 國立台灣大學 National Taiwan University

Goal of computing Faster, faster and faster

Why parallel computing? Moore's law is dead (for CPU frequency)

1. Tianhe-2(NUDT) Top500 (Nov 2013) 3,120,000 cores (Intel Xeon E5, Intel Xeon Phi) 2. Titan (Cray) 560,640 cores (Opetron 6274, NVIDIA K20x) 3. Sequoia (IBM) 1,572,864 cores (Power BQC) 4. K computer (Fujitsu) 705,024 cores (Sparc64) 5. Mira (IBM) 786,432 cores (Power BQC)

Amdahl's law Serial Parallelizable work Serial 2 processors Serial 4 processors Serial many processors

Amdahl's law Total speedup = Parallelizable work 1 1 P +P/S Speedup for Parallelizable work Example: P: 0.8 (80% work is parallelizable) S: 8 (8 processors) Total speedup: 3.33x

Amdahl's law

Scaling exapmle Workload: sum of 10 scalars, and 10 10 matrix sum Single processor Time = (10+100) Tadd = 110 Tadd 10 processors Time = 10 Tadd + (100/10) Tadd = 20 Tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 Tadd + (100/100) Tadd = 11 Tadd Speedup = 110/11 = 10 (10% of potential)

Scaling example What if matrix size is 100 100? Single processor Time = (10+10000) Tadd = 10010 Tadd 10 processors Time = 10 Tadd + (10000/10) Tadd = 1010 Tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 Tadd + (10000/100) Tadd = 110 Tadd Speedup = 10010/110 = 91 (91% of potential)

Scalability The ability of a system to handle growing amount of work Strong scaling Fixed total problem size Run a fixed problem faster Weak scaling Fixed problem size per processor Run a bigger (or smaller) problem

Parallel computing system Parallelization design for processors Hardware multithreading Multi-processor system Cluster computing system Grid computing system

Parallelization design for processors Instruction level parallelism add $t0, $t1, $t2 mul $t3, $t4, $t5 Data level parallelism add 0($t1), 0($t2), 0($t3) add 4($t1), 4($t2), 4($t3) add 8($t1), 8($t2), 8($t3)

Flynn's taxonomy Single instruction Multiple instruction Single data SISD (Single-core processor) MISD (very rare) Multiple data SIMD (Superscalar, vector processor, GPU, etc.) MIMD (Multi-core processor)

SIMD Operate element-wise on vectors of data MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time, each with different data address Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications

Example: dot product mov esi, dword ptr [src] mov edi, dword ptr [dst] mov ecx, Count start: movaps xmm0, [esi] //a3, a2, a1, a0 mulps xmm0, [esi + 16] //a3*b3,a2*b2,a1*b1,a0*b0 haddps xmm0, xmm0 //a3*b3+a2*b2,a1*b1+a0*b0, //a3*b3+a2*b2,a1*b1+a0*b0 movaps xmm1, xmm0 psrldq xmm0, 8 addss xmm0, xmm1 movss [edi],xmm0 add esi, 32 add edi, 4 sub ecx, 1 jnz start

Vector processors Highly pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS 32 64-element registers (64-bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth

Example: DAXPY (Y = a X + Y) Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result

Hardware multithreading Allows multiple threads to share the functional units of a single processor in an overlapping fashion Coarse-grained multithreading Switches threads only on costly stall Fine-grained multithreading Interleaved execution of multiple threads Simultaneous multithreading Multiple-issue, dynamically scheduled processor to exploit thread-level parallelism

Hardware multithreading

Multi-processor system Shared memory multi-processor

Multi-processor system Non-uniform memory access multi-processor

Cluster computing system

Grid computing system

GPU introduction 國立台灣大學 National Taiwan University

Computer graphics rendering

History of computer graphics 1960, Ivan Sutherland's Sketchpad The beginning of computer graphics 1992, OpenGL 1.0 1996, Voodoo I The first consumer 3D graphics card 1996, DirectX 3.0 The first version including Direct3D

The history of computer graphics 2000, DirectX 8.0 The first version supporting HLSL 2001, GeForce 3 (NV20) The first consumer GPU 2004, OpenGL 2.0 The first version supporting GLSL 2006, GeForce 8 (G80) The first NVIDIA GPU supporting CUDA 2008 OpenCL (Apple, AMD, IBM, Qualcomm, Intel, )

Graphics pipeline Instructions States Data Transforms Lighting, etc Input processor Do geometry stuff Rasterize Pixel shading Do pixel stuff Z-buffer Transparency Accumulate pixel result

Make it faster Input processor Do Do geometry stuff Do geometry geometry stuff stuff Do Do pixel pixel stuff Do pixel stuff stuff Accumulate pixel result

Add framebuffer support Input processor Do Do geometry stuff Do geometry geometry stuff stuff Do Do pixel pixel stuff Do pixel stuff stuff FB (memory) Accumulate pixel result

Add programmability Front end Get data Process data Output data Do geometry stuff Geometry Do geometry shader stuff ALU Do Do pixel pixel stuff Pixel shader stuff ALU FB (memory) Raster operations

Uniform shader Front end Buffer Do Do geometry geometry stuff Uniform shader stuff ALU FB (memory) Raster operations

Scaling it up again Front end Buffer Do Do geometry geometry stuff Uniform shader stuff ALU FB (memory) Raster operations

NVIDIA tesla GPU Processors Sorting, Distribution Memory I/O

Interconnection networks 國立台灣大學 National Taiwan University

Interconnection networks Bus Ring Performance metric Network bandwidth, the peek transfer rate of a network (best case) Bisection bandwidth, the bandwidth between two equal parts of a multiprocessor (worse case)

Network topologies Bus Ring 2D mesh N-cube Fully connected

Multistage networks Crossbar Omega network

Network characterisitics Performance Latency Throughput Link bandwidth Total network bandwidth Bisection bandwidth Congestion delays Cost Power Routability in silicon

Parallel benchmark 國立台灣大學 National Taiwan University

Parallel benchmark Linpack: matrix linear algebra SPECrate: parallel run of SPEC CPU programs Job-level parallelism SPLASH: Stanford Parallel Applications for Shared Memory Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite computational fluid dynamics kernels PARSEC (Princeton Application Repository for Shared Memory Computers) suite Multithreaded applications using Pthreads and OpenMP

Modeling performance What is the performance metric of interest? Attainable GFLOPs/second Measured using computational kernels from Berkeley Design Pattern Arithmetic intensity FLOPs per byte of memory access For a given computer, determine Peak FLOPs Peak memory bytes/second

Roofline diagram

Opteron X2 vs. X4 Comparing systems 2-core vs. 4-core 2.2GHz vs. 2.3GHz Same memory system

Optimizing performance

Optimizing performance Choices of optimization depends on arithmetic intensity of code

Four example systems Intel Xeon e5345 AMD Opteron X4 2356 Sun UltraSPARC T2 5140 IBM Cell QS20

Roofline diagrams

Conclusion Goal: higher performance by using multiple processors Difficulties Developing parallel software Devising appropriate architectures Many reasons for optimism Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects!

Parallel programming 國立台灣大學 National Taiwan University

Can a program be parallelized? Matrix multiplication for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < K; ++k) Fibonacci sequence A = 0, B = 1 C[i][j] = A[i][k] * B[k][j]; for (int i = 0; i < N; ++i) { } C = A + B; A = B; B = C;

Parallel programming Software/algorithm is the key Significant performance improvement Otherwise, Just use a faster uniprocessor Difficulties Partitioning Coordination Communication overhead

Parallel programming Job-level parallelism Single program runs on multiple processors Single program runs on multiple computers

Job-level parallelism Operation system does it now How to improve the throughput? Processors number vs. jobs number Memory usage I/O statistics Scheduling and priority...

Process Multi-process program a running instance of a program Example: Browser Process WWW server Process Process Process Browser Browser Browser

Multi-thread program Thread Light weight process Example: Thread Browser Thread WWW server Thread Thread Browser Browser Browser

Multi-process vs. multi-thread Performance Launch time Context switch Kernel-awareness scheduling Communication Inter-process vs. inter-thread communication Stability

Message passing interface (MPI) Node Node Node Broadcast Node Node Node

MapReduce/hadoop

GPU programming 國立台灣大學 National Taiwan University

GPGPU

Cuda

Thanks! 國立台灣大學 National Taiwan University