Massively Parallel Architectures

Similar documents
Introduction to CELL B.E. and GPU Programming. Agenda

Tesla Architecture, CUDA and Optimization Strategies

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing. Lecture 19: CUDA - I

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Lecture 3: Introduction to CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

high performance medical reconstruction using stream programming paradigms

Threading Hardware in G80

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Parallel Numerical Algorithms

Accelerating image registration on GPUs

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Introduction to CUDA Programming

Introduction to GPGPU and GPU-architectures

Introduction to CUDA

GPU Programming Using CUDA

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to Parallel Computing with CUDA. Oswald Haan

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

High Performance Computing. University questions with solution

CUDA Programming Model

ECE 574 Cluster Computing Lecture 15

Matrix Multiplication in CUDA. A case study

Using The CUDA Programming Model

1/25/12. Administrative

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Device Memories and Matrix Multiplication

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Real-time Graphics 9. GPGPU

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Introduction to CUDA

GPGPU/CUDA/C Workshop 2012

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Dense Linear Algebra. HPC - Algorithms and Applications

Real-time Graphics 9. GPGPU

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications

Introduction to CUDA (1 of n*)

Lecture 2: CUDA Programming

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Scientific discovery, analysis and prediction made possible through high performance computing.

Intro to GPU s for Parallel Computing

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Performance Optimization. Patrick Legresley

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Double-Precision Matrix Multiply on CUDA

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

High-Performance Computing Using GPUs

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to CUDA (1 of n*)

Programming Parallel Computers

University of Bielefeld

CS : Many-core Computing with CUDA

CDA3101 Recitation Section 13

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CONSOLE ARCHITECTURE

GPUs and GPGPUs. Greg Blanton John T. Lubia

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

ECE 574 Cluster Computing Lecture 17

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA C Programming Mark Harris NVIDIA Corporation

High Performance Computing and GPU Programming

Introduc)on to GPU Programming

Scientific Computations Using Graphics Processors

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Technology for a better society. hetcomp.com

GPU programming. Dr. Bernhard Kainz

Nvidia G80 Architecture and CUDA Programming

GPU Memory Memory issue for CUDA programming

The University of Texas at Austin

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Lecture 11: GPU programming

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Portland State University ECE 588/688. Graphics Processors

Programming Parallel Computers

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Lecture 2: Introduction to CUDA C

HPC with Multicore and GPUs

Transcription:

Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009

Motivation The CELL processor Harder,Better,Faster,Stronger (famous tune) Scientific Computation is largely demanding of computation power Faster computation = more results now Biology and Health Care Oiling and Finance Video Games Industry

Motivation The CELL processor Harder,Better,Faster,Stronger (famous tune) Scientific Computation is largely demanding of computation power Faster computation = more results now Biology and Health Care Oiling and Finance Video Games Industry The Silent Revolution Computing Power : 400 GFLOPS vs 32 GFLOPS Memory bandwidth : 100-200 GB/s vs 10 GB/s GPU are in everyday PCs Cell went from server blade to the game industry (PS3)

Motivation The CELL processor When Video games ruled the World Game design has become ever more sophisticated. Fast GPUs lead to complex shader for real-time effects. In turn, the demand for speed has led to ever-increasing innovation in card design. The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors.

Motivation The CELL processor When Video games ruled the World Game design has become ever more sophisticated. Fast GPUs lead to complex shader for real-time effects. In turn, the demand for speed has led to ever-increasing innovation in card design. The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors. The NV40 architecture has 225 million transistors, compared to about 175 million for the Pentium 4 EE 3.2Ghz chip.

Motivation The CELL processor

Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures

Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures... and Practice Tools and Languages Sample code

Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU

Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU The CELL Processor Heterogenous multi-core DSP-like coprocessor High-memory bandwidth ( 200GB/s)

Where to find it??? The CELL processor Architecture Coding for the CELL

The CELL Processor Architecture Coding for the CELL Structure 1 PowerPC Processing Unit 8 Synergetic Processing Unit 1 XDRAM Interface 1 4-way DMA bus Parallelism source TLP over the PPE TLP over the SPE ILP inside each SPE

The CELL Processor Architecture Coding for the CELL

Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process

Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process... that don t work OpenMP : bad scaling, huge executable Task-based MPI : huge latency, low bandwidth

Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension

Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension Specificities of the SPEs Specialized Altivec SIMD extension No scalar ALU Cacheless and predictorless

Memory and Communications Architecture Coding for the CELL Communicating between PPE and SPEs SPE LS are virtually mapped into PPE memory PPE and SPE code share the same process space SPE code must be downloaded when application starts

Memory and Communications Architecture Coding for the CELL Communicating between PPE and SPEs SPE LS are virtually mapped into PPE memory PPE and SPE code share the same process space SPE code must be downloaded when application starts Handling SPE Local Store SPE LS is only 256KB for code+data SPE memories aren t shared Need for explicit data transfer primitives

Memory and Communications Architecture Coding for the CELL Mailbox Allow transfer of small data (32bits) between SPE and PPE Two mailbox per SPE (in and out) Two mode : waiting or polling Useful for simple synchronization (thread pool pattern) Primitives : spe_in_mbox_write and spe_in_mbox_read

Memory and Communications Architecture Coding for the CELL Mailbox Allow transfer of small data (32bits) between SPE and PPE Two mailbox per SPE (in and out) Two mode : waiting or polling Useful for simple synchronization (thread pool pattern) Primitives : spe_in_mbox_write and spe_in_mbox_read Signal Allow transfer of small data (32bits) between SPEs Two signal slots per SPE (generic purpose) Useful for message-passing emulation with DMA transfers Primitives : mfc_sndsig and spe_read_signal

DMA Transfers The CELL processor Architecture Coding for the CELL Principles Offload the SPU from being blocked during memory transfer Used to download SPE code into SPE LS Up to 4 transfers cna be done in parallel over the SPE-Bus Up to one upload and one download in parallel over the PPE bus Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

DMA Transfers The CELL processor Architecture Coding for the CELL Principles Offload the SPU from being blocked during memory transfer Used to download SPE code into SPE LS Up to 4 transfers cna be done in parallel over the SPE-Bus Up to one upload and one download in parallel over the PPE bus Primitives : mfc_get,mfc_put and mfc_read_tag_status_all Traps and Pitfalls Data to send/receive must be aligned on a 128bits boundary Data size should be 1,2,4,8 or any multiple of 16 bytes Limited number of DMA channels Double buffering must be considered

Motivation The CELL processor The NVIDIA Architecture Programming with CUDA GPU beyond 3D graphics Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation

Motivation The CELL processor The NVIDIA Architecture Programming with CUDA GPU beyond 3D graphics Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Back in the day of opengl GPGPU Limited texture size/dimension Limited outputs Lack of integers and bitwise operators Limited communications

The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM

The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM Tesla machines 8-series GPUs : 200 GFLOPS stand-alone or 1U rackable unit

Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory

Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory Processors High density SMP Support 4-way SIMD

Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels

Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels Grids and Blocks Kernel = grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate

Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D

Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D Memory Access Depend son domain Image : 2D Physics : 3D

Memory Access Patterns The NVIDIA Architecture Programming with CUDA Each thread can R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant

Memory Access Patterns The NVIDIA Architecture Programming with CUDA Each thread can R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant The host can R/W constant memory R/W texture memory R/W global memory

The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads

The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads Texture and Constant Constants initialized by host Contents visible to all threads

CUDA Processing Flow The NVIDIA Architecture Programming with CUDA

Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory

Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory Copy to Device cudamemcpy() : copy memory between host and device Asynchronous since Cuda 1.1 Works 4-way : (host,device) X (host,device)

Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory Copy to Device Example cudamemcpy() : copy memory between host and device Asynchronous since Cuda 1.1 Works 4-way : (host,device) X (host,device) float *host, *device; cudamallochost(&host, sizeof(float)*64*64); cudamalloc(&device, sizeof(float)*64*64); cudamemcpy(host, device, sizeof(float)*64*64, cudamemcpyhosttodevice);

Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid

Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid Run the kernel CUDA provides a synatx extnsion for calling a given function over a given grid

Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid Run the kernel CUDA provides a synatx extnsion for calling a given function over a given grid Example dim3 dimblock(16,16); dim3 dimgrid(64 / dimblock.x, 64 / dimblock.y); device_kernel<<<dimgrid, dimblock>>>(host,64);

Build a Parallel kernel The NVIDIA Architecture Programming with CUDA kernel.cu global void device_kernel(float* data, size_t size) { // Block index int bx = blockidx.x; int by = blockidx.y; } // Thread index int tx = threadidx.x; int ty = threadidx.y; // Index of the first sub-matrix of A processed by the block int begin = size * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int end = begin + size - 1; // Step size used to iterate through the sub-matrices of A int step = BLOCK_SIZE; for(int a = begin; a <= end; a += step) data[a + size * ty + tx] = 255 - data[a + size * ty + tx];

Sample Code The CELL processor The NVIDIA Architecture Programming with CUDA see mmul.*

As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library

As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library Some untapped domain Operationnal Research Cryptography/Compression Artificial Intelligence