COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Similar documents
COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6339 Accelerators in Big Data

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Accelerators

CS 179: GPU Computing. Lecture 2: The Basics

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Accelerators

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Fundamental CUDA Optimization. NVIDIA Corporation

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Tesla Architecture, CUDA and Optimization Strategies

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Josef Pelikán, Jan Horáček CGG MFF UK Praha

University of Bielefeld

High Performance Computing and GPU Programming

Parallel Numerical Algorithms

EEM528 GPU COMPUTING

Preparing seismic codes for GPUs and other

GRAPHICS PROCESSING UNITS

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA Programming

Introduc)on to GPU Programming

Fundamental Optimizations

CUDA Basics. July 6, 2016

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Programmable Graphics Hardware (GPU) A Primer

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA

HPCSE II. GPU programming and CUDA

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Portland State University ECE 588/688. Graphics Processors

CUDA Performance Optimization. Patrick Legresley

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Multi-Processors and GPU

GPU CUDA Programming

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

NVIDIA Fermi Architecture

ME964 High Performance Computing for Engineering Applications

Mattan Erez. The University of Texas at Austin

Introduction to GPU programming with CUDA

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

04. CUDA Data Transfer

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to GPGPUs and to CUDA programming model

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Introduction to CUDA (1 of n*)

Lecture 11: GPU programming

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Processors, Performance, and Profiling

Computer Architecture

Spring Prof. Hyesoon Kim

Scientific discovery, analysis and prediction made possible through high performance computing.

From Shader Code to a Teraflop: How Shader Cores Work

Programming with CUDA, WS09

Cartoon parallel architectures; CPUs and GPUs

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

COMP 322: Fundamentals of Parallel Programming

Numerical Simulation on the GPU

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

TUNING CUDA APPLICATIONS FOR MAXWELL

Portland State University ECE 588/688. Cray-1 and Cray T3E

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Basics of CADA Programming - CUDA 4.0 and newer

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA Architecture & Programming Model

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPU Programming Using CUDA

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Massively Parallel Algorithms

GPU Programming with CUDA. Pedro Velho

Data Parallel Execution Model

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Superscalar Processors

Scientific Computing on GPUs: GPU Architecture Overview


Transcription:

COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: Larrabee: a many-core x86 architecture for visual computing, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/userfiles/en-us/file/larrabee_manycore.pdf Nvidia GT200: [2] David Kanter, Nvidia GT200: Inside a Parallel Processor, http://www.realworldtech.com/page.cfm?articleid=rwt090808195242&p=1 09/08/2008 Nvidia Fermi: [3] David Kanter, Inside Fermi: Nvidia s HPC Push, http://www.realworldtech.com/page.cfm?articleid=rwt093009110932&p=1 09/30/2009 [4] Peter N. Glaskowsky, Nvidia s Fermi: The First Complete GPU Architecture http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia%27s_fermi- The_First_Complete_GPU_Architecture.pdf 1

Larrabee Motivation Comparison of two architectures with the same number of transistors Half the performance of a single stream for the simplified core 40x increase for multi-stream executions 2 out-of-order cores Instruction issue 4 2 10 in-order cores VPU per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock Larrabee Overview Many-core visual computing architecture Based on x86 CPU cores Extended version of the regular x86 instruction set Supports subroutines and page faulting Number of x86 cores can vary depending on the implementation and processor version Fixed functional units for texture filtering Other graphical operations such as rasterization or postshader blending done in software 2

Larrabee Overview (II) Image Source: [1] Overview of a Larrabee Core (I) Image Source: [1] 3

Overview of a Larrabee Core (I) x86 core derived from the Pentium processor No out-of-order execution Standard Pentium instruction set with the addition of 64 bit instructions Instructions for pre-fetching data into L1 and L2 cache Support for 4 simultaneous threads, separate registers for each thread Each core is augmented with a wide vector processor (VPU) 32kb L1 Instruction cache, 32 kb L1 Data Cache 256 KB of local subset of the L2 cache Coherent L2 cache across all cores Vector Processing Unit in Larrabee 16-wide VPU executing integer, single- and double precision floating point operations VPU supports gather-scatter operations The 16 elements are loaded or can be stored from up to 16 different addresses Support for predicated instructions using a mask control register (if-then-else statements) 4

Inter-Processor Ring Network Bi-directional ring network 512 bits-wide per direction Routing decisions done before injecting message into the network Larrabee Programming Models Most application can be executed without modification due to the full support of the x86 instruction set Support for POSIX threads to create multiple threads API extended by thread affinity parameters Recompiling code with Larrabee s native compiler will generate automatically the codes to use the VPUs. Alternative parallel approaches Intel threading building blocks Larrabee specific OpenMP directives 5

Larrabee Performance Image Source: [1] Nvidia GT200 A GT200 is multi-core chip with two level hierarchy focuses on high throughput on data parallel workloads 1 st level of hierarchy: 10 Thread Processing Clusters (TPC) 2 nd level of hierarchy: each TPC has 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) a texture pipeline (used for memory access) Global Block Scheduler: issues thread blocks to SMs with available capacity simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account 6

Nvidia GT200 Image Source: [2] Nvidia GT200 streaming multiprocessor (I) Instruction fetch, decode and issue logic 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a core by Nvidia) 8 branch units a thread encountering a branch will stall until it is resolved (no speculation), branch delay: 4 cycles two 64bit special units for less frequent operations 64bit operations 8-12 times slower than 32bit operations! 1 special function unit for unusual instructions transcendental functions, interpolations, reciprocal square roots take anywhere from 16 to 32 cycles to execute 7

Nvidia GT200 streaming multiprocessor (II) single issue with SIMD capabilities can execute up to 8 thread blocks/1024 threads concurrently does not support speculative execution or branch prediction Instructions are scoreboarded to reduce stalls Each SP has access to 2048 register file entries each with 32 bits a double precision number has to utilize two adjacent registers register file can be used by up to 128 threads concurrently Nvidia GT200 streaming multiprocessor (III) Image Source: [2] 8

Nvidia GT200 streaming multiprocessor (IV) Execution units of an SM run at twice the frequency of fetch and issue logic as well as memory and register 64KB register file that is partitioned across alls SPs 16KB shared memory that can be used for communication between the threads running on the SPs of the same SM organized in 4096 entries, 16 banks ( = 32bit bank width) accessing shared memory is as fast as accessing a register! Load/Store operations Generated in SMs, but handled by SM controller in the TPC load pipeline shared hardware with texture pipeline shared by three 3 SMs mutual exclusive usage of load and texture pipelines effective address calculation + mapping of 40byte virtual addresses to physical address by MMU Texture cache: 2-D addressing read only caches without cache coherence entire cache hierarchy invalidated if a data item is modified texture caches used to save bandwidth and power, not really faster than texture memory 9

Load/Store operations (II) Image Source: [2] Generalized Memory Model 10

CUDA Memory Model (II) cudaerror_t cudamalloc(void** devptr, size_t size) Allocates size bytes of device(global) memory pointed to by *devptr Returns cudasuccess for no error cudaerror_t cudamempy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) Dst = destination memory address Src = source memory address Count = bytes to copy Kind = type of transfer ( HostToDevice, DeviceToHost, DeviceToDevice ) cudaerror_t cudafree(void* devptr) Frees memory allocated with cudamalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo http://www.cse.buffalo.edu/faculty/miller/courses/cse710/heavner.pdf Hello World: Vector Addition (II) int main ( int argc, char ** argv) { float a[n], b[n], c[n]; float *d_a, *d_b, *d_c; cudamalloc( &d_a, N*sizeof(float)); cudamalloc( &d_b, N*sizeof(float)); cudamalloc( &d_c, N*sizeof(float)); cudamemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudamemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsperblock(256); // 1-D array of threads dim3 blockspergrid(n/256); // 1-D grid vecadd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c); cudamemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudafree(d_a); cudafree(d_b); cudafree(d_c); } 11

Availability of the GT200 processor Image Source: [2] Nvidia Fermi processor Next generation processors of Nvidia Got rid of one level of hierarchy only contains 16 SM processors, but no notion of TPCs each SM processor has 32 ALU units (Nvidia cores ) compared to 8 on the GT200 further subdivided into execution blocks using 16 units 16 load/store units compared to 1 for three SMs in GT200 64 kb local SRAM that can be split into L1 cache and shared memory (16kb/48kb or 48kb/16kb) 4 special function units compared to 1 in GT200 12

Nvidia Fermi SM processor Image Source: [4] Nvidia Fermi processor Can manage up 1,536 threads simultaneously per SM compared to 1024 per SM on the GT200 Register file increased to 128kB, (32k entries) New: modified address space using 40bit addresses global, shared and local addresses are ranges within that address space New: support for atomic read-modify-write operation New: support for predicated instructions 13