COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: Larrabee: a many-core x86 architecture for visual computing, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/userfiles/en-us/file/larrabee_manycore.pdf Nvidia GT200: [2] David Kanter, Nvidia GT200: Inside a Parallel Processor, http://www.realworldtech.com/page.cfm?articleid=rwt090808195242&p=1 09/08/2008 Nvidia Fermi: [3] David Kanter, Inside Fermi: Nvidia s HPC Push, http://www.realworldtech.com/page.cfm?articleid=rwt093009110932&p=1 09/30/2009 [4] Peter N. Glaskowsky, Nvidia s Fermi: The First Complete GPU Architecture http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia%27s_fermi- The_First_Complete_GPU_Architecture.pdf 1

Larrabee Motivation Comparison of two architectures with the same number of transistors Half the performance of a single stream for the simplified core 40x increase for multi-stream executions 2 out-of-order cores Instruction issue 4 2 10 in-order cores VPU per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock Larrabee Overview Many-core visual computing architecture Based on x86 CPU cores Extended version of the regular x86 instruction set Supports subroutines and page faulting Number of x86 cores can vary depending on the implementation and processor version Fixed functional units for texture filtering Other graphical operations such as rasterization or postshader blending done in software 2

Larrabee Overview (II) Image Source: [1] Overview of a Larrabee Core (I) Image Source: [1] 3

Overview of a Larrabee Core (I) x86 core derived from the Pentium processor No out-of-order execution Standard Pentium instruction set with the addition of 64 bit instructions Instructions for pre-fetching data into L1 and L2 cache Support for 4 simultaneous threads, separate registers for each thread Each core is augmented with a wide vector processor (VPU) 32kb L1 Instruction cache, 32 kb L1 Data Cache 256 KB of local subset of the L2 cache Coherent L2 cache across all cores Vector Processing Unit in Larrabee 16-wide VPU executing integer, single- and double precision floating point operations VPU supports gather-scatter operations The 16 elements are loaded or can be stored from up to 16 different addresses Support for predicated instructions using a mask control register (if-then-else statements) 4

Inter-Processor Ring Network Bi-directional ring network 512 bits-wide per direction Routing decisions done before injecting message into the network Larrabee Programming Models Most application can be executed without modification due to the full support of the x86 instruction set Support for POSIX threads to create multiple threads API extended by thread affinity parameters Recompiling code with Larrabee s native compiler will generate automatically the codes to use the VPUs. Alternative parallel approaches Intel threading building blocks Larrabee specific OpenMP directives 5

Larrabee Performance Image Source: [1] Nvidia GT200 A GT200 is multi-core chip with two level hierarchy focuses on high throughput on data parallel workloads 1 st level of hierarchy: 10 Thread Processing Clusters (TPC) 2 nd level of hierarchy: each TPC has 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) a texture pipeline (used for memory access) Global Block Scheduler: issues thread blocks to SMs with available capacity simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account 6

Nvidia GT200 Image Source: [2] Nvidia GT200 streaming multiprocessor (I) Instruction fetch, decode and issue logic 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a core by Nvidia) 8 branch units a thread encountering a branch will stall until it is resolved (no speculation), branch delay: 4 cycles two 64bit special units for less frequent operations 64bit operations 8-12 times slower than 32bit operations! 1 special function unit for unusual instructions transcendental functions, interpolations, reciprocal square roots take anywhere from 16 to 32 cycles to execute 7

Nvidia GT200 streaming multiprocessor (II) single issue with SIMD capabilities can execute up to 8 thread blocks/1024 threads concurrently does not support speculative execution or branch prediction Instructions are scoreboarded to reduce stalls Each SP has access to 2048 register file entries each with 32 bits a double precision number has to utilize two adjacent registers register file can be used by up to 128 threads concurrently Nvidia GT200 streaming multiprocessor (III) Image Source: [2] 8

Nvidia GT200 streaming multiprocessor (IV) Execution units of an SM run at twice the frequency of fetch and issue logic as well as memory and register 64KB register file that is partitioned across alls SPs 16KB shared memory that can be used for communication between the threads running on the SPs of the same SM organized in 4096 entries, 16 banks ( = 32bit bank width) accessing shared memory is as fast as accessing a register! Load/Store operations Generated in SMs, but handled by SM controller in the TPC load pipeline shared hardware with texture pipeline shared by three 3 SMs mutual exclusive usage of load and texture pipelines effective address calculation + mapping of 40byte virtual addresses to physical address by MMU Texture cache: 2-D addressing read only caches without cache coherence entire cache hierarchy invalidated if a data item is modified texture caches used to save bandwidth and power, not really faster than texture memory 9

Load/Store operations (II) Image Source: [2] Generalized Memory Model 10

CUDA Memory Model (II) cudaerror_t cudamalloc(void** devptr, size_t size) Allocates size bytes of device(global) memory pointed to by *devptr Returns cudasuccess for no error cudaerror_t cudamempy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) Dst = destination memory address Src = source memory address Count = bytes to copy Kind = type of transfer ( HostToDevice, DeviceToHost, DeviceToDevice ) cudaerror_t cudafree(void* devptr) Frees memory allocated with cudamalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo http://www.cse.buffalo.edu/faculty/miller/courses/cse710/heavner.pdf Hello World: Vector Addition (II) int main ( int argc, char ** argv) { float a[n], b[n], c[n]; float *d_a, *d_b, *d_c; cudamalloc( &d_a, N*sizeof(float)); cudamalloc( &d_b, N*sizeof(float)); cudamalloc( &d_c, N*sizeof(float)); cudamemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudamemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsperblock(256); // 1-D array of threads dim3 blockspergrid(n/256); // 1-D grid vecadd<<<blockspergrid, threadsperblock>>>(d_a, d_b, d_c); cudamemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudafree(d_a); cudafree(d_b); cudafree(d_c); } 11

Availability of the GT200 processor Image Source: [2] Nvidia Fermi processor Next generation processors of Nvidia Got rid of one level of hierarchy only contains 16 SM processors, but no notion of TPCs each SM processor has 32 ALU units (Nvidia cores ) compared to 8 on the GT200 further subdivided into execution blocks using 16 units 16 load/store units compared to 1 for three SMs in GT200 64 kb local SRAM that can be split into L1 cache and shared memory (16kb/48kb or 48kb/16kb) 4 special function units compared to 1 in GT200 12

Nvidia Fermi SM processor Image Source: [4] Nvidia Fermi processor Can manage up 1,536 threads simultaneously per SM compared to 1024 per SM on the GT200 Register file increased to 128kB, (32k entries) New: modified address space using 40bit addresses global, shared and local addresses are ranges within that address space New: support for atomic read-modify-write operation New: support for predicated instructions 13