Multi-core Programming: Introduction Timo Lilja January 22, 2009 1 Outline Outline Contents 1 Practical Arrangements 1 2 Multi-core processors 1 2.1 CPUs.................................. 1 2.2 GPUs.................................. 4 2.3 Open Problems............................. 9 3 Topics 9 2 1 Practical Arrangements Practical Arrangements Meetings: A232 on Thursdays between 14-16 o'clock Presentation: extended slide-sets are to be handed-out before the presentation Programming topics and meeting times are decided after we review the questionnaire forms We might not have meetings every week! Course web-page: http://www.cs.hut.fi/u/tlilja/multicore/ 3 2 Multi-core processors 2.1 CPUs Multi-core CPUs Combines two ore more independent cores into a single chip. Cores do not have to be identical Moore's law still holds but increasing frequency started to be problematic ca. 2004 for the x86 architectures Main problems 1
memory wall ILP wall power wall The memory wall is the gap between memory and processor speeds. The Instruction Level Parallelism wall (ILP) refers to the problem of nding enough parallelism to keep higher clock-speed CPUs busy. The power wall relates to the fact that increasing the clock speed increases CPU power consumption. 4 History Originally used in DSPs. E.g., mobile phones have general purpose processor for UI and DSP for RT processing. IBM POWER4 was the rst non-embedded dual-core processor in 2001 HP PA-8800 in 2003 Intel's and AMD's rst dual-cores in 2005 Intel and AMD are came relatively late to the multi-core market, Intel had, however, hyper-threading/smt in 2002 Sun Ultrasparc T1 in 2005 Lots of others: ARM MPCore, STI Cell (PlayStation 3), GPUs, Network Processing, DSPs,... 5 Multi-core Advantages and Disadvantages Advantages Cache coherency is more ecient since the signals have less distance to travel than in separate chip CPUs. Power consumption may be less when compared to independent chips Some circuitry is shared. E.g, L2-cache Improved response time for multiple CPU intensive workloads Disadvantages Applications perform better only if they are multi-threaded Multi-core design may not use silicon area as optimally as single core CPUs System bus and memory access may become bottlenecks 6 2
Programming multi-core CPUs Basically nothing new here Lessons learnt in independent chip SMP programming still valid Shared memory access - mutexes Conventional synchronization problems Shared memory vs. message passing Threads Operating system scheduling Programming language support vs. library support 7 What to gain from multi-cores Amdahl's law The speedup of a program is limited by the time needed for the sequential fraction of the program For example: if a program needs 20 hours in a single core and 1 hour of computation cannot be parallelized, then the minimal execution time is 1 hour regardless of number of cores. Not all computation can be parallelized Care must be taken when an application is parallelized If the SW architecture was not written with concurrent execution in mind then good luck with the parallelization! 8 Software technologies Posix threads Separate processes CILK OpenMP Intel Threading Building Blocks Various Java/C/C++ libraries/language support FP languages: Erlang, Concurrent ML/Haskell Cilk is an extension of C language which has some new concurrency primitives embedded in it. The primitives are spawn, sync, inlet and abort. Spawn executes the code in parallel and sync causes the execution wait until all spawned processes have been completed. Inlets (inlet, abort) provide more advanced form of parallelism. An example of bonacci calculation in Cilk: 9 3
cilk int fib (int n) if (n < 2) return 1; else int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); OpenMP is a set of compiler directives and library routines which provide a concurrent execution framework for C/C++ and Fortran. An example of parallel for-loop execution in OpenMP int main(int argc, char *argv[]) const int N = 100000; int i, a[n]; #pragma omp parallel for for (i=0;i<n;i++) a[i]= 2*i; return 0; Intel Threading Building Blocks is a C++ template library for multi-core programming. It provides primitives for parallel loop execution, synchronization, mutual exclusion, and such. Erlang is concurrent programming language developed by Ericsson. It provides few powerful concurrency primitives and message-passing interface to concurrency. Erlang has lightweight user-level share-nothing threads. % create process and call the function % web:start_server(port, MaxConnections) ServerProcess = spawn (web, start_server, [Port, MaxConnections]), % create a remote process and call the % function web:start_server(port, MaxConnections) % on machine RemoteNode RemoteProcess = spawn(remotenode, web, start_server, [Port, MaxConnections]), % send the pause, 10 message % (a tuple with an atom "pause" and a number "10") % to ServerProcess (asynchronously) ServerProcess! pause, 10, % receive messages sent to this process receive a_message -> do_something; data, DataContent -> handle(datacontent); hello, Text -> io:format("got hello message: ~s", [Text]); goodbye, Text -> io:format("got goodbye message: ~s", [Text]) 4
end. 2.2 GPUs Stream processing Based on SIMD/MIMD paradigms Given a set data stream and a function kernel which is to be applied to each element in the stream Stream processing is not standard CPU + SIMD/MIMD stream processors are massively parallel (e.g. 100s of GPU cores instead of CPUs 1-10 cores today) imposes limits on kernel and stream size Kernel must be independent and data locally used to get performance gains from stream processing SIMD is part of Flynn's taxonomy: SISD Single Instruction, Single Data. There is no parallelism in the execution in either instruction or data streams. 10 SIMD Single Instruction, Multiple Data. CPU can operate on multiple data elements when executing a single instruction MISD Multiple Instruction, Single Data. Single CPU instruction operates on multiple data streams. Mainly used for redundancy or fault-tolerance. MIMD Multiple Instruction, Multiple Data. Multiple instruction streams perform operations on multiple data streams. Some kind of distributed system An example: traditional for-loop for (i = 0; i < 100 * 4; i++) r[i] = a[i] + b[i]; in SIMD paradigm for (i = 0; i < 100; i++) vector_sum(r[i],a[i],[i]); in parallel stream paradigm streamelements 100 streamelementformat 4 numbers elementkernel "@arg0+@arg1" result = kernel(source0, source1) 11 5
GPUs General-purpose computing on GPUs (GPGPU) Origins in programmable vertex and fragment shaders GPUs are suitable for problems that can be solved using stream processing Thus, data parallelism must be high and computation independent arithmetic intensity = operations / words transferred Computation that benet from GPUs have high arithmetic intensity Vertex shaders allow the programmer to change the attributes of a vertex (i.e., its position, orientation, texture) and fragment shaders calculate the color perpixel. Originally shader programming capabilities were very limited but modern graphics libraries (OpenGL, DirectX) provide very versatile programming environments. Arithmetic intensity is the ratio of computation to bandwidth. The more the program has to transfer data the less there is to gain from GPU stream processing. 12 Gather vs. Scatter High arithmetic intensity requires that communication between stream elements is minimised. Gather Kernel requests information from other parts of memory Corresponds to random-acccess load capability Scatter Kernel distributes information to other elemnts Corresponds to random-acccess store capability 13 GPU Resources Programmable processors Vertex Processors Fragment Processors Memory management Rasterizer Texture Unit Render-to-Texture 6
Vertex is a set of position, color, normal vector and such. Vertex processor apply vertex program (vertex shader) to transform a vertex relative to camera and then each set of three vertices is used to compute a fragment, which is needed to create shaded pixels in the nal image. Vertex processors have hardware to process four-element vectors. NVidia GeForce 6800 has six vertex processors. Fragment/pixel processors are fully programmable and operate SIMD-style on input elements, processing four elements in parallel. Fragment processors can fetch data parallel from textures so they are capabale of gather. The output address is xed before the fragment is processed, thus they can do no scatter natively. For GPGPU, fragment processors are used more than vertex processors. There are more fragment than vertex processors. Output of fragment processors goes directly to memory. Rasterizer groups sets of three vertices and computes the triangles. From this stream of triangles the fragments are generated. Rasterizer can be seen as an address interplotar. Texture Unit is the way the fragment and vertex processors can access memory. It is, in a way, read-only memory interface Render-to-Texture is the way to write the result to a texture instead of graphic cards frame buer (i.e. screen). In other words, a write-only memory interface. Data types Basic types: integers, oats, booleans Floating point support somewhat limited Some NVidia Tesla models support full double precesion oats Care must be taken when using GPU oats CPU vs. GPU mapping between CPU and GPU concepts: Software technologies GPU textures (streams) fragment programs (kernel) render-to-texture geometry rasterization texture coordinates vertex coordinates ATI/AMD Stream SDK NVidia Cuda OpenCL BrookGPU GPU kernels and Haskell? Other FPs? Intel Larrabee and correspondign software? CPU arrays inner loops feedback computation invocation computational domain computational range 14 15 16 17 7
NVidia Cuda (1/2) First beta in 2007 C compiler with language extensions specic to GPU stream processing Low-level ISAs are closed, proprietary driver compiles the code to the GPU (AMD/ATI have opened their ISAs) OS Support: Windows XP/Vista, Linux, Mac OS X In Linux Redhat/Suses/Fedora/Ubuntu supported, though no.debs but a shell-script installer available http://www.nvidia.com/object/cuda_get. html PyCuda: Python interface for cuda: http://mathema.tician.de/software/ pycuda 18 NVidia Cuda (2/2) An Example: // Kernel definition global void vecadd(float* A, float* B, float* C) int main() // Kernel invocation vecadd<<<1, N>>>(A, B, C); Compiler is nvcc and le-extension.cu See CUDA 2.0 Programming Guide and Reference manual in http://www. nvidia.com/object/cuda_develop.html Unfortunately the actual implementation is a bit more complex than the above slide says. Care must be taken in order to allocate necessary device buers, copy data between them and the host and after the execution release the allocated resources. The code below creates a vector [1.0, 2.0, 3.0] and adds it to itself producing another vector and prints the result. 19 /* -*- c -*- */ #include <stdio.h> #include <cuda.h> global void vecadd(float* A, float* B, float* C) int i = threadidx.x; C[i] = A[i] + B[i]; #define SIZE 3 int main() 8
float a_host[size] = 1.0, 2.0, 3.0; float r_host[size]; float *a_gpu, *r_gpu; int i; cudamalloc((void **)&a_gpu, SIZE); cudamalloc((void **)&r_gpu, SIZE); cudamemcpy(a_gpu, a_host, sizeof(float)*size, cudamemcpyhosttodevice); // Kernel invocation vecadd<<<1, SIZE>>>(a_gpu, a_gpu, r_gpu); cudamemcpy(r_host, r_gpu, sizeof(float)*size, cudamemcpydevicetohost); for (i = 0; i < SIZE; i++) printf("%f\n", r_host[i]); cudafree(a_gpu); cudafree(r_gpu); return 0; 2.3 Open Problems Open Problems How ready are current environments for multi-core/gpu? E.g., Java/JVM What tools are needed for developing concurrent software? In multi-core CPUS and GPUs E.g., debuggers for GPUs? Operating system support? Schedulers Device drivers? Totally proprietary, licensing issues? Lack of standards? Is OpenCL a solution? 20 3 Topics Possible topics (1/2) Multi-core CPUs Threads, OpenMP, UPC, Intel Threading Building Blocks Intel's Tera-scale Computing Research Program 9
GPU NVidia Cuda, AMD FireStream, Intel Larrabee, OpenCL Stream processing Programming languages FP languages: Haskell and GPUs, Concurrent ML, Erlang Main stream languages: Java/JVM/C#/C/C++ GPU/Multi-core support in script languages (Python, Ruby, Perl) Message passing vs. shared memory 21 Possible topics (2/2) Hardware overview multi-core CPUs, GPUs: What is available? How many cores? embedded CPUs, network hardware other? Applications What applications are (un)suitable for multi-core CPUs/GPUs? Gaining performance in legacy applications: (Is it possible? How to do it? Problems? Personal experiences?) 22 10