Multi-core Programming: Introduction

Similar documents
CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduc)on to GPU Programming

ECE 574 Cluster Computing Lecture 17

GPU programming. Dr. Bernhard Kainz

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

ECE 574 Cluster Computing Lecture 15

From Brook to CUDA. GPU Technology Conference

GPU for HPC. October 2010

COSC 6385 Computer Architecture - Multi Processor Systems

Introduction to GPU hardware and to CUDA

Real-time Graphics 9. GPGPU

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Technology for a better society. hetcomp.com

Parallel Programming on Larrabee. Tim Foley Intel Corp

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Real-time Graphics 9. GPGPU

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

The Art of Parallel Processing

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Parallel Computing Introduction

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Parallel Computing Why & How?

Multi-Processors and GPU

GPUs and GPGPUs. Greg Blanton John T. Lubia

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Portland State University ECE 588/688. Graphics Processors

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Modern Processor Architectures. L25: Modern Compiler Design

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

! Readings! ! Room-level, on-chip! vs.!

CSC573: TSHA Introduction to Accelerators

University of Bielefeld

PowerVR Hardware. Architecture Overview for Developers

Accelerating CFD with Graphics Hardware

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA

GPGPU. Peter Laurens 1st-year PhD Student, NSC

CDA3101 Recitation Section 13

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Accelerators

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Parallel Computing. Hwansoo Han (SKKU)

Computer Architecture

Parallel Accelerators

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Parallel Programming Libraries and implementations

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

Parallel Computing. November 20, W.Homberg

high performance medical reconstruction using stream programming paradigms

Parallel Architectures

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How Shader Cores Work

Lecture 11: GPU programming

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

JCudaMP: OpenMP/Java on CUDA

Introduction to Parallel Programming

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Performance potential for simulating spin models on GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

High-Performance Computing Using GPUs

Accelerating image registration on GPUs

High Performance Computing and GPU Programming

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CS420: Operating Systems

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel Computing Ideas

GPU Computing: A Quick Start

Massively Parallel Architectures

Antonio R. Miele Marco D. Santambrogio

Introduction to CUDA Programming

Message Passing Interface (MPI)

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Trends and Challenges in Multicore Programming

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

27. Parallel Programming I

GPUs and Emerging Architectures

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Hybrid Computing Stéphane Bihan, CAPS

Transcription:

Multi-core Programming: Introduction Timo Lilja January 22, 2009 1 Outline Outline Contents 1 Practical Arrangements 1 2 Multi-core processors 1 2.1 CPUs.................................. 1 2.2 GPUs.................................. 4 2.3 Open Problems............................. 9 3 Topics 9 2 1 Practical Arrangements Practical Arrangements Meetings: A232 on Thursdays between 14-16 o'clock Presentation: extended slide-sets are to be handed-out before the presentation Programming topics and meeting times are decided after we review the questionnaire forms We might not have meetings every week! Course web-page: http://www.cs.hut.fi/u/tlilja/multicore/ 3 2 Multi-core processors 2.1 CPUs Multi-core CPUs Combines two ore more independent cores into a single chip. Cores do not have to be identical Moore's law still holds but increasing frequency started to be problematic ca. 2004 for the x86 architectures Main problems 1

memory wall ILP wall power wall The memory wall is the gap between memory and processor speeds. The Instruction Level Parallelism wall (ILP) refers to the problem of nding enough parallelism to keep higher clock-speed CPUs busy. The power wall relates to the fact that increasing the clock speed increases CPU power consumption. 4 History Originally used in DSPs. E.g., mobile phones have general purpose processor for UI and DSP for RT processing. IBM POWER4 was the rst non-embedded dual-core processor in 2001 HP PA-8800 in 2003 Intel's and AMD's rst dual-cores in 2005 Intel and AMD are came relatively late to the multi-core market, Intel had, however, hyper-threading/smt in 2002 Sun Ultrasparc T1 in 2005 Lots of others: ARM MPCore, STI Cell (PlayStation 3), GPUs, Network Processing, DSPs,... 5 Multi-core Advantages and Disadvantages Advantages Cache coherency is more ecient since the signals have less distance to travel than in separate chip CPUs. Power consumption may be less when compared to independent chips Some circuitry is shared. E.g, L2-cache Improved response time for multiple CPU intensive workloads Disadvantages Applications perform better only if they are multi-threaded Multi-core design may not use silicon area as optimally as single core CPUs System bus and memory access may become bottlenecks 6 2

Programming multi-core CPUs Basically nothing new here Lessons learnt in independent chip SMP programming still valid Shared memory access - mutexes Conventional synchronization problems Shared memory vs. message passing Threads Operating system scheduling Programming language support vs. library support 7 What to gain from multi-cores Amdahl's law The speedup of a program is limited by the time needed for the sequential fraction of the program For example: if a program needs 20 hours in a single core and 1 hour of computation cannot be parallelized, then the minimal execution time is 1 hour regardless of number of cores. Not all computation can be parallelized Care must be taken when an application is parallelized If the SW architecture was not written with concurrent execution in mind then good luck with the parallelization! 8 Software technologies Posix threads Separate processes CILK OpenMP Intel Threading Building Blocks Various Java/C/C++ libraries/language support FP languages: Erlang, Concurrent ML/Haskell Cilk is an extension of C language which has some new concurrency primitives embedded in it. The primitives are spawn, sync, inlet and abort. Spawn executes the code in parallel and sync causes the execution wait until all spawned processes have been completed. Inlets (inlet, abort) provide more advanced form of parallelism. An example of bonacci calculation in Cilk: 9 3

cilk int fib (int n) if (n < 2) return 1; else int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); OpenMP is a set of compiler directives and library routines which provide a concurrent execution framework for C/C++ and Fortran. An example of parallel for-loop execution in OpenMP int main(int argc, char *argv[]) const int N = 100000; int i, a[n]; #pragma omp parallel for for (i=0;i<n;i++) a[i]= 2*i; return 0; Intel Threading Building Blocks is a C++ template library for multi-core programming. It provides primitives for parallel loop execution, synchronization, mutual exclusion, and such. Erlang is concurrent programming language developed by Ericsson. It provides few powerful concurrency primitives and message-passing interface to concurrency. Erlang has lightweight user-level share-nothing threads. % create process and call the function % web:start_server(port, MaxConnections) ServerProcess = spawn (web, start_server, [Port, MaxConnections]), % create a remote process and call the % function web:start_server(port, MaxConnections) % on machine RemoteNode RemoteProcess = spawn(remotenode, web, start_server, [Port, MaxConnections]), % send the pause, 10 message % (a tuple with an atom "pause" and a number "10") % to ServerProcess (asynchronously) ServerProcess! pause, 10, % receive messages sent to this process receive a_message -> do_something; data, DataContent -> handle(datacontent); hello, Text -> io:format("got hello message: ~s", [Text]); goodbye, Text -> io:format("got goodbye message: ~s", [Text]) 4

end. 2.2 GPUs Stream processing Based on SIMD/MIMD paradigms Given a set data stream and a function kernel which is to be applied to each element in the stream Stream processing is not standard CPU + SIMD/MIMD stream processors are massively parallel (e.g. 100s of GPU cores instead of CPUs 1-10 cores today) imposes limits on kernel and stream size Kernel must be independent and data locally used to get performance gains from stream processing SIMD is part of Flynn's taxonomy: SISD Single Instruction, Single Data. There is no parallelism in the execution in either instruction or data streams. 10 SIMD Single Instruction, Multiple Data. CPU can operate on multiple data elements when executing a single instruction MISD Multiple Instruction, Single Data. Single CPU instruction operates on multiple data streams. Mainly used for redundancy or fault-tolerance. MIMD Multiple Instruction, Multiple Data. Multiple instruction streams perform operations on multiple data streams. Some kind of distributed system An example: traditional for-loop for (i = 0; i < 100 * 4; i++) r[i] = a[i] + b[i]; in SIMD paradigm for (i = 0; i < 100; i++) vector_sum(r[i],a[i],[i]); in parallel stream paradigm streamelements 100 streamelementformat 4 numbers elementkernel "@arg0+@arg1" result = kernel(source0, source1) 11 5

GPUs General-purpose computing on GPUs (GPGPU) Origins in programmable vertex and fragment shaders GPUs are suitable for problems that can be solved using stream processing Thus, data parallelism must be high and computation independent arithmetic intensity = operations / words transferred Computation that benet from GPUs have high arithmetic intensity Vertex shaders allow the programmer to change the attributes of a vertex (i.e., its position, orientation, texture) and fragment shaders calculate the color perpixel. Originally shader programming capabilities were very limited but modern graphics libraries (OpenGL, DirectX) provide very versatile programming environments. Arithmetic intensity is the ratio of computation to bandwidth. The more the program has to transfer data the less there is to gain from GPU stream processing. 12 Gather vs. Scatter High arithmetic intensity requires that communication between stream elements is minimised. Gather Kernel requests information from other parts of memory Corresponds to random-acccess load capability Scatter Kernel distributes information to other elemnts Corresponds to random-acccess store capability 13 GPU Resources Programmable processors Vertex Processors Fragment Processors Memory management Rasterizer Texture Unit Render-to-Texture 6

Vertex is a set of position, color, normal vector and such. Vertex processor apply vertex program (vertex shader) to transform a vertex relative to camera and then each set of three vertices is used to compute a fragment, which is needed to create shaded pixels in the nal image. Vertex processors have hardware to process four-element vectors. NVidia GeForce 6800 has six vertex processors. Fragment/pixel processors are fully programmable and operate SIMD-style on input elements, processing four elements in parallel. Fragment processors can fetch data parallel from textures so they are capabale of gather. The output address is xed before the fragment is processed, thus they can do no scatter natively. For GPGPU, fragment processors are used more than vertex processors. There are more fragment than vertex processors. Output of fragment processors goes directly to memory. Rasterizer groups sets of three vertices and computes the triangles. From this stream of triangles the fragments are generated. Rasterizer can be seen as an address interplotar. Texture Unit is the way the fragment and vertex processors can access memory. It is, in a way, read-only memory interface Render-to-Texture is the way to write the result to a texture instead of graphic cards frame buer (i.e. screen). In other words, a write-only memory interface. Data types Basic types: integers, oats, booleans Floating point support somewhat limited Some NVidia Tesla models support full double precesion oats Care must be taken when using GPU oats CPU vs. GPU mapping between CPU and GPU concepts: Software technologies GPU textures (streams) fragment programs (kernel) render-to-texture geometry rasterization texture coordinates vertex coordinates ATI/AMD Stream SDK NVidia Cuda OpenCL BrookGPU GPU kernels and Haskell? Other FPs? Intel Larrabee and correspondign software? CPU arrays inner loops feedback computation invocation computational domain computational range 14 15 16 17 7

NVidia Cuda (1/2) First beta in 2007 C compiler with language extensions specic to GPU stream processing Low-level ISAs are closed, proprietary driver compiles the code to the GPU (AMD/ATI have opened their ISAs) OS Support: Windows XP/Vista, Linux, Mac OS X In Linux Redhat/Suses/Fedora/Ubuntu supported, though no.debs but a shell-script installer available http://www.nvidia.com/object/cuda_get. html PyCuda: Python interface for cuda: http://mathema.tician.de/software/ pycuda 18 NVidia Cuda (2/2) An Example: // Kernel definition global void vecadd(float* A, float* B, float* C) int main() // Kernel invocation vecadd<<<1, N>>>(A, B, C); Compiler is nvcc and le-extension.cu See CUDA 2.0 Programming Guide and Reference manual in http://www. nvidia.com/object/cuda_develop.html Unfortunately the actual implementation is a bit more complex than the above slide says. Care must be taken in order to allocate necessary device buers, copy data between them and the host and after the execution release the allocated resources. The code below creates a vector [1.0, 2.0, 3.0] and adds it to itself producing another vector and prints the result. 19 /* -*- c -*- */ #include <stdio.h> #include <cuda.h> global void vecadd(float* A, float* B, float* C) int i = threadidx.x; C[i] = A[i] + B[i]; #define SIZE 3 int main() 8

float a_host[size] = 1.0, 2.0, 3.0; float r_host[size]; float *a_gpu, *r_gpu; int i; cudamalloc((void **)&a_gpu, SIZE); cudamalloc((void **)&r_gpu, SIZE); cudamemcpy(a_gpu, a_host, sizeof(float)*size, cudamemcpyhosttodevice); // Kernel invocation vecadd<<<1, SIZE>>>(a_gpu, a_gpu, r_gpu); cudamemcpy(r_host, r_gpu, sizeof(float)*size, cudamemcpydevicetohost); for (i = 0; i < SIZE; i++) printf("%f\n", r_host[i]); cudafree(a_gpu); cudafree(r_gpu); return 0; 2.3 Open Problems Open Problems How ready are current environments for multi-core/gpu? E.g., Java/JVM What tools are needed for developing concurrent software? In multi-core CPUS and GPUs E.g., debuggers for GPUs? Operating system support? Schedulers Device drivers? Totally proprietary, licensing issues? Lack of standards? Is OpenCL a solution? 20 3 Topics Possible topics (1/2) Multi-core CPUs Threads, OpenMP, UPC, Intel Threading Building Blocks Intel's Tera-scale Computing Research Program 9

GPU NVidia Cuda, AMD FireStream, Intel Larrabee, OpenCL Stream processing Programming languages FP languages: Haskell and GPUs, Concurrent ML, Erlang Main stream languages: Java/JVM/C#/C/C++ GPU/Multi-core support in script languages (Python, Ruby, Perl) Message passing vs. shared memory 21 Possible topics (2/2) Hardware overview multi-core CPUs, GPUs: What is available? How many cores? embedded CPUs, network hardware other? Applications What applications are (un)suitable for multi-core CPUs/GPUs? Gaining performance in legacy applications: (Is it possible? How to do it? Problems? Personal experiences?) 22 10