Parallel programming languages:

Size: px

Start display at page:

Download "Parallel programming languages:"

Catherine Flynn
5 years ago
Views:

1 Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1

2 The Microelectronics Group at the University of Bristol 2

3 The team Simon McIntosh-Smith Head of Group Dr Jose Nunez-Yanez Prof David May Dr Kerstin Eder Prof Dhiraj Pradhan Dr Simon Hollis 3

4 Group expertise Energy Aware COmputing (EACO): Multi-core and many-core computer architectures High Performance Computing (GPUs, OpenCL) Network on Chip (NoC) Reconfigurable architectures Design verification (formal and simulation-based) Formal specification and analysis Silicon process variation Fault tolerant design (hardware and software) 4

5 Heterogeneous many-core 5

6 Heterogeneous many-core XMOS 6

7 Didn t parallel computing use to be a niche? 7

8 When I were a lad 8

9 But now parallelism is mainstream Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU 9

10 HPC stronger than ever 548,352 processor cores delivering 8.16 PetaFLOPS (8.16x10 15 ) 10

Big computing is mainstream too http://www.nytimes.

html Report: Google Uses About 900,000 Servers (Aug 1 st

11 Big computing is mainstream too Report: Google Uses About 900,000 Servers (Aug 1 st 2011) report-google-uses-about servers/ 11

12 A renaissance in parallel programming OpenMP OpenCL Erlang Unified Parallel C Fortress Cilk CUDA Chapel X10 Pthreads XC Go HMPP CHARM++ Co-Array Fortran Linda MPI C++ AMP 12

13 Groupings of languages Partitioned Global Address Space (PGAS): Fortress X10 Chapel Co-array Fortran Unified Parallel C CSP: XC Message passing: MPI Shared memory: OpenMP GPU languages: OpenCL CUDA HMPP Object oriented: C++ AMP CHARM++ Multi-threaded: Cilk Go 13

14 2 nd fastest computer in the world 14

15 Emerging GPGPU standards OpenCL, DirectCompute, C++ AMP, NVIDIA s de facto standard CUDA 15

16 OpenCL 16

GPU CPU GMCH CPU ICH DRAM OpenCL lets programmers write a single portable program

17 It s a Heterogeneous world A modern system includes: One or more CPUs One or more GPUs DSP processors other? GPU CPU GMCH CPU ICH DRAM OpenCL lets programmers write a single portable program that uses ALL resources in in the heterogeneous platform heterogeneous platform GMCH = graphics memory control hub, ICH = Input/output control hub 17

18 Industry Standards for Programming Heterogeneous Platforms CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly general purpose data-parallel computing Multiprocessor programming e.g. OpenMP Heterogeneous Computing Graphics APIs and Shading Languages OpenCL Open Computing Language Open, royalty-free standard for for portable, parallel programming of of heterogeneous parallel computing CPUs, GPUs, and other processors 18 heterogeneous parallel computing CPUs, GPUs, and other processors

Wrote a rough draft straw man API Khronos Compute group formed Freescale TI + many more Apple was tired of

19 The origins of OpenCL Ericsson Nokia AMD ATI Merged, needed commonality across products IBM Sony IMG Tech Nvidia Intel GPU vendor - wants to steal mkt share from CPU CPU vendor - wants to steal mkt share from GPU Wrote a rough draft straw man API Khronos Compute group formed Freescale TI + many more Apple was tired of recoding for many core, GPUs. Pushed vendors to standardize. Dec Third party names are the property of their owners.

20 System level architecture of OpenCL OpenCL programs OpenCL dynamic libraries Vendor-supplied drivers Could also be multi-core CPUs 20

21 OpenCL Platform Model One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units Each Compute Unit is further divided into one or more Processing Elements 21

22 The BIG idea behind OpenCL Replace loops with functions (a kernel) executing at each point in a problem domain (index space). E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops void trad_mul(const int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Data Parallel OpenCL kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n work-items 22

An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together)

23 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize outside of a work-group 23

24 OpenCL Memory Model Private Memory Per Work-Item Local Memory Shared within a Work-Group Global / Constant Memories Visible to all Work-Groups Host Memory On the CPU Private Memory Work-Item Work-Group Compute Device Work-Item Local Memory Private Memory Work-Group Local Memory Global Memory & Constant Memory Host Memory Private Memory Work-Item Private Memory Work-Item Host Memory management is explicit You must move data from host global local and back 24

25 OpenCL C Language Derived from ISO C99 A supersubset (Tim Mattson, Intel) No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields Additions to the language for parallelism Work-items and work-groups Vector types Synchronization Address space qualifiers, e.g. global, local Optimized image access Built-in functions 25

26 Contexts and Queues Contexts are used to contain and manage the state of the world Contexts are defined & manipulated by the host and contain: Devices Kernels - OpenCL functions Program objects - kernel sources and executables Memory objects Command-queue - coordinates execution of kernels Kernel execution commands Memory commands - transfer or mapping of memory object data Synchronization commands - constrains the order of commands 26

$OpenCL architecture CPU GPU Context Programs Kernels Memory Objects Command Queues kernel void dp_mul( global const float *a, global const float *b, global float *c) { int id = get_global_id(0);$

27 OpenCL architecture CPU GPU Context Programs Kernels Memory Objects Command Queues kernel void dp_mul( global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary dp_mul arg arg [0] [0] arg[0] value value value arg arg [1] [1] arg[1] value value value arg arg [2] [2] arg[2] value value value Buffers Images In Out of of Order Order Queue Queue e e Compute GPU Device 27

28 Example: vector addition The hello world program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=0 to N-1 For the OpenCL solution, there are two parts Kernel code Host code 28

29 Vector Addition - Kernel kernel void vec_add ( global const float *a, global const float *b, global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } 29

Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated

30 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = clcreateprogramwithsource(context, 1, &program_source, NULL, NULL); // build the program err = clbuildprogram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clenqueuereadbuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); 30

31 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices Define = malloc(cb); platform and and queues clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, Define Memory sizeof(cl_float)*n, objects srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = Create clcreateprogramwithsource(context, the program 1, &program_source, NULL, NULL); Create the program Build the program // build the program err = clbuildprogram(program, Build the program 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], Create and and setup setup kernel sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; Execute the the kernel // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err Read = clenqueuereadbuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), results back dst, 0, to NULL, the NULL); host Read results back to the host It s complicated, but most of this is boilerplate and not as bad as it looks. 31

32 C++ to the rescue #include <CL/cl.hpp> const std::string hwstring("hello World\n"); using namespace cl; int main(void) { char * outh = new char[hwstring.length()+1]; Buffer outcl(cl_mem_write_only, hwstring.length()+1); } Program program(loadprogram("helloworld.cl")); program.build(); std::function<event (const EnqueueArgs&, Buffer)> hello = make_kernel<buffer>(program, "hello"); hello( EnqueueArgs(NDRange(hwstring.length()+1)), outcl); enqueuereadbuffer(outcl, CL_TRUE, 0, hwstring.length()+1, outh); cout << outh << endl; C++ API published on Khronos website 32

33 OpenCL case study 33

34 Bristol GPU-based drug docking success story BUDE Collaborators: Richard Sessions, Amaurys Avila Ibarra 34

35 Supermicro GPU server 35

36 Relative performance 162 seconds per simulation 1,120 seconds per simulation Less than 2 days to screen a library of 1 million drug candidates on 1000 GPUs 36

Relative energy efficiency 0.034 kwh per simulation 0.011 kwh per simulation 0.

37 Relative energy efficiency kwh per simulation kwh per simulation kwh = 0.16 pence per simulation 1 million simulations 1,600 on energy for one experiment 37

38 Summary Parallel languages are going through a renaissance Not just for the niche high end any more No silver bullets, lots of wheel reinventing In HPC, GPUs being adopted quickly at the high-end In embedded computing, OpenCL gaining ground Everything going heterogeneous many-core 38

39 39

The resurgence of parallel programming languages

The resurgence of parallel programming languages Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk 1 The Microelectronics Research Group at