Parallel programming languages:

Size: px
Start display at page:

Download "Parallel programming languages:"

Transcription

1 Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1

2 The Microelectronics Group at the University of Bristol 2

3 The team Simon McIntosh-Smith Head of Group Dr Jose Nunez-Yanez Prof David May Dr Kerstin Eder Prof Dhiraj Pradhan Dr Simon Hollis 3

4 Group expertise Energy Aware COmputing (EACO): Multi-core and many-core computer architectures High Performance Computing (GPUs, OpenCL) Network on Chip (NoC) Reconfigurable architectures Design verification (formal and simulation-based) Formal specification and analysis Silicon process variation Fault tolerant design (hardware and software) 4

5 Heterogeneous many-core 5

6 Heterogeneous many-core XMOS 6

7 Didn t parallel computing use to be a niche? 7

8 When I were a lad 8

9 But now parallelism is mainstream Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU 9

10 HPC stronger than ever 548,352 processor cores delivering 8.16 PetaFLOPS (8.16x10 15 ) 10

11 Big computing is mainstream too Report: Google Uses About 900,000 Servers (Aug 1 st 2011) report-google-uses-about servers/ 11

12 A renaissance in parallel programming OpenMP OpenCL Erlang Unified Parallel C Fortress Cilk CUDA Chapel X10 Pthreads XC Go HMPP CHARM++ Co-Array Fortran Linda MPI C++ AMP 12

13 Groupings of languages Partitioned Global Address Space (PGAS): Fortress X10 Chapel Co-array Fortran Unified Parallel C CSP: XC Message passing: MPI Shared memory: OpenMP GPU languages: OpenCL CUDA HMPP Object oriented: C++ AMP CHARM++ Multi-threaded: Cilk Go 13

14 2 nd fastest computer in the world 14

15 Emerging GPGPU standards OpenCL, DirectCompute, C++ AMP, NVIDIA s de facto standard CUDA 15

16 OpenCL 16

17 It s a Heterogeneous world A modern system includes: One or more CPUs One or more GPUs DSP processors other? GPU CPU GMCH CPU ICH DRAM OpenCL lets programmers write a single portable program that uses ALL resources in in the heterogeneous platform heterogeneous platform GMCH = graphics memory control hub, ICH = Input/output control hub 17

18 Industry Standards for Programming Heterogeneous Platforms CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly general purpose data-parallel computing Multiprocessor programming e.g. OpenMP Heterogeneous Computing Graphics APIs and Shading Languages OpenCL Open Computing Language Open, royalty-free standard for for portable, parallel programming of of heterogeneous parallel computing CPUs, GPUs, and other processors 18 heterogeneous parallel computing CPUs, GPUs, and other processors

19 The origins of OpenCL Ericsson Nokia AMD ATI Merged, needed commonality across products IBM Sony IMG Tech Nvidia Intel GPU vendor - wants to steal mkt share from CPU CPU vendor - wants to steal mkt share from GPU Wrote a rough draft straw man API Khronos Compute group formed Freescale TI + many more Apple was tired of recoding for many core, GPUs. Pushed vendors to standardize. Dec Third party names are the property of their owners.

20 System level architecture of OpenCL OpenCL programs OpenCL dynamic libraries Vendor-supplied drivers Could also be multi-core CPUs 20

21 OpenCL Platform Model One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units Each Compute Unit is further divided into one or more Processing Elements 21

22 The BIG idea behind OpenCL Replace loops with functions (a kernel) executing at each point in a problem domain (index space). E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops void trad_mul(const int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Data Parallel OpenCL kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n work-items 22

23 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize outside of a work-group 23

24 OpenCL Memory Model Private Memory Per Work-Item Local Memory Shared within a Work-Group Global / Constant Memories Visible to all Work-Groups Host Memory On the CPU Private Memory Work-Item Work-Group Compute Device Work-Item Local Memory Private Memory Work-Group Local Memory Global Memory & Constant Memory Host Memory Private Memory Work-Item Private Memory Work-Item Host Memory management is explicit You must move data from host global local and back 24

25 OpenCL C Language Derived from ISO C99 A supersubset (Tim Mattson, Intel) No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields Additions to the language for parallelism Work-items and work-groups Vector types Synchronization Address space qualifiers, e.g. global, local Optimized image access Built-in functions 25

26 Contexts and Queues Contexts are used to contain and manage the state of the world Contexts are defined & manipulated by the host and contain: Devices Kernels - OpenCL functions Program objects - kernel sources and executables Memory objects Command-queue - coordinates execution of kernels Kernel execution commands Memory commands - transfer or mapping of memory object data Synchronization commands - constrains the order of commands 26

27 OpenCL architecture CPU GPU Context Programs Kernels Memory Objects Command Queues kernel void dp_mul( global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary dp_mul arg arg [0] [0] arg[0] value value value arg arg [1] [1] arg[1] value value value arg arg [2] [2] arg[2] value value value Buffers Images In Out of of Order Order Queue Queue e e Compute GPU Device 27

28 Example: vector addition The hello world program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=0 to N-1 For the OpenCL solution, there are two parts Kernel code Host code 28

29 Vector Addition - Kernel kernel void vec_add ( global const float *a, global const float *b, global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } 29

30 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = clcreateprogramwithsource(context, 1, &program_source, NULL, NULL); // build the program err = clbuildprogram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clenqueuereadbuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); 30

31 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices Define = malloc(cb); platform and and queues clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, Define Memory sizeof(cl_float)*n, objects srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = Create clcreateprogramwithsource(context, the program 1, &program_source, NULL, NULL); Create the program Build the program // build the program err = clbuildprogram(program, Build the program 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], Create and and setup setup kernel sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; Execute the the kernel // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err Read = clenqueuereadbuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), results back dst, 0, to NULL, the NULL); host Read results back to the host It s complicated, but most of this is boilerplate and not as bad as it looks. 31

32 C++ to the rescue #include <CL/cl.hpp> const std::string hwstring("hello World\n"); using namespace cl; int main(void) { char * outh = new char[hwstring.length()+1]; Buffer outcl(cl_mem_write_only, hwstring.length()+1); } Program program(loadprogram("helloworld.cl")); program.build(); std::function<event (const EnqueueArgs&, Buffer)> hello = make_kernel<buffer>(program, "hello"); hello( EnqueueArgs(NDRange(hwstring.length()+1)), outcl); enqueuereadbuffer(outcl, CL_TRUE, 0, hwstring.length()+1, outh); cout << outh << endl; C++ API published on Khronos website 32

33 OpenCL case study 33

34 Bristol GPU-based drug docking success story BUDE Collaborators: Richard Sessions, Amaurys Avila Ibarra 34

35 Supermicro GPU server 35

36 Relative performance 162 seconds per simulation 1,120 seconds per simulation Less than 2 days to screen a library of 1 million drug candidates on 1000 GPUs 36

37 Relative energy efficiency kwh per simulation kwh per simulation kwh = 0.16 pence per simulation 1 million simulations 1,600 on energy for one experiment 37

38 Summary Parallel languages are going through a renaissance Not just for the niche high end any more No silver bullets, lots of wheel reinventing In HPC, GPUs being adopted quickly at the high-end In embedded computing, OpenCL gaining ground Everything going heterogeneous many-core 38

39 39

The resurgence of parallel programming languages

The resurgence of parallel programming languages The resurgence of parallel programming languages Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk 1 The Microelectronics Research Group at

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

OpenCL and the quest for portable performance. Tim Mattson Intel Labs

OpenCL and the quest for portable performance. Tim Mattson Intel Labs OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing

OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing Benedict Gaster AMD Principle Member of Technical Staff, AMD Research (now at Qualcomm) OPENCL TODAY WHAT WORKS, WHAT DOESN T

More information

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.

More information

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications

More information

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011 Making OpenCL Simple with Haskell Benedict R. Gaster January, 2011 Attribution and WARNING The ideas and work presented here are in collaboration with: Garrett Morris (AMD intern 2010 & PhD student Portland

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

Introduction to OpenCL. Benedict R. Gaster October, 2010

Introduction to OpenCL. Benedict R. Gaster October, 2010 Introduction to OpenCL Benedict R. Gaster October, 2010 OpenCL With OpenCL you can Leverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applications

More information

Supporting Computer Vision through High Performance GPU Programming. Dong Ping Zhang 12 Jan, 2013 AMD Research

Supporting Computer Vision through High Performance GPU Programming. Dong Ping Zhang 12 Jan, 2013 AMD Research Supporting Computer Vision through High Performance GPU Programming Dong Ping Zhang 12 Jan, 2013 AMD Research Most parallel code runs on CPUs designed for scalar workloads 2 Supporting Computer Vision

More information

Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite

Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite riveill@unice.fr http://www.i3s.unice.fr/~riveill What is OpenCL good for Anything that

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012 Copyright Khronos Group 2012 Page 1 OpenCL 1.2 August 2012 Copyright Khronos Group 2012 Page 2 Khronos - Connecting Software to Silicon Khronos defines open, royalty-free standards to access graphics,

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

A hands-on Introduction to OpenCL

A hands-on Introduction to OpenCL A hands-on Introduction to OpenCL Tim Mattson Acknowledgements: Alice Koniges of Berkeley Lab/NERSC and Simon McIntosh-Smith, James Price, and Tom Deakin of the University of Bristol OpenCL Learning progression

More information

The Future of Many Core Computing: Software for many core processors. Tim Mattson Intel Labs January 2010

The Future of Many Core Computing: Software for many core processors. Tim Mattson Intel Labs January 2010 The Future of Many Core Computing: Software for many core processors Tim Mattson Intel Labs January 2010 1 Disclosure The views expressed in this talk are those of the speaker and not his employer. I am

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Colin Riddell GPU Compiler Developer Codeplay Visit us at OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Design Patterns and the Quest for General Purpose Parallel Programming. Tim Mattson Intel Labs

Design Patterns and the Quest for General Purpose Parallel Programming. Tim Mattson Intel Labs Design Patterns and the Quest for General Purpose Parallel Programming g Tim Mattson Intel Labs timothy.g.mattson@intel.com 1 Disclosure The views expressed in this talk are those of the speaker and not

More information

Other Parallel Programming Environments

Other Parallel Programming Environments Other Parallel Programming Environments Tim Mattson Intel Corp. timothy.g.mattson@ intel.com 1 1 The Big Three In HPC, three programming environments dominate covering the major classes of hardware. OpenMP:

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

OpenCL Press Conference

OpenCL Press Conference Copyright Khronos Group, 2011 - Page 1 OpenCL Press Conference Tokyo, November 2011 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2011 - Page

More information

Copyright Khronos Group, Page 1. OpenCL Overview. February 2010

Copyright Khronos Group, Page 1. OpenCL Overview. February 2010 Copyright Khronos Group, 2011 - Page 1 OpenCL Overview February 2010 Copyright Khronos Group, 2011 - Page 2 Khronos Vision Billions of devices increasing graphics, compute, video, imaging and audio capabilities

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

OpenCL Introduction. Acknowledgements. Frédéric Desprez. Simon Mc Intosh-Smith (Univ. of Bristol) Tom Deakin (Univ. of Bristol)

OpenCL Introduction. Acknowledgements. Frédéric Desprez. Simon Mc Intosh-Smith (Univ. of Bristol) Tom Deakin (Univ. of Bristol) OpenCL Introduction Frédéric Desprez INRIA Grenoble Rhône-Alpes/LIG Corse team Simulation par ordinateur des ondes gravitationnelles produites lors de la fusion de deux trous noirs. Werner Benger, CC BY-SA

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

OpenCL in Action. Ofer Rosenberg

OpenCL in Action. Ofer Rosenberg pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

OpenCL: History & Future. November 20, 2017

OpenCL: History & Future. November 20, 2017 Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Parallel Programming Concepts. GPU Computing with OpenCL

Parallel Programming Concepts. GPU Computing with OpenCL Parallel Programming Concepts GPU Computing with OpenCL Frank Feinbube Operating Systems and Middleware Prof. Dr. Andreas Polze Agenda / Quicklinks 2 Recapitulation Motivation History of GPU Computing

More information

GPU Architecture and Programming with OpenCL

GPU Architecture and Programming with OpenCL GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models

More information

Threaded Programming. Lecture 9: Alternatives to OpenMP

Threaded Programming. Lecture 9: Alternatives to OpenMP Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming

More information

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

PARLab Parallel Boot Camp

PARLab Parallel Boot Camp PARLab Parallel Boot Camp Introduction to OpenCL Tim Mattson Microprocessor and Programming Research Lab Intel Corp. Heterogeneous computing A modern platform has: Multi-core CPU(s) A GPU DSP processors

More information

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige How GPUs can find your next hit: Accelerating virtual screening with OpenCL Simon Krige ACS 2013 Agenda > Background > About blazev10 > What is a GPU? > Heterogeneous computing > OpenCL: a framework for

More information

INTRODUCING OPENCL TM

INTRODUCING OPENCL TM INTRODUCING OPENCL TM The open standard for parallel programming across heterogeneous processors 1 PPAM 2011 Tutorial IT S A MEANY-CORE HETEROGENEOUS WORLD Multi-core, heterogeneous computing The new normal

More information

Getting Started with Intel SDK for OpenCL Applications

Getting Started with Intel SDK for OpenCL Applications Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016 Copyright Khronos Group, 2014 - Page 1 SYCL SG14, February 2016 BOARD OF PROMOTERS Over 100 members worldwide any company is welcome to join Copyright Khronos Group 2014 SYCL 1. What is SYCL for and what

More information

Parallelization using the GPU

Parallelization using the GPU ~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Accelerating molecular docking on multi- and manycore computer architectures

Accelerating molecular docking on multi- and manycore computer architectures Accelerating molecular docking on multi- and manycore computer architectures Simon McIntosh-Smith University of Bristol, UK simonm@cs.bris.ac.uk 1 ! Power-limited regimes Processor power consumption now

More information

Copyright Khronos Group, Page 1. OpenCL. GDC, March 2010

Copyright Khronos Group, Page 1. OpenCL. GDC, March 2010 Copyright Khronos Group, 2011 - Page 1 OpenCL GDC, March 2010 Authoring and accessibility Application Acceleration System Integration Copyright Khronos Group, 2011 - Page 2 Khronos Family of Standards

More information

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013 Copyright Khronos Group 2013 - Page 1 OpenCL BOF SIGGRAPH 2013 Copyright Khronos Group 2013 - Page 2 OpenCL Roadmap OpenCL-HLM (High Level Model) High-level programming model, unifying host and device

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

CS 677: Parallel Programming for Many-core Processors Lecture 12

CS 677: Parallel Programming for Many-core Processors Lecture 12 1 CS 677: Parallel Programming for Many-core Processors Lecture 12 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu CS Department Project Poster

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

GPGPU A Supercomputer on Your Desktop

GPGPU A Supercomputer on Your Desktop GPGPU A Supercomputer on Your Desktop GPUs 2 Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. They became very very popular with videogamers,

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Lecture Topic: An Overview of OpenCL on Xeon Phi

Lecture Topic: An Overview of OpenCL on Xeon Phi C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue

More information

An Alternative to GPU Acceleration For Mobile Platforms

An Alternative to GPU Acceleration For Mobile Platforms Inventing the Future of Computing An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson andreas@adapteva.com 50 th DAC June 5th, Austin, TX Adapteva Achieves 3 World Firsts 1. First

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Evaluating OpenMP s Effectiveness in the Many-Core Era

Evaluating OpenMP s Effectiveness in the Many-Core Era Evaluating OpenMP s Effectiveness in the Many-Core Era Prof Simon McIntosh-Smith HPC Research Group simonm@cs.bris.ac.uk 1 Bristol, UK 10 th largest city in UK Aero, finance, chip design HQ for Cray EMEA

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

Hello World: Vector Addi/on

Hello World: Vector Addi/on Hello World: Vector Addi/on // Compute sum of length N vectors: C = A + B void vecadd (float* a, float* b, float* c, int N) { for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; } int main () { int N =...

More information

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

It's the end of the world as we know it

It's the end of the world as we know it It's the end of the world as we know it Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Background Graduated as Valedictorian in Computer Science from Cardiff University

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

Parallel Programming Recipes

Parallel Programming Recipes San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2010 Parallel Programming Recipes Thuy C. Nguyenphuc San Jose State University Follow this and additional

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

Accelerating sequential computer vision algorithms using commodity parallel hardware

Accelerating sequential computer vision algorithms using commodity parallel hardware Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision

More information

CS 677: Parallel Programming for Many-core Processors Lecture 11

CS 677: Parallel Programming for Many-core Processors Lecture 11 1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Project Status Update Due

More information

Instructions for setting up OpenCL Lab

Instructions for setting up OpenCL Lab Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

APARAPI Java platform s Write Once Run Anywhere now includes the GPU. Gary Frost AMD PMTS Java Runtime Team

APARAPI Java platform s Write Once Run Anywhere now includes the GPU. Gary Frost AMD PMTS Java Runtime Team APARAPI Java platform s Write Once Run Anywhere now includes the GPU Gary Frost AMD PMTS Java Runtime Team AGENDA The age of heterogeneous computing is here The supercomputer in your desktop/laptop Why

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on

More information

Threads. Threads (continued)

Threads. Threads (continued) Threads A thread is an alternative model of program execution A process creates a thread through a system call Thread operates within process context Use of threads effectively splits the process state

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

INTRODUCTION TO OPENCL. HAIBO XIE, PH.D.

INTRODUCTION TO OPENCL. HAIBO XIE, PH.D. INTRODUCTION TO OPENCL HAIBO XIE, PH.D. haibo.xie@amd.com AGENDA What s OpenCL Fundamentals for OpenCL programming OpenCL programming basics OpenCL programming tools Examples & demos 2 WHAT IS OPENCL Open

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Introduction à OpenCL

Introduction à OpenCL 1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)

More information