COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE

Size: px
Start display at page:

Download "COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE"

Transcription

1 COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA July 2017 Santa Barbara, CA

2 EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL TESTING IS CRITICAL

3 FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING

4 FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Test case 2 Test case 3 Expected result 1 Expected result 2 Expected result 3 Application Test case n Expected result n

5 FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING Test suite Test case 1 Test case 2 Test case 3 Expected result 1 Expected result 2 Expected result 3 Application Test case n Expected result n TESTING IS AN IDEAL CANDIDATE FOR PARALLELISATION

6 CPU SERVERS Expensive Do not scale easily as test suites grow Can be extremely underutilised

7 CPU SERVERS Expensive Do not scale easily as test suites grow Can be extremely underutilised GPUS Cheap and widely available Large-scale parallelism, thousands of threads SIMD architecture suited to functional testing

8 EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Test case 1 Test case 2 Test case 3 Read test cases: INPUT[] = {test case 1 test case n} Transfer INPUT[] to GPU memory Build and launch tested program on the GPU threads Expected result 1 Expected result 2 Expected result 3 th_id n-1 OUTPUT[th_id] = program( INPUT[th_id] ) Test case n Expected result n Transfer OUTPUT[] to CPU memory A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages , Sweden, Nov 2014.

9 EXECUTE TESTS IN PARALLEL ON THE GPU THREADS Test suite Read test cases: INPUT[] = {test case 1 test case n} Test case 1 Test case 2 Test case 3 Transfer INPUT[] to GPU memory Build and launch tested program on the GPU threads th_id n-1 Expected result 1 Expected result 2 Expected result 3 CHALLENGES Usability Test case n OUTPUT[th_id] = program( INPUT[th_id] ) Transfer OUTPUT[] to CPU memory Expected result n Scope Performance? A. Rajan, S. Sharma, P. Schrammel, D. Kroening. Accelerated test execution using GPUs. In proceedings of ASE 2014, pages , Sweden, Nov 2014.

10 INTRODUCING PARTECL Test cases (CSV format) Unmodified source files ParTeCL CodeGen OpenCL ParTeCL Runtime Execution on the GPU Config file

11 INPUTS Example: Configuration: #include <stdio.h> #include <stdlib.h> int c; int addc(int a, int b){ return a + b + c; } input: int a 1 input: int b 2 result: int sum variable: sum Test cases: int main(int argc, char* argv[]){ int a = atoi(argv[1]); int b = atoi(argv[2]); c = 3; int sum = addc(a, b); printf("%d + %d + %c = %d\n", a, b, c, sum); }

12 PARTECL CODEGEN Example: OpenCL: #include <stdio.h> #include <stdlib.h> int c; int addc(int a, int b){ return a + b + c; } int main(int argc, char* argv[]){ int a = atoi(argv[1]); int b = atoi(argv[2]); c = 3; int sum = addc(a, b); printf("%d + %d + %c = %d\n", a, b, c, sum); } #include "structs.h" //#include <stdio.h> //#include <stdlib.h> /*int c;*/ int addc(int a, int b, int *c){ return a + b + (*c); } kernel void main_kernel( global struct test_input* inputs, global struct test_result* results){ int idx = get_global_id(0); struct test_input input_gen = inputs[idx]; global struct test_result *result_gen = &results[idx]; int argc = input_gen.argc; result_gen->test_case_num = input_gen.test_case_num; int c; int a = input_gen.a; int b = input_gen.b; c = 3; int sum = addc(a, b, &c); /*printf("%d + %d + %c = %d\n", a, b, c, sum);*/ result_gen->sum = sum; }

13 CODE TRANSFORMATIONS global scope variables command line arguments standard in/out standard library (partial support): clclibc

14 PARTECL RUNTIME Read test cases: INPUT[] = {test case 1 test case n} Transfer INPUT[] to GPU memory Automatically generated OpenCL Build and launch tested program on the GPU threads th_id n-1 OUTPUT[th_id] = program( INPUT[th_id] ) Transfer OUTPUT[] to CPU memory

15 CHALLENGES Usability Scope Performance? Test cases (CSV format) Unmodified source files ParTeCL CodeGen OpenCL ParTeCL Runtime Execution on the GPU Config file

16 EVALUATION 1. Speedup against CPU 2. Data transfer overhead 3. Comparison to a multi-core CPU 4. Correctness

17 EXPERIMENT Subjects: EEMBC - Industry-standard benchmark suite for embedded software Hardware: GPU - NVidia Tesla K40m; CPU - Intel Xeon, 8 cores Test suite size: 130K

18 SPEEDUP AGAINST CPU

19 DATA TRANSFER OVERHEAD viterb00 Input transfer Output transfer Kernelexecution 80 fbital00 Input transfer Output transfer Kernelexecution a2time01 Input transfer Output transfer Kernelexecution 40 autcor00 Input transfer Output transfer Kernelexecution Execution time [ms] Execution time [ms] Execution time [ms] Execution time [ms] Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale Number of tests (log base 2 scale) Execution time [ms] tblook01 Input transfer Output transfer Kernelexecution Execution time [ms] conven00 Input transfer Output transfer Kernelexecution Execution time [ms] fft00 Input transfer Output transfer Kernelexecution Execution time [ms] puwmod01 Input transfer Output transfer Kernelexecution Execution time [ms] rspeed01 Input transfer Output transfer Kernelexecution Number of tests (log base 2 scale Number of tests (log base 2 scale) Number of tests (log base 2 scale) Number of tests (log base 2 scale Number of tests (log base 2 scale

20 DATA TRANSFER OVERHEAD

21 COMPARISON TO A MULTI-CORE CPU

22 CHALLENGES Usability Scope Performance

23 CORRECTNESS For all 9 benchmarks, testing results from the GPU are an exact match to the testing results from the CPU.

24 SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results

25 SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results FUTURE WORK Extend evaluation & scope Analyse & improve performance

26 THANKS ParTeCL CodeGen ParTeCL Runtime clclibc github.com/wyaneva/partecl-codegen github.com/wyaneva/partecl-runtime github.com/wyaneva/clclibc

27

28 C FEATURES Out of the box: pure functions, function calls, double precision (for OpenCL 1.2) With transformations: standard in/out global scope variables standard library calls (partial support) Unsupported (yet): dynamic memory allocation file I/O recursion

Ajitha Rajan, Christophe Dubach. in preparation for: ISSTA July 2017 Santa Barbara, CA

Ajitha Rajan, Christophe Dubach. in preparation for: ISSTA July 2017 Santa Barbara, CA COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach in preparation for: ISSTA 2017 10 July 2017 Santa Barbara, CA EMBEDDED SOFTWARE IS EVERYWHERE

More information

Accelerated Test Execution Using GPUs

Accelerated Test Execution Using GPUs Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software

More information

Compiler-Assisted Test Acceleration on GPUs for Embedded Software

Compiler-Assisted Test Acceleration on GPUs for Embedded Software Compiler-Assisted Test Acceleration on GPUs for Embedded Software Vanya Yaneva School of Informatics University of Edinburgh, UK vanya.yaneva@ed.ac.uk Ajitha Rajan School of Informatics University of Edinburgh,

More information

Automated Test Execution Using GPUs

Automated Test Execution Using GPUs Automated Test Execution Using GPUs Vanya Yaneva E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research Laboratory for Foundations of Computer Science CDT in Pervasive Parallelism

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to.

When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to. Refresher When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to. i.e. char *ptr1 = malloc(1); ptr1 + 1; // adds 1 to pointer

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Performance Diagnosis for Hybrid CPU/GPU Environments

Performance Diagnosis for Hybrid CPU/GPU Environments Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments

More information

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

CSCI565 Compiler Design

CSCI565 Compiler Design CSCI565 Compiler Design Spring 2011 Homework 4 Solution Due Date: April 6, 2011 in class Problem 1: Activation Records and Stack Layout [50 points] Consider the following C source program shown below.

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

Data Parallel Algorithmic Skeletons with Accelerator Support

Data Parallel Algorithmic Skeletons with Accelerator Support MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

CS16 Midterm Exam 1 E01, 10S, Phill Conrad, UC Santa Barbara Wednesday, 04/21/2010, 1pm-1:50pm

CS16 Midterm Exam 1 E01, 10S, Phill Conrad, UC Santa Barbara Wednesday, 04/21/2010, 1pm-1:50pm CS16 Midterm Exam 1 E01, 10S, Phill Conrad, UC Santa Barbara Wednesday, 04/21/2010, 1pm-1:50pm Name: Umail Address: @ umail.ucsb.edu Circle Lab section: 9am 10am 11am noon (Link to Printer Friendly-PDF

More information

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community http://csc.cs.rit.edu History and Evolution of Programming Languages 1. Explain the relationship between machine

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

ECE264 Fall 2013 Exam 1, September 24, 2013

ECE264 Fall 2013 Exam 1, September 24, 2013 ECE264 Fall 2013 Exam 1, September 24, 2013 In signing this statement, I hereby certify that the work on this exam is my own and that I have not copied the work of any other student while completing it.

More information

C programming for beginners

C programming for beginners C programming for beginners Lesson 2 December 10, 2008 (Medical Physics Group, UNED) C basics Lesson 2 1 / 11 Main task What are the values of c that hold bounded? x n+1 = x n2 + c (x ; c C) (Medical Physics

More information

3L Diamond. Multiprocessor DSP RTOS

3L Diamond. Multiprocessor DSP RTOS 3L Diamond Multiprocessor DSP RTOS What is 3L Diamond? Diamond is an operating system designed for multiprocessor DSP applications. With Diamond you develop efficient applications that use networks of

More information

Computer Systems Assignment 2: Fork and Threads Package

Computer Systems Assignment 2: Fork and Threads Package Autumn Term 2018 Distributed Computing Computer Systems Assignment 2: Fork and Threads Package Assigned on: October 5, 2018 Due by: October 12, 2018 1 Understanding fork() and exec() Creating new processes

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Lesson 5: Functions and Libraries. EE3490E: Programming S1 2018/2019 Dr. Đào Trung Kiên Hanoi Univ. of Science and Technology

Lesson 5: Functions and Libraries. EE3490E: Programming S1 2018/2019 Dr. Đào Trung Kiên Hanoi Univ. of Science and Technology Lesson 5: Functions and Libraries 1 Functions 2 Overview Function is a block of statements which performs a specific task, and can be called by others Each function has a name (not identical to any other),

More information

Parallel Programming Using MPI

Parallel Programming Using MPI Parallel Programming Using MPI Prof. Hank Dietz KAOS Seminar, February 8, 2012 University of Kentucky Electrical & Computer Engineering Parallel Processing Process N pieces simultaneously, get up to a

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Generating Performance Portable Code using Rewrite Rules

Generating Performance Portable Code using Rewrite Rules Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer Christian Fensch Sam Lindley Christophe Dubach The Problem(s)

More information

ntroduction to C CS 2022: ntroduction to C nstructor: Hussam Abu-Libdeh (based on slides by Saikat Guha) Fall 2011, Lecture 1 ntroduction to C CS 2022, Fall 2011, Lecture 1 History of C Writing code in

More information

CS 0449 Sample Midterm

CS 0449 Sample Midterm Name: CS 0449 Sample Midterm Multiple Choice 1.) Given char *a = Hello ; char *b = World;, which of the following would result in an error? A) strlen(a) B) strcpy(a, b) C) strcmp(a, b) D) strstr(a, b)

More information

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices Prasanna Pandit Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India prasanna@hpc.serc.iisc.ernet.in

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

CA341 - Comparative Programming Languages

CA341 - Comparative Programming Languages CA341 - Comparative Programming Languages David Sinclair Dynamic Data Structures Generally we do not know how much data a program will have to process. There are 2 ways to handle this: Create a fixed data

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

CPSC 341 OS & Networks. Threads. Dr. Yingwu Zhu

CPSC 341 OS & Networks. Threads. Dr. Yingwu Zhu CPSC 341 OS & Networks Threads Dr. Yingwu Zhu Processes Recall that a process includes many things An address space (defining all the code and data pages) OS resources (e.g., open files) and accounting

More information

ECE 264 Exam 2. 6:30-7:30PM, March 9, You must sign here. Otherwise you will receive a 1-point penalty.

ECE 264 Exam 2. 6:30-7:30PM, March 9, You must sign here. Otherwise you will receive a 1-point penalty. ECE 264 Exam 2 6:30-7:30PM, March 9, 2011 I certify that I will not receive nor provide aid to any other student for this exam. Signature: You must sign here. Otherwise you will receive a 1-point penalty.

More information

AMCAT Automata Coding Sample Questions And Answers

AMCAT Automata Coding Sample Questions And Answers 1) Find the syntax error in the below code without modifying the logic. #include int main() float x = 1.1; switch (x) case 1: printf( Choice is 1 ); default: printf( Invalid choice ); return

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

CS 789 Multiprocessor Programming. Optimizing the Sequential Mandelbrot Computation.

CS 789 Multiprocessor Programming. Optimizing the Sequential Mandelbrot Computation. CS 789 Multiprocessor Programming Optimizing the Sequential Mandelbrot Computation. School of Computer Science Howard Hughes College of Engineering University of Nevada, Las Vegas (c) Matt Pedersen, 2010

More information

CSE 160 Lecture 7. C++11 threads C++11 memory model

CSE 160 Lecture 7. C++11 threads C++11 memory model CSE 160 Lecture 7 C++11 threads C++11 memory model Today s lecture C++ threads The C++11 Memory model 2013 Scott B. Baden / CSE 160 / Winter 2013 2 C++11 Threads Via , C++ supports a threading

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

ECE264 Summer 2013 Exam 1, June 20, 2013

ECE264 Summer 2013 Exam 1, June 20, 2013 ECE26 Summer 2013 Exam 1, June 20, 2013 In signing this statement, I hereby certify that the work on this exam is my own and that I have not copied the work of any other student while completing it. I

More information

ARCHER Champions 2 workshop

ARCHER Champions 2 workshop ARCHER Champions 2 workshop Mike Giles Mathematical Institute & OeRC, University of Oxford Sept 5th, 2016 Mike Giles (Oxford) ARCHER Champions 2 Sept 5th, 2016 1 / 14 Tier 2 bids Out of the 8 bids, I know

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

APT Session 4: C. Software Development Team Laurence Tratt. 1 / 14

APT Session 4: C. Software Development Team Laurence Tratt. 1 / 14 APT Session 4: C Laurence Tratt Software Development Team 2017-11-10 1 / 14 http://soft-dev.org/ What to expect from this session 1 C. 2 / 14 http://soft-dev.org/ Prerequisites 1 Install either GCC or

More information

DS Assignment I. 1. Set a pointer by name first and last to point to the first element and last element of the list respectively.

DS Assignment I. 1. Set a pointer by name first and last to point to the first element and last element of the list respectively. DS Assignment I 1 Suppose an integer array by name list is declared of size N (ex: #define N 10 int list[n]; ) Write C statements to achieve the following: 1 Set a pointer by name first and last to point

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

More information

Memory Management. a C view. Dr Alun Moon KF5010. Computer Science. Dr Alun Moon (Computer Science) Memory Management KF / 24

Memory Management. a C view. Dr Alun Moon KF5010. Computer Science. Dr Alun Moon (Computer Science) Memory Management KF / 24 Memory Management a C view Dr Alun Moon Computer Science KF5010 Dr Alun Moon (Computer Science) Memory Management KF5010 1 / 24 The Von Neumann model Memory Architecture One continuous address space Program

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory Memory Management To do q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory Memory management Ideal memory for a programmer large, fast, nonvolatile and cheap not

More information

Memory Allocation in C

Memory Allocation in C Memory Allocation in C When a C program is loaded into memory, it is organized into three areas of memory, called segments: the text segment, stack segment and heap segment. The text segment (also called

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

GPU Linear algebra extensions for GNU/Octave

GPU Linear algebra extensions for GNU/Octave Journal of Physics: Conference Series GPU Linear algebra extensions for GNU/Octave To cite this article: L B Bosi et al 2012 J. Phys.: Conf. Ser. 368 012062 View the article online for updates and enhancements.

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

CS 2301 Exam 3 B-Term 2011

CS 2301 Exam 3 B-Term 2011 NAME: CS 2301 Exam 3 B-Term 2011 Questions 1-3: (15) Question 4: (15) Question 5: (20) Question 6: (10) Question 7: (15) Question 8: (15) Question 9: (10) TOTAL: (100) You may refer to one sheet of notes

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Strings. Compare these program fragments:

Strings. Compare these program fragments: Objects 1 What are objects? 2 C doesn't properly support object oriented programming But it is reasonable to use the word object to mean a structure or array, accessed using a pointer This represents another

More information

518 Lecture Notes Week 3

518 Lecture Notes Week 3 518 Lecture Notes Week 3 (Sept. 15, 2014) 1/8 518 Lecture Notes Week 3 1 Topics Process management Process creation with fork() Overlaying an existing process with exec Notes on Lab 3 2 Process management

More information

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 14

BIL 104E Introduction to Scientific and Engineering Computing. Lecture 14 BIL 104E Introduction to Scientific and Engineering Computing Lecture 14 Because each C program starts at its main() function, information is usually passed to the main() function via command-line arguments.

More information

CAPS Technology. ProHMPT, 2009 March12 th

CAPS Technology. ProHMPT, 2009 March12 th CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.

More information

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits CS307 What is a thread? Threads A thread is a basic unit of CPU utilization contains a thread ID, a program counter, a register set, and a stack shares with other threads belonging to the same process

More information

Lecture Topic: An Overview of OpenCL on Xeon Phi

Lecture Topic: An Overview of OpenCL on Xeon Phi C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue

More information

CS16 Midterm Exam 2 E02, 10W, Phill Conrad, UC Santa Barbara Tuesday, 03/02/2010

CS16 Midterm Exam 2 E02, 10W, Phill Conrad, UC Santa Barbara Tuesday, 03/02/2010 CS16 Midterm Exam 2 E02, 10W, Phill Conrad, UC Santa Barbara Tuesday, 03/02/2010 Name: Umail Address: @ umail.ucsb.edu Circle Lab section: 3PM 4PM 5PM Link to Printer Friendly PDF Version Please write

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

The following program computes a Calculus value, the "trapezoidal approximation of

The following program computes a Calculus value, the trapezoidal approximation of Multicore machines and shared memory Multicore CPUs have more than one core processor that can execute instructions at the same time. The cores share main memory. In the next few activities, we will learn

More information

TR On Using Multiple CPU Threads to Manage Multiple GPUs under CUDA

TR On Using Multiple CPU Threads to Manage Multiple GPUs under CUDA TR-2008-04 On Using Multiple CPU Threads to Manage Multiple GPUs under CUDA Hammad Mazhar Simulation Based Engineering Lab University of Wisconsin Madison August 1, 2008 Abstract Presented here is a short

More information

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google)

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google) An introduction to Halide Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google) Today s agenda Now: the big ideas in Halide Later: writing & optimizing real code Hello world (brightness)

More information

Ruud van der Pas. Senior Principal So2ware Engineer SPARC Microelectronics. Santa Clara, CA, USA

Ruud van der Pas. Senior Principal So2ware Engineer SPARC Microelectronics. Santa Clara, CA, USA Senior Principal So2ware Engineer SPARC Microelectronics Santa Clara, CA, USA SC 13 Talk at OpenMP Booth Wednesday, November 20, 2013 1 What Was Missing? 2 Before OpenMP 3.0 n Constructs worked well for

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Colin Riddell GPU Compiler Developer Codeplay Visit us at OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Threaded Programming. Lecture 9: Alternatives to OpenMP

Threaded Programming. Lecture 9: Alternatives to OpenMP Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming

More information

SPOC : GPGPU programming through Stream Processing with OCaml

SPOC : GPGPU programming through Stream Processing with OCaml SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages

More information

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction

More information

Supporting Class / C++ Lecture Notes

Supporting Class / C++ Lecture Notes Goal Supporting Class / C++ Lecture Notes You started with an understanding of how to write Java programs. This course is about explaining the path from Java to executing programs. We proceeded in a mostly

More information

CS 33. Architecture and the OS. CS33 Intro to Computer Systems XIX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Architecture and the OS. CS33 Intro to Computer Systems XIX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Architecture and the OS CS33 Intro to Computer Systems XIX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. The Operating System My Program Mary s Program Bob s Program OS CS33 Intro to

More information