Easy to adapt C code to kernel code

Size: px

Start display at page:

Download "Easy to adapt C code to kernel code"

Jayson Francis
6 years ago
Views:

2 The language of OpenCL kernels A simplified version of C No recursion No pointers to functions Kernels have no return values Easy to adapt C code to kernel code Tal Ben-Nun, HUJI. All rights reserved. 2

3 The following kernel performs vector addition: kernel void VecAdd(const global float *veca, const global float *vecb, global float *result) { } int id = get_global_id(0); result[id] = veca[id] + vecb[id]; Tal Ben-Nun, HUJI. All rights reserved. 3

5 Work-items are discerned via identifications Identified globally using global ID Identified as part of a work-group using local ID Use the built-in functions: get_global_id(uint dim) Returns the global ID get_local_id(uint dim) Returns the local ID get_group_id(uint dim) Returns the work-group s global ID dim specifies the requested dimension Tal Ben-Nun, HUJI. All rights reserved. 5

6 Other helpful functions: get_work_dim() Number of dimensions get_global_size(uint dim) Amount of work-items get_local_size(uint dim) Size of work-groups get_num_groups(uint dim) Amount of work-groups Tal Ben-Nun, HUJI. All rights reserved. 6

7 All variables have annotations that identify their address space: global: Global compute device memory constant: Specialized constant global memory local: Shared work-group memory private (default): Work-item memory read_only/write_only: Used for images only Tal Ben-Nun, HUJI. All rights reserved. 7

8 OpenCL-C provides two data types: scalars and vectors Scalar data types operate just like C Examples: char, int, float, half (16-bit FP) Unsigned counterparts: uchar, uint Vector data types are new and use the vectorization capabilities of compute devices Tal Ben-Nun, HUJI. All rights reserved. 8

9 Vectors are defined as typen type is one of the scalar data types N is one of the following: 2,3,4,8,16 Example: float4 vf = (float4)(1.0f, 6.5f, 0.9f, -1.0f); Using vectors, one action replaces N actions Miscellaneous data types include: image2d_t and image3d_t, sampler_t event_t Tal Ben-Nun, HUJI. All rights reserved. 9

10 Using vector data types: kernel void VectorizedCopy(const global float16 *src, global float16 *dest) { // Kernel dimension is (length / 16) int id = get_global_id(0); } // Note: Operates on 16 values concurrently dest[id] = src[id]; Tal Ben-Nun, HUJI. All rights reserved. 10

This can also be written for a regular array: kernel void ComplicatedCopy(const global float *src, global float *dest) { // Kernel dimension is still (length / 16) int id =

11 This can also be written for a regular array: kernel void ComplicatedCopy(const global float *src, global float *dest) { // Kernel dimension is still (length / 16) int id = get_global_id(0); } // The following automatically loads the correct position float16 val = vload16(id, src); vstore16(val, id, dest); Tal Ben-Nun, HUJI. All rights reserved. 11

12 Vectors can be accessed via three notations: XYZW: f.x is the first dimension, etc. High/low: f.hi, f.lo, f.even, f.odd S-notation: f.s0 is the first, f.s3 is the fourth, etc. Vectors can be reshaped Into sub-vectors: float4 a; float2 b = a.xz; By shuffling: f.xyzw = f.wzyx; f.s0123 = f.s1320; Tal Ben-Nun, HUJI. All rights reserved. 12

13 In kernel code, the most common functions from <math.h> are provided automatically Examples: sin(), cos(), asin(), pow() Most functions operate on vectors too Floating-point comparison: isequal() More functions: min(), max(), clamp(), clz() (count leading zeros) Tal Ben-Nun, HUJI. All rights reserved. 13

kernel void VectorSine(global float4 *vec) {

Operates on 4 values concurrently vec[id] =

15 Working with work-groups is very important Sometimes information must be shared Redundant computations can be avoided To avoid memory conflicts, work-items in the same work-group can wait for each other Work-groups are completely independent from one another Tal Ben-Nun, HUJI. All rights reserved. 15

16 Local memory can either be statically allocated Example: local int data[20]; Or dynamically allocated, its size specified in host code Use clsetkernelarg(kernel, index, size, NULL); The local array is then one of the kernel arguments Tal Ben-Nun, HUJI. All rights reserved. 16

17 Synchronization is achieved with barriers and memory fences Barrier: Blocks work-items until the entire workgroup reaches the barrier function Use barrier(clk_local_mem_fence); Memory Fence: Ensures correct ordering of memory read/write operations (advanced) Use mem_fence() with the same argument as barrier() Tal Ben-Nun, HUJI. All rights reserved. 17

$kernel void Reverse32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32]; // Load values to shared memory share[lid] =$

18 kernel void Reverse32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32]; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); } vector[id] = share[32 - lid]; Without the barrier, the work-items will have not had the memory ready for reading. Tal Ben-Nun, HUJI. All rights reserved. 18

$kernel void Sum32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32], sum; // Load values to shared memory share[lid] = vector[id];$

19 kernel void Sum32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32], sum; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); // Only the first group item performs summation if(lid == 0) { sum = 0; for(int i = 0; i < 32; i++) sum += share[i]; } barrier(clk_local_mem_fence); } // All group items use the same value vector[id] = sum; Tal Ben-Nun, HUJI. All rights reserved. 19

20 Work-items can read global memory to local (or vice versa) asynchronously: event_t async_work_group_copy ( local gentype *dst, const global gentype *src, size_t num_elements, event_t event) The event_t returned is used in wait_group_events to wait for the copy Tal Ben-Nun, HUJI. All rights reserved. 20

kernel void CopyWhileComputing(const global int *vector, global int *err) { int id

the copy, must be called by all work-items ev = async_work_group_copy(localvec,

wait_group_events(1, &ev); // Waits for copy to finish } for(int i = 0; i < 20;

21 kernel void CopyWhileComputing(const global int *vector, global int *err) { int id = get_global_id(0); local int localvec[20], computedblock[20]; event_t ev; // Start the copy, must be called by all work-items ev = async_work_group_copy(localvec, vector + id, 20, NULL); dosomethingcomplicated(computedblock); // May take a while wait_group_events(1, &ev); // Waits for copy to finish } for(int i = 0; i < 20; i++) if(localvec[i]!= computedblock[i]) *error = 1; Tal Ben-Nun, HUJI. All rights reserved. 21

23 A type of memory objects, optimized for pixel-wise access In kernels: image2d_t and image3d_t Images can only be read-only or write-only Specified in address space qualifiers Tal Ben-Nun, HUJI. All rights reserved. 23

25 Data type specifies the size and structure of pixel data Examples: CL_FLOAT, CL_UNSIGNED_32 Channel order specifies the amount and ordering of channels in a pixel Examples: CL_INTENSITY, CL_RGBA, CL_ARGB Tal Ben-Nun, HUJI. All rights reserved. 25

27 Images are opaque handles that can only be accessed using samplers Samplers define the way images are accessed and what happens when accessing beyond its borders For instance, a normalized coordinate sampler takes an (x,y,z) coordinate and accesses the image in the range [0,0,0] [1,1,1] Tal Ben-Nun, HUJI. All rights reserved. 27

28 Samplers contain 3 distinct properties: Coordinate Normalization CLK_NORMALIZED_COORDS_TRUE/FALSE Filtering (accessing data between pixels) CLK_FILTER_NEAREST Nearest pixel value CLK_FILTER_LINEAR Linear interpolation Tal Ben-Nun, HUJI. All rights reserved. 28

29 Addressing (accessing out-of-range coordinates) CLK_ADDRESS_NONE No access CLK_ADDRESS_CLAMP Default value (0,0,0,0) CLK_ADDRESS_CLAMP_TO_EDGE Same as edge CLK_ADDRESS_REPEAT Wraps around image Tal Ben-Nun, HUJI. All rights reserved. 29

30 Samplers are declared in kernels as sampler_t Note: Samplers may be also initialized in the host using clcreatesampler Example: sampler_t sampler = CLK_NORMALIZED_COORDS_TRUE CLK_ADDRESS_REPEAT CLK_FILTER_NEAREST; Tal Ben-Nun, HUJI. All rights reserved. 30

31 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; Since we want to blur the image correctly on the edges, we use CLAMP_TO_EDGE } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 31

32 To read from images, use read_image{f,i,ui} Examples: float4 read_imagef(image2d_t image, sampler_t sampler, int2/float2 coord) uint4 read_imageui(image3d_t image, sampler_t sampler, float4 coord) Tal Ben-Nun, HUJI. All rights reserved. 32

$kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.$ 0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.

0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.

33 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; Reads all 9 neighboring pixels (target pixel included) color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 33

34 To write to images, use write_image{f,i,ui} Examples: void write_imagef(image2d_t image, int2 coord, float4 color) void write_imagei(image2d_t image, int2 coord, int4 color) Notice there are no samplers Actual pixels have to be written 3D images cannot be write-only Only with the cl_khr_3d_image_writes extension Tal Ben-Nun, HUJI. All rights reserved. 34

$kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.$

35 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 35

$kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.$

37 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 37

38 Exercise 1 contains both OpenCL host and kernel programming It is strongly recommended to test the code on an actual GPU Optimized code is considered a bonus and graded accordingly Good luck! Tal Ben-Nun, HUJI. All rights reserved. 38

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions