OpenCL Device Fission Benedict R. Gaster, AMD

Size: px

Start display at page:

Download "OpenCL Device Fission Benedict R. Gaster, AMD"

Virgil Kelley
6 years ago
Views:

1 Copyright Khronos Group, Page 1 Fission Benedict R. Gaster, AMD March 2011

Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used to: - Reserve a part of the device for use for

2 Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used to: - Reserve a part of the device for use for high-priority/latency-sensitive tasksl - Subdivide compute devices along some shared hardware feature like a Supported by CPU and Cell Broadband devices - Multicore CPU devices (AMD and Intel) - IBM Cell Broadband In the future support for the GPU too? Developed by - AMD, Apple, IBM, and Intel Copyright Khronos Group, Page 2

3 Copyright Khronos Group, Page 3 is a portable threading library Cross platform: - Windows, Linux, and OS X Threading features: - Asynchronous queues (i.e. back-ends for devices) - Events, dependency control Example class Parallel with public interface: class Parallel { public: Parallel(); static unsigned int atomicadd(unsigned int, volatile unsigned int *); bool parallelfor(int range, std::function<void (int i)>); Simple with Fission

4 Copyright Khronos Group, Page 4 6 Core x86 CPU L2 L2 Memory controller L2 6MB L3 L2 L2 IO controller (HT) L2 #define USE_CL_DEVICE_FISSION 1 #include <CL/cl.hpp> class Parallel { private: cl::context context_; std::vector<cl::> subs_; cl::commandqueue queue_; std::vector<cl::commandqueue> queues_; static void CL_CALLBACK wrapper(void * a); public: Parallel() { DDR3 System memory n GB

5 Copyright Khronos Group, Page 5 6 Core x86 CPU L2 L2 L2 L2 L2 L2 std::vector<cl::platform> platforms; cl::platform::get(&platforms); cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[1])(), 0 ; Memory controller 6MB L3 IO controller (HT) context_ = cl::context( CL_DEVICE_TYPE_CPU, properties); DDR3 System memory n GB

Copyright Khronos Group, 2011 - Page 6 1 device mapped to 6 core L2 L2 Memory controller L2 6MB L3 L2 L2 IO controller (HT) L2 std::vector<cl::> devices = context_.

6 Copyright Khronos Group, Page 6 1 device mapped to 6 core L2 L2 Memory controller L2 6MB L3 L2 L2 IO controller (HT) L2 std::vector<cl::> devices = context_.getinfo<cl_context_devices>(); We know that it will always return just a single device for CPU, i.e. it will use all cores to execute work-groups on that device. DDR3 System memory n GB

7 Copyright Khronos Group, Page 7 6 devices mapped to 6 cores 1. Check device fission supported L2 L2 Memory controller L2 6MB L3 L2 L2 IO controller (HT) L2 if (devices[0]. getinfo<cl_device_extensions>(). find( "cl_ext_device_fission") == std::string::npos) { exit(-1); DDR3 System memory n GB

8 Copyright Khronos Group, Page 8 6 devices mapped to 6 cores 2. Create Sub-devices L2 L2 L2 6MB L3 L2 L2 L2 cl_device_partition_property_ext props[] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 1, CL_PROPERTIES_LIST_END_EXT, 0 ; Memory controller IO controller (HT) std::vector<cl::s> sdevices; devices[0].createsubs(props, &sdevices); DDR3 System memory n GB

9 Copyright Khronos Group, Page 9 6 devices mapped to 6 cores 3. Create Command Queues L2 L2 L2 6MB L3 L2 L2 L2 for (auto i = sdevices.begin(); i!= sdevices.end(); i++) { queues.push_back( cl::commandqueue(context, *i)); // end of Parallel() Memory controller IO controller (HT) DDR3 System memory n GB

10 Copyright Khronos Group, Page 10 6 CmdQueues mapped to 6 devices Context CPU CPU CPU CPU CPU CPU Queue Queue Queue Queue Queue Queue

11 Copyright Khronos Group, Page 11 6 CmdQueues mapped to 6 devices Context CPU1 CPU CPU CPU CPU CPU Queue Queue Queue Queue Queue Queue Each command queue is asynchronous in execution!

12 Copyright Khronos Group, Page 12 Native Kernels Enqueue C++ functions, compiled by the host compiler, to execute from within an command queue cl_int clenqueuenativekernel (cl_command_queue command_queue, void (*user_func)(void *) void *args, size_t cb_args, cl_uint num_mem_objects, const cl_mem *mem_list, const void **args_mem_loc, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) There is no guarantee that the function will execute in same thread that the enqueue was performed; must be careful about thread-local-storage usage

13 parallelfor bool parfor(int range, std::function<void (int i)> f) { std::vector<cl::event> events; size_t args[2]; args[0] = reinterpret_cast<size_t>(&f); int index = 0; for (int x = 0; x < range; x++) { int numqueues = range - x > queues_.size()? queues_.size() : range - x; cl::event event; while(numqueues > 0) { ; cl::event::waitforevents(events); return true; Copyright Khronos Group, Page 13

$parallelfor bool parfor(int range, std::function<void (int i)> f) { std::vector<cl::event> events; size_t args[2]; args[0] = reinterpret_cast<size_t>(&f); ; int index = 0; for (int x = 0; x < range;$

14 parallelfor bool parfor(int range, std::function<void (int i)> f) { std::vector<cl::event> events; size_t args[2]; args[0] = reinterpret_cast<size_t>(&f); ; int index = 0; for (int x = 0; x < range; x++) { int numqueues = range - x > queues_.size()? queues_.size() : range - x; cl::event event; while(numqueues > 0) { cl::event::waitforevents(events); return true; args[1] = static_cast<size_t>(index++); queues_[numqueues-1].enqueuenativekernel( wrapper, std::make_pair( static_cast<void *>(args), sizeof(size_t)*2), NULL, NULL, NULL, &event); events.push_back(event); numqueues--; x++; Copyright Khronos Group, Page 14

$Copyright Khronos Group, 2011 - Page 15 parallelfor - wrapper private: static void CL_CALLBACK wrapper(void * a) { size_t * args = static_cast<size_t$

15 Copyright Khronos Group, Page 15 parallelfor - wrapper private: static void CL_CALLBACK wrapper(void * a) { size_t * args = static_cast<size_t *>(a); std::function<void (int i)> * f = reinterpret_cast<std::function<void (int i)>*>(args[0]); (*f)(static_cast<int>(args[1])); ; // class Parallel

16 How many primes const unsigned int numnumbers = 1024 * 1024; int main(void) volatile unsigned int numprimes = 0; int * numbers = new int[numnumbers]; Parallel parallel; parallel.parallelfor(numnumbers, [numbers, &numprimes] (int x) { auto isprime = [] (unsigned int n) -> bool { if (n == 1 n == 2) { return true; if (n % 2 == 0) { return false; for (unsigned int odd = 3; odd <= static_cast<unsigned int>(sqrtf(static_cast<float>(n))); odd +=2) { if (n % odd == 0) { return false; return true; ; // isprime if (isprime(numbers[x])) { Parallel::atomicAdd(1, &numprimes); ); // parallelfor //main Copyright Khronos Group, Page 16

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop