QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

Size: px

Start display at page:

Download "QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens"

Annabelle Harper
5 years ago
Views:

1 QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Website: gepura.io Bart Goossens

2 NVIDIA 1000X by 2025 The benefits of GPUs. Growing application domain See: Exploits massive parallelism e.g. to process large amounts of data in parallel Speed-ups of 10 to 100 Energy efficient calculations/watt Applied to many applications: Multimedia, finances, big data, scientific computing, Commodity HW Standard in desktops and laptops NEW TRENDS: integration in embedded applications and mobile devices (smartphones: OnePlus, LG, HTC, ) 2

3 Number 3 supercomputer: TITAN Oct 2012, Oak Ridge (USA), 17.9 petaflops (LINPACK benchmark) 3

4 The drawbacks of GPU programming solved by Quasar Low level coding experts needed Strong coupling between algorithm development and implementation Long development lead times Each HW platform requires new optimizations 4

5 Data characteristics Kernel characteristics Data Quasar: overview Algorithm (quasar code) Programming language Compact code, easy to learn, easy to use High level of abstraction Hardware agnostic programming Code analysis Kernel generation Loop parallelization Target-specific optim. Developer feedback Compilation.NET OpenMP & SIMD OpenCL CUDA Hardware characteristics Runtime Memory management Load balancing Scheduling Kernel parameter optimization CPU Multi-core CPU Many-core accelerator GPU Embedded device 5

6 Quasar - Scripting language Same abstraction level as Python and Matlab: numelev=10 for ielev = 0..numElev-1 % Define the vector that points at this azimuth and elevation from the array phase center pointing_vectors[0,:] = cos(az*pi/180)*cos(el[ielev]*pi/180) pointing_vectors[1,:] = sin(az*pi/180)*cos(el[ielev]*pi/180) pointing_vectors[2,:] = sin(el[ielev]*pi/180) endfor Matrix slices! % Compute the actual focus point (meters) focus_points = pointing_vectors * diagm(focus_range) Matrix multiplication! % Compute the difference in range to each element in the array with respect to the array phase center focus_points_matrix = reshape(kron( focus_points,ones(numels,1)), [numels,3,numaz]) delta_range = sqrt((reshape(sum((p_array_matrix - _ focus_points_matrix).^2,1), [numels, numaz]))) - ones(numels,1)*focus_range % Compute the true array response vectors to the source (azimuth,elevation,focus range) for ifrq = 0..numFreqs-1 v[:,:,ielev,ifrq] = exp(1j*2*pi*delta_range*freqs[ifrq]/soundspeed) endfor 6

7 Example: code written in CUDA vs. Quasar CUDA Quasar #include <cuda.h> // Kernel that executes on the CUDA device global void square_array(float *a, int N) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx]; } // main routine that executes on the host int main(void) { float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); // Cleanup free(a_h); cudafree(a_d); } a_d = 0..9 print a_d.^2 7

8 Example: Nvidia Thrust vs. Quasar CUDA (Nvidia Thrust library) Quasar #include <thrust/device_vector.h> #include <thrust/transform.h> #include <thrust/sequence.h> #include <thrust/copy.h> #include <thrust/functional.h> #include <iostream> template <typename T> struct square { host device T operator()(const T& x) const { return x * x; } }; int main(void) { // allocate three device_vectors with 10 elements thrust::device_vector<int> X(10); thrust::device_vector<int> Y(10); // initialize X to 0,1,2,3,... thrust::sequence(x.begin(), X.end()); // compute Y = X.^2 thrust::transform(x.begin(), X.end(), Y.begin(), square<int>()); thrust::copy(y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n")); return 0; } a_d = 0..9 print a_d.^2 8

9 Quasar - IDE 9

Compilation Runtime Hardware characteristics CPU

10 Data characteristics Kernel characteristics Algorithm (quasar code) Data Code analysis Compilation Runtime Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU Embedded device 10

11 Quasar: smart parallelization CPU vs GPU: different parallelism (coarse-gained threads vs fine-grain threads)! Identifying parallelism is not enough to guarantee optimal execution on a given device: Multiple parallelization options are available And trade-offs need to be made Often, multiple parallelizations for different target platforms need to be developed compiler support needed! Quasar smart parallelization focuses on: 1) Multidimensional perfect loops 2) Multidimensional imperfect loops 3) Complex loop nests 11

12 Quasar: smart parallelization 1) Multidimensional perfect loops: The iterator domains describe axis-aligned boxes cnt_iy cnt_ix Example (template matching) 12

13 Quasar: smart parallelization - developer feedback 13

14 Quasar: smart parallelization 2) Multidimensional imperfect loops: Convert to perfect loops by adding branches for m=0..m-1 for n=0..n-1-m*2 for p=0..2 im_out[m-1-m,n-1-n,p]=im[m,n,p] endfor endfor endfor Affine loop transformations (polyhedral framework) for m=0..m-1 for n=0..n-1 for p=0..2 im[m,n,p]=0.3*im[m-1,n-1,p] + 0.4*im[m-2,n-2,p]+0.3*im[m,n,p] endfor endfor endfor m n =n-m for m=0..m-1 for n=0..n-1 for p=0..2 if n<n-m*2 im_out[m-1-m,n-1-n,p]=im[m,n,p] endif endfor endfor endfor m n p = n m n p Original loop not parallelizable Now parallelizable along 2 dimensions! 14

15 Quasar: smart parallelization 3) Loop nests Advantage of a higher-level language: compiler has a lot more information for parallelizing the code! Example: beam forming (e.g., radar applications) for ielev = 0..numElev-1 % Define the vector that points at this azimuth and elevation from the array phase center pointing_vectors[0,:] = cos(az*pi/180)*cos(el[ielev]*pi/180) pointing_vectors[1,:] = sin(az*pi/180)*cos(el[ielev]*pi/180) pointing_vectors[2,:] = sin(el[ielev]*pi/180) endfor 2D Parallelization! % Compute the actual focus point (meters) focus_points = pointing_vectors * diagm(focus_range) % Compute the difference in range to each element in the array with respect to the array phase center focus_points_matrix : cube = reshape(kron( focus_points,ones(numels,1)), [numels,3,numaz]) delta_range = sqrt((reshape(sum((p_array_matrix - _ focus_points_matrix).^2,1), [numels, numaz]))) - ones(numels,1)*focus_range % Compute the true array response vectors to the source (azimuth,elevation,focus range) for ifrq = 0..numFreqs-1 v[:,:,ielev,ifrq] = exp(1j*2*pi*delta_range*freqs[ifrq]/soundspeed) endfor 4D Parallelization! 15

16 Data characteristics Kernel characteristics Algorithm (quasar code) Data Code analysis Compilation Runtime Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU Embedded device 16

17 Compilation: (mis)conceptions about high-level code High-level code is not necessarily slower than low-level code! Y = repmat(ones(10,10)*8 eye(10), [4,1]) Naive compilation would evaluate the operations one-by-one (resulting in unnecessary memory allocations and bad data locality) Smart compilation (Quasar) can deduce a function that directly generates the output For m,n : y(m,n) = 8 ((m % 10) == n? 1 : 0) High-level operations recognized by the compiler: 1. zeros(), ones(), eye() - matrix constructors 2. transpose(), shuffledims(), repmat() - matrix shaping functions k..n, linspace() - slicing functions 4. sum(), prod(), cumsum(), cumprod() - aggregation functions 5. Operators (+,-,.*,./, *, ) 17

18 Compilation: Reductions (rewriting rules) Allow to provide an alternative implementation for certain operations (e.g., BLAS Basic Linear Algebra Subroutines) cublas reduction (alpha : scalar, x : vec, y : vec) -> alpha * x + y = blas_sscal(alpha, x, y) % import blas.q reduction (x) -> real(ifft2(x)) = irealfft2(x) Define trivial optimizations reduction (x:mat) -> real(x) = x reduction (x:mat) -> imag(x) = zeros(size(x)) reduction (x:mat) -> transpose(transpose(x)) = x reduction (x:mat) -> x[:,:] = x Shorthands reduction (x : cube) -> x[:] = reshape(x,[1,numel(x)]) 18

19 Data characteristics Kernel characteristics Algorithm (quasar code) Data Code analysis Compilation Runtime Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU Embedded device 20

Runtime execution: memory management State 4 CPU GPU: NA State 5 CPU: NA GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3

20 Runtime execution: memory management State 4 CPU GPU: NA State 5 CPU: NA GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3 CPU: non-dirty GPU: dirty Automatic memory management Allocation/disposal Transparent marshalling Transfer between CPU-GPU State 1 CPU: non-dirty GPU: non-dirty 21

21 Runtime execution: Optimization to Hardware Automated parameter optimization: Block size Grid size Number of threads Number of warps Shared memory 22

execution time (μs) Runtime execution: Load Balancing 100000 CPU NVIDIA GeForce GTX770 NVIDIA GeForce GTX440 time CPU >time GPU time GPU1 > time

22 execution time (μs) Runtime execution: Load Balancing CPU NVIDIA GeForce GTX770 NVIDIA GeForce GTX440 time CPU >time GPU time GPU1 > time CPU >time GPU time CPU <time GPU Automated load balancing based on: Data 1 Hardware characteristics Kernel characteristics 23

Sequential: Concurrent: Automated scheduling: Reduce

23 Runtime execution: Scheduler Inter-object dependencies automatically detected at runtime! Sequential: Concurrent: Automated scheduling: Reduce memory transfer times Concurrent kernel execution (CUDA streams) Multi-GPU! 24

24 Runtime execution: multi-gpu programming {!sched gpu_index=0} accelerations1 = zeros(3,blocksize) distsums1 = zeros(blocksize) {!sched gpu_index=1} accelerations2 = zeros(3,blocksize) distsums2 = zeros(blocksize) tree_end = [size(trees[0],1)-blocksize, size(trees[1],1)-blocksize] %calculate each of the tree's influence for tree = 0..K-1 {!sched gpu_index=0} parallel_do(blocksize,pointblocks[0],orders[0], calculate_update) end {!sched gpu_index=1} parallel_do(blocksize,pointblocks[1],orders[1],calculate_update) In Quasar, we simply {!sched gpu=#} for selecting which GPU the kernel is run on. The Quasar runtime takes care of everything else! 25

25 Quasar: Results Faster development 2 weeks vs. 3 months for a CUDA implementation of an MRI reconstruction algorithm (with equal performance) Faster execution using the GPU 64 fps vs 2.91 fps for a template matching algorithm More efficient code: 300 lines of Quasar code vs lines of C++ code for a registration algorithm 26

26 256x x x x x x x x1024 Quasar: Benchmarks 2048x x x x x x x x x x x x x x x x x x x x x x x x x x x Benchmark processing times (ms) Anisotropic diffusion Bilateral filter BLS-GSM Gaussian filter Non-local means denoising Non-local means deconvolution Wavelet thresholding Matlab Quasar 27

27 Key take aways GPUs: performance revolution High-level code: gives compiler more flexibility for code generation + optimization Required when targetting heterogeneous systems (GPUs) Quasar: New programming paradigm for heterogeneous systems (CPU, GPU...) Low barrier of entry (Matlab-like) Reduces development time Future proof! Redshift IDE: powerful code editing, debugging, profiling! Available for tryout: 28

28 Questions? 29

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei