Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.

Size: px
Start display at page:

Download "Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU."

Transcription

1 Astrosim 217 Les «Houches» the 7th ;-) PU : the disruptive technology of 21th century Little comparison between CPU, MIC, PU Emmanuel Quémener

2 A (human) generation before... A serial B film in : The Last Starfighter 27 minutes of synthetic images ⁹ operations per image Use of Cray X-MP (13 kw) 66 days (in fact, 1 year needed) 216: on TX 18Ti (25 W) 3.4 secondes Comparison TX18Ti / Cray Performance : 1 6! Consommation ~ 9! 2/48

3 «The beginning of all things» «Nvidia Launches Tesla Personal Supercomputer» When : on 19/11/28 Where : sur Tom's Hardware What : a PCIe card C16 PCIe with 24 cores How much : 933 flops SP (but 78 flops DP) Qui : Nvidia 3/48

4 What position for PUs? The others «accelerators» Accelerator or the old history of coprocessors : 887 (on 886/888) for floating points operations 1989: 8387 (on 8386) and the respect of IEEE : 8486DX and the integration of FPU inside the CPU 1997 : K6-3DNow! & Pentium MMX : SIMD inside the CPU 1999 : SSE functions and the birth of a long serie (SSE4 & AVX) When chips stand out from CPU 1998 : DSP style TMS32C67x as tools 28: Cell inside the PS3, IBM inside Road Runner and Top1 of Top5 Business of compilers & very accurate programming model 4/48

5 Why a PU is so powerful? To construct 3D scene! 2 approaches: Raytracing : PovRay Shadering : 3 operations Raytracing : Starting from the eye to the objects of scene Shadering Model2World : vectorial objects placed in the scene World2View : projection of objects on a plan of view View2Projection : pixelisation of vectorial plan of view 5/48

6 Why a PU is so powerful Shadering & Matrix computing Modele 2 World : 3 matrix products Rotation Translation Scaling M2W World 2 View : 2 matrix products Camera position Direction de l'endroit pointé View 2 Projection W2V V2P Pixellisation A PU : a «huge» Matrix Multiplier 6/48

7 Is PU really so powerful? With my best MyriALUs OpenBLAS DP Tesla for Pros TX for amers clblas DP Thunking DP CuBLAS DP OpenBLAS SP clblas SP Thunking SP AMD for Alt* CuBLAS SP 3. CPU Tesla P1 TX 18Ti R9 Fury Xeon Phi Ryzen7 18 7/48 E5-268v4 x2

8 BLAS match : CPU vs PU xemm when x=d(p) or S(P) Tesla for Pros TX for amers CuBLAS 4. OpenBLAS DP clblas DP Thunking DP CuBLAS DP OpenBLAS SP clblas SP Thunking SP CuBLAS SP clblas AMD For Alt* OpenBLAS 3. CPU n Ti Ti Ti Ti 6 9 m K Tita K la 1 P 7 9 X X X 5 k4 C M X T T TX TX T TX TX X 1 a la s la Tes s la ro T l T d e s s Te T Te Te ua Q y 7 # 1 ur n o x 9 F Na D R H 9-2 R Xe on i Ph FX n7 e yz R 8/48 x2 x2 x2 x2 x2 x v3 v3 v4 v4 6 E x E E E E E

9 What Nvidia never broadcasts... xemm on a large domain... Performance Tesla P1 TX 18 Size Yes, performance exists, in specific conditions! 9/48

10 How a PU must be considered A laboratory of parallelism You ve got dozen of thousands of Equivalent PU! You have the choice between : CPU MIC PU 1/48

11 How many elements inside? Difference between PU & CPU Tesla P1 Operations Matrix Multiply Vectorization Pipelining Shader (multi)processor Programmation : 1993 énéricité : 22 OpenL, lide, Direct3D, CgToolkit, CUDA, OpenCL 11/48

12 Why PU is so powerful? Come back generic processing Inside the PU Specialized processing units Pipelining efficiency Adaptation Loss Changing scenes Different nature of details The Solution More generalist processing units! 12/48

13 Why is PU so powerful All «M» Flynn taxonomy included Vectorial : SIMD (Simple Instruction Multiple Data) Parallel : MIMD (Multiple Instructions Multiple Data) Addition of 2 positions (x,y,z) : 1 uniq command Several executions in parallel with the (almost) same data In fact, SIMT : Simple Instruction Multiple Threads All processing units share the Threads Each processing unit can work independently from others Need to synchronize the Threads 13/48

14 Important dates : OpenL and the birth of a standard : OpenL 1.2 and interesting functions : Cg Toolkit (Nvidia) and the extensions of language Wrappers for all languages (Python) 27-6 : CUDA (Nvidia) or the arrival of a real langague 28-6 : Snow Leopard (Apple) integrates OpenCL La volonté d'utiliser au mieux sa machine? : OpenCL 1. and its first specifications : WebCL and its first version by Nokia 14/48

15 Who use OpenCL? As applications OS : Apple in MacOSX «Big» applications : Libreoffice Ffmpeg Known applications raphical ones : photoscan, Scientific : OpenMM, romacs, Lammps, Amber, Hundreds of softwares, librairies! But checking is essentiel /48

16 Why PU is a disruptive technology How to reshuffle the cards in Scientific Computing In a conference in february 26, this x1 in 5 years Between 2 and 215 eforce 2 o/tx 98Ti : from 286 to 2816 MOperations/s : x1 Classical CPU: x1 16/48

17 Just a remind of last friday! Parallel programming models Cluster Node CPU Node PU Node Nvidia Accelerator MPI Yes Yes No No Yes* PVM Yes Yes No No Yes* OpenMP No Yes No No Yes* Pthreads No Yes No No Yes* OpenCL No Yes Yes Yes Yes CUDA No No No Yes No TBB No Yes No No Yes* OpenCL is the most versatile one CUDA can ONLY be used with Nvidia PUs Accelerator seems to be the most universal, but... 17/48

18 Act as integrator : do not reinvent the wheel! Parallel programming librairies Cluster Node CPU Node PU Node Nvidia Accelerator BLAS BLACS MKL OpenBLAS MKL clblas CuBLAS OpenBLAS MKL LAPACK Scalapack MKL Atlas MKL clmama MAMA MagmaMIC FFT FFTw3 FFTw3 clfft CuFFT FFTw3 Classic librairies can be used under PU Nvidia provides lots of implementations OpenCL provides several on them, but not so complete 18/48

19 Why OpenCL? Non Politically Correct History In the middle of 2, Apple needs power for MacOSX Some processing can be done by Nvidia chipsets CUDA exists as a good successor of CgToolkit But Apple did not want to be jailed (as their users :-) ) Apple promotes the creation of Khronos consortium AMD, IBM, ARM came quickly and Nvidia, Intel, Altera, came backwards... 19/48

20 What OpenCL offers 1 implementations on x86 3 implementations for CPU : AMD one : the original Intel one : efficient for specific parallel rate PortableCL (POCL) : OpenSource 4 implementations for PU : AMD one : for AMD/ATI graphical boards Nvidia one : for Nvidia graphical boards «Beignet» one : Open Source one for Intel raphical Chipset «Mesa» one : Open Source one for all Open Source drivers in Linux Kernel 1 implementation for Accelerator : Intel one for Xeon Phi 1 implementation for FPA : Intel one for Altera FPA On other platforms, it seems to be possible 2/48

21 oal : replace loop by distribute 2 types of distributions Blocks/WorkItems Domain «global», large but slow amount of memory available Threads Domain «local», small but quick amount of memory available Needed for synchronisation of process Different access to memory : «clinfo» to get the properties of CL devices : Max work items Max work items : 124x124x124 for CPU so 1.1 billion distribution 21/48

22 How to do it? Little example as... A «Hello World» in OpenCL... Define two vectors in ASCII Duplicate them in 2 huge vectors Add them using a OpenCL kernel Définir deux vecteurs en «Ascii» Print them on screen Add of 2 vectors a+b =c For each n : c[n] = a[n] + b[n] The End 22/48

23 Kernel Code & Build and Call Add a vector in OpenCL kernel void VectorAdd( global int* c, global int* a, global int* b) { // Index of the elements to add unsigned int n = get_global_id(); // Sum the n th element of vectors a and b and store in c c[n] = a[n] + b[n]; } OpenCLProgram = cl.program(ctx, OpenCLSource).build() OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,PUOutputVector, PUVector1, PUVector2) 23/48

24 How to OpenCL? Write «Hello World!» in C #include <stdio.h> cl_mem PUVector1 = clcreatebuffer(pucontext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector1, NULL); #include <stdlib.h> cl_mem PUVector2 = clcreatebuffer(pucontext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector2, NULL); #include <CL/cl.h> const char* OpenCLSource[] = { " kernel void VectorAdd( global int* c, global int* a, global int* b)", cl_mem PUOutputVector = clcreatebuffer(pucontext, CL_MEM_WRITE_ONLY,sizeof(int) * SIZE, NULL, NULL); cl_program OpenCLProgram = clcreateprogramwithsource(pucontext, 7, OpenCLSource,NULL, NULL); "{", " // Index of the elements to add \n", clbuildprogram(openclprogram,, NULL, NULL, NULL, NULL); " unsigned int n = get_global_id();", cl_kernel OpenCLVectorAdd = clcreatekernel(openclprogram, "VectorAdd", NULL); " // Sum the n th element of vectors a and b and store in c \n", clsetkernelarg(openclvectoradd,, sizeof(cl_mem),(void*)&puoutputvector); " c[n] = a[n] + b[n];", clsetkernelarg(openclvectoradd, 1, sizeof(cl_mem), (void*)&puvector1); clsetkernelarg(openclvectoradd, 2, sizeof(cl_mem), (void*)&puvector2); "}" size_t WorkSize[1] = {SIZE}; // one dimensional Range }; clenqueuendrangekernel(cqcommandqueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL,, NULL, NULL); int InitialData1[2] = {37,5,54,5,56,,43,43,74,71,32,36,16,43,56,1,5,25,15,17}; int InitialData2[2] = {35,51,54,58,55,32,36,69,27,39,35,4,16,44,55,14,58,75,18,15}; #define SIZE 248 int HostOutputVector[SIZE]; clenqueuereadbuffer(cqcommandqueue, PUOutputVector, CL_TRUE,,SIZE * sizeof(int), HostOutputVector,, NULL, NULL); int main(int argc, char **argv) clreleasekernel(openclvectoradd); { clreleaseprogram(openclprogram); int HostVector1[SIZE], HostVector2[SIZE]; OpenCL kernel for(int c = ; c < SIZE; c++) { Kernel Call clreleasecommandqueue(cqcommandqueue); clreleasecontext(pucontext); clreleasememobject(puvector1); HostVector1[c] = InitialData1[c%2]; clreleasememobject(puvector2); HostVector2[c] = InitialData2[c%2]; clreleasememobject(puoutputvector); } for (int Rows = ; Rows < (SIZE/2); Rows++) { cl_platform_id cpplatform; printf("\t"); cletplatformids(1, &cpplatform, NULL); for(int c = ; c <2; c++) { printf("%c",(char)hostoutputvector[rows * 2 + c]); cl_int cierr1; } cl_device_id cddevice; cierr1 = cletdeviceids(cpplatform, CL_DEVICE_TYPE_PU, 1, &cddevice, NULL); } cl_context PUContext = clcreatecontext(, 1, &cddevice, NULL, NULL, &cierr1); printf("\n\nthe End\n\n"); return ; cl_command_queue cqcommandqueue = clcreatecommandqueue(pucontext, } cddevice,, NULL); 24/48 Number of lines Of OpenCL kernel

25 How to OpenCL? Write «Hello World!» in Python import pyopencl as cl OpenCL kernel import numpy import numpy.linalg as la import sys mf = cl.mem_flags PUVector1 = cl.buffer(ctx, mf.read_only mf.copy_host_ptr, hostbuf=hostvector1) PUVector2 = cl.buffer(ctx, mf.read_only mf.copy_host_ptr, hostbuf=hostvector2) OpenCLSource = """ kernel void VectorAdd( global int* c, global int* a, global int* b) { PUOutputVector = cl.buffer(ctx, mf.write_only, HostVector1.nbytes) // Index of the elements to add OpenCLProgram = cl.program(ctx, OpenCLSource).build() unsigned int n = get_global_id(); // Sum the n th element of vectors a and b and store in c OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,PUOutputVector, PUVector1, PUVector2) c[n] = a[n] + b[n]; HostOutputVector = numpy.empty_like(hostvector1) } cl.enqueue_copy(queue, HostOutputVector, PUOutputVector) """ PUVector1.release() InitialData1=[37,5,54,5,56,,43,43,74,71,32,36,16,43,56,1,5,25,15,17] PUVector2.release() InitialData2=[35,51,54,58,55,32,36,69,27,39,35,4,16,44,55,14,58,75,18,15] PUOutputVector.release() SIZE=248 HostVector1=numpy.zeros(SIZE).astype(numpy.int32) HostVector2=numpy.zeros(SIZE).astype(numpy.int32) for c in range(size): HostVector1[c] = InitialData1[c%2] HostVector2[c] = InitialData2[c%2] OutputString='' Kernel Call for rows in range(size/2): OutputString+='\t' for c in range(2): OutputString+=chr(HostOutputVector[rows*2+c]) ctx = cl.create_some_context() print OutputString queue = cl.commandqueue(ctx) sys.stdout.write("\nthe End\n\n"); 25/48

26 How to OpenCL? «Hello World» C/Python : to weight On previous OpenCL implementations : In C : 75 lines, 262 words, 2848 bytes In Python : 51 lines, 137 words, 1551 bytes Factors :.68,.52,.54 in lines, words and bytes. Programming OpenCL : Difficult programming context «To open the box in more difficult than to assemble the furniture unit!» Requirement to simplify the calls by an API No compatibility between SDK of AMD and Nvidia Everyone rewrite their own API One solution, however «The» solution : Python 26/48

27 Which simple code to implement in //? PiMC: Pi by the method of the target Historical example of the Monte Carlo method Parallel implementation: Equivalent distribution of iterations From 2 to 4 parameters Number of iterations Parallel Rate (PR) (Type of variable: INT32, INT64, FP32, FP64) (RN: MWC, CON, SHR3, KISS) 2 simple observables: Estimation of Pi (just indicative, Pi not being rational :-)) Elapsed Time (Individual or lobal) 27/48

28 Differences between CUDA/OpenCL device ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work) ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work) { { uint jcong=seed_z+work; uint jcong=seed_z+work; ulong total=; ulong total=; for (ulong i=;i<iterations;i++) { for (ulong i=;i<iterations;i++) { float x=confp ; float x=confp ; float y=confp ; float y=confp ; ulong inside=((x*x+y*y) <= THEONE)? 1:; ulong inside=((x*x+y*y) <= THEONE)? 1:; total+=inside; total+=inside; } } return(total); global void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=mainloop(iterations,seed_z,seed_w,blockidx.x); } kernel void MainLooplobal( global ulong *s,ulong iterations,uint seed_w,uint seed_z) { s[blockidx.x]=total; ulong total=mainloop(iterations,seed_z,seed_w,get_global_id()); syncthreads(); barrier(clk_lobal_mem_fence); s[get_global_id()]=total; } } 28/48

29 You think Python is not efficient? Let s have a quick comparison... MPI+PyOpenCL HT MPI+PyOpenCL NoHT MPI+OpenMP/C MPI/C Itops + E. + E E E E E E E 4 1. ITOPS 29/48

30 Tesla C16 TX 68 HD 797 3/48 1x Intel Pentium E63 Cluster 64n8c MPI+OpenMP/C Cluster 64n8c MPI/C Cluster 64n8c MPI+PyCL HT Cluster 64n8c MPI+PyCL NoHT 2x Intel 268v4 14c 2x Intel 2687v3 1c 2 Boards on Host! 2x Intel X555 4c 2x Intel X555 4c MPI+C.E+ 1x Intel Pentium M 755 AMD Ryzen7 18 AMD FX-959 MIC Phi 712P R9 295X2 +pthreads R9 295X2 #1 R9 29 R9 38 R9 Fury Nano Fury R7 24 Kaveri A1-785K PU 1.5E+11 HD 645 HD 585 T 73 T 71 T 64 T 62 T 43 TX 18Ti TX 18 TX 15Ti TX 98Ti TX 98 TX 97 TX E+11 TX 75Ti TX 75 TX 78Ti TX Titan TX 78 TX 69 +pthreads TX 69 #1 Tesla for Pros TX 56Ti Quadro K4 Quadro K2 Quadro 4 NVS 315 2xTesla P1 +pthreads Tesla P1 2xTesla K8 +pthreads Tesla K8 +pthreads Tesla K8 #1 Tesla K4m Tesla K2m 2.E+11 Tesla M29 Tesla M27 Pi Dart Board match (yesterday) : Python vs C & CPU vs PU vs Phi 2 Chips on Board! TX for amers AMD/ATI 1.E+11 5.E+1 CPU

31 With 1 QPU, low "regime"... The CPU explodes the PU! From 2 to 5x slower! TX 18 PiOpenCL PiOCL Base PiOpenCL Boost 1 PiOpenCL TX 98Ti 1 PiOCL Base Tesla K4c PiOpenCL Boost Quadro K4 TX Titan 1 R9-29 R9-295X Xeon Phi E5-2687v3 1 E5-2637v2 i i v2 7v2 7v3 Ph 95X 29 itan 4c 8T X eon 9-2 R9 TX T K4 la K X 9 X 1 E R T T o X s dr E5 E5 E5 Te a u Q E5-262v2 E X555 E. + E E E 5 E E order of magnitude... 31/48

32 Small comparison between *PU MIC Phi emerges, PU ambushed From 1 to 124 PR 32/48

33 Deep exploration of Parallel Rates Parallel Rate from 124 to Xeon Phi with 6 cores 6x27 6x41 6x64 6x64 PR from 124 to AMD HD797 with 248 ALU AMD R9-29X with 2816 ALU 248x4 2816x4 Nvidia TX Titan & Tesla K4 All prime numbers > /48

34 Deep exploration of Parallel Rates Parallel Rate from 124 to PR from 124 to AMD HD797 & R9-29 Long Period : 4x number of ALU Period of 14 (14 SMX) Nvidia TX Titan & Tesla K4 Short Period : number of SMX units Period of 15 (15 SMX) 34/48

35 And the latest Nvidia TX18, Always affected with oddities? Simple Precision Double Precision Variability without real structure 35/48

36 Behaviour of 2 (P)PU vs PR 36/48

37 Is Parallel Envelop reproducible? For 2 specific boards... Nvidia T62 AMD HD645 37/48

38 Variability : A distinctive criterion! What a strange property :-/... Mode Blocks Mode Threads 38/48

39 A «memory-bound» test Splutter code on OpenCL devices Intel IviBridge & Haswell Accelerator Xeon Phi PU AMD/ATI HD 797 & R29X PU Nvidia T, TX, Tesla 39/48

40 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP32, FP64 4/48

41 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP32, FP64 41/48

42 Very different performances Nvidia, AMD, Intel, Single/Double 3 D²/T_FP32 TX for amers Tesla for Pros D²/T_FP64 AMD/ATI for Alt* 1 CPUs AMD 5 Intel m m Ti Ti Ti Ti K8 P1 6 X68 78 X C M la K la K s la la TX T TX T TX TX TX1 a Te Tes s l es la Tes Tes e T T X #1 no PU 58 D64 D X Na K D H H H R R9 1 A Xe on i Ph 9 PU 95 K C - 18 FX 85 en7-7 Ry z 1 A x1 x 1 x2 x2 x 2 x 2 x 2 x 2 x 2 x 2 x v 2 v 2 v 2 v 3 v 3 v 4 v M E6 X E E E E E E E E 42/48

43 For high powered *PU De la «charge» à la saturation... Simple precision 1E+11 1E+1 TX 18 Nano Xeon Phi FX-959 E5-2687v3 x2 1E+1 1E+9 Double precision TX 18 Nano Xeon Phi FX-959 E5-2687v3 x2 1E+9 1E+8 1E+8 1E+7 1E+7 1E+6 1E+6 1E E Results expressed in unit Size²/Elapsed 43/

44 But if PU is to powerful Quid about its original goal? 2.5E+1 1E+11 Xorg Off SP 1E+1 Xorg On SP 2.E+1 Xorg Off DP Xorg On DP Xorg Off SP Xorg On SP Xorg Off DP Xorg On DP 1E+9 1.5E+1 1E+8 1.E+1 1E+7 5.E+9 1E+6 1E+5.E When using as compute PU, beware to Xorg! For AMD/ATI, boring when non root user... 44/48

45 Performances normalized to TDP Nvidia, AMD, Intel, Single / Double 12 Ratio32W Ratio64W i X 1 o U U Ti 8 Ti 8 Ti 8 Ti x1 x1 x2 x2 x2 x2 x2 x2 x2 x2 x2 6 9 m m 8 Ph 59 CP X # an P v2 v2 v2 v3 v3 v4 v4 6 X6 78 X K2 K4 la K P1 n N 5 - K o T T X C M D D D K a TX TX TX T TX1 H H H R9-29 FX 5 en7 M E6 X Xe la s la s la s la Tes es l 5 9 s z 8 8 e e T E R Te Te T T -7 Ry -7 E E E E E E E A1 A1 45/48

46 Performances normalized to Price Nvidia, AMD, Intel, Single/Double 45 Ratio32Pr 4 Ratio64Pr i i i i i m m U X #1 ano PU x 1 x 1 x 2 x 2 x 2 x2 x 2 x 2 x 2 x2 x 2 T 68 8T 98 8T 8 8T Ph K8 P CP 18 6 X v2 v2 v2 v3 v3 v4 v4 1 X X 2 9 N n 5 5 K K a T T X C M a a sl la o K TX TX TX T TX1 HD HD HD R9-29 FX 5K en7 a l l Xe M E6 X sl esla Tes Tes Te Tes z 8 8 e R E y 7 T T E E E E E E E - R 1 1 A A 46/48

47 Performances normalized to Cores.MHz Nvidia, AMD, Intel, Single / Double 8 7 Ratio32 Ratio i i i i U hi 6 9 m m K X #1 ano PU x 1 x 1 x 2 x 2 x 2 x 2 x 2 x2 x 2 x 2 x2 T 68 8T 98 8T 8 8T P 8 59 CP 18 6 X v2 v2 v2 v3 v3 v4 v4 X X 2 9 P N n 5 5 K a K C M l a o K TX T TX T TX TX TX1 HD HD HD R9-29 FX 5K en7 Xe la sla sla sla Tes esl M E6 X s z 8 8 e e T R E y 7 Te Te T T E E E E E E E - R 1 1 A A 47/48

48 But what's the use of all this? Characterize & avoid regrets... Now, the «brute» power is inside PUs But, QPU (PR=1) of PU is 5x slower than CPU To exploit PU, parallel rate MUST be over 1 To program PU, prefer : Developer approach : Python with PyOpenCL (or PyCUDA) Integrator approach : external libraries optimized In all cases, instrument your launches! 48/48

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU Houches 2016 : 6th School Computational Physics GPU : the disruptive technology of 21th century Little comparison between CPU, MIC, GPU Emmanuel Quémener A (human) generation before... A serial B film

More information

From modelisation house To test center. Emmanuel Quémener

From modelisation house To test center. Emmanuel Quémener Presentation of IT resources provided by Blaise Pascal Center From modelisation house To test center Emmanuel Quémener Blaise Pascal Center «House of Modelisation» &... Conferences Trainings Projects Hotel

More information

Multi-(nodes cores shaders) On Python For Dummies. A (very) pragmatic point of view

Multi-(nodes cores shaders) On Python For Dummies. A (very) pragmatic point of view PyPhy, Multi-(nodes cores shaders) On Python For Dummies A (very) pragmatic point of view Emmanuel QUEMENER Centre Blaise Pascal, ENS-Lyon CC BY-NC-SA 2011 From science to simulations : causality Inside

More information

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? IR CBP Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

1 of 18 11/08/15 14:34 Breakdown of the talk 1. Why are you listening to me? 2. Approaches to parallelism 3. CUDA or OpenCL? 4. (Some) Coding in OpenCL 5. What the future may hold 2 of 18 11/08/15 14:34

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

ARCHER Champions 2 workshop

ARCHER Champions 2 workshop ARCHER Champions 2 workshop Mike Giles Mathematical Institute & OeRC, University of Oxford Sept 5th, 2016 Mike Giles (Oxford) ARCHER Champions 2 Sept 5th, 2016 1 / 14 Tier 2 bids Out of the 8 bids, I know

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Python for Scientific Computing. Eric Jones

Python for Scientific Computing. Eric Jones Python for Scientific Computing Eric Jones Language Popularity (as rated by TIOBE*) * * www.tiobe.com Language Popularity (as rated by TIOBE*) * * www.tiobe.com Language Popularity (as rated by TIOBE*)

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallelization using the GPU

Parallelization using the GPU ~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card Shin Yoo, Mark Harman CREST, University College London, UK Shmuel Ur University of Bristol, UK It is all good improving SBSE

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

ECE 574 Cluster Computing Lecture 18

ECE 574 Cluster Computing Lecture 18 ECE 574 Cluster Computing Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 2 April 2019 HW#8 was posted Announcements 1 Project Topic Notes I responded to everyone s

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

OpenCL Training Course

OpenCL Training Course OpenCL Training Course Intermediate Level Class http://www.ksc.re.kr http://webedu.ksc.re.kr INDEX 1. Class introduction 2. Multi-platform and multi-device 3. OpenCL APIs in detail 4. OpenCL C language

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Tools for Multi-Cores and Multi-Targets

Tools for Multi-Cores and Multi-Targets Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million

More information

Rock em Graphic Cards

Rock em Graphic Cards Rock em Graphic Cards Agnes Meyder 27.12.2013, 16:00 1 / 61 Layout Motivation Parallelism Old Standards OpenMPI OpenMP Accelerator Cards CUDA OpenCL OpenACC Hardware C++AMP The End 2 / 61 Layout Motivation

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Embedded Systems Architecture. Computer Architectures

Embedded Systems Architecture. Computer Architectures Embedded Systems Architecture Computer Architectures M. Eng. Mariusz Rudnicki 1/18 A taxonomy of computer architectures There are many different types of architectures, and it is worth considering some

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake MPI and I/O from GPU vs. CPU Traditional CPU point-of-view

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information