Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.

Size: px

Start display at page:

Download "Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU."

Melanie Haynes
5 years ago
Views:

1 Astrosim 217 Les «Houches» the 7th ;-) PU : the disruptive technology of 21th century Little comparison between CPU, MIC, PU Emmanuel Quémener

2 A (human) generation before... A serial B film in : The Last Starfighter 27 minutes of synthetic images ⁹ operations per image Use of Cray X-MP (13 kw) 66 days (in fact, 1 year needed) 216: on TX 18Ti (25 W) 3.4 secondes Comparison TX18Ti / Cray Performance : 1 6! Consommation ~ 9! 2/48

Tom's Hardware What : a PCIe card C16 PCIe with 24

3 «The beginning of all things» «Nvidia Launches Tesla Personal Supercomputer» When : on 19/11/28 Where : sur Tom's Hardware What : a PCIe card C16 PCIe with 24 cores How much : 933 flops SP (but 78 flops DP) Qui : Nvidia 3/48

4 What position for PUs? The others «accelerators» Accelerator or the old history of coprocessors : 887 (on 886/888) for floating points operations 1989: 8387 (on 8386) and the respect of IEEE : 8486DX and the integration of FPU inside the CPU 1997 : K6-3DNow! & Pentium MMX : SIMD inside the CPU 1999 : SSE functions and the birth of a long serie (SSE4 & AVX) When chips stand out from CPU 1998 : DSP style TMS32C67x as tools 28: Cell inside the PS3, IBM inside Road Runner and Top1 of Top5 Business of compilers & very accurate programming model 4/48

the eye to the objects of scene Shadering Model2World : vectorial objects placed in

5 Why a PU is so powerful? To construct 3D scene! 2 approaches: Raytracing : PovRay Shadering : 3 operations Raytracing : Starting from the eye to the objects of scene Shadering Model2World : vectorial objects placed in the scene World2View : projection of objects on a plan of view View2Projection : pixelisation of vectorial plan of view 5/48

6 Why a PU is so powerful Shadering & Matrix computing Modele 2 World : 3 matrix products Rotation Translation Scaling M2W World 2 View : 2 matrix products Camera position Direction de l'endroit pointé View 2 Projection W2V V2P Pixellisation A PU : a «huge» Matrix Multiplier 6/48

7 Is PU really so powerful? With my best MyriALUs OpenBLAS DP Tesla for Pros TX for amers clblas DP Thunking DP CuBLAS DP OpenBLAS SP clblas SP Thunking SP AMD for Alt* CuBLAS SP 3. CPU Tesla P1 TX 18Ti R9 Fury Xeon Phi Ryzen7 18 7/48 E5-268v4 x2

8 BLAS match : CPU vs PU xemm when x=d(p) or S(P) Tesla for Pros TX for amers CuBLAS 4. OpenBLAS DP clblas DP Thunking DP CuBLAS DP OpenBLAS SP clblas SP Thunking SP CuBLAS SP clblas AMD For Alt* OpenBLAS 3. CPU n Ti Ti Ti Ti 6 9 m K Tita K la 1 P 7 9 X X X 5 k4 C M X T T TX TX T TX TX X 1 a la s la Tes s la ro T l T d e s s Te T Te Te ua Q y 7 # 1 ur n o x 9 F Na D R H 9-2 R Xe on i Ph FX n7 e yz R 8/48 x2 x2 x2 x2 x2 x v3 v3 v4 v4 6 E x E E E E E

9 What Nvidia never broadcasts... xemm on a large domain... Performance Tesla P1 TX 18 Size Yes, performance exists, in specific conditions! 9/48

10 How a PU must be considered A laboratory of parallelism You ve got dozen of thousands of Equivalent PU! You have the choice between : CPU MIC PU 1/48

11 How many elements inside? Difference between PU & CPU Tesla P1 Operations Matrix Multiply Vectorization Pipelining Shader (multi)processor Programmation : 1993 énéricité : 22 OpenL, lide, Direct3D, CgToolkit, CUDA, OpenCL 11/48

12 Why PU is so powerful? Come back generic processing Inside the PU Specialized processing units Pipelining efficiency Adaptation Loss Changing scenes Different nature of details The Solution More generalist processing units! 12/48

13 Why is PU so powerful All «M» Flynn taxonomy included Vectorial : SIMD (Simple Instruction Multiple Data) Parallel : MIMD (Multiple Instructions Multiple Data) Addition of 2 positions (x,y,z) : 1 uniq command Several executions in parallel with the (almost) same data In fact, SIMT : Simple Instruction Multiple Threads All processing units share the Threads Each processing unit can work independently from others Need to synchronize the Threads 13/48

14 Important dates : OpenL and the birth of a standard : OpenL 1.2 and interesting functions : Cg Toolkit (Nvidia) and the extensions of language Wrappers for all languages (Python) 27-6 : CUDA (Nvidia) or the arrival of a real langague 28-6 : Snow Leopard (Apple) integrates OpenCL La volonté d'utiliser au mieux sa machine? : OpenCL 1. and its first specifications : WebCL and its first version by Nokia 14/48

15 Who use OpenCL? As applications OS : Apple in MacOSX «Big» applications : Libreoffice Ffmpeg Known applications raphical ones : photoscan, Scientific : OpenMM, romacs, Lammps, Amber, Hundreds of softwares, librairies! But checking is essentiel /48

16 Why PU is a disruptive technology How to reshuffle the cards in Scientific Computing In a conference in february 26, this x1 in 5 years Between 2 and 215 eforce 2 o/tx 98Ti : from 286 to 2816 MOperations/s : x1 Classical CPU: x1 16/48

17 Just a remind of last friday! Parallel programming models Cluster Node CPU Node PU Node Nvidia Accelerator MPI Yes Yes No No Yes* PVM Yes Yes No No Yes* OpenMP No Yes No No Yes* Pthreads No Yes No No Yes* OpenCL No Yes Yes Yes Yes CUDA No No No Yes No TBB No Yes No No Yes* OpenCL is the most versatile one CUDA can ONLY be used with Nvidia PUs Accelerator seems to be the most universal, but... 17/48

18 Act as integrator : do not reinvent the wheel! Parallel programming librairies Cluster Node CPU Node PU Node Nvidia Accelerator BLAS BLACS MKL OpenBLAS MKL clblas CuBLAS OpenBLAS MKL LAPACK Scalapack MKL Atlas MKL clmama MAMA MagmaMIC FFT FFTw3 FFTw3 clfft CuFFT FFTw3 Classic librairies can be used under PU Nvidia provides lots of implementations OpenCL provides several on them, but not so complete 18/48

19 Why OpenCL? Non Politically Correct History In the middle of 2, Apple needs power for MacOSX Some processing can be done by Nvidia chipsets CUDA exists as a good successor of CgToolkit But Apple did not want to be jailed (as their users :-) ) Apple promotes the creation of Khronos consortium AMD, IBM, ARM came quickly and Nvidia, Intel, Altera, came backwards... 19/48

20 What OpenCL offers 1 implementations on x86 3 implementations for CPU : AMD one : the original Intel one : efficient for specific parallel rate PortableCL (POCL) : OpenSource 4 implementations for PU : AMD one : for AMD/ATI graphical boards Nvidia one : for Nvidia graphical boards «Beignet» one : Open Source one for Intel raphical Chipset «Mesa» one : Open Source one for all Open Source drivers in Linux Kernel 1 implementation for Accelerator : Intel one for Xeon Phi 1 implementation for FPA : Intel one for Altera FPA On other platforms, it seems to be possible 2/48

21 oal : replace loop by distribute 2 types of distributions Blocks/WorkItems Domain «global», large but slow amount of memory available Threads Domain «local», small but quick amount of memory available Needed for synchronisation of process Different access to memory : «clinfo» to get the properties of CL devices : Max work items Max work items : 124x124x124 for CPU so 1.1 billion distribution 21/48

22 How to do it? Little example as... A «Hello World» in OpenCL... Define two vectors in ASCII Duplicate them in 2 huge vectors Add them using a OpenCL kernel Définir deux vecteurs en «Ascii» Print them on screen Add of 2 vectors a+b =c For each n : c[n] = a[n] + b[n] The End 22/48

23 Kernel Code & Build and Call Add a vector in OpenCL kernel void VectorAdd( global int* c, global int* a, global int* b) { // Index of the elements to add unsigned int n = get_global_id(); // Sum the n th element of vectors a and b and store in c c[n] = a[n] + b[n]; } OpenCLProgram = cl.program(ctx, OpenCLSource).build() OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,PUOutputVector, PUVector1, PUVector2) 23/48

24 How to OpenCL? Write «Hello World!» in C #include <stdio.h> cl_mem PUVector1 = clcreatebuffer(pucontext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector1, NULL); #include <stdlib.h> cl_mem PUVector2 = clcreatebuffer(pucontext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector2, NULL); #include <CL/cl.h> const char* OpenCLSource[] = { " kernel void VectorAdd( global int* c, global int* a, global int* b)", cl_mem PUOutputVector = clcreatebuffer(pucontext, CL_MEM_WRITE_ONLY,sizeof(int) * SIZE, NULL, NULL); cl_program OpenCLProgram = clcreateprogramwithsource(pucontext, 7, OpenCLSource,NULL, NULL); "{", " // Index of the elements to add \n", clbuildprogram(openclprogram,, NULL, NULL, NULL, NULL); " unsigned int n = get_global_id();", cl_kernel OpenCLVectorAdd = clcreatekernel(openclprogram, "VectorAdd", NULL); " // Sum the n th element of vectors a and b and store in c \n", clsetkernelarg(openclvectoradd,, sizeof(cl_mem),(void*)&puoutputvector); " c[n] = a[n] + b[n];", clsetkernelarg(openclvectoradd, 1, sizeof(cl_mem), (void*)&puvector1); clsetkernelarg(openclvectoradd, 2, sizeof(cl_mem), (void*)&puvector2); "}" size_t WorkSize[1] = {SIZE}; // one dimensional Range }; clenqueuendrangekernel(cqcommandqueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL,, NULL, NULL); int InitialData1[2] = {37,5,54,5,56,,43,43,74,71,32,36,16,43,56,1,5,25,15,17}; int InitialData2[2] = {35,51,54,58,55,32,36,69,27,39,35,4,16,44,55,14,58,75,18,15}; #define SIZE 248 int HostOutputVector[SIZE]; clenqueuereadbuffer(cqcommandqueue, PUOutputVector, CL_TRUE,,SIZE * sizeof(int), HostOutputVector,, NULL, NULL); int main(int argc, char **argv) clreleasekernel(openclvectoradd); { clreleaseprogram(openclprogram); int HostVector1[SIZE], HostVector2[SIZE]; OpenCL kernel for(int c = ; c < SIZE; c++) { Kernel Call clreleasecommandqueue(cqcommandqueue); clreleasecontext(pucontext); clreleasememobject(puvector1); HostVector1[c] = InitialData1[c%2]; clreleasememobject(puvector2); HostVector2[c] = InitialData2[c%2]; clreleasememobject(puoutputvector); } for (int Rows = ; Rows < (SIZE/2); Rows++) { cl_platform_id cpplatform; printf("\t"); cletplatformids(1, &cpplatform, NULL); for(int c = ; c <2; c++) { printf("%c",(char)hostoutputvector[rows * 2 + c]); cl_int cierr1; } cl_device_id cddevice; cierr1 = cletdeviceids(cpplatform, CL_DEVICE_TYPE_PU, 1, &cddevice, NULL); } cl_context PUContext = clcreatecontext(, 1, &cddevice, NULL, NULL, &cierr1); printf("\n\nthe End\n\n"); return ; cl_command_queue cqcommandqueue = clcreatecommandqueue(pucontext, } cddevice,, NULL); 24/48 Number of lines Of OpenCL kernel

25 How to OpenCL? Write «Hello World!» in Python import pyopencl as cl OpenCL kernel import numpy import numpy.linalg as la import sys mf = cl.mem_flags PUVector1 = cl.buffer(ctx, mf.read_only mf.copy_host_ptr, hostbuf=hostvector1) PUVector2 = cl.buffer(ctx, mf.read_only mf.copy_host_ptr, hostbuf=hostvector2) OpenCLSource = """ kernel void VectorAdd( global int* c, global int* a, global int* b) { PUOutputVector = cl.buffer(ctx, mf.write_only, HostVector1.nbytes) // Index of the elements to add OpenCLProgram = cl.program(ctx, OpenCLSource).build() unsigned int n = get_global_id(); // Sum the n th element of vectors a and b and store in c OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,PUOutputVector, PUVector1, PUVector2) c[n] = a[n] + b[n]; HostOutputVector = numpy.empty_like(hostvector1) } cl.enqueue_copy(queue, HostOutputVector, PUOutputVector) """ PUVector1.release() InitialData1=[37,5,54,5,56,,43,43,74,71,32,36,16,43,56,1,5,25,15,17] PUVector2.release() InitialData2=[35,51,54,58,55,32,36,69,27,39,35,4,16,44,55,14,58,75,18,15] PUOutputVector.release() SIZE=248 HostVector1=numpy.zeros(SIZE).astype(numpy.int32) HostVector2=numpy.zeros(SIZE).astype(numpy.int32) for c in range(size): HostVector1[c] = InitialData1[c%2] HostVector2[c] = InitialData2[c%2] OutputString='' Kernel Call for rows in range(size/2): OutputString+='\t' for c in range(2): OutputString+=chr(HostOutputVector[rows*2+c]) ctx = cl.create_some_context() print OutputString queue = cl.commandqueue(ctx) sys.stdout.write("\nthe End\n\n"); 25/48

26 How to OpenCL? «Hello World» C/Python : to weight On previous OpenCL implementations : In C : 75 lines, 262 words, 2848 bytes In Python : 51 lines, 137 words, 1551 bytes Factors :.68,.52,.54 in lines, words and bytes. Programming OpenCL : Difficult programming context «To open the box in more difficult than to assemble the furniture unit!» Requirement to simplify the calls by an API No compatibility between SDK of AMD and Nvidia Everyone rewrite their own API One solution, however «The» solution : Python 26/48

27 Which simple code to implement in //? PiMC: Pi by the method of the target Historical example of the Monte Carlo method Parallel implementation: Equivalent distribution of iterations From 2 to 4 parameters Number of iterations Parallel Rate (PR) (Type of variable: INT32, INT64, FP32, FP64) (RN: MWC, CON, SHR3, KISS) 2 simple observables: Estimation of Pi (just indicative, Pi not being rational :-)) Elapsed Time (Individual or lobal) 27/48

28 Differences between CUDA/OpenCL device ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work) ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work) { { uint jcong=seed_z+work; uint jcong=seed_z+work; ulong total=; ulong total=; for (ulong i=;i<iterations;i++) { for (ulong i=;i<iterations;i++) { float x=confp ; float x=confp ; float y=confp ; float y=confp ; ulong inside=((x*x+y*y) <= THEONE)? 1:; ulong inside=((x*x+y*y) <= THEONE)? 1:; total+=inside; total+=inside; } } return(total); global void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z) { ulong total=mainloop(iterations,seed_z,seed_w,blockidx.x); } kernel void MainLooplobal( global ulong *s,ulong iterations,uint seed_w,uint seed_z) { s[blockidx.x]=total; ulong total=mainloop(iterations,seed_z,seed_w,get_global_id()); syncthreads(); barrier(clk_lobal_mem_fence); s[get_global_id()]=total; } } 28/48

29 You think Python is not efficient? Let s have a quick comparison... MPI+PyOpenCL HT MPI+PyOpenCL NoHT MPI+OpenMP/C MPI/C Itops + E. + E E E E E E E 4 1. ITOPS 29/48

30 Tesla C16 TX 68 HD 797 3/48 1x Intel Pentium E63 Cluster 64n8c MPI+OpenMP/C Cluster 64n8c MPI/C Cluster 64n8c MPI+PyCL HT Cluster 64n8c MPI+PyCL NoHT 2x Intel 268v4 14c 2x Intel 2687v3 1c 2 Boards on Host! 2x Intel X555 4c 2x Intel X555 4c MPI+C.E+ 1x Intel Pentium M 755 AMD Ryzen7 18 AMD FX-959 MIC Phi 712P R9 295X2 +pthreads R9 295X2 #1 R9 29 R9 38 R9 Fury Nano Fury R7 24 Kaveri A1-785K PU 1.5E+11 HD 645 HD 585 T 73 T 71 T 64 T 62 T 43 TX 18Ti TX 18 TX 15Ti TX 98Ti TX 98 TX 97 TX E+11 TX 75Ti TX 75 TX 78Ti TX Titan TX 78 TX 69 +pthreads TX 69 #1 Tesla for Pros TX 56Ti Quadro K4 Quadro K2 Quadro 4 NVS 315 2xTesla P1 +pthreads Tesla P1 2xTesla K8 +pthreads Tesla K8 +pthreads Tesla K8 #1 Tesla K4m Tesla K2m 2.E+11 Tesla M29 Tesla M27 Pi Dart Board match (yesterday) : Python vs C & CPU vs PU vs Phi 2 Chips on Board! TX for amers AMD/ATI 1.E+11 5.E+1 CPU

31 With 1 QPU, low "regime"... The CPU explodes the PU! From 2 to 5x slower! TX 18 PiOpenCL PiOCL Base PiOpenCL Boost 1 PiOpenCL TX 98Ti 1 PiOCL Base Tesla K4c PiOpenCL Boost Quadro K4 TX Titan 1 R9-29 R9-295X Xeon Phi E5-2687v3 1 E5-2637v2 i i v2 7v2 7v3 Ph 95X 29 itan 4c 8T X eon 9-2 R9 TX T K4 la K X 9 X 1 E R T T o X s dr E5 E5 E5 Te a u Q E5-262v2 E X555 E. + E E E 5 E E order of magnitude... 31/48

32 Small comparison between *PU MIC Phi emerges, PU ambushed From 1 to 124 PR 32/48

to 16384 AMD HD797 with 248 ALU AMD R9-29X with 2816 ALU

33 Deep exploration of Parallel Rates Parallel Rate from 124 to Xeon Phi with 6 cores 6x27 6x41 6x64 6x64 PR from 124 to AMD HD797 with 248 ALU AMD R9-29X with 2816 ALU 248x4 2816x4 Nvidia TX Titan & Tesla K4 All prime numbers > /48

34 Deep exploration of Parallel Rates Parallel Rate from 124 to PR from 124 to AMD HD797 & R9-29 Long Period : 4x number of ALU Period of 14 (14 SMX) Nvidia TX Titan & Tesla K4 Short Period : number of SMX units Period of 15 (15 SMX) 34/48

35 And the latest Nvidia TX18, Always affected with oddities? Simple Precision Double Precision Variability without real structure 35/48

36 Behaviour of 2 (P)PU vs PR 36/48

37 Is Parallel Envelop reproducible? For 2 specific boards... Nvidia T62 AMD HD645 37/48

38 Variability : A distinctive criterion! What a strange property :-/... Mode Blocks Mode Threads 38/48

39 A «memory-bound» test Splutter code on OpenCL devices Intel IviBridge & Haswell Accelerator Xeon Phi PU AMD/ATI HD 797 & R29X PU Nvidia T, TX, Tesla 39/48

40 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP32, FP64 4/48

41 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP32, FP64 41/48

42 Very different performances Nvidia, AMD, Intel, Single/Double 3 D²/T_FP32 TX for amers Tesla for Pros D²/T_FP64 AMD/ATI for Alt* 1 CPUs AMD 5 Intel m m Ti Ti Ti Ti K8 P1 6 X68 78 X C M la K la K s la la TX T TX T TX TX TX1 a Te Tes s l es la Tes Tes e T T X #1 no PU 58 D64 D X Na K D H H H R R9 1 A Xe on i Ph 9 PU 95 K C - 18 FX 85 en7-7 Ry z 1 A x1 x 1 x2 x2 x 2 x 2 x 2 x 2 x 2 x 2 x v 2 v 2 v 2 v 3 v 3 v 4 v M E6 X E E E E E E E E 42/48

43 For high powered *PU De la «charge» à la saturation... Simple precision 1E+11 1E+1 TX 18 Nano Xeon Phi FX-959 E5-2687v3 x2 1E+1 1E+9 Double precision TX 18 Nano Xeon Phi FX-959 E5-2687v3 x2 1E+9 1E+8 1E+8 1E+7 1E+7 1E+6 1E+6 1E E Results expressed in unit Size²/Elapsed 43/

44 But if PU is to powerful Quid about its original goal? 2.5E+1 1E+11 Xorg Off SP 1E+1 Xorg On SP 2.E+1 Xorg Off DP Xorg On DP Xorg Off SP Xorg On SP Xorg Off DP Xorg On DP 1E+9 1.5E+1 1E+8 1.E+1 1E+7 5.E+9 1E+6 1E+5.E When using as compute PU, beware to Xorg! For AMD/ATI, boring when non root user... 44/48

45 Performances normalized to TDP Nvidia, AMD, Intel, Single / Double 12 Ratio32W Ratio64W i X 1 o U U Ti 8 Ti 8 Ti 8 Ti x1 x1 x2 x2 x2 x2 x2 x2 x2 x2 x2 6 9 m m 8 Ph 59 CP X # an P v2 v2 v2 v3 v3 v4 v4 6 X6 78 X K2 K4 la K P1 n N 5 - K o T T X C M D D D K a TX TX TX T TX1 H H H R9-29 FX 5 en7 M E6 X Xe la s la s la s la Tes es l 5 9 s z 8 8 e e T E R Te Te T T -7 Ry -7 E E E E E E E A1 A1 45/48

46 Performances normalized to Price Nvidia, AMD, Intel, Single/Double 45 Ratio32Pr 4 Ratio64Pr i i i i i m m U X #1 ano PU x 1 x 1 x 2 x 2 x 2 x2 x 2 x 2 x 2 x2 x 2 T 68 8T 98 8T 8 8T Ph K8 P CP 18 6 X v2 v2 v2 v3 v3 v4 v4 1 X X 2 9 N n 5 5 K K a T T X C M a a sl la o K TX TX TX T TX1 HD HD HD R9-29 FX 5K en7 a l l Xe M E6 X sl esla Tes Tes Te Tes z 8 8 e R E y 7 T T E E E E E E E - R 1 1 A A 46/48

47 Performances normalized to Cores.MHz Nvidia, AMD, Intel, Single / Double 8 7 Ratio32 Ratio i i i i U hi 6 9 m m K X #1 ano PU x 1 x 1 x 2 x 2 x 2 x 2 x 2 x2 x 2 x 2 x2 T 68 8T 98 8T 8 8T P 8 59 CP 18 6 X v2 v2 v2 v3 v3 v4 v4 X X 2 9 P N n 5 5 K a K C M l a o K TX T TX T TX TX TX1 HD HD HD R9-29 FX 5K en7 Xe la sla sla sla Tes esl M E6 X s z 8 8 e e T R E y 7 Te Te T T E E E E E E E - R 1 1 A A 47/48

48 But what's the use of all this? Characterize & avoid regrets... Now, the «brute» power is inside PUs But, QPU (PR=1) of PU is 5x slower than CPU To exploit PU, parallel rate MUST be over 1 To program PU, prefer : Developer approach : Python with PyOpenCL (or PyCUDA) Integrator approach : external libraries optimized In all cases, instrument your launches! 48/48

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU Houches 2016 : 6th School Computational Physics GPU : the disruptive technology of 21th century Little comparison between CPU, MIC, GPU Emmanuel Quémener A (human) generation before... A serial B film