From modelisation house To test center. Emmanuel Quémener

Size: px

Start display at page:

Download "From modelisation house To test center. Emmanuel Quémener"

Oscar Maxwell
6 years ago
Views:

1 Presentation of IT resources provided by Blaise Pascal Center From modelisation house To test center Emmanuel Quémener

2 Blaise Pascal Center «House of Modelisation» &... Conferences Trainings Projects Hotel June, /3

3 Centre Blaise Pascal ~ Dryden FR Small illustrative example Nasa X-9 Cell from F- Engine from F- ears from F-6 Studies Flap wings Incidence > «Fly-By-Wire» To recycle, to reuse and explore from new domains June, 3/3

4 From engines to parallel envelope A necessary exploration? Flight Envelope Parallel Envelope Software Hardware Speed/altitude Parallelization/Memory/CPU/PU Load factor Input/Output/Contexts/... June, 4/3

5 CBP also a Test Center Experimental platform with technical benchs Multi-nodes: clusters from 4 to 64 nodes, Nodes/Cores : 4/4, /64, /64, 64/ Multi-cores : more than machines from to cores, Nodes/Servers: from to cores, Workstations: from to 6 cores, types of CPU différents PU & Accelerator : 46 different models of PU (AMD & Nvidia), Intel MIC PPU : 9 ; PU Nvidia : ; PU AMD/ATI : types ; Xeon Phi P Integration : 4 virtual machines : Debian from Lenny to Sid both 3 & 64 bits,... Exotic hardware : 3 machines ARMv under Debian Jessie or Ubuntu 3D Vizualisation: stations, videoprojectors, monitors, 4 glasses 3D Virtual Desktop : up to 3 hosts with xgo/virtuall COMOD with SIDUS : «Compute On My Own Device» for laboratories users SIDUS is «Single Instance Distributing Universal System» alaxy prototype for intensive processing of biological data June, /3

6 Some «Digital Benchs» permanent or ephemeral! Schools in Les Houches about «computational physics» : «standalone» complete virtual machine SIDUS virtual machine Digital Humanities : eography : virtual machines and already operational ones Vizualisation workstations for graphical processing Biology : alaxy gateway (with cluster nodes processing) Processing and display stations for MorphoraphX Server provisioning to study & process under Repeat* with lusterfs, TMPFS & BRD Emmanuel Quemener June, 6/3

7 To what does it serve? Example : Repeat* lusterfs/tmpfs Progression «Input Database Coverage» Progression «Family Refinement» Less is Better! More is Better!. 3 R RamDisk. R3 lusterfs R Ramdisk R3 lusterfs Accélération %.3%.9%.9% Emmanuel Quemener June,.4% /

8 An example of interactions Between Laboratory/CBP/PSMN ive U t ap P d a nt AMD y e u B i pm u Eq Preproduction Laboratory Evaluation Test center Nvidia Sy Kno st wem H s & ow Ne SID tw US or ks Production Hardware CBP / Sidus OS CBP Compute Center Influence the choice of hardware background Hardware LBMC / Sidus OS CBP Emmanuel Quemener June, Hardware PSMN / Sidus OS PSMN /3

9 How to manage + machines? SIDUS! Single Instance Distributing Universal System June, 9/3

10 Exemples of studies : predictions... Overshoot Amdahl Law with Mylq One From T=T+p/N to T=T+c*N+p/N PiMC : parallel execution with C/MPI for x x Mesures 4x Itops (Iterative Operations Per Second) 3.x 3x.x x.x x x Amdahl & Mylq Laws Mylq Law Fit on range [-] Mylq Law Mylq Law Fit on range [-6] Amdahl Law Mylq Law Fit on range [-] Mylq Law Amdahl Law 4x 3x x x Mylq Law Fit on range [-64] Itops (Iterative Operations Per Second) 4.x distributed iterations Mesures PiMC : parallel execution with C/MPI for distributed iterations Parallel Rate (NP in MPI) Emmanuel Quemener June, Parallel Rate (NP in MPI) /3 3 4

11 From (multi many)-cores to myri-alus? A parallel laboratory on a chip! You got from to dozen of thousands of equivalent PU! You got the choice between : CPU MIC PU How to choose and distinguish them for its use? June, /3

12 A long long time ago, between 9... and, variable is frequency! Power vs RPM x in years & stop (and even a drop) June, /3

13 The "engine", source of "power" Forget frequency, change RPM to PR 4. Horse Power vs RPM Pthreads OpenMP 3. MPI OpenCL Intel 3. OpenCL AMD. Theory What "behaviors" for these engines at "load-bearing"? The "engine", a system entangling hardware, OS and software June, 3/3

14 Why is the PU so powerful? Shadering & Matrix Calculation Model World: 3 matrix products Rotation Translation Scalinge MW World View: matrix products Positioning the camera Location Direction View Projection WV VP Pixelisation So a PU: it's a "big" Matrix Multiplier June, 4/3

15 BLAS comparaison : CPU vs PU xemm when x=d(p) or S(P). OpenBLAS DP 9. clblas DP.. 6. Tesla for Pros Thunking DP TX for amers CuBLAS DP OpenBLAS SP clblas SP AMD/ATI For Alt*. 4. Thunking SP CuBLAS SP 3. CPUs AMD. Intel.. m n Ti Ti Ti Ti 69 Tita K P TX TX X k C M la K s la la TX X X X o a TX T T T T TX s l es la Tes Te Tes dr e a T T u Q June, 9 HD r y # no Fu x Na 9 R - 9 R9 Xe on i Ph 9 9 X F n ze Ry /3 3 x x 3 x 3 x 4 x 4 x E6 66 v v v v x E E E E E

16 What Nvidia never broadcasts... xemm on a large domain... Performance Tesla P TX Size Yes, performance is there, but in specific conditions! June, 6/3

17 Which simple code to implement in //? PiMC: Pi by the method of the target Historical example of the Monte Carlo method Parallel implementation: Equivalent distribution of iterations From to 4 parameters Number of iterations Parallel Rate (PR) (Type of variable: INT3, INT64, FP3, FP64) (RN: MWC, CON, SHR3, KISS) simple observables: Estimation of Pi (just indicative, Pi not being rational :-)) Elapsed Time (Individual or lobal) June, /3

18 With QPU, low "regime"... The CPU explodes the PU! From to x slower! TX PiOpenCL PiOCL Base PiOpenCL Boost PiOpenCL TX 9Ti PiOCL Base Tesla K4c PiOpenCL Boost Quadro K4 TX Titan R9-9 R9-9X Xeon Phi E-6v3 E-63v i i 66 v v v3 Ph 9X 9 itan 4c T 3 X eon 9- R9 TX T K4 la K X 9 X E R T T o X s dr E E E Te a u Q E-6v E-66 X E. + E. + E E E E June,. order of magnitude... /3

19 Is Python effective? A quick comparison... MPI+PyOpenCL HT MPI+PyOpenCL NoHT MPI+OpenMP/C MPI/C + E. Itops E+. E+. 4 E+. 6 E+. E+. E+. + E 4. ITOPS 64 nodes with cores 3 implementations ( hybrid) June, 9/3

20 Pi Monte Carlo: the match... Python vs C & CPU vs PU vs Phi E+ E+ E+ E+ Tesla for Pros TX for amers AMD/ATI E+ E+ 6E+ 4E+ E+ CPU E+ i i i n i i i 6 9 m m # ds T 6 # ita T T T T T 3 4 ea 6 X 69 X X T X X X X 9 X S o 4 K K4 C M M la K la K K hr la P T V T T T T T a T N adr dr o dr o TX TX T TX TX TX TX T TX sl sla sla es es sla p es u Te Te Te T T Te + T Q Qua Qua K a l s e T June, T T T T T H C c c c T T 9 U 4 ur y ur y 9 # /C /C P 3 I+ 4 4 oh L H PI P P o F 9 F X M E6 MP v3 v4 M nm X- n D D K R an R R R 9 i L N PyC c e h F m m H H e c X C p P N y I+ D yz 9 tiu ti u 4 el 6 6 4n +O R - R IC en en Int tel tel +P P AM D r6 I M PI c M l P el P X x x In x In te MP e M ia M s t t l r A c 4n lu c In In te ve C 4n n 6 x x x In Ka 6 64 ter er er lus t t s s C lu lu C C D /3

21 Small comparison between * PU Low pararegime: the reign of the CPU à PR June, /3

22 Small comparison between *PU MIC Phi Out of Wood, PU De à 4 de PR June, /3

23 Deep Exploration of PR ParaRégime from 4 to 634 Xeon Phi with 6 cœurs 6x 6x4 6x64 6x64 PR from 4 to 634 AMD HD9 with 4 ALU AMD R9-9X with 6 ALU 4x4 6x4 Nvidia TX Titan & Tesla K4 Prime numbers > 4 June, 3/3

24 Very deep exploration of PRs Regime of //ism from 4 to 636 PR from 4 to 636 AMD HD9 & R9-9 Large Period : 4x number of ALU Period of 4 (4 SMX) Nvidia TX Titan & Tesla K4 Small Period : number of SMX units Period of ( SMX) June, 4/3

25 "Back to the Physics" A Newtonian N-body code... Pass on a code "fine grain" Code N-body very simply integrated Newton's second law, autogravitating system Differential integration methods: Euler Implicite, Euler Explicit, Heun, Runge Kutta Settings : Physical: number of particles Numeric: no integration, number of steps Then: influence FP3, FP64 June, /3

26 Very different performances Nvidia, AMD, Intel, Single/Double 3 D²/T_FP3 TX for amers Tesla for Pros D²/T_FP64 AMD/ATI for Alt* CPUs AMD Intel m m Ti Ti Ti Ti K P 6 X6 X9 9 C M la K la K s la la TX T TX T TX TX TX a s l es la Tes Tes Te Tes e T T i X o U x x x x x x x x x x x 9 PU Ph X # Nan P 6 v v v v 3 v 3 v 4 v 4 n 9 K C - o D D D K H H H R9-9 FX en Xe M E6 X z E R y E E E E E E E R A A June, 6/3

27 Performances normalized to cores Nvidia, AMD, Intel, Single / Double Ratio3 Ratio i i i i U hi 6 9 m m K 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T P 9 CP 6 X 4 6 v v v v3 v3 v4 v4 X 6 X 9 P 9 N n K a K C M l a o K TX T TX T TX TX TX HD HD HD R9-9 FX K en Xe la sla sla sla Tes esl M E6 X s z e e T R E y Te Te T T E E E E E E E - R A A June, /3

28 Performances normalized to TDP Nvidia, AMD, Intel, Single / Double Ratio3W Ratio64W 6 4 m 6 K K a C sl a a sl Te sl e e T T 6 9 X X T T TX X HD R9 June, no Na K A U CP 3 6 E /3 x x x x x 3 4 v v v E E E E

29 Performances normalized to Price Nvidia, AMD, Intel, Single/Double 4 Ratio3Pr 4 Ratio64Pr 3 3 i i i i i m m U 4 9 9X # ano PU x x x x x x x x x x x T 6 T 9 T T Ph K P 9 CP 6 X 6 v v v v3 v3 v4 v4 X 6 X 9 N 9 n K K a T T X C M a a sl la o K TX TX TX T TX HD HD HD R9-9 FX K en a l l Xe M E6 X sl esla Tes Tes Te Tes z e R E y T T E E E E E E E - R A A June, 9/3

30 But what's the use of all this? Characterize & avoid regrets... What you must remember : In a standard machine, in 6, the "raw" power is provided by the PU For a low parallelism, the CPU is much more efficient than the PU A PU is greater than the CPU only for PR> The PU achieves its optimum ONLY for some PRs Without a deep characterization, disappointments to foresee... OpenCL is a good compromise like language * PU Python remains the fastest approach of OpenCL What to do: experiment! June, 3/3

Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.

Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU. Astrosim 217 Les «Houches» the 7th ;-) PU : the disruptive technology of 21th century Little comparison between CPU, MIC, PU Emmanuel Quémener A (human) generation before... A serial B film in 1984 1984