Computing and energy performance

Size: px

Start display at page:

Download "Computing and energy performance"

Beverly Snow
5 years ago
Views:

1 Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas Jost 13/01/2011

2 1 First experiments on GPU clusters

3 2 x 16 CPU+GPU nodes clusters : Xeon dual core + GT8800 Xeon dual core + GT285 First experiments on GPU clusters Experimental testbed Nehalem quad core + GT285 Nehalem quad core + GT480 a 16 nodes state of the art CPU+GPU cluster an older 16 nodes CPU+GPU cluster 2 gigabit Ethernet switches an heterogeneous 32 nodes cluster regular upgrade of the system Some energy monitoring external devices : Raritan (Dominion PX)

4 First experiments on GPU clusters Collection of experiments 3 benchmarks with different features: 1 European option pricer: Embarrassinglyparallel ll l Monte Carlo computations Parallel random number generator have been ported on GPU 2 PDE solver: Strong computations Regular communications between nodes Some computations (must) remain on CPU 3 2D Jacobi relaxation: Repetitive light computations Frequent communications between neighbor nodes

First experiments on GPU clusters Collection of experiments 1E+5 Pricer parallel MC 1E+3 EDP Solver

nodes Execut tion time (s) 1E+2 1E+1 1E+0 1 10 100 Number of nodes Execut tion time (s) 1E+3 1E+2 1E+1

parallel MC 1E+2 EDP Solver Synchronous 1E+3 Jacobi Relaxation Energy (Wh) 1E+3 1E+2 1E+1 Energy (Wh)

5 First experiments on GPU clusters Collection of experiments 1E+5 Pricer parallel MC 1E+3 EDP Solver Synchronous 1E+4 Jacobi Relaxation Executio on time (s) 1E+4 1E+3 1E+2 1E+1 1E Number of nodes Execut tion time (s) 1E+2 1E+1 1E Number of nodes Execut tion time (s) 1E+3 1E+2 1E Number of nodes Monocore CPU cluster Multicore CPU cluster Manycore GPU cluster 1E+4 Pricer parallel MC 1E+2 EDP Solver Synchronous 1E+3 Jacobi Relaxation Energy (Wh) 1E+3 1E+2 1E+1 Energy (Wh) 1E+1 Energy (Wh) 1E+2 1E+1 1E Number of nodes 1E Number of nodes 1E Number of nodes

6 First experiments on GPU clusters Computational & energy model design Temporal gain (speedup) p) and energy gain of GPU cluster vs CPU cluster: Energy gain Speedup ter vs PU cluster GPU clus multicore CP 1E+3 1E+2 1E+1 1E+0 Pricer parallel MC OK Number of nodes er vs U cluster GPU clust multicore CP 1E+2 1E+1 1E+0 EDP Solver synchronous Hum Number of nodes GPU clust ter vs multicore CPU cluster 1E+2 1E+1 1E+0 Jacobi Relaxation Beyond? Predictions? Number of nodes Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases : why? beyond 16 nodes?

7 First experiments on GPU clusters Computational & energy model design Temporal gain (speedup) p) and energy gain of GPU cluster vs CPU cluster: Energy gain Speedup ter vs PU cluster GPU clus multicore CP 1E+3 1E+2 1E+1 1E+0 Pricer parallel MC OK Number of nodes er vs U cluster GPU clust multicore CP 1E+2 1E+1 1E+0 EDP Solver synchronous Hum Number of nodes GPU clust ter vs multicore CPU cluster 1E+2 1E+1 1E+0 Jacobi Relaxation Beyond? Predictions? Number of nodes Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases : why? beyond 16 nodes?

8 First experiments on GPU clusters Computational & energy model design CPU cluster GPU cluster Computations T calc CPU If algorithm is adapted to GPU architecture: T calc GPU << T calc CPU Communications T comm CPU = T comm MPI else: do not use GPUs! T comm GPU T comm CPU T comm GPU = T transfert GPUtoCPU + T comm MPI + T transfert CPUtoGPU t Total time T CPU T GPU <? > T CPU.. For a set pb: when the number of nodes increases,t comm becomes dominant and GPU cluster interest decreases

9 2 A first performance model

10 A first performance model First modelling approach Observation of the first experimental performances: it exists a «scalable area», performances of CPU and GPU clusters have different slopes. Execut tion time (s) 1E+3 1E+2 1E+1 1E+0 EDP Solver Synchronous Number of nodes Scalable area Modelling of the «scalablearea» assuming some experimental measurements of the real application are possible (simple modelling). Energ gy (Wh) GPU cluster vs multicore CPU cluster 1E+2 1E+1 1E+0 1E+2 1E+1 1E+0 EDP Solver Synchronous Number of nodes Scalable area EDP Solver synchronous Number of nodes

11 A first performance model First modelling approach We model the «scalablearea»: T E σ GPU CPU σ E T CPU GPU σ σ E T N (nodes) N (nodes) We consider the electrical lpower dissipated by nodes and switch : We observe: T(N)=T(1)/N σ T E(N)=E(1).N σ E We consider: with: CPU σ T GPU > σ T E(N) = P(1).T(N).N+Pswitch.N/Nmax.T(N) Pswitch.N/Nmax.T(N) with P: electrical power (Watts)

12 A first performance model First modelling approach We obtain: σ E = 1 σ T GPU σ SU (N) = SU (1).N T GPU σ = T EG (N) EG (1).N G/C G/C σ CPU T G/C G/C σ T CPU We compute the 2 threshold number of nodes: G/C T 1/( σ GPU ) T G/C σ CPU T G/C 1/( ) N = N = SU (1) SU (N) = 1 G/C E 1/( σ GPU ) T G/C σ CPU T G/C 1/( ) N = N = EG (1) EG (N) = 1

13 A first performance model First modelling approach 3 areas appear when increasing the number of nodes: GPU cluster more efficient (about T and E) GPU cluster faster OR less energy consumming CPU cluster more efficient (about T and E) 1 min(, ) max(, ) N (nodes) N G/C T N G/C E N G/C T N G/C E Choose GPU cluster Strategy and heuristic required to choose GPU or CPU cluster Choose CPU cluster

14 Improving performances with asynchronous algorithms Investigation with our PDE solver

15 Improving performances with asynchronous algorithms Asynchronous parallel computing Asynchronous algo. provide implicit overlapping of communications and computations, and communications are important on GPU clusters. But : They should improve executions on GPU clusters Some iterative algorithms can be turned into asynchronous algorithms (not all), A strong mathematical theory supports this approach. And : The convergence detection of the algorithm is more complex and requires more communications (than with synchronous algo) Some extra iterations are required to achieve the same accuracy.

16 Improving performances with asynchronous algorithms Parallel iterative PDE solver

17 Improving performances with asynchronous algorithms Inner linear solver

18 Improving performances with asynchronous algorithms Asynchronous version and really more complex parallel implementation!

19 Improving performances with asynchronous algorithms Asynchronous version

nodes Nehalem quad core + GT285 2 interconnected Gibagit switches Rmk: two clusters managed by one

20 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Execution time usingbothgpu clusters of Supelec (to minimize): 17 nodes Xeon dual core + GT nodes Nehalem quad core + GT285 2 interconnected Gibagit switches Rmk: two clusters managed by one OAR server T exec(s) GPUs & synchronous T exec(s) GPUs & asynchronous Nb of fast nodes Nb of fast nodes

asynchronous version achieves more regular speedup asynchronous

& synchronous vs 1 GPU GPU cluster & asynchronous vs 1 GPU Sync.

21 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Speedup vs 1 GPU (to maximize): asynchronous version achieves more regular speedup asynchronous version achieves better speedup on high nb of nodes GPU cluster & synchronous vs 1 GPU GPU cluster & asynchronous vs 1 GPU Sync. Speedup vs seq. Sync. Speedup vs seq. Nb of fast nodes Nb of fast nodes

(to minimize): measurement errors become important sync.

energy consumption curves are (just) «different» GPU cluster

22 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Energy consumption (to minimize): measurement errors become important sync. and async. energy consumption curves are (just) «different» GPU cluster & synchronous GPU cluster & asynchronous Energy co onsumptio on(w.h) Nb of fast nodes Energy co onsumptio on(w.h) Nb of fast nodes

cluster Energy overhead factor vs 1 GPU (to

23 Improving performances with asynchronous algorithms Performances on a heterogeneous cluster Energy overhead factor vs 1 GPU (to minimize): overhead curves are (just) «differents» no more global attractive solution! GPU cluster & synchronous vs 1 GPU GPU cluster & asynchronous vs 1 GPU Energy overhead factor Nb of fast nodes factor Energy overhead Nb of fast nodes

24 4 Need for a new performance model and an auto adaptive solution

25 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Relative async vs sync speedup and energy gain exhibit some similarities: can be used to choose the version to run need a fine model (region frontiers are complex) need a heuristic when only one gain is greater than 1 Speedup Async. better Energy gain Async. better

26 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Energy Delay Product (EDP) (to minimize): to track a global optimum, considering both T and E parallel runs on many nodes seem better no large differences between sync. and async. versions GPU cluster & synchronous GPU cluster & asynchronous

EDP sync / EDP async can be used to make a choice

27 Need for a new performance model and an auto adaptive solution Relative async. vs sync. performances Async. vs sync. relative Energy Delay Product ratio: We compute: EDP sync / EDP async can be used to make a choice inside ambiguous regions Compute the EDP and choose sync. or async. version in this region (where relative SU > 1 and relative EG < 1) Choose async. version Choose sync. version

28 Need for a new performance model and an auto adaptive solution Automatic choice criteria Automatic selection of the «best» version to run: Synchronous algo. Asynchronous algo. CPU cluster GPU cluster Criteria: relative speedup : tracking HPC performances relative energy gain : tracking low energy consumption relative energy delay product : tracking a compromise but need a model dlto automatize thischoice h i a fine model : criteria variations are small and region frontiers arecomplex a model not requiring long and large experiments

29 Need for a new performance model and an auto adaptive solution Fine model required First model limitations: requires/assumes a «scalable area» approximative model (not fine) requires 4 executions of the entire application: on 1 and N 0 nodes running the 2 versions to compare measuring both T and E T CPU σ T adapted to optimize the execution of a long life application scaling on a parallel machine σ T GPU N (nodes) To achieve an automatic selection on a short life application, we need : a model requiring only small elementary benchmarks to fix the model parameter values on the hardware used, a fine model not requiring the application exhibit a perfect scalability on the architecture.

30 Need for a new performance model and an auto adaptive solution Fine model required First version of this fine model exists It takes into account: different power dissipation of the different «identical» nodes of the cluster when starting computations on GPU the power dissipation increases 2 times: when the GPU starts to compute when the fan of the GPU starts, or/and when the GPU increases its frequency, when stopping computations on GPU the power dissipation decreases several times, but not immediately!

31 Need for a new performance model and an auto adaptive solution Fine model required (Watts) (s)

32 Need for a new performance model and an auto adaptive solution Fine model required First evaluation: on our PDE solver execution and on our heterogeneous GPU cluster error (model observation) : 6% aftersomebiascorrections : 1% Stronger evaluation is planned: on different hardware on different applications then: a heuristicof auto adaptation/auto selection t ti t ti of the right algorithm will be implemented To be continued

33 5 Conclusion and perspectives

34 Long term objectives Heuristic Performance model (energy and computations) Elementary hardware benchmarks Kernel 1 v1 Kernel 1 v2 Kernel 1 v3 Kernel 2 v1 Kernel 2 v2 Parallel algorithm 1 Parallel algorithm 2 a.out O [speed, energy, edp, ] Auto adaptation of a.out End after fast execution End after edp compromise execution user scheduler End after low energy consumption execution

35 Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization of a multi algorithms PDE solver on CPU and GPU clusters Questions?

Energy issues of GPU computing clusters

AlGorille INRIA Project Team Energy issues of GPU computing clusters Stéphane Vialle SUPELEC UMI GT CNRS 2958 & AlGorille INRIA Project Team EJC 19 20/11/2012 Lyon, France What means «using a GPU cluster»?