Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Size: px

Start display at page:

Download "Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms"

Josephine Montgomery
5 years ago
Views:

1 Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel Programming

Motivation WHY: Scientific software must be optimized, by hand or automatically WHERE: Current hybrid computational systems: CPU plus several GPUs WHAT: Improving linear algebra routine

2 Motivation WHY: Scientific software must be optimized, by hand or automatically WHERE: Current hybrid computational systems: CPU plus several GPUs WHAT: Improving linear algebra routine performance a balanced assignation of the work to the processing units the best number of CPU threads HOW: Auto-tuning technique: execution time modeling WHEN: At installation time, mainly 2

3 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 3

4 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 4

5 Routine: hybrid matrix multiplication kernel C = aab + bc C = a (AB1 + AB2) + b(c1+c2) aab1 + bc1 to the GPU aab2 + bc2 to the CPU 5

6 Routine: hybrid matrix multiplication kernel C = aab + bc C = a (AB1+ +Abc+ABcpu) + b(c1+ +Cc+Ccpu) aabi + bci to the GPU i aabcpu c+ bccpu to the CPU 6

7 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 7

8 Auto-tuning method: empirical modeling Auto-tuning process. Three main phases: Design phase: building a Theoretical Model of the routine Texecution = function(n, System Parameters, Algorithmic Parameters) Installation phase: including computing and communication capacities of the current machine in the Model System Parameters Execution phase: deciding the values for a set of configurable parameters in the Model Algorithmic Parameters 8

9 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 9

10 Auto-tuning method: empirical modeling Design phase: building a theoretical model of the routine Computing time in each GPU i: n: dimension of the matrices Tcomp_gpu_i(n,n gpu_i ) = 2n 2 n gpu_i k comp_gpu_i n gpu_i : portion of matrix B assigned to GPU i (Algorithmic Parameter) k comp_gpu_i : average time to perform a basic operation (a double precision multiplication or addition) in GPU i (Computing System Parameter) The communication time CPU GPU i: T comm_gpu_i (n, n gpu_i ) = 3t s_i + (n 2 + 2nn gpu_i )t w_i t s_i and t w_i are Communication System Parameters t s_i : the average time for starting the communication t w_i : the average time for sending/receiving double precision number to/from the CPU Time to perform the work assigned to GPU i: T gpu_i (n, n gpu_i ) = T comp_gpu_i (n, n gpu_i ) + T comm_gpu_i (n, n gpu_i ) 10

11 Auto-tuning method: empirical modeling Design phase: building a theoretical model of the routine Computing time in the CPU: n: dimension of the matrices Tcomp_cpu(n,n cpu,t) = 2n 2 n cpu k comp_cpu_t n cpu : portion of matrix B assigned to the CPU (Algorithmic Parameter) t: number of CPU threads (Algorithmic Parameter) k comp_cpu_t : average time to perform a basic operation in the CPU with t threads (Computing System Parameter). Time to perform the work assigned to the CPU: T cpu (n, n cpu,t) = Tcomp_cpu(n,n cpu,t) 11

12 Auto-tuning method: empirical modeling Design phase: building a theoretical model of the routine A problem n r x n r to be solved Communications with GPUs overlap with CPU computing Total execution time the processing unit with the largest time: Ttotal(n r, n gpu_1,..., n gpu_c,n cpu,t) = max { Tgpu_1(n r, n gpu_1 ),..., Tgpu_c(n r, n gpu_c ), Tcpu(n r, n cpu, t) } 12

13 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 13

14 Auto-tuning method: empirical modeling Installation phase In the installation phase two main tasks: 1. Measuring System Parameter values: 2. Searching the best Algorithmic Parameter values for a fixed set of problem sizes 14

15 Auto-tuning method: empirical modeling Installation phase 1. Measuring System Parameter values: System Parameter values depend on the quantity of data In GPU i: For the Computing System Parameter, k comp_gpu_i : A benchmark applied to a set of local subproblems a set of execution times The set of execution times + Model a set of values for k comp_gpu_i For the Communication System Parameters, t s_i and t w_i A benchmark applied to a set of local subproblems a set of execution times The set of execution times + Model a set of values for t s_i and t w_i In CPU with t threads (t = 1,,number of cores): For the Computing System Parameter, k comp_cpu_t : A benchmark applied to a set of local subproblems a set of execution times The set of execution times + Model a set of values for k comp_gpu_t 15

16 Auto-tuning method: empirical modeling Installation phase 2. Searching the best Algorithmic Parameter values for a fixed set of problem sizes Global_ Problem_ Size_Set: a set of data sizes for the whole problem (System Manager) For each problem size from Global_ Problem_ Size_Set A complete searching process of the best Algorithmic Parameter values (according to the Model with System Parameter values). The selected Algorithmic Parameter values are added to Global_Best_AP_Installation_Set 16

17 Auto-tuning method: empirical modeling Installation phase 2. Searching the best Algorithmic Parameter values for a fixed set of problem sizes Almost the whole installation time is devoted to this second part This time depends on: The searching space dimensions (number of GPUs and the number of CPU threads) The Distribution Grain Size, DGS (decided by System Manager) DGS value conditions: The installation time The goodness of the future taken decisions 17

18 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 18

19 Auto-tuning method: empirical modeling Execution phase At runtime, a problem n r x n r to be solved The Algorithmic Parameter values to use: those in Global_Best_ AP_Installation_Set for the n closest to n r None significant overload in the total execution time 19

20 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 20

21 Experimental results Platform: 12CtwoC2075fourGTX590 A shared-memory system with: 2 hexa-cores (12 cores) Intel Xeon E GPU devices Nvidia Fermi Tesla C2075 with: 5375 MBytes of Global Memory 448 cores: 14 Streaming Multiprocessors, 32 Streaming Processors per Multiprocessor 4 GPU Nvidia GeForce GTX 590 with: 1536 MBytes in Global Memory 512 cores :16 Streaming Multiprocessors, 32 Streaming Processors per Multiprocessor The different system configurations: 12CxC2075yGTX590, to indicate: a CPU with 12 cores x Fermi Tesla C2075 y GeForce GTX

22 Time (seconds) Time (seconds) Time (seconds) Time (seconds) Experimental results Platform: 12ConeC2075twoGTX590 model vs. experimental execution time. CPU threads=10. 0,150 n=1000 1,000 0,800 n=2000 0,100 0,600 0,050 0, work distributions modeling time execution time 0,400 0,200 0, work distributions modeling time execution time n=3000 n=4000 4,000 8,000 3,000 6,000 2,000 modeling time 4,000 1,000 execution time 2,000 0,000 0, work distributions work distributions modeling time execution time 22

23 Experimental results Platform: 12CtwoC2075fourGTX590 Auto-tuning vs. optimal. CPU threads=10, DSG=10% of n 23

24 Experimental results Platform: 12CtwoC2075fourGTX590 GFLOPS and installation time with different Distribution Grain Sizes (DGS) 24

25 Experimental results Different Platforms GFLOPS with different work distribution methods. 25

26 Experimental results Different Platforms GFLOPS with different work distribution methods. 26

27 Outline Routine: hybrid matrix multiplication kernel Auto-tuning method: empirical modeling Design phase Installation phase Execution phase Experimental results Conclusion and future work 27

28 Conclusion and future work Autotuning methodology: empirical modeling. Adaptation to multicore+multigpu systems. To obtain: Balanced distributions of the work Best number of CPU threads Linear algebra routines. Example: a matrix-matrix multiplication Satisfactory results for these complex, heterogeneous systems are reported More experiments: in more systems with larger numbers of coprocessors of different architectures with other linear algebra routines More complex systems should be considered, for example multicore+multigpu+multimic 28

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum