Automatic Tuning of the High Performance Linpack Benchmark

Size: px

Start display at page:

Download "Automatic Tuning of the High Performance Linpack Benchmark"

Kimberly Holmes
5 years ago
Views:

1 Automatic Tuning of the High Performance Linpack Benchmark Ruowei Chen Supervisor: Dr. Peter Strazdins The Australian National University

2 What is the HPL Benchmark? World s Top 500 Supercomputers Table1. Top 5 in Rank Country Machine Processor Cores 1 China Tianhe-1A-NUDT THMPP Intel X5670 6C 2.93 GHZ 2 United States Jaguar Cray XT5-HE AMD OPTERON 6C 2.6GHZ 3 China Dawning TC3600 Intel X5650 Nvidia Tesla GPU 4 Japan TSUBAME 2.0 Intel G7 Xeon 6C 2.93GHZ 5 United States 186, , ,640 73,278 Hopper Cray XE6 12-core 2.1GHZ 153,408

3 HPL is doing simple things Solving a system of dense linear algebra equations with n-unknowns X * X 2 = X ( n = 3 ) ( Normally the n will be around 1,000 ~ 100,000+ ) Determine the performance of the computer system. Performance = N cal Time Where N cal = Number of effective floating point operations Time = Overall Time T calculation + T communication ( The communication overhead may be overlapped by calculations)

4 in a complex manner. HPL s algorithm: Two-dimensional block-cyclic data distribution - Rightlooking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swapbroadcast algorithm - backward substitution with lookahead of depth 1

5 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? HPL algorithm is changeable via 17 parameters: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration

6 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? HPL algorithm is changeable via 17 parameters: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration Considering all parameter (except N, P, Q): (Exhaustive Search) number of Combination = 1,045,094,400 running time on Jaguar (2 nd fastest) 10.3 hours (N=5000) 10,300 hours (N=50000)

7 Automatic Tuning of the High Performance Linpack Benchmark So what to tune on? Important: Not so Important: N -- Problem size Pmap -- Process mapping NB -- Blocking Factor threshold -- for matrix validity test P -- Rows in process grid Ndiv -- Panels in recursion Q -- Columns in process grid Nbmin -- Recursion stopping criteria Depth -- Lookahead depth Swap -- Swap algorithm Bcasts -- Panel broadcasting method L1,U -- to store triangle of panel Pfacts -- Panel factorization method Align -- Memory alignment Rfacts -- Recursive factorization method Equilibration Considering all important parameters (except N, P, Q): (Exhaustive Search) number of Combination = 50,400 running time on Jaguar (2 nd fastest) 1.79 seconds (N=5000) 497,222 hours (N=5,000,000)

8 Automatic Tuning of the High Performance Linpack Benchmark Out: Exhaustive Search In: Random Linear (Linear-Random Hybrid) Nelder-Mead

9 Nelder-Mead B R B E B Yes W Reflection G or G Expansion Accept Reflection? B C2 B W G No C1 or S M G Contraction Shrinking Next Iteration Fig 1. One iteration of Nelder-Mead

10 Automatic Tuning of the High Performance Linpack Benchmark Automatic, but how? Fig 2. Structure of the Autotune Program * This program is written by Sotirios Diamand, Christopher Frazer, Samuel Rathmanner. They started this project in I ve made a few changes to this program.

11 Goals Achieve Highest Performance Cut down Tuning time save cpu time for other more important jobs possible to achieve higher performance

12 Gflops Tuning by Random Number of Configurations Fig 3. Performance of Random Tuner

13 Gflops Cut Down Tuning Time Early Termination n Non-ET ET - 5% ET - 10% ET - 20% n i n-i GFlops (2/3 (n 3 (n i) 3 ) + 3/2 (n 2 (n i) 2 ))/T e. Fig 4. The shape of the problem matrix when terminated early Fig 5. Accuracy of the early termination

14 GFlops time (sec) Cut Down Tuning Time Use a reasonable problem size time (sec) GFLOPS Problem Size: N Fig 5. Effect of the problem size on performance and tunning time

15 GFLOPS Profiling Block Size NB N=1000 N=2000 N=3000 N= NB Fig 6. Effect of the problem size N on profiling block size NB

16 Tuning Result Table 2. Tuning results of different Tuners Tuner Gflops Tuning time (sec) Efficiency (%) Linear Hybrid Nelder-Mead (Best) Random (5 configs) Random (10 configs) Random (15 configs) Random (20 configs) Random (25 configs)

17 Conclusion Early Termination helps to cut down tuning time. We don t need to use maximal problem size in tuning. Linear Search performs poorly suggests that there re interdependencies between parameters. Nelder-Mead performs well on HPL tuning. Random tuner works well, suggests that number of important parameters may be fewer. Future Work

Automatic Tuning of the High Performance Linpack Benchmark

Automatic Tuning of the High Performance Linpack Benchmark Ruowei Chen Supervisor: Dr. Peter Strazdins The final report for Comp8750 Computer System Project submitted in partial fulfillment of the degree