*Yuta SAWA and Reiji SUDA The University of Tokyo

Size: px

Start display at page:

Download "*Yuta SAWA and Reiji SUDA The University of Tokyo"

Aldous Mathews
5 years ago
Views:

1 Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi, Ltd.

2 Background Multi-Core CPUs in personal computers

3 CPUs on personal computers Year: Fastest Intel CPU (for PC) Cores (thread) Pentium Pentium 4 Pentium D Core i (8) GFLOPS How many dimensions of matrix multiplications can be solved in a second About 5 times in the dimension, and about 125 times in FLOPS

4 Users paradigm shift about usage of computers A decade ago, users ran applications sequentially, or the performance of the application was decreased CPU CPU Application App1 App2 time App1 App2 time Running sequentially Today, users can run a few applications concurrently without much overhead or Core App1 CPU Running concurrently disturbing each other Core App2 time

5 BLAS (Basic Linear Algebra Subprograms) BLAS are interfaces of multiplication routines of vectors and matrices BLAS are basic routine in numerical calculations BLAS are called from many other libraries such as LAPACK ARPACK LAPACK ScalaPACK PBLAS BLAS

6 DGEMM routine in BLAS DGEMM is the simplest double-precision matrix multiplication written as follows: m C := αab + βc k m n C := α A B + β C n k Computation complexity of DGEMM routine is O(mnk) This DGEMM routine is our target problem in this paper

7 BLAS on personal computers DGEMM routine have much parallelism In personal computers, the number of available CPU cores changes depends on other applications The number of physical cores p = 2 Core App 1 CPU Core Available cores for BLAS 1 2 Application App 1 App 2 Two parameters about the number of CPU cores time 2

8 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning time ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems ATLAS for large problems CPU Core Core Core Core CPU Core Core Core Core BLAS BLAS sub task sub task time BLAS BLAS BLAS BLAS sub task sub task sub task sub task time

9 Importance of Dynamic Load Balancing In traditional BLAS implementations, the size of each sub task was almost the same Core BLAS CPU Core Core BLAS BLAS Core BLAS sub task sub task sub task sub task Sub tasks with the same size If there are other tasks running concurrently, using dynamic load balancing is better time Sub tasks with the same size Dynamic load balancing CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task BLAS sub task time CPU Core Core Core Core BLAS BLAS BLAS App 1 sub task sub task sub task sub task time

10 Dynamically Load-Balanced BLAS (DL-BLAS)

11 What is DL-BLAS? DL-BLAS is one of BLAS implementations DL-BLAS is well parallelized even if there are other applications running concurrently CPU Core Core Core Core App 1 BLAS sub task BLAS sub task BLAS sub task BLAS sub task App 2 Time Other Application

12 DL-BLAS parallelization algorithm DL-BLAS split a calculation into some tasks, and assign them to CPUs using dynamic load balancing Each CPU calculates tasks using BLAS implementation Original Calculation 1 2 Split into sub-tasks DL-BLAS provides Core 2 2 Core 1 1 Core 1 to cores Core n Assign sub-tasks Calculate using ATLAS Each core calculates sub-tasks

13 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

14 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

15 DL-BLAS task splitting DGEMM calculation can be written as C = αab + βc A is partitioned by (n b, k b ) matrices, B is by (k b, n b ) and C is be (n b, n b ) m k m n C += n A k B Block size: (n b, k b ) n b k b n b matrix partitioning n b += n b k b One task

16 Parameter Tuning Algorithms in DL-BLAS

17 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

18 DL-BLAS performance modeling DL-BLAS performance = single-thread performance multi-thread speed-up We modeled the lower bound of multi-thread speed-up We created an algorithm to find quasi-maximum value of single-thread performance

19 Multi-thread speed-up and the number of sub tasks CPU Core Core Core Core If there are 3 available CPU cores and 4 sub tasks, they need 2 unit time. App 1 Example 1: sub-tasks are large The speed-up rate is 4 / 2 = 2 Parallel efficiency: 2 / 3 =.667 Execution time CPU Core Core Core Core App 1 Example 2: sub-tasks are small If there are 3 available CPU cores and 1 sub tasks, they need 4 unit time. the speed-up rate is 1 / 4 = 2.5 Parallel efficiency: 2.5 / 3 =.833 If there are only small number of tasks, the speed-up rate can be small

20 Trade-off between multi-thread speedup and single-thread performance Generally there are following trends Larger block size provides higher single-thread calculation performance Larger block size make the number of tasks smaller, and then load balance mechanism do not work well Multi-Thread speed-up rate Single-Thread performance Small block size and Large number of sub tasks Large block size and Small number of sub tasks

21 Multi-Thread speed-up rate (1/2) We denote parameters, p and i time CPU Core Core Core Core App 1 App 1 App2 App3 p i p: The number of physical cores (NOT be changed in a machine) i: available CPU cores (may be changed by other applications) Clearly, i p Letting the number of sub tasks be s, we define the speedup rate h(i, s) and h (i, s) as follows h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation

22 Multi-Thread speed-up rate (2/2) h( i, s) = ih'( i, s) = GFLOPS value of multi-threads (using i cores) calculation GFLOPS value of single-thread calculation h (i, s) i = 2 i = 3 i = 4 Can we find the lower bound of these graphs? s

23 We proved following theorem Theorem about speed-up rate ), '( p t p s i h p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time h (i, s) s t + = ) ( p s p s f 1. For every 1 < i p, t s

24 Example of speed-up rate Case: p = 4 In the evaluations, we used machines which have 4 physical CPU cores or 4 CPU threads h'( i, s) p 1 f ( s) = 1 s + p p 1 t + p p: number of physical CPU cores s: number of tasks i: number of CPU cores which can be used on that time 1 < i p, t s.87 We created more than or equal to 17 tasks (We considered 85% is well parallelized) s

25 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

26 Searching space of block size Basic idea is follows Multi-Thread speed-up rate Single-Thread performance Well parallelized area Not so be different Small block size and Large number of tasks 17 tasks Large block size and Small number of tasks Block size s min.6 s max s max Search parameters in this range How to search?

27 Relation between block sizes and single-thread performance Exhaustive search results of calculating 1 by 1 square matrix multiplication on Q66 (Intel Core 2 Quad) += n b k b k b n b Highest performance block size single-thread performance Actual Trend Small block size and Large number of tasks Large block size and Small number of tasks

28 Diagonal Searching Algorithm Evaluating all parameters s min n b, k b s max need much time to calculate Diagonal Searching Algorithm decrease the calculation complexity Algorithm 1. Evaluate all paramters s min n b = k b s max, the highest performance block size be j 2. Search parameters in n b = j or k b = j n b j 3. Choose the highest performance block size found in STEP 1 and STEP 2 k b j This algorithm returned the parameters which gave us the highest performance on many architectures (Intel Core 2 Extreme, Intel Core i7, AMD Phenom, AMD Phenom II)

29 DL-BLAS overview Problem Our solution STEP1 Static scheduling is inefficient on personal computers Tile matrices by sub-matrices with block size and use dynamic load balancing STEP2 We have to decide block size Use Diagonal Searching Algorithm When the problem size are fixed STEP3 We want efficient parameters for every problem sizes? Use Reductive Searching Algorithm and Parameter Selection Algorithm

30 Relation between block size and problem size Multi-Thread speed-up rate is changed when the problem size is changed Single-Thread performance Well parallelized area for a larger problem Well parallelized area for a smaller problem Multi-Thread speed-up rate for a larger problem Small block size and Large number of tasks Large block size and Small number of tasks Multi-Thread speed-up rate for a smaller problem

31 Reductive Searching Algorithm We want to get multiple [n b, k b ] pairs n b Algorithm: n 2 n 1 1. Initialize s min and s max be [15, 25], and a = 1 2. In the range [s min, s max ], run Diagonal Searching Algorithm, and let the results of the solution be [n a, k a ] k b 15 k 2 3. If single-thread calculation is faster than multithread calculation, then last the algorithm 25 k 1 4. let [s min, s max ] = [.5 n a,.9 n a ] and add [n a, k a ] to result-list Result-list n b k b n 1 k 1 n 2 k 2 5. a := a + 1 and Go to STEP 2 Large paramter Small parameter

32 Parameter Selection For each problem, we select one pair of the parameters (n a, k a ) from the result-list using following algorithm n m C Algorithm = α n For i = 1 to length(result-list) num-tasks = n n i m / n i if num-tasks > 16 end return (n i, k i ); return // single-thread calculation End for k A m B k Result-list Large paramter n b k b n 1 k 1 n 2 k 2 n 3 k 4 Small parameter / Select largest pair of parameters which make more than 16 tasks

33 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate

34 Performance Evaluation Results

35 Algorithms work-flow Prepare hardware and BLAS routine Installation time Tune DL-BLAS using Diagonal Searghing Algorithm and Reductive Searching Algorithm Store result-list on the disc Result-list n b k b n 1 k 1 n 2 k 2 n 3 k 4 When DL-BLAS routines are called, get result-list from the disc Calculation time Using Parameter Selection Algorithm, decide the block size and calculate

36 Hardware and Problems We used some CPU models Intel Core 2 Extreme, QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 As first, we evaluate Diagonal Searching Algorithm using 1 by 1 square matrix multiplications

Results of Diagonal Searching Algorithm Intel Core 2 Extreme (QX965) Intel ATOM 33 15 15 n b 25 37.5G FLOPS 15 n b 25 15 1.7G FLOPS k b k b 25 25G FLOPS 25 1.

37 Results of Diagonal Searching Algorithm Intel Core 2 Extreme (QX965) Intel ATOM n b G FLOPS 15 n b G FLOPS k b k b 25 25G FLOPS G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm 1.69 GFLOPS Highest performance parameters 1.7 GFLOPS

k b 25 25G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest

38 Results of Diagonal Searching Algorithm Intel Core i7 965 with hyper-threading Intel Core i7 965 without hyper-threading n b 25 35G FLOPS n b 25 37G FLOPS k b k b 25 25G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters

Results of Diagonal Searching Algorithm AMD Phenom 96 AMD Phenom II X4 94

5G FLOPS 15 15 n b 25 35G FLOPS k b k b 25 1G FLOPS 25 25G FLOPS Results

39 Results of Diagonal Searching Algorithm AMD Phenom 96 AMD Phenom II X n b G FLOPS n b 25 35G FLOPS k b k b 25 1G FLOPS 25 25G FLOPS Results of Diagonal Searching Algorithm = Highest performance parameters Results of Diagonal Searching Algorithm = Highest performance parameters

40 Results of Reductive Searching Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) Result-list is shown in right k b (24,24) (48,5) (56,56) This calculation need less than half an hour (112, 98) (168, 168) (224, 168) 25

41 Results of Parameter Selection Algorithm (12,13) (16,2) n b 25 Intel Core 2 Extreme (QX965) In square matrix multiplication, following parameters are used k b 25 (24,24) (48,5) (56,56) (112, 98) (168, 168) (224, 168) Dimension of square matrix ~48 49~64 65~96 97~ ~ ~ ~ ~ ~ Used parameter (n b, k b ) Single Thread (12,13) (16, 2) (24, 24) (48, 5) (56, 56) (112, 98) (168, 168) (224, 168)

42 Results of Reductive Searching Algorithm n b 25 AMD Phenom II X4 94 Result-list is shown in right (41, 4) (64, 4) k b (56, 56) (14, 8) This calculation need less than half an hour (92, 8) (132, 12) (172, 12) (26, 2) 25

43 Results of Parameter Selection Algorithm k b 25 n b (41, 4) (64, 4) (14, 8) (56, 56) (92, 8) (132, 12) (172, 12) (26, 2) 25 AMD Phenom II X4 94 In square matrix multiplication, following parameters are used Dimension of square matrix Used parameter (n b, k b ) ~164 Single Thread 165~224 (41, 4) 225~256 (56, 56) 257~368 (64, 4) 369~416 (92, 8) 417~528 (14, 8) 529~684 (132, 12) 685~892 (172, 12) 825~ (26, 2)

44 Evaluation Machines We solved square matrix multiplication problem with the following hardware and software CPU models Intel Core 2 Extreme QX965 Intel Core i7 965 Intel ATOM 33 AMD Phenom 96 AMD Phenom II X4 94 BLAS implementations DL-BLAS ATLAS GotoBLAS (version 1.26)

45 Evaluation Circumstances We make following three circumstances No other tasks are running (no-task case) CPU DL-BLAS calculation Busy loop program is running concurrently (busy-loop case) CPU while(true){ DL-BLAS calculation Busy-loop } Inner product (dot product) program is running concurrently (inner-product case) while(true){ CPU DL-BLAS calculation Inner product double ip = ; for i=1 to n ip += a[i] + b[i] } memory access

46 DGEMM calculation on QX965 (Intel Core 2 Extreme) performance (GFLOPS) GotoBLAS is fastest in no-task case, but slowest in busy-loop case and inner-product case DL-BLAS is faster than ATLAS in all cases DL-BLAS is fastest in busy-loop case and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS no-task busy-loop 4 4 inner-product performance (GFLOPS) matrix size matrix size matrix size 2 16 performance (GFLOPS)

47 DGEMM calculation on 965 (Intel Core i7) hyper-threading GotoBLAS is not so fast on this architecture Hyper-Thread is harmful K.Goto said, developer of GotoBLAS ATLAS is fastest in some cases Middle size of problems in busy-loop and inner-product case GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) 5 5 no-task busy-loop 5 5 inner-product performance (GFLOPS) matrix size matrix size 2 2 matrix size performance (GFLOPS)

48 High-performance BLAS implementations GotoBLAS is the fastest BLAS implementation in the world today Users can set the number of threads CPU Core Core Core Core GotoBLAS with NUM_THREAD=4 BLAS BLAS BLAS BLAS sub task sub task sub task sub task ATLAS is BLAS implementation which use auto tuning ATLAS decides the number of threads automatically depending on the problem size ATLAS for small problems CPU Core Core Core Core BLAS BLAS sub task sub task ATLAS for large problems CPU Core Core Core Core BLAS BLAS BLAS BLAS sub task sub task sub task sub task

49 DGEMM calculation on 965 (Intel Core i7) without hyper-threading GotoBLAS is fastest in no-task case In other cases, the performance of GotoBLAS is unstable Other tasks disturb GotoBLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 5 inner-product performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)

50 DGEMM calculation on Intel ATOM 33 with hyper-threading DL-BLAS is faster in busy-loop case and innerproduct case The performance of DL-BLAS is not faster than ATLAS in no-task case It is considered that DL-BLAS is disturbed by hyper-threading ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 2 inner-product 2 performance (GFLOPS) matrix size 1 matrix size 1 matrix size performance (GFLOPS)

51 DGEMM calculation on AMD Phenom 96 GotoBLAS is fastest in no-task case DL-BLAS is the fastest for small problems in busy-loop case and inner-product case For large problems, the performances are not so different GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop 35 inner-product 35 performance (GFLOPS) matrix size 2 matrix size 2 matrix size performance (GFLOPS)

52 DGEMM calculation on X4 94 (AMD Phenom II) GotoBLAS is a little faster than ATLAS and DL-BLAS in no-task case The performance of DL-BLAS is similar to that of ATLAS GotoBLAS ATLAS DL-BLAS with ATLAS performance (GFLOPS) no-task busy-loop inner-product performance (GFLOPS) matrix size matrix size matrix size performance (GFLOPS)

53 Conclusion We implemented a BLAS implementation, which use dynamic load balancing We call the implementation DL-BLAS Our implementations need auto tuning technique to get block size parameters We call the techniques Diagonal Searching Algorithm, Reductive Searching Algorithm, and Parameter Selection In some cases, performance of DL-BLAS was better than ATLAS and GotoBLAS The performance of DL-BLAS is constantly not so wrong

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28