CSCI-580 Advanced High Performance Computing

Size: px

Start display at page:

Download "CSCI-580 Advanced High Performance Computing"

Austen Dennis
5 years ago
Views:

1 CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT)

2 Square-Matrix Multiplication!2

3 An Intel machine used in MIT!3

4 An Intel machine used in MIT!3

5 An Intel machine used in MIT!3

6 Triply nested loop in Python n = 4096 A = initmat(n) B = initmat(n) C = initmat(n) for i in range(n): for j in range(n): for k in range(n): C[i][j] += A[i][k] * B[k][j]!4

7 Triply nested loop in Python n = 4096 A = initmat(n) B = initmat(n) C = initmat(n) for i in range(n): for j in range(n): for k in range(n): C[i][j] += A[i][k] * B[k][j]!4

8 Triply nested loop in Python n = 4096 A = initmat(n) B = initmat(n) C = initmat(n) for i in range(n): for j in range(n): for k in range(n): C[i][j] += A[i][k] * B[k][j]!4

9 Triply nested loop in Python n = 4096 A = initmat(n) B = initmat(n) C = initmat(n) for i in range(n): for j in range(n): for k in range(n): C[i][j] += A[i][k] * B[k][j]!4

10 Maybe Java can be faster!5

11 Maybe Java can be faster!5

12 Maybe Java can be faster!5

13 Why I love C!6

14 Where we stand so far!7

15 Where we stand so far!7

Interpreter and JIT (Just-In-Time compilation) o An interpreter interprets one statement at a time o JIT compiler is everywhere When it hits a new

16 Interpreter and JIT (Just-In-Time compilation) o An interpreter interprets one statement at a time o JIT compiler is everywhere When it hits a new method, check if it is already complied. If already jitted, directly execute it. If not, compile it and generated a list of machine instructions.!8

17 Optimization switches!9

18 The question to ask!10

19 Performance counter tells more o Performance counters a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems Cache misses, committed instructions, memory bandwidth, branch misses, etc. o For the C version of matrix multiplication # of L3 references: 34,320,418,733 # of L3 misses: 34,042,409,392 L3 hit ratio: 0.81%!11

20 Poor locality!12

21 Data transpose!13

22 Data transpose!13

23 Warning: math is coming

24 Data reuse!15

25 Data reuse!16

26 Further decomposing tile computation C(1,1) C(1,1) = + * A(1,1) B(1,1) C(1,1) C(1,1) = + * A(1,2) B(2,1) C(1,1) C(1,1) = + * A(1,3) B(3,1)!17

27 Further decomposing tile computation C(1,2) C(1,2) = + * A(1,1) B(1,2) C(1,2) C(1,2) = + * A(1,2) B(2,2) C(1,2) C(1,2) = + * A(1,3) B(3,2)!18

28 Tiling!19

29 Tiling!19

30 Performance of tiling!20

31 Divide and conquer!21

32 Divide and conquer!22

33 Performance of D&C!23

34 Performance of D&C!23

35 Performance of D&C!23

36 Performance of D&C!23

37 Function-call overhead!24

38 Performance of coarsening + transpose!25

39 Performance of coarsening + transpose!25

40 Vectorization o Each core of the computer has 8 vector units which can initiate 8 floating-point operations on each cycle using a single vector instruction!26

41 Vectorization o Each core of the computer has 8 vector units which can initiate 8 floating-point operations on each cycle using a single vector instruction interchange these two loops!26

42 Vectorization!27

43 Parallel loops!28

44 Recursive parallel matrix multiply!29

45 Parallel-loops performance!30

46 Unportable performance!31

47 Final reckoning!32

48 Programming language popularity!33

Faster python o NumPy (http://www.numpy.

49 Faster python o NumPy ( An extension to Python for fast mathematical operations!34

50 Faster Python o PyPy ( An interpreter and JIT written in Python!35

51 JIT language is slower? Source:

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute