Various optimization and performance tips for processors

Size: px

Start display at page:

Download "Various optimization and performance tips for processors"

Darren Dickerson
6 years ago
Views:

1 Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1

2 Contents Introducing myself Merit/demerit of Optimization How to avoid reducing performance (not improving performance ) Explanation of each tips GotoBLAS Tutorial (Monday 10:30 to 11:30) 2006/12/7 Kazushige Goto (TACC) 2

3 Four Years ago I was a patent examiner at JPO & got a chance to study abroad I had to find research groups, but No one except for UT responded my request I developed DGEMM for P4 and used at Buffalo Univ. at NY. 2006/12/7 Kazushige Goto (TACC) 3

4 One regret thing Wrong Naming I never thought it would become one of the major BLAS Now over 4,500 users have been registered I don t know number of downloads No one can pronounce my name 2006/12/7 Kazushige Goto (TACC) 4

5 My questions about R Do you need high precision floating point operation? 64bit? Or do you need 128bit? Or is integer operation enough? 80bit FP BLAS? 2006/12/7 Kazushige Goto (TACC) 5

6 Standard optimization 1. Compiler optimization is good enough if data are on L1 cache 2. Your job is to manage data and move them to L1 cache 3. Please don t expect too much Not good, not bad performance L1 cache is very small 2006/12/7 Kazushige Goto (TACC) 6

7 Advanced optimization 1. Bandwidth aware programming 2. Separate important functions we call it Kernel 3. Write function in assembler 4. Compiler s code is not enough Much better performance L2 cache is very large 2006/12/7 Kazushige Goto (TACC) 7

8 Who determine performance? Everyone asked me how to improve performance Actually performance is a kind of demerit point system 1. Start from 100% 2. Someone prevents working 3. Then reducing performance We have to get rid of all problem 2006/12/7 Kazushige Goto (TACC) 8

9 Can I improve performance? Depends on your bottleneck I/O : no hope Main memory : up to 2x Cache memory : up to 6x Instruction scheduling : up to 10x Especially integer operation can be improved up to 100x! Of course, it s very^2 difficult 2006/12/7 Kazushige Goto (TACC) 9

10 Side effect of Optimization Come to nothing if algorithm is changed Performance comparison Optimization can control the order of superiority People will misunderstand which is better Need fair optimization and comparison Good algorithm + non-optimized coding Bad algorithm + optimized coding 2006/12/7 Kazushige Goto (TACC) 10

11 Bunch of mines Operating System Memory Instruction Scheduling Floating point exception Synchronization cost on SMP 2006/12/7 Kazushige Goto (TACC) 11

12 Biggest bottle neck : Human Algorithm is always 1 st priority! Human should understand Computer is far from perfect Computer loves simple work Computer hates any exceptions and interrupts Optimization is the last resort 2006/12/7 Kazushige Goto (TACC) 12

13 Operating System - Process Scheduling - Generally process can t use 100% of CPU cycles Interrupt handling Process scheduling In case of many active processes Timer frequency problem Linux default is 1000Hz You may change to 100Hz 2006/12/7 Kazushige Goto (TACC) 13

14 Operating System - Memory management - Very important for performance Have you ever seen performance variations? Slow Fast Slow It cause due to physically noncontiguous memory mapping Average performance is nonsense User can t control it 2006/12/7 Kazushige Goto (TACC) 14

15 Memory Mapping Page Page Virtual Memory (Contiguous) Physical Memory (Non-contiguous) 2006/12/7 Kazushige Goto (TACC) 15

16 Performance variations Perormance Variations (PPC970) Performance Performance (HugeTLB) L2 Conflicts MFlops Iterations # of conflicts 2006/12/7 Kazushige Goto (TACC) 16

17 Operating System - Frequency Throttle - Recent CPU can control frequency to reduce power consumption Very slow at the beginning of benchmark You can check proc file system /sys/devices/system/cpu/cpu? /cpufreq/scaling_min_freq 2006/12/7 Kazushige Goto (TACC) 17

18 Throttle Performance CPU Freuency Throttling Normal Throttle MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 18

19 Memory Issue Page fault (Low amount of memory) TLB miss Narrow bandwidth Cache miss Large latency Cache bank conflict Unaligned trap 2006/12/7 Kazushige Goto (TACC) 19

20 Memory Latency Memory Latency on Opteron HugeTLB MMAP Cycles Vector Size (kb) 2006/12/7 Kazushige Goto (TACC) 20

21 Memory Bandwidth Memory Bandwidth on Opteron HugeTLB MMAP Doubles/cycle Matrix Order 2006/12/7 Kazushige Goto (TACC) 21

22 Instruction cache Simple sin benchmark First access costs too much More than 3 times call is required to get good performance Iteration Itanium Pentium /12/7 Kazushige Goto (TACC) 22

23 Instruction Scheduling (Skip) Decoding bottleneck Scheduling rules Complex dependencies Integer divide and remainder Each architecture has each characteristic --- deep world 2006/12/7 Kazushige Goto (TACC) 23

24 Floating point exception Subnormal Overflow Underflow +Infinity, -Infinity NaN (Not a Number) Dividing by zero 2006/12/7 Kazushige Goto (TACC) 24

25 Strange initialization by great user Inf *0 is actually NaN, not Zero Some users call SCAL (One of BLAS functions) with alpha = Zero to initialize matrix BLAS (only my BLAS?) doesn t take into account about special case of IEEE /12/7 Kazushige Goto (TACC) 25

26 Floating point exception cost Architecture SubNormal Infinity, Nan, Overflow, Underflow Pentium Core Opteron 41 1 Itaniuim POWER5 9 1 Relative value (normal is 1) 2006/12/7 Kazushige Goto (TACC) 26

27 Calculation Order Association law s problem (A + B) + C!= A + (B + C) Optimization needs changing order of calculation Order depends on architecture Really difficult to get correct (same) result between architectures 2006/12/7 Kazushige Goto (TACC) 27

28 Function Call Overhead Spill operation (save/restore register values) Big hidden bottleneck Try out static inline function if function size is too small doesn t contain other function calls Use -fno-inline if you use profile option 2006/12/7 Kazushige Goto (TACC) 28

29 System Call Overhead Different from normal function call System call mmap/munmap, shared memory Write to/read from file Signaling malloc is not system call Output to stderr is unbuffered! 2006/12/7 Kazushige Goto (TACC) 29

30 Example DDOT (double precision dot) I don t explain how to optimize it Please understand Calculation order in ddot function Result may vary Unrolling type SSE or SIMD operation Aligned/unaligned issue 2006/12/7 Kazushige Goto (TACC) 30

31 DDOT on R Testing was failed with my BLAS Actually my BLAS was sanity The problem was 1. R has original ddot function 2. It uses x87 FP stack (80bit precision) 3. My BLAS uses SSE2 (64bit precision) 4. Results are fairly different 2006/12/7 Kazushige Goto (TACC) 31

32 Reason Intermediate result was close to ZERO 80bit FP can hold small value 64bit FP can t do that BLAS can t avoid it BLAS changes calculation order to get better performance 2006/12/7 Kazushige Goto (TACC) 32

33 DDOT data on R X[ 0] = e+00 Y[ 0] = e+00 X[ 1] = e-01 Y[ 1] = e-01 X[ 2] = e-01 Y[ 2] = e-01 X[ 3] = e-01 Y[ 3] = e+00 X[ 4] = e-01 Y[ 4] = e+00 X[ 5] = e-01 Y[ 5] = e+00 X[ 6] = e-01 Y[ 6] = e+00 X[ 7] = e-01 Y[ 7] = e+00 X[ 8] = e-01 Y[ 8] = e+00 X[ 9] = e-01 Y[ 9] = e+00 Totally 10! = patterns for add operations 2006/12/7 Kazushige Goto (TACC) 33

34 How results vary Precision Min Max 32bit e e-07 64bit 80bit 64bit with sort(*) e e e e e-15 (*) Sort in absolute ascending order and add 2006/12/7 Kazushige Goto (TACC) 34

35 Why is calculation order different? We have to hide instruction latency Itanium2 : 8 times unrolling POWER5 : 16 times unrolling Pentium4 with SSE2 : 8 times unrolling Pentium4 with x87 : 4 times unrolling SPARC : 4 times unrolling Calculation order is completely different 2006/12/7 Kazushige Goto (TACC) 35

36 Alignment / Unalignment Alignment It s related to address of data Offset address should be multiply for data size 16bit 0x082 : aligned, 0x83 : unaligned 32bit 0x084 : aligned, 0x86 : unaligned 64bit 0x088 : aligned, 0x8a : unaligned 128bit 0x090 : aligned, 0x98 : unaligned 2006/12/7 Kazushige Goto (TACC) 36

37 Aligned data Some architecture needs 128bit alignment to move data effectively Intel SSE/SSE2 Intel IA64 IBM VMX (Altivec) User s argument of X and Y are not always aligned 2006/12/7 Kazushige Goto (TACC) 37

38 Four Scenarios 1. X : aligned Y : aligned 2. X : unaligned Y : aligned 3. X : aligned Y : unaligned 4. X : unaligned Y : unaligned 1 and 2 may be same result. 1 and 3 may be different result even data are exactly same!! 2006/12/7 Kazushige Goto (TACC) 38

39 Synchronization Cost Always reduces efficiency Two ways for synchronization By Kernel Other threads/process can use CPU Bad response Busy wait (different from spin loop) Other threads/process can t use CPU Pretty good response 2006/12/7 Kazushige Goto (TACC) 39

40 Threaded Operation It s important to divide jobs equally Accessing queue will take long time Many threads try to access same queue at once Waking up/suspending threads cost If we can get rid of above costs, what s happen? 2006/12/7 Kazushige Goto (TACC) 40

41 Pthread overhead (Level 2) MFlops Thread Overhead (DGEMV on Itanium2) Single MultiThreaded Busy Wait Matrix Order 2006/12/7 Kazushige Goto (TACC) 41

42 Imagine how data move! Modified data CPU 0 CPU 1 Cache prevent writing back data Memory All data has to go through main memory! 2006/12/7 Kazushige Goto (TACC) 42

43 Pthread Overhead (Level 3) Thread Overhead (DGEMM on Itanium2) Single Multi Threaded Busy Wait MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 43

44 80bit FP BLAS 128bit FP is really good, but slow due to software emulation 80bit FP is less precise, but more precise than 64bit FP No penalty except for load/store operation GCC can handle it by long double 2006/12/7 Kazushige Goto (TACC) 44

45 The problem 80bit FP is not compatible with 128bit FP Intel x86 / x86_64 Bad Performance Intel IA64 Good performance (92% of peak) I don t know how useful it is Is anyone interested in? 2006/12/7 Kazushige Goto (TACC) 45

46 QGEMM on Itanium2 QGEMM Performance on Itanium2 DGEMM QGEMM MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 46

47 QGEMM on Opteron QGEMM performance on Opteron DGEMM QGEMM MFlops Matrix Order 2006/12/7 Kazushige Goto (TACC) 47

48 Conclusion The performance of your application comes from many reasons Operating System Your algorithm Function overhead Data alignment Data types etc 2006/12/7 Kazushige Goto (TACC) 48

49 Please do not Easy optimization that compiler can do Unrolling loop Easy blocking Simple hand optimize Stick cache size; bandwidth is more important Excessive threaded operation 2006/12/7 Kazushige Goto (TACC) 49

50 Please do Improving your algorithm Please be aware of Limited bandwidth Avoiding to use subnormal value Separating important function Dividing job equally on thread operation 2006/12/7 Kazushige Goto (TACC) 50

51 Any questions? Then please join Tutorial on Monday! 2006/12/7 Kazushige Goto (TACC) 51

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal