LLVM Performance Improvements and Headroom

Size: px

Start display at page:

Download "LLVM Performance Improvements and Headroom"

Peregrine Wade
5 years ago
Views:

1 LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers Meeting 2015 San Jose, CA

2 Messages Tuning and focused local optimizations Advancing optimization technology Getting inspired by heroic optimizations Exposing performance opportunities to developers

3 Benchmarks SPEC CINT2006 1) Kernels LLVM Tests benchmarking-only SPEC CFP2006 (7 C/C++ benchmarks) 1) SPEC is a registered trademark of the Standard Performance Evaluation Cooperation

4 Setup Clang-600 (~LLVM 3.5) vs Clang-700 (~LLVM 3.7) -O3 -FLTO -PGO -O3 -FLTO ARM64

5 Some Performance Gains SPEC CINT2006: +6.5% Kernels: up to 70%

6 Acknowledgements Adam Nemet, Arnold Schwaighofer, Chad Rossier, Chandler Carruth, James Molloy, Michael Zolotukhin, Tyler Nowicki, Yi Jiang and many other contributors of the LLVM community

SPEC CINT2006 25% 22% 18.35% 19.3% 11.7% 8% 5.05% 4.4% 5.

7 SPEC CINT % 22% 18.35% 19.3% 11.7% 8% 5.05% 4.4% 5.3% 6% 6.5% 3.2% 2.5% 2.6% -1.6% 400.perlbench 401.bzip2 0.9% 403.gcc 429.mcf 1.3% 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref -1.6% 471.omnetpp 473.astar 483.xalancbmk Geomean

8 Some Reasons For Gains Condition folding: ((c) >= 'A' && (c) <= 'Z') ((c) >= 'a' && (c) <= z ) cmp + br -> tbnz Unrolling of loops with conditional stores Register pressure aware loop rotation Local (narrow) optimizations

9 Hot Loop: 456.hmmer for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ;

10 Hot Loop: 456.hmmer Loop Distribution for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; for (k = 1; k <= M; k++) { // split d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; ; - Improves cache efficiency - Partial vectorization

11 Hot Loop: 456.hmmer Loop Distribution + Store To Load Forwarding for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; ; = d[0]; Tk-1 for (k = 1; k <= M; k++) { // split // d[k]= d[k-1] + t1[k-1] d[k] = T k = + t1[k - 1]; Tk-1 if ((sc = mc[k - 1] + t2[k - 1]) > T k ) T k = d[k] = sc; - Critical Path Shortening

12 Reflection Many local narrowly focused optimizations Loop Distribution advances capabilities of Loop Transformation Framework

13 Kernel Performance 70% 70% 52.5% 53% 35% 17.5% 0% -10% -8% -6% -17.5% SHA Sobel Filter Compress BlackScholes Convolution

14 Convolution : Estimate Loop Unrolling Benefits No Unrolling static const int k[] = { 0, 1, 5 ; for (v = 0; v < size; v++) { r += src[v] * k[v];

15 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];

16 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2];

17 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += src[0] * 0; r += src[1] * 1;a r += src[2] * 5;

18 Convolution : Estimate Loop Unrolling Benefits With Unrolling static const int k[] = { 0, 1, 5 ; r += 0; r += src[1];a saves mul + add saves mul r += src[2] * 5;

19 Kernel Regressions SHA (-10%) Aggressive load hoisting Tune Scheduler Sobel Filter (-8%) LSR expression normalization Avoid GEP in base address calculation Compress (-6%) No CCMP optimization due to tbnz Generalize CCMP

20 Performance Changes In LLVM Tests 70.00% 50.00% Gains %Gain 30.00% 10.00% % Losses %

21 Headroom

22 CINT2006 Headroom +5%? +15% Inlining+GlobalOpts +LoopFusion libquantum( 10X ) +10% AoS->SoA libquantum (2X) +2% Tuning? CINT2006 2% 10% 15% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown

23 CFP2006 Headroom +5%? +25% AoS->SoA milc (3X) Value Profiling povray(10%) +2% Tuning CFP2006 2% 25% 5% 0% 10% 20% 30% 40% Tuning Advanced Heroic Unknown

24 Array of Structs (AoS) to Structs of Array (SoA)

25 Libquantum: AoS->SoA hot_code(, quantum_reg_node *reg) { int i; for (i = 0; i < reg->size; i++) { if (reg->node[i].state & C) { reg->node[i].state ^= T; struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; Hot loop uses only some fields of a structure

26 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

27 quantum_reg_node N[] N[0] N[1] N[2] N[0].state N[1].state N[2].state 20 Cache: b 128b 192b fictitious cache line size!

28 struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; ; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

29 COMPLEX_FLOAT *amplitude; int *state; quantum_reg_node N[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

30 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->n[i].state & C) { reg->n[i].state ^= T);

31 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] for(i=0; i<reg->size; i++){ if(reg->state[i] & C) { reg->state[i] ^= T);

32 COMPLEX_FLOAT *amplitude; int *state; COMLEX_FLOAT amplitude[] MAX_UNSIGNED state[] Cache (after AoS):

33 AoS to SoA speeds up benchmarks and applications that are memory-bandwidth bound and/or cache bound

34 AoS->SoA: Challenges Legality Casts to/from struct type, escaped types, address taken of individual fields, parameter and return values, semantic of constants Transformations Data accesses, memory allocation Usability Debugging

35 AoS->SoA: Changing Structure Definition and Accesses struct quantum_reg_node { COMPLEX_FLOAT amplitude; MAX_UNSIGNED state; ; struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash; ; struct quantum_reg_node { int width; int size; int hashw; COMLEX_FLOAT *amplitude; int *state; int *hash; ; reg->n[i].state to reg->state[i]

36 Constants if (!reg.state) reg.node = calloc(r; eg.size, sizeof(quantum_reg_node)) %call10 = call %conv, i64 16);

37 Parameter Passing j = quantum_get_state(reg1->node) j = quantum_get_state(reg1->amplitude, reg1->state);

38 References Implementing Data Layout Optimizations in LLVM Framework, Prashantha NR et al., 2014 LLVM Developer Meeting G. Chakrabarti, F. Chow, Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti, Open64 Workshop at CGO, 2008 O. Golovanevsky et al, Workshops/compiler2007/present/data-layout-optimizations-ingcc.pdf, GCC Workshop, 2007 R. Hundt et al, Practical Structure Layout Optimization and Advice, CGO, 2006

39 More Headroom: Heroics

40 Libquantum: Heroics test_sum() {...; cnot(2 * width - 1, width - 1, reg); sigma_x(2 * width - 1, reg);...; 2 similar functions Hot loops Whole program visibility Alias analysis GlobalModRef Cost Model Inline + Fuse

41 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

42 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

43 Interprocedural Loop Fusion void void status(&qec, status(&qec, B1; decohere(reg); B2 decohere(reg); decohere(reg);

44 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (status) { B2; do_something; return; decohere(reg); decohere(reg);

45 Interprocedural Loop Fusion void void status(&qec, void decohere(q_reg) { status(&qec, Global B1; Value Prop if (0) { B2 do_something; return; decohere(reg); decohere(reg);

46 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

47 Interprocedural Loop Fusion void void status(&qec, NULL); if (qec) B1; status(&qec, NULL); if (qec) B2;

48 Interprocedural Loop Fusion void void Inlining status(&qec, B1; status(&qec, NULL); if (qec) status(&qec, B2;

49 Interprocedural Loop Fusion void void Inlining status(&qec, if (globalvar) status(&qec, B1; B2;

50 Interprocedural Loop Fusion void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B1; else { if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) decohere(reg); void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalvar) B2; else { if (foo(y)) return; for (i = 0; i < reg->size; i++) { decohere(reg);

51 Interprocedural Loop Fusion void GlobalModRef void status(&qec, NULL); if (globalvar) B1; else { status(&qec, NULL); if (globalvar) B2; else {

52 Interprocedural Loop Fusion void GlobalModRef void status(&qec, B1; if (globalvar) { B1; B2; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

53 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; for (i B2= 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; if (foo(y)) return; for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

54 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { status(&qec, if (foo(x)) return; B2 if (foo(y)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

55 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fuse bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) for (i = 0; reg->state[i] i < reg->size; ^= T; i++) {

56 Interprocedural Loop Fusion void GlobalModRef void status(&qec, if (globalvar) { B1; B2; B1; else { Fused Loops bool V1 = foo(x); status(&qec, bool V2 = foo(y); if (V1 V2) B2 return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C)

57 What can we learn? Optimization Scope: Call Chain Concept: Function Similarity Challenge: Hoist statements across loops Techniques: Global Value Prop, Partial Inlining, GlobalModRef, Loop Fusion,

58 Take Aways Techniques needed for heroics can be generalized to advance optimization technology Cost Model?

59 How to expose performance opportunities to developers?

60 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) b[i] = S * a[i];

61 builtin_nontemporal_store void scaledcpy(float * restrict a, float * restrict b, float S, int N) { for (int i = 0; i < N; i++) // b[i] = S * a[i]; builtin_nontemporal_store(s * a[i], &b[i]);

62 Vectorizer: Hints and Diagnostics while (good) { for (i = 0; i < N; i++) { DW[i] = A[i - 3] + B[i - 2] + C[i - 1] + D[i]; UW[i] = A[i] + B[i + 1] + C[i + 2] + D[i + 3]; remark: loop not vectorized: Avoid runtime pointer checking when you know the arrays will always be independent by specifying '#pragma clang loop vectorize(assume_safety) before the loop or by specifying 'restrict' on the array arguments. Erroneous results will occur if these options are incorrectly applied!

63 Conclusions The best days for LLVM performance are ahead of us

64 Questions?

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.