Matheus Serpa, Eduardo Cruz and Philippe Navaux

Size: px

Start display at page:

Download "Matheus Serpa, Eduardo Cruz and Philippe Navaux"

Aldous Collins
5 years ago
Views:

1 Matheus Serpa, Eduardo Cruz and Philippe Navaux Informatics Institute Federal University of Rio Grande do Sul (UFRGS) Hillsboro, September 27 IXPUG Annual Fall Conference 2018

2 Goal of this work 2

3 Methodology Application Oil and Gas exploration application provided by Petrobras Cooperation project ITA-UFRGS-PETROBRAS Coordinated by Prof. Philippe O. A. Navaux Performance evaluation and optimization of geophysics models Application 3D Wave Propagation Model Written in C + OpenMP 3

4 Methodology Platforms Evaluation in 2 real machines Broadwell 2 x Intel Xeon E v4 (22 cores, 2-SMT, 256-bit VPU) 32 KB L1, 256 KB L2, 55 MB shared L3 88 threads Q1 16 4

L2, 55 MB shared L3 88 threads Q1 16 Knights Landing Intel Xeon Phi

5 Methodology Platforms Evaluation in 2 real machines Broadwell 2 x Intel Xeon E v4 (22 cores, 2-SMT, 256-bit VPU) 32 KB L1, 256 KB L2, 55 MB shared L3 88 threads Q1 16 Knights Landing Intel Xeon Phi 7250 (68 cores, 4-SMT, 512-bit VPU) 32 KB L1, 512 KB L2 272 threads Q2 16 5

6 Memory Access Pattern to Improve Data Locality 6

7 Memory Access Pattern to Improve Data Locality Cache performance 7

8 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache performance Broadwell L1 L2 L3 Cache Level 8

9 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache Hit (%) Cache performance Broadwell L1 L2 L3 Cache Level Knights Landing L1 L2 Cache Level 9

10 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop 10

11 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop index

12 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop index stride

13 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop index stride Spatial locality in caches If a memory position is referenced, then it is likely that nearby memory positions will be referenced in the near future. Loop interchange to improve data locality exchanging the order of two or more loops we reduce the stride size 13

14 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop Loop Interchange index stride for(y = 0; y < nnoj; y++) for(z = 0; z < k1 k0; z++) for(x = 0; x < nnoi; x++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop 14

15 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop Loop Interchange index stride for(y = 0; y < nnoj; y++) for(z = 0; z < k1 k0; z++) for(x = 0; x < nnoi; x++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop index

16 Memory Access Pattern to Improve Data Locality for(x = 0; x < nnoi; x++) for(y = 0; y < nnoj; y++) for(z = 0; z < k1 - k0; z++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop Loop Interchange index stride for(y = 0; y < nnoj; y++) for(z = 0; z < k1 k0; z++) for(x = 0; x < nnoi; x++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop index stride

17 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache performance before Broadwell L1 L2 L3 Cache Level after 17

18 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache performance before Broadwell L1 L2 L3 Cache Level after +62% +15.8% 18

19 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache Hit (%) Cache performance before Broadwell L1 L2 L3 Cache Level after +62% +15.8% before Knights Landing L1 L2 Cache Level after 19

20 Memory Access Pattern to Improve Data Locality Cache Hit (%) Cache Hit (%) Cache performance before Broadwell L1 L2 L3 Cache Level after before Knights Landing L1 L2 Cache Level after +62% +15.8% +89.5% 20

21 Memory Access Pattern to Improve Data Locality Performance Gain Performance gain Over xyz Broadwell 3.94 Knights Landing Architecture 21

22 Exploiting SIMD for Floating-point Computations 22

23 Exploiting SIMD for Floating-point Computations Many compilers feature auto-vectorization However, some vectorizations cannot be fully checked at compile time 23

24 Exploiting SIMD for Floating-point Computations Many compilers feature auto-vectorization However, some vectorizations cannot be fully checked at compile time We looked at the number of AVX instructions Compiler remarks 24

25 Exploiting SIMD for Floating-point Computations Many compilers feature auto-vectorization However, some vectorizations cannot be fully checked at compile time We looked at the number of AVX instructions Compiler remarks Manual vectorization 25

26 Exploiting SIMD for Floating-point Computations Many compilers feature auto-vectorization However, some vectorizations cannot be fully checked at compile time We looked at the number of AVX instructions Compiler remarks Manual vectorization Directives to manually vectorize the code for(y = 0; y < nnoj; y++) for(z = 0; z < k1 k0; z++) #pragma simd for(x = 0; x < nnoi; x++) index = (y * nnoi + x) + (k0 + z) * arm; Application main loop 26

27 Exploiting SIMD for Floating-point Computations Performance Gain Performance gain Over code not vectorized Broadwell Knights Landing Architecture 27

28 Optimizing Memory Affinity 28

29 Optimizing Memory Affinity Interchip Traffic (GB) 29

30 Optimizing Memory Affinity Interchip Traffic (GB) Interchip Traffic (GB) Broadwell Knights Landing Architecture 30

31 Optimizing Memory Affinity Interchip Traffic (GB) Interchip Traffic (GB) Thread and data mapping to reduce latency Broadwell Knights Landing Architecture 31

32 Optimizing Memory Affinity We used thread and data mapping to reduce latency 32

33 Optimizing Memory Affinity We used thread and data mapping to reduce latency Thread Mapping Map threads that communicate to cores close in the memory hierarchy. 33

34 Optimizing Memory Affinity We used thread and data mapping to reduce latency Thread Mapping Map threads that communicate to cores close in the memory hierarchy. Data Mapping Map pages to the NUMA node near to the threads that performs most memory accesses to them. 34

35 Optimizing Memory Affinity Interchip Traffic (GB) Interchip Traffic (GB) before Broadwell Knights Landig after Architecture 35

36 Optimizing Memory Affinity Performance Gain Performance gain Broadwell Knights Landing Architecture 36

37 Performance Comparison We applied all optimizations together 37

38 Performance Comparison Performance Gain We applied all optimizations together Performance over naive version Broadwell Knights Landing Architecture 38

39 Performance Comparison Optimized application performance 39

40 Performance Comparison GFLOPS Optimized application performance 1.2x faster on Broadwell Broadwell Knights Landing Architecture 40

41 Conclusion We optimize an oil and gas application and use hardware counters to measure the impact of each strategy. Loop Interchange (up to 5.3 on Broadwell) Explicit Vectorization (up to 6.5 on Knights Landing) Thread and Data Mapping (up to 4.4 on Knights Landing) Broadwell performance 1.2x over Knights Landing 41

42 This research received funding from Intel under the Modern Code Project, and Petrobras under 2016/ Matheus Serpa Hillsboro, September 27 IXPUG Annual Fall Conference 2018

43 Matheus Serpa Informatics Institute Federal University of Rio Grande do Sul (UFRGS) Hillsboro, September 27 IXPUG Annual Fall Conference 2018

Intel Knights Landing Hardware

Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute