Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

Size: px

Start display at page:

Download "Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi"

Michael Lee
6 years ago
Views:

gov Mark Govett, Jacques Middlecoff James Rosinski NOAA

1 Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Tom Henderson Mark Govett, Jacques Middlecoff James Rosinski NOAA Global Systems Division Indraneil Gokhale, Ashish Jha, Ruchira Sasanka Intel Corp.

2 WSM6 Microphysics n Microphysics parameterization used in WRF n Water vapor, cloud water, cloud ice, rain, snow, graupel n Also used by MPAS (NCAR), NIM (NOAA), GRIMS (YSU/KIAPS,KMA), etc. n More about NIM in Jim Rosinski s talk n Double-precision in MPAS, NIM, and GRIMS, single-precision in WRF 2

3 Why WSM6? n Slowest single routine for expected NIM configuration n Re-use WSM5 tuning for Xeon Phi already done by John Michalakes where possible n Some divergence from John s approach n NOTE: This is work in-progress n NOTE: Xeon Phi = KNC = MIC 3

4 Source Code Requirements n Must maintain single source code for all desired execution modes n Single and multiple CPU/GPU/Xeon Phi n Prefer Fortran + directives n Use F2C-ACC (Govett) and commercial OpenACC compilers for GPU n Use OpenMP plus Intel directives for CPU and Xeon Phi n Avoid architecture-specific code transformations n Automatic transformations OK 4

5 Port Validation n Good bitwise-exact solutions for NIM dynamics debugging n Slow-but-exact Intel library for Xeon & Xeon Phi, Thanks Intel! n Optionally push rare math library calls back to CPU for NVIDIA GPU n Rudimentary validation for WSM6 thus far n Idealized input data set n Will extend to more rigorous validation of real input data case(s) shortly 5

6 What Makes Good Code for Xeon and Xeon Phi? n Vectorizable n Fixed inner dimension n Literal constants known at compile time n Adjustable length of inner dimension n Optimal = vector width n Aligned memory n Begin arrays on vector boundaries n Intel compiler warns of inefficient behavior n Loops that cannot be vectorized n partial, peel, and remainder loops n Unaligned access 6

7 Code Modifications: Threading n Add single OpenMP loop for all physics n Minimize OpenMP overhead n Split arrays into chunks with fixed inner dimension n Pick inner chunk size at compile time n Allow large chunk size for GPU n Modify loops that transfer arrays between dynamics and physics to handle chunks n Very little impact on existing code n Use Intel Inspector to find race conditions n It really works 7

8 Code Modifications: Threading n MPAS and NIM dynamics: (k,icell) n k = vertical index within a single column n icell = single horizontal index over all columns n Physics: (i,k,j) n i = horizontal index over columns in a single chunk n k = vertical index within a single column n j = index over chunks n Use OpenMP to thread j loop 8

9 Example: Chunk Width = 4 Dynamics (k,icell) k icell Replicate last column* Physics (i,k,j) k i j=1 j=2 j=3 j=4 * Replication avoids adding if blocks to all physics i loops 9

10 Code Modifications: Vectorization n Add compiler flag for alignment n Split/fuse loops per Intel compiler complaints n Add Intel compiler directives n Alignment n Compiler cannot always tell if memory is aligned n Vectorization n Compiler cannot always tell if a loop can be safely vectorized n Intel added two of these missed by me 10

11 Automatic Code Modifications n Compiler wants to see literal constants for memory and loop bounds n WRF index conventions n Memory: n real x(ims:ime,kms:kme,jms:jme) n Loop bounds: n do i=its,ite n do k=kts,kte n Amenable to automatic translation n Manually push WRF index conventions down into a few lower-level routines 11

12 Automatic Code Modifications n Optionally translate i to literal constants n Select chunk size at build time n Optionally translate k to literal constants n Select vertical size at build time n Not done by John Michalakes real :: y(ims:ime,kms:kme) real :: x(kms:kme) do k=kts,kte do i=its,ite real :: y(1:8,1:32) real :: x(1:32) do k=1,32 do i=1,8 n Optional + automatic = very flexible 12

13 Idealized Test Case n NIM aqua-planet test case n Single-node test n 225km global resolution (10242 columns) n 32 vertical levels n Time-step = 900 seconds n 72 time steps n WSM6 called every time step n Mimic expected per-node workload for target resolution <3km 13

14 Devices and Compilers n SNB 2 sockets (on loan from Intel) n E5-2670, 2.6GHz, 16 cores/node n ifort 14 n IVB-EP 2 sockets (Intel endeavor) n E5-2697v2, 2.7GHz, 24 cores/node n ifort 15 beta n HSW-EP 2 sockets (Intel endeavor) n E5-2697v3, 2.6 GHz, 28 cores/node n ifort 15 beta n KNC 1 socket (on loan from Intel) n 7120A, 1.238GHz n ifort 14 14

15 WSM6 Best Run Times Device Threads Chunk Width (DP words) Time Time with Intel Optimizations SNB KNC IVB HSW-EP n Intel optimizations reduce precision and make assumptions about padding, streaming stores, etc. n Defensible because WSM6 uses single precision in WRF 15

16 Effect of Automatic Code Transformations Device Threads Baseline Time Time With Literal k Time With Literal i and k KNC IVB n 1.4x speedup on KNC n 1.3x speedup on IVB n Compiler directives instead of user-guided code translation? 16

17 Effect of Vector Length Device 2 DP Words 4 DP Words 8 DP Words 16 DP Words 32 DP Words KNC IVB

18 Summary n WSM6 performance n KNC competitive with SNB despite slower clock n KNL will need to catch up with IVB/HSW n Optimizations sped up both Xeon and Xeon Phi 18

19 Near-Future Directions n WSM6 n Real-data test case n Add OpenMP loops (2) to MPAS physics n Repeat WSM6 tests in MPAS n More rigorous validation n Investigate load imbalance n Fractions of peak performance n FLOPS, memory bw & latency n Target other WRF physics packages used by NIM 19

20 Thanks to n Intel: Mike Greenfield, Ruchira Sasanka, Ashish Jha, Indraneil Gokhale, Richard Mills n Provision of loaner system and access to endeavor n Consultations regarding code optimization n Work-arounds for compiler issues n Aggressive optimization n John Michalakes n Consultation regarding WSM5 work n Code re-use 20

21 Thank You 2/22/12 21

22 Compiler Options n Xeon baseline optimization flags n -O3 ftz -qopt-report-phase=loop,vec -qoptreport=4 -align array64byte -xavx n Xeon aggressive optimization flags n -fp-model fast=1 -no-prec-div -no-prec-sqrt - fimf-precision=low -fimf-domain-exclusion=15 - opt-assume-safe-padding n Xeon Phi baseline optimization flags n -O3 ftz -vec-report6 -align array64byte n Xeon Phi aggressive optimization flags n -fp-model fast=1 -no-prec-div -no-prec-sqrt - fimf-precision=low -fimf-domain-exclusion=15 - opt-assume-safe-padding -opt-streamingstores always -opt-streaming-cache-evict=0 22

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski, Jacques Middlecoff NOAA Global Systems Division