Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16
QDP-JIT/, A Framework for Lattice QCD Calculations for GPUs QDP-JIT/ provides a reimplementation of QDP++ for NVIDIA GPUs Automatic off-loading of expressions to the accelerators Multi-GPU support Dynamic code generation Additional Just-In-Time (JIT) compilation step with NVIDIA driver Data layout is optimized for coalesced memory accesses Automatic H2D, D2H memory transfers via a software cache Trajectory Time [s] 18 16 14 1 1 8 6 4 V =4 3 256, 2+1 Anisotropic Clover, m π ~ 23 MeV, τ =.2 CPU only (XE nodes) CPU+QUDA QDP-JIT+QUDA F. T. Winter M. A. Clark R. G. Edwards B. Joo in IPDPS'14 Automatic tuning of CUDA kernels 128 256 4 512 8 16 XE Sockets / XK Nodes Paper accepted for publication in IEEE International Parallel & Distributed Processing Symposium 14 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 2 / 16
QDP-JIT/LLVM Motivation Code maintainability No template specializations (SSE, AVX, etc.) for each architecture No heavy usage of #ifdef constructs Performance portability Efficient code generation for all relevant targets Not to be committed on compilers ability to deal with templated codes Support for vector units, memory pre-fetchers, etc. Efficient code: threading, scheduling, cache blocking, etc. QDP-JIT/LLVM LLVM IR Architecture independent implementation of QDP++ LLVM is a framework worth targeting LLVM IR is architecture independent LLVM is embraced by HPC industry, e.g. NVIDIA, IBM, Intel,... F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 3 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? QDP-JIT/ is limited to GPUs. To target a broader range of architectures a new LLVM IR code generator was implemented. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 4 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? GPU route is still there via, two approaches: The open source NV backend or closed source libnvvm library. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 5 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? libnvvm part of CUDA since 5.5 and includes -specific optimizations. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 6 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate x86 code with LLVM s mature x86 backend. (Great SSE/AVX support) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 7 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate PowerPC 64 code. Some support for QPX (work in progress). F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 8 / 16
QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? New architectures supported provided that it supports JIT compilation. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 9 / 16
Optimization: Custom Data Layout QDP++ specifies the data layout through the nesting order of templated data types: Outer < Spin < Color < Reality < float > > > > QDP-JIT splits the outer loop by an optional inner vector length I Outer < Spin < Color < Reality < Inner < float > > > > > The code generation step intercepts and changes the data layout Spin < Color < Reality < Outer < Inner < float > > > > > (GPUs, I = 1) Outer < Spin < Color < Reality < Inner < float > > > > > (CPUs with SSE/AVX, I = 2/4/8) Outer < Spin < Color < Inner < Reality < float > > > > > (BG/Q, I = 2 (DP)) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16
Benchmark on Intel Sandy Bridge 15 1 5 15 1 5 15 1 5 15 1 5 t_linalg (single precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = 4. Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 256 (SP) peak 4 8 12 16 24 28 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 11 / 16
Benchmark on Intel Sandy Bridge 1 1 8 6 4 1 1 8 6 4 1 1 8 6 4 1 1 8 6 4 t_linalg (double precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = 16 4. Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 128 (DP) peak 4 8 12 16 24 28 32 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 12 / 16
Benchmark on Blue Gene/Q (single node, preliminary) 15 1 5 15 1 5 15 1 5 15 1 5 t_linalg DP, 1 node, threads=32, inner=4, layout=oscri M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H QPX instructions are generated, there are however still alignment issue Out of L2 cache for local problem sizes larger than L = 16 4. 4 6 8 1 12 14 16 18 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 13 / 16
Benchmark on Blue Gene/Q (single node) 3 35 25 1 15 5 3 35 25 1 15 5 3 35 25 1 15 5 3 35 25 1 15 5 t_linalg DP, 1 node, threads=64, QDP++, OMP, gcc -O3 M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H GCC on vanilla QDP++ is currently doing better on the linear algebra than QDP-JIT/LLVM. Mainly because the LLVM BG/Q backend misses essential performance features. 4 6 8 1 12 14 16 18 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 14 / 16
Benchmark on Blue Gene/Q, preliminary Rb2 Wilson DSlash, local volume V =12 1 4, DP, 1 MPI rank/node QDP-JIT, 32 threads QDP++, 16 threads 1 Shifting of sub-lattices Overlapping of computation and off-node communication. For rb2 Wilson DSlash preliminary measurements show a speedup factor of 12.4. Performance [] 8 6 4 16 256 BG/Q nodes F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 15 / 16
Summary & Outlook QDP-JIT/LLVM provides an architecture independent implementation of QDP++ Runs Chroma HMC (Wilson Clover) on GPUs, x86, and BG/Q Optimizations: Custom data layout to support vectorization Multi-threading Sub-lattice shifting Improve performance on BG/Q (QPX, SPI) Intel Xeon Phi (KNL) Apply advanced optimizations: Polyhedral model Cache blocking Memory prefetching Overlapping MPI and compute F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 16 / 16