OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Size: px

Start display at page:

Download "OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz"

Alexandrina Stewart
5 years ago
Views:

1 OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

2 MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other accelerators. Continuity à Fortran! Current and future large codes. Quick Porting à OpenMP, OpenACC, MPI etc. Cross platform compiler à LLVM, PGI, GNU, etc. Conclusion, Fortran programmers needs OpenMP GPU Offload in Flang 2

3 Motivation Background Integration of OpenMP GPU Offload into Flang AGENDA Experimental Evaluation Conclusion 3

4 Programming GPU-Accelerated Systems Separate CPU System and GPU Memories GPU Developer View PCIe System Memory GPU Memory 4

5 OPENMP 4.5 Device Offload Model!$OMP TARGET ENTER DATA MAP(to:x, y) copy array x, y to the default device!$omp TARGET TEAMS DISTRIBUTE PARALLEL DO DO i=1, n y(i) = a * x(i) + y(i) ENDDO!$OMP TARGET EXIT DATA MAP(from: y) copy array y from the default device target: teams: distribute: parallel: do: Offload region Create teams of threads Distribute the iteration across the master threads of all teams Create threads Distribute the iteration across the threads that already exist in the team 5

ILM CPU target LLVM-IR flang-driver flang1 flang2 LLVM ld asm

project: NNSA Labs, NVIDIA/PGI ILM -> ILI Optimize ILI ILI to

6 example.f90 THE FLANG PROJECT An open source Fortran front-end for LLVM ILM CPU target LLVM-IR flang-driver flang1 flang2 LLVM ld asm Parse argument Scan Parse Build AST AST -> ILM Multi-year project: NNSA Labs, NVIDIA/PGI ILM -> ILI Optimize ILI ILI to LLVM-IR Runtime libraries libomp.so libpgmath.so libflang.so libflangrt.so Based on the PGI Fortran front-end Re-engineering for integration with LLVM Develop CLANG-quality Fortran front end Glossary ILM: PGI-IR generated by frontend ILI: PGI-IR generated by backend 6

7 OPENMP GPU OFFLOAD IN FLANG Design Goals Provide an implementation of OpenMP 4.5 GPU Offload targeting NVIDIA GPUs in Flang Aim to keep the same design as Clang OpenMP 4.5 Leverage the Clang/LLVM driver Take advantage of LLVM s OpenMP runtimes Compatibility with Flang and Clang OpenMP GPU Offload 7

COMPILATION PIPELINE $ flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda example.f90 example.f90 tmp1.

implemented here. Flang2 generates device specific code ptx libomptarget-nvptx.

8 COMPILATION PIPELINE $ flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda example.f90 example.f90 tmp1.f90 ILM CPU target LLVM-IR flang-driver flang1 flang2 LLVM ld asm Driver invokes the compiler for each OpenMP target tmp2.f90 flang1 generates the same ILM ILM NVPTX target LLVM-IR flang1 flang2 LLVM OpenMP device codegen is implemented here. Flang2 generates device specific code ptx libomptarget-nvptx.bc ptxas Two ways of adding libomptarget-nvptx 1. Inline *.bc (precompiled cuda-clang) 2. Link *.a by nvlink sass nvlink libomptarget-nvptx.a Runtime libraries libomp.so libpgmath.so libflang.so libflangrt.so libomptarget.so LLVM Offload Runtime 8

9 CODE TRANSFORMATION Example with OpenMP CPU!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO DO i=1, n y(i) = a * x(i) + y(i) ENDDO Use libomp(kmpc) as CPU OpenMP runtime libomp) Outline for each OpenMP construct Generated LLVM-IR for Host target define () { call %0,...) define internal kmpc_fork_teams (..., %teamsfunc) define internal call kmpc_for_static_init_4 (...) call kmpc_fork_call(..., %parallelfunc) call kmpc_for_static_fini (...) define internal call kmpc_for_static_init_4 (...) ; <The loop body > call kmpc_for_static_fini (...) Outlining for $target Outlining for $teams & fork teams Distribute loop across $teams Outline for $parallel & fork threads Distribute loop across threads 9

10 CODE TRANSFORMATION Example with OpenMP GPU Offload!$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO MAP(to:x) MAP(tofrom:y) DO i=1, n y(i) = a * x(i) + y(i) ENDDO Offload target region to the device New code transformation for device Use libomptarget for GPU offload Use libomptarget-nvptx in device OpenMP RT Generated LLVM-IR for Host target define () { call tgt_target_teams(i8* %kernelid,...) Generated LLVM-IR for NVPTX target define ptx_kernel teamsfunc (...) define internal call kmpc_for_static_init_4 (...) call call kmpc_for_static_fini (...) define internal call kmpc_for_static_init_4 (...) ; <The loop body > call kmpc_for_static_fini (...) Offload $target to GPU Outlining for $teams & call func Distribute loop across $teams Outline for $parallel & call func Distribute loop across threads 10

11 OUTLINER OPTIMIZATION Reduce number of outlining for device code Calling function in the device is not efficient Omit outlining for combined constructs in device! Generated LLVM-IR - Default define ptx_kernel teamsfunc (...) define internal call kmpc_for_static_init_4 (...)!$distribute call call kmpc_for_static_fini (...) define internal call kmpc_for_static_init_4 (...)!$do ; <The loop body > call kmpc_for_static_fini (...) Generated LLVM-IR - Reduced number of outlining in device define ptx_kernel call kmpc_for_static_init_4 (...) ;!$distribute call kmpc_for_static_init_4 (...) ;!$do ; <The loop body > call kmpc_for_static_fini (...) call kmpc_for_static_fini (...) 11

12 EXPERIMENTAL EVALUATION Compilers studied in our experiments We run our experiments on IBM POWER8NVL and NVIDIA Pascal GPUs with CUDA 8.0 Compilers Flang OpenMP GPU (this work) Flags -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Mpreprocess Clang 7.0 -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda PGI Mpreprocess -fast -ta=tesla,cc60,cuda8.0 -mp IBM XLF qsmp -qoffload -qpreprocessor -O3 12

13 EXPERIMENTAL EVALUATION (II) Compilers studied in our experiments We experiment flang with different runtimes Compiler Target Offload Runtime Device OpenMP Runtime Flang Flang Flang LLVM Offload RT - libomptarget IBM XL Offload RT - libxlsmp IBM XL Offload RT - libxlsmp Compile device RT with nvcc Link libomptarget-nvxptx Compile device RT with nvcc Link libomptarget-nvxptx Compile device RT clang-cuda Inline libomptarget-nvxptx.bc 13

14 DAXPY y = y *ax PGI Fortran SPEEDUP VS SINGLE-CORE FLANG XLF Flang + Link libomptarget-nvptx.a + libomptarget Flang + Link libomptarget-nvptx.a + libxlsmp Flang + Inline libomptarget-nvptx.bc + libxlsmp Clang 2 Clang + static(schedule,1) 0 32MB Vector Size 14

15 360.ILBDC Spec Accel PGI Fortran 140 SPEEDUP VS SINGLE-CORE FLANG XLF Flang + Link libomptarget-nvptx.a Flang + Link libomptarget-nvptx.a + libomptarget + libxlsmp Flang + Inline libomptarget-nvptx.bc + libxlsmp 0 32MB Vector Size Performances are considered ESTIMATES per SPEC run and reporting rules. SPEC is a registered trademark of the Standard Performance Evaluation Corporation ( 15

16 CONCLUSION Flang OpenMP GPU Offload Introduced OpenMP GPU Offload to Flang Support for combined-constructs Proposed compatible implementation with Clang Flang performance is in line with Clang Obtained better performance with IBM XL s target offload runtime 16

OpenMP GPU Offload in Flang and LLVM

OpenMP GPU Offload in Flang and LLVM Güray Özen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz NVIDIA Corp. {gozen, satzeni, mwolfe, asouthwell, gklimowicz}@nvidia.com Abstract Graphics