Porting MILC to GPU: Lessons learned

Similar documents
High Performance Computing with Accelerators

Designing and Optimizing LQCD code using OpenACC

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

QUDA Programming For Staggered Quarks

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

University of Bielefeld

Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner

Device Memories and Matrix Multiplication

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions

OpenACC (Open Accelerators - Introduced in 2012)

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

An Introduction to OpenAcc

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

arxiv: v1 [hep-lat] 12 Nov 2013

OpenStaPLE, an OpenACC Lattice QCD Application

GPU Clusters for High-Performance Computing

Programming in CUDA. Malik M Khan

Lab 1 Part 1: Introduction to CUDA

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Addressing Heterogeneity in Manycore Applications

Matrix Multiplication

OpenACC Fundamentals. Steve Abbott November 15, 2017

Porting COSMO to Hybrid Architectures

Experiences with Achieving Portability across Heterogeneous Architectures

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Matrix Multiplication

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction to CUDA (1 of n*)

PGI Fortran & C Accelerator Programming Model. The Portland Group

Martin Dubois, ing. Contents

CME 213 S PRING Eric Darve

Accelerator programming with OpenACC

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

High Performance Computing and GPU Programming

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

INTRODUCTION TO OPENACC

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

DawnCC : a Source-to-Source Automatic Parallelizer of C and C++ Programs

Intel Xeon Phi Coprocessors

Architecture, Programming and Performance of MIC Phi Coprocessor

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

NAMD GPU Performance Benchmark. March 2011

Hands-on CUDA Optimization. CUDA Workshop

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

designing a GPU Computing Solution

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Getting Started with Directive-based Acceleration: OpenACC

OpenACC 2.6 Proposed Features

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

The Stampede is Coming: A New Petascale Resource for the Open Science Community

RAMSES on the GPU: An OpenACC-Based Approach

Advanced CUDA Optimizations

Technology for a better society. hetcomp.com

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Dense Linear Algebra. HPC - Algorithms and Applications

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

arxiv: v1 [hep-lat] 1 Dec 2017

Introduction to OpenACC

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

COMP Parallel Computing. Programming Accelerators using Directives

CUDA Experiences: Over-Optimization and Future HPC

An Introduc+on to OpenACC Part II

OpenACC Course. Office Hour #2 Q&A

Optimization of Tele-Immersion Codes

Transcription:

Porting MILC to GPU: Lessons learned Dylan Roeh Jonathan Troup Guochun Shi Volodymyr Kindratenko Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

GPU Clusters at NCSA Lincoln Production system available via TeraGrid HPC allocation Nodes: 192 CPU cores: 1536 Accelerator Units (S1070): 96 Total GPUs: 384 CPU cores/gpu ratio: 4 AC Experimental system managed by ISL Nodes: 32 CPU cores: 128 Accelerator Units (S1070): 32 Total GPUs: 128 CPU cores/ GPU ratio: 1

New Experimental Cluster Node Core i7 host PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe interface PCIe interface PCIe interface PCIe interface Tesla S1070 Tesla S1070 CPU cores (Intel core i7): 4 Accelerator Units (S1070): 2 Total GPUs: 8 CPU cores/gpu ratio: 0.5 Theoretical peak PCIe bandwidth: 18 GB/s CPU Memory: 24 GB GPU Memory: 32 GB

Goals/Approach Goals Understand what is involved in implementing MILC on GPU Understand GPU cluster hardware requirements for MILC Approach Port few kernels using PGI x64+gpu compiler Port few kernels using native NVIDIA CUDA C compiler

Portland Group PGI x64+gpu compiler Uses OpenMP-like directives to identify portions of code for moving to GPU No need for programming in CUDA C Programmer is still responsible for parallelizing the code Supports C, C++, and Fortran(!) Supports 64-bit Linux and latest NVIDIA GPUs

Target code for (i=0;i < sites_on_node; i++) { mult_su3_na(,, ) void mult_su3_na(su3_matrix *a, su3_matrix *b, su3_matrix *c ) { register int i,j,k; register complex x,y; for(i=0;i<3;i++)for(j=0;j<3;j++) { x.real=x.imag=0.0; for(k=0;k<3;k++) { CMUL_J(a->e[i][k], b->e[j][k], y); CSUM( x, y ); c->e[i][j] = x;

Target code #pragma acc region for (i=0;i < sites_on_node; i++) { mult_su3_na(,, ) PGI compiler does not like structures void mult_su3_na(su3_matrix *a, su3_matrix *b, su3_matrix *c ) { register int i,j,k; register complex x,y; for(i=0;i<3;i++)for(j=0;j<3;j++) { x.real=x.imag=0.0; for(k=0;k<3;k++) { CMUL_J(a->e[i][k], b->e[j][k], y); CSUM( x, y ); c->e[i][j] = x;

Modified code #pragma acc region for (i=0;i < sites_on_node; i++){ mult_su3_na(,, ) Remove structures and make it pure pointers PGI compiler does not like pointers either inline void mult_su3_na_plain(float* a, float* b, float* c) { int i,j,k; for (i =0;i < 3;i++) for (j=0;j< 3;j++) { float real_part=0; float imag_part=0; for (k=0;k <3; k++) { float real_a = *(a + i*6+ k*2); float imag_a = *(a+ i*6+k*2+1); float real_b = *(b + j*6+k*2); float imag_b = *(b + j*6+k*2 +1); real_part += real_a*real_b - imag_a* imag_b; imag_part += real_a*imag_b + imag_b*real_a; float* real_p = c + i*6+j*2; float* imag_p = c + i*6+j*2 + 1; *real_p = real_part; *imag_p = imag_part;

Modified code Copy data into a continuous memory region float matrix_buf_a[max_float_count]; float matrix_buf_b[max_float_count]; float matrix_buf_c[max_float_count]; for(i=0;i < sites_on_node; i++){ memcpy( &matrix_buf_a[18*i], &oprod_along_path[length-ilink][i], sizeof(su3_matrix)); memcpy( &matrix_buf_b[18*i], &mats_along_path[ilink][i], sizeof(su3_matrix)); memcpy( &matrix_buf_c[18*i], &mat_tmp0[i], sizeof(su3_matrix)); #pragma acc region for(i=0;i < sites_on_node; i++){ //unroll completely Fully unroll the function int ii =0; int jj = 0; real_a = matrix_buf_a[ii*6 + 18*i]; imag_a = matrix_buf_a[ii*6 + 1 + 18*i]; real_b = matrix_buf_b[jj*6 + 18*i]; imag_b = matrix_buf_b[jj*6 + 1 + 18*i]; real_part += real_a*real_b - imag_a* imag_b; imag_part += real_a*imag_b + imag_a* real_b; real_a = matrix_buf_a[ii*6 + 1+ 1 + 18*i]; imag_a = matrix_buf_a[ii*6 + 3 + 18*i]; real_b = matrix_buf_b[jj*6 + 2 + 18*i]; imag_b= matrix_buf_b[jj*6 + 3 + 18*i]; real_part += real_a*real_b - imag_a*imag_b; imag_part += real_a*imag_b + imag_a*real_b; real_a = matrix_buf_a[ii*6+ 4+18*i]; imag_a = matrix_buf_a[ii*6+4+1+18*i]; real_b = matrix_buf_b[jj*6+4+18*i]; imag_b = matrix_buf_b[jj*6+4 +1+18*i]; real_part += real_a*real_b - imag_a* imag_b; imag_part += real_a*imag_b + imag_a*real_b; matrix_buf_c[ii*6+jj*2 + 18*i]= real_part; matrix_buf_c[ii*6+jj*2 + 1 + 18*i]= imag_part; ii=0;jj=1;...

Conclusions with PGI x64+gpu compiler Significant code modifications are necessary To work around compiler limitations To parallelize the code The benefits are not entirely clear Even though there is no need to write in CUDA C, code modifications are very similar Compiler transformations are not always visible/obvious

Implementing kernels in CUDA C Use part of the fermion forces kernel as an example (fermion_force_fn_multi.c) Two approaches Port the entire kernel Implement parallel SU(3) operations (scalar multiply-add, multiply, adjoint, projector) with the end goal of replacing many of the FOREVEN, FORODD, and FORALL inner loops

Implementing kernels in CUDA C First approach: port the entire kernel Does not particularly increase the FLOPS/bytes ratio Involves some significant possibility of branching Second approach: implement parallel SU(3) operations May be an easier strategy to move the code to GPU, but One have to deal with the low FLOPS/bytes ratio in many of the operations

Possible ways forward Consider packed SU(3) representations rather than the 3x3 complex matrix representation Given the first two rows of the matrix (6 complex numbers, or 12 real numbers), one can reconstruct the third by taking the cross product of the conjugates of those rows. FLOPS are cheap on the GPU! They can also be packed as 8 real numbers, using a method described in the paper by Bunk and Sommer. This will require to adapt MILC to use packed SU(3) matrices all the way through, and offload any significant loop of SU(3) operations to the GPU.

Possible ways forward Switch storage of SU(3) matrices from an arrayof-structs to a struct-of-arrays To allow for easy coalesced global memory reads/writes. To eliminate additional data manipulation on the CPU before it is shipped to the GPU memory but converting the data inside of fermion_force_fn_multi will reduce or eliminate the performance advantage of offloading some of the loops to the GPU

Conclusions Porting MILC as is to GPU is not the right strategy because of the low FLOPS/byte ratio throughout many MILC kernels And the shear number of small kernels! A more fruitfully approach is to start redesigning essential algorithms in the GPU-friendly manner Issues to consider include data layout and algorithms selection to maximize FLOPS/byte ratio