War Stories : Graph Algorithms in GPUs

Similar documents
Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Irregular Graph Algorithms on Parallel Processing Systems

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

Extreme-scale Graph Analysis on Blue Waters

Massively Parallel Graph Analytics

Advanced CUDA Optimization 1. Introduction

Extreme-scale Graph Analysis on Blue Waters

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

GPU Fundamentals Jeff Larkin November 14, 2016

Perspectives form US Department of Energy work on parallel programming models for performance portability

Kokkos Tutorial. A Trilinos package for manycore performance portability. H. Carter Edwards, Christian Trott, and Daniel Sunderland

Scalable GPU Graph Traversal!

PULP: Fast and Simple Complex Network Partitioning

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Lecture 2: CUDA Programming

LLVM for the future of Supercomputing

A case study of performance portability with OpenMP 4.5

CUDA. Matthew Joyner, Jeremy Williams

OpenMP for next generation heterogeneous clusters

Kokkos: Enabling performance portability across manycore architectures

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Portland State University ECE 588/688. Graphics Processors

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

CUDA GPGPU Workshop 2012

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

A performance portable implementation of HOMME via the Kokkos programming model

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to Multicore Programming

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU hardware and to CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP 4.0. Mark Bull, EPCC

GPUs have enormous power that is enormously difficult to use

Accelerated Load Balancing of Unstructured Meshes

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Parallel Numerical Algorithms

Graph Partitioning for Scalable Distributed Graph Computations

Introduction to Multicore Programming

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

ECE 574 Cluster Computing Lecture 15

Applications of Berkeley s Dwarfs on Nvidia GPUs

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Introduction to GPU programming with CUDA

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Introduction II. Overview

GPUs and Emerging Architectures

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenMP Optimization and its Translation to OpenGL

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

OpenMP 4.0/4.5. Mark Bull, EPCC

CME 213 S PRING Eric Darve

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

High Performance Computing on GPUs using NVIDIA CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Intel Thread Building Blocks, Part II

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Introduction to CUDA Programming

Performance of deal.ii on a node

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Mapping MPI+X Applications to Multi-GPU Architectures

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

HPC future trends from a science perspective

NUMA-aware OpenMP Programming

Tesla GPU Computing A Revolution in High Performance Computing

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Practical Introduction to CUDA and GPU

Tesla GPU Computing A Revolution in High Performance Computing

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CUDA Programming Model

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

AN 831: Intel FPGA SDK for OpenCL

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Lecture 3: Intro to parallel machines and models

Parallel Programming Libraries and implementations

Technology for a better society. hetcomp.com

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Implementing Many-Body Potentials for Molecular Dynamics Simulations

Multi Agent Navigation on GPU. Avi Bleiweiss

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Transcription:

SAND2014-18323PE War Stories : Graph Algorithms in GPUs Siva Rajamanickam(SNL) George Slota, Kamesh Madduri (PSU) FASTMath Meeting Exceptional service in the national interest is a multi-program laboratory managed and operated by Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011 -XXXXP

Increasingly Complex Heterogeneous Future; d Future Proof Performance Portable Code? Memory Spaces - Bulk non-volatile (Flash?) - Standard DDR (DDR4) - Fast memory (HBM/HMC) - (Segmented) scratch-pad on die Execution Spaces - Throughput cores (GPU) - Latency optimized cores (CPU) - Processing in memory Special Hardware - Non caching loads - Read only cache -Atomics Programming models - GPU: CUDA-ish - CPU: OpenMP - PIM:?? NVRAM 1

Outline What is Kokkos (Slides from Kokkos Developers: Carter Edwards, Christian Trott, Dan Sunderland) Layered collection of C++ libraries Thread parallel programming model that managed data access patterns Graph Algorithms with OpenMP Graph Algorithms with Kokkos Conclusion 2

Kokkos: A Layered Collection of Libraries Standard C++, Not a language extension In spirit of Intel's TBB, NVIDIA's Thrust & CUSP, MS C++AMP,... Not a language extension: OpenMP, OpenACC, OpenCL, CUDA Uses C++ template meta-programming Currently rely upon C++1998 standard (everywhere except IBM's xlc) Prefer to require C++2011 for lambda syntax Need CUDA with C++2011 language compliance Application & Library Domain Layer Kokkos Sparse Linear Algebra Kokkos Containers Kokkos Core Back-ends: OpenMP, pthreads, Cuda, vendor libraries...

Kokkos' Layered Libraries Core Multidimensional arrays and subarrays in memory spaces parallel_for, parallel_reduce, parallel_scan on execution spaces Atomic operations: compare-and-swap, add, bitwise-or, bitwise-and Containers > UnorderedMap -fast lookup and thread scalable insert / delete Vector - subset of std::vector functionality to ease porting Compress Row Storage (CRS) graph Host mirrored & synchronized device resident arrays Sparse Linear Algebra Sparse matrices and linear algebra operations Wrappers for vendors' libraries Portability layer for Trilinos manycore solvers 4

Kokkos Core: Managing Data Access Performance Portability Challenge: Require Device-Dependent Memory Access Patterns CPUs (and Xeon Phi) Core-data affinity: consistent NUMA access (first touch) Hyperthreads' cooperative use of L1 cache Alignment for cache-lines and vector units GPUs Thread-data affinity: coalesced access with cache-line alignment Temporal locality and special hardware (texture cache) i Array of Structures" vs. Structure of Arrays"? > This is, and has been, the wrong question Right question: Abstractions for Performance Portability? EL

Kokkos Core: Fundamental Abstractions EEL Devices have Execution Space and Memory Spaces Execution spaces: Subset of CPU cores, GPU,... Memory spaces: host memory, host pinned memory, GPU global memory, GPU shared memory, GPU UVM memory,... Dispatch computation to execution space accessing data in memory spaces Multidimensional Arrays, with a twist Map multi-index (i,j,k,...) ^ memory location in a memory space Map is derived from an array layout > Choose layout for device-specific memory access pattern Make layout changes transparent to the user code; > IF the user code honors the simple API: a(i,j,k,...) Separates user's index space from memory layout

Kokkos Core: Multidimensional Array Layout and Access Attributes Override device's default array layout class View<double**[3][8], Layout, Device> a("a",n,m); E.g., force row-major or column-major > Multi-index access is unchanged in user code Layout is an extension point for blocking, tiling, etc. Example: Tiled layout class View<double**, TileLeft<8,8>, Device> b("b",n,m); > Layout changes are transparent to user code > IF the user code honors the a(i,j,k,...) API I. ; / / J V V Z / y / J Y Data access attributes - user's intent class View<const double**[3][8], Device, RandomRead> x = a ; Constant + RandomRead + GPU ^ read through GPU texture cache Transparent to user code

Kokkos Core: Dispatch Data Parallel Functors EL 'NW' units of data parallel work parallel_for( NW, functor) Call functor( iw ) with iw e [0,NW) and #thread < NW parallel_reduce( NW, functor) Call functor( iw, value ) which contributes to reduction 'value' Inter-thread reduction via functor.init(value) & functor.join(value,input) Kokkos manages inter-thread reduction algorithms and scratch space parallel_scan( NW, functor) Call functor( iw, value, final_flag ) multiple times (possibly) if final_flag == true then 'value' is the prefix sum for 'iw' Inter-thread reduction via functor.init(value) & functor.join(value,input) Kokkos manages inter-thread reduction algorithms and scratch space 8

Kokkos Core: Dispatch Data Parallel Functors EEL League of Thread Teams (grid of thread blocks) parallel_for( { #teams, #threads/team }, functor) Call functor( teaminfo ) teaminfo = { #teams, team-id, #threads/team, thread-in-team-id } parallel_reduce( { #teams, #threads/team }, functor) Call functor( teaminfo, value ) parallel_scan( { #teams, #threads/team }, functor) Call functor( teaminfo, value, final_flag ) A Thread Team has Concurrent execution with intra-team collectives (barrier, reduce, scan) Team-shared scratch memory Exclusive use of CPU and Xeon Phi cores while executing

Outline What is Kokkos Graph Algorithms with OpenMP Graph Algorithms with Kokkos Conclusion 10

Computing Strongly Connected Components Problem: Given a graph find all the strongly connected components in the graph Multistep method: Multithreaded with OpenMP Optimized for the best CPU performance, state-of-the-art code. Scales to millions of vertices and billions of edges Data-Parallel code, minimal synchronization Good as a baseline for porting to Kokkos Scales to 16-32 threads FASTMath session in PP14 and IPDPS 14.

Multistep Method with OpenMP Different Steps of the Algorithm uses different types of parallelism Per-vertex for-loop Level synchronous BFS Algorithm Multistep Hong Speedup vs. Tarjan's 20 15 10 5 Twitter Wiki Links 12.5 10.0 7.5 5.0 2.5 LiveJournal 12 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16 R-MAT 24 GNP 1 GNP 10 12 4 8 12 4 8 12 4 8 12 4 8

Outline What is Kokkos Graph Algorithms with OpenMP Graph Algorithms with Kokkos Conclusion 13

Thread Parallel vs Thread Scalable Common construct in OpenMP programming Allocate threadlocal data, do parallel work Non-Starter in the GPUs for large arrays Count, Allocate, Fill, paradigms #pragma omp { // allocate an array parallel for do work } Non-Starter for graph algorithms Need Algorithms that use tiny threadlocal data and synchronize with global memory Tiny == 16-32 edges Expensive, too many synchronizations Need Algorithms that use threadteams and shared memory (scratch space) between a team of threads.

Thread Teams in GPUs Multiprocessor (up to about 15/GPU) Multiple groups of stream processors (12 x 16) Warps of threads all execute SIMT on single group of stream processors (32 threads/warp, two cycles per instruction) Irregular computation (high degree verts, if/else, etc.) can result in most threads in warp doing NOOPs Kokkos Model: Thread team - multiple warps on same multiprocessor, but all still SIMT for GPU Thread league - multiple thread teams Work statically partitioned to teams before parallel code is called

Challenges in ThreadScalable codes Goal: Fast Kokkos-based ThreadScalable algorithm for CPU/GPU/Phi Challenges: No persistent thread-local storage Minimize serial portions for GPU/Phi Mitigate effect of high degree vertices, irregular graphs Mitigate algorithmic differences of various architectures Solutions: Very small static thread-owned arrays No more Tarjan's, minimize possible per-thread work Implement new algorithmic tweaks, for loadbalancing, to current methods HOPE this doesn't happen!!!!

Handling Imbalance in GPU threads Chunking: Tranform vertex queue into edge queue Each thread can explore only a few edges and chunks the rest of the edges for later stages Delayed Exploration of High Degree Vertices: When a single thread in a team encounters a high degree vertex, its exploration is delayed The vertex is placed in shared memory queue, only accessible by thread team (just a template parameter in Kokkos) Once team finishes original work (minus large degree vertices), the team works to explore all delayed vertices via inner loop parallelism On CPU, size of thread teams is usually 1, so this algorithm would default back to standard approach on that architecture 17

Results - Algorithms Multistep (M): Simple trimming, Dir. Opt BFS, Coloring until less than 100k vertices remain with single thread exploration on backward stage, Serial Tarjan algorithm. Multistep in Kokkos (MK): Simple trimming, Dir. Opt BFS with small thread owned queues, coloring with fully parallel forward and backward, no Tarjan GPU in Kokkos (GK): Simple trimming, Dir. Opt BFS with chunking, coloring with delayed exploration GPU min Memory in Kokkos (GKM): Only utilize out edges, simple trimming, Forward BFS with chunking and fully bottom-up BFS on backward stage, forward coloring with delayed exploration and fully bottom-up reverse search

Results with Kokkos versions sk-2005 Twitter 20-15- 80-10- 40-5- 7.5-5.0-2.5-0- 0-0.0- Algorithm: 'M-' = Multistep, 'MK-' = Multistep Kokkos, 'GK-' = GPU Kokkos, 'GKM-' = GPU Kokkos Mem Environment: '-C' = CPU, '-G' = GPU, '-P' = Xeon Phi Multistep (M) is the fastest algorithm in CPU GK is the fastest algorithm in the GPU Phi is (a lot) slower

Conclusion Kokkos provides the path-forward for refactoring codes to different architectures Handles data layout Portable, ThreadScalable performance Algorithmic Challenges: Still different algorithms perform better in different architectures Hard to see a single refactor for algorithms in different architectures Kokkos programming is C++ programming not CUDA programming