The Parallel Boost Graph Library spawn(active Pebbles)

Similar documents
Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

AM++: A Generalized Active Message Framework

The Case for Collective Pattern Specification

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

High Performance Computing on GPUs using NVIDIA CUDA

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Parallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Active Pebbles: Parallel Programming for Data-Driven Applications

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Flexible Architecture Research Machine (FARM)

Patterns for! Parallel Programming!

Scalable Software Transactional Memory for Chapel High-Productivity Language

Message Passing Models and Multicomputer distributed system LECTURE 7

Hybrid Model Parallel Programs

Parallel Architectures

Grappa: A latency tolerant runtime for large-scale irregular applications

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

PGAS: Partitioned Global Address Space

Introduction to Parallel Computing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Writing Parallel Libraries with MPI -- The Good, the Bad, and the Ugly --

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Runtime Support for Scalable Task-parallel Programs

A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System

Issues in Multiprocessors

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Trends and Challenges in Multicore Programming

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Marco Danelutto. May 2011, Pisa

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Graph: representation and traversal

Introduction to Parallel Programming Models

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Introduction II. Overview

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Issues in Multiprocessors

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Shared Virtual Memory. Programming Models

Current Topics in OS Research. So, what s hot?

Chapter 12: Multiprocessor Architectures. Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing

Trees and Graphs Shabsi Walfish NYU - Fundamental Algorithms Summer 2006

Practical Near-Data Processing for In-Memory Analytics Frameworks

CUDA GPGPU Workshop 2012

Advances of parallel computing. Kirill Bogachev May 2016

Introduction to parallel Computing

OpenMP for next generation heterogeneous clusters

Introduction to Distributed Systems

CellSs Making it easier to program the Cell Broadband Engine processor

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Programming Models for Supercomputing in the Era of Multicore

Continuum Computer Architecture

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Overview: Emerging Parallel Programming Models

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Modern Processor Architectures. L25: Modern Compiler Design

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Graph Representation

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Unified Runtime for PGAS and MPI over OFED

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel Algorithm Engineering

A brief introduction to OpenMP

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

GPU Sparse Graph Traversal

CUDA. Matthew Joyner, Jeremy Williams

LLVM-based Communication Optimizations for PGAS Programs

Latency-Tolerant Software Distributed Shared Memory

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

On the cost of managing data flow dependencies

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Under the Hood, Part 1: Implementing Message Passing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Three basic multiprocessing issues

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Parallel Algorithm Design. CS595, Fall 2010

Graph implementations :

Graphs. Graph G = (V, E) Types of graphs E = O( V 2 ) V = set of vertices E = set of edges (V V)

Caches. Hiding Memory Access Times

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Our new HPC-Cluster An overview

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Legion Mapping Interface

Graph Search. Adnan Aziz

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Transcription:

The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University

Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors, property maps, etc. Lie-Quan Lee, Jeremy Siek, and Andrew Lumsdaine. Generic Graph Algorithms for Sparse Matrix Ordering. ISCOPE. Lecture Notes in Computer Science 1732. 1999.

Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel Same interfaces as sequential BGL Coarse-grained, BSP Douglas Gregor and Andrew Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation. Workshop on Parallel Object- Oriented Scientific Computing. July, 2005. Douglas Gregor and Andrew Lumsdaine. Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. Object-Oriented Programming, Systems, Languages, and Applications. October, 2005

Case Study: Breadth-First Search ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

SPMD Breadth-First Search a b c h g i d f e a b c h g i d f e Process 0 Process 1 Process 2 a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q Q.push(d) a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q d a b c h g i d f e

SPMD Breadth-First Search Rank 0 Rank 1 Rank 2 Rank 3 Get neighbors Redistribute queues Combine received queues 6

SPMD Breadth-First Search (Strong Scaling) Time [s] 110 100 90 80 70 60 50 40 30 20 10 0 Breadth-First Search (2 27 vertices 2 29 edges) 2 4 8 16 32 64 128 Number of Nodes SPMD/BSP Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

SPMD Breadth-First Search (Weak Scaling) 80 70 Breadth-First Search (2 25 vertices 2 27 edges per node) SPMD/BSP 60 Time [s] 50 40 30 20 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Find the Sequential Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

Find the Synchronization Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications

What s Wrong Here? SPMD Traditional Scientific Apps Coarse-grained Static Balanced Bandwidth-bound with large messages Communicates data Data-driven Apps Fine-grained dependencies Dynamic, data-dependent Irregular Latency-bound with small messages Communicates control flow

Messaging Models Two-sided MPI Explicit sends and receives One-sided MPI-2 one-sided, ARMCI, PGAS languages Remote put and get operations Limited set of atomic updates into remote memory Active messages GASNet, DCMF, LAPI, Charm++, X10, etc. Explicit sends, implicit receives User-defined handler called on receiver for each message

Active Messages Created by von Eicken et al, for Split-C (1992) Messages sent explicitly Receivers register handlers but not involved with individual messages Messages often asynchronous for higher throughput Send Reply handler Process 1 Process 2 Message handler Reply Time

Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets, not processors Coalescing Message Type Message Type AM++ Transport Coalescing Message Type Termination Detection TD Level Epoch MPI or Vendor Communication Library

Low-Level vs. High-Level AM Systems Active messaging systems (loosely) on a spectrum of features vs. performance Low-level systems typically have restrictions on message handler behavior, explicit buffer management, etc. High-level systems often provide dynamic load balancing, service discovery, authentication/security, etc. Low DCMF GASNet Charm++/X10 Java RMI High

The AM++ Framework AM++ provides a middle ground between low- and high-level systems Gets performance from low-level systems Gets programmability from high-level systems High-level features can be built on top of AM++ Low AM++ DCMF GASNet Charm++/X10 Java RMI High

Messages Can Send Messages send<t>(t)! send<t>(t)! owner(t) local? send<t>(t)! Yes Execute handler<t>(t) owner(t) local? Yes Execute handler<t>(t) owner(t) local? No Termination detection Detect network quiescence Pluggable Send message return!

Active Pebbles Need to separate what the programmer expresses from what is actually executed A programming model and an execution model

Active Pebbles Features Programming model Active messages (pebbles) Fine-grained addressing (targets) Execution model Flexible message coalescing Message reductions Active routing Termination detection Features are synergistic AM++ is our implementation of Active Pebbles model

Programming Model Program with natural granularity No need to artificially coarsen object granularity Transparent addressing static and dynamic local and remote Bulk, anonymous handling of messages and targets Epoch model Enforce message delivery Controlled relaxed consistency!"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ;. 4565-&! <"#$%&'()*%+:=.,=."(48-./..7".4((.6%789!)*#.)".,.4*%.!(405.....46<.,.>.#):*0%....<"#$%&'()*%+,=.:=.?@-..%(#%.")*.%409.6%789!)*.A.)".,....<"#$%&'()*%+,=.A=.BCDE-. ;..,#-./0&12#3/& "#$%&'& "#$%&(& "#$%&)& "#$%&*& "#$%&+&

Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant computation Distributed computation into network Active routing Exploit physical topology Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.insert(6, D) 0 1 F table.insert(7,b) table.insert(6,a) 4 5 P0!"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.insert(6,a) table.insert(...) 2 3 MULTI-SOURCE REDUCTION 4 B A SINGLE-SOURCE REDUCTION table.insert(4,a) table.insert(4,b) 6 7

Routing + Message Coalescing Coalescing buffers limit scalability Communications typically all-to-all Impose a limited topology with fewer neighbors Better scalability, higher latency 11 P0 P1 P2 P3 P0 P1 P2 P3 14 11 14 11 14 P4 P5 P6 P7 P4 P5 P6 P7 11 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P15 14 11 14 Multi-source coalescing

Message Reductions + Routing Messages to same target can often be combined Reductions application-specific, user-defined Routing allows cache hits at intermediate hops P0 P1 P2 P3 Automatically synthesize optimized collectives P4 P5 P6 P7 Distribute computation into the network P8 P9 P10 P11 P12 P13 P14 P15

Active Message Abstraction Pebbles are agnostic as to where they execute, operate on targets Independent of how messages are processed Network communication (MPI, GASNet, DCMF, IB Verbs ) Work stealing (Cilk++, Task parallelism) OpenMP (over coalescing buffers) Immediate execution in caller (of send()) Thread-safe metadata allows weakening message order Updates to targets must be atomic Algorithms may have to tolerate weak consistency

Evaluation: Message Latency 25

Evaluation: Message Bandwidth 26

Active Pebbles Meant to support all kinds of parallelism Started with optimizing distributed memory communication Same features allow integration of fine-grained parallelism

Active Messages for Work Decomposition Key idea is to find natural granularity Each pebble represents an independent computation that can be executed in parallel ENQUEUE(Q, s) while (Q = ) #pragma omp parallel for u DEQUEUE(Q) for(each v Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) queue push msg send() else color[u] BLACK

What s Thread-safe in AP? Messaging Epoch begin/end Termination detection NOT: Message creation Modifying message features: routing, coalescing, reductions Modifying termination detection Modifying the number of threads

Parallel BGL Architecture AP makes messaging threadsafe. Property maps make metadata manipulation thread-safe and allow messages to be processed in arbitrary order Just as transactional memory generalizes processor atomics to arbitrary transactions, Active Pebbles generalizes one-sided operations to user-defined handlers T0 T1 T2 T3 Collectives Object- Based Addr. Message Type Algorithm Class Entry point operator() Shared State Message Types Message Types Message Types Object- Based Addr. Reductions Coalescing Message Type AM++ Transport LockMap Object- Based Addr. Coalescing Message Type Graph (static... for now) Termination Detection TD Level Graph Property Maps Epoch MPI or Vendor Communication Library

Active Pebbles Breadth-First Search a b explore(b, 1) c Process 0 explore(h, 2) explore(b, 1) explore(h, 3) d f h e g i Process 1 Process 2

BFS: Strong Scaling 120 100 Breadth-First Search (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 80 Time [s] 60 40 20 0 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

BFS: Weak Scaling 80 70 60 Breadth-First Search (2 25 vertices 2 27 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] 50 40 30 20 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Delta-Stepping: Strong Scaling 1000 Delta-Stepping SSSP (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 100 Time [s] 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Delta-Stepping: Weak Scaling 1000 Delta-Stepping SSSP (2 24 vertices 2 26 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] 100 10 1 2 4 8 16 32 64 128 Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of 128 2.0Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

Transactional Metadata updates Sometimes we want to update non-contiguous, dependent data atomically Predecessor and BFS level Or arbitrary visitor code supplied by users Limited-scope transactions, are these any easier? T0 T1 T2 T3 Algorithm Class Entry point std::atomic<std::pair<t1, T2> >!?! operator() template <typename T> struct Shared State Foo;! std::atomic<foo<t> >!!?!! LockMap Graph (static... for now) Graph Property Maps

Summary Active Messages Express fine-grained, asynchronous operations elegantly Well-matched to data-driven problems Enable fine-grained parallelism (threads, GPUs, FPGAs, ) Asynchrony allows latency hiding Concise expression and efficient execution Separate programming and execution models Active Pebbles Simple programming model Execution model maps programs to hardware efficiently AM++ is our implementation Could be targeted as a runtime by languages (X10, Chapel, )

Future Work Constrained parallelism in shared memory Not entirely work queue-based Not entirely recursion-based Acceleration Coalesced message buffers offer great opportunities here if getting to the accelerator is cheap Optimizing local memory transactions Intel s TSX in Haswell CPUs Declarative language for transaction code generation

Questions? More info on Active Pebbles Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. Active Pebbles: Parallel Programming for Data-Driven Applications. ICS 11. More info on AM++ Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. AM++: A Generalized Active Message Framework. PACT 10. More info on the Parallel Boost Graph Library and graph applications: http://www.osl.iu.edu/research/pbgl http://www.boost.org Watch for a new release of PBGL based on Active Pebbles, running on AM++ soon! (Ask if you d like access to a pre-release, very alpha copy) ngedmond@cs.indiana.edu