The Parallel Boost Graph Library spawn(active Pebbles)

Size: px
Start display at page:

Download "The Parallel Boost Graph Library spawn(active Pebbles)"

Transcription

1 The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University

2 Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors, property maps, etc. Lie-Quan Lee, Jeremy Siek, and Andrew Lumsdaine. Generic Graph Algorithms for Sparse Matrix Ordering. ISCOPE. Lecture Notes in Computer Science

3 Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel Same interfaces as sequential BGL Coarse-grained, BSP Douglas Gregor and Andrew Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation. Workshop on Parallel Object- Oriented Scientific Computing. July, Douglas Gregor and Andrew Lumsdaine. Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. Object-Oriented Programming, Systems, Languages, and Applications. October, 2005

4 Case Study: Breadth-First Search ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

5 SPMD Breadth-First Search a b c h g i d f e a b c h g i d f e Process 0 Process 1 Process 2 a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q Q.push(d) a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q d a b c h g i d f e

6 SPMD Breadth-First Search Rank 0 Rank 1 Rank 2 Rank 3 Get neighbors Redistribute queues Combine received queues 6

7 SPMD Breadth-First Search (Strong Scaling) Time [s] Breadth-First Search (2 27 vertices 2 29 edges) Number of Nodes SPMD/BSP Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

8 SPMD Breadth-First Search (Weak Scaling) Breadth-First Search (2 25 vertices 2 27 edges per node) SPMD/BSP 60 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

9 Find the Sequential Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK

10 Find the Synchronization Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications

11 What s Wrong Here? SPMD Traditional Scientific Apps Coarse-grained Static Balanced Bandwidth-bound with large messages Communicates data Data-driven Apps Fine-grained dependencies Dynamic, data-dependent Irregular Latency-bound with small messages Communicates control flow

12 Messaging Models Two-sided MPI Explicit sends and receives One-sided MPI-2 one-sided, ARMCI, PGAS languages Remote put and get operations Limited set of atomic updates into remote memory Active messages GASNet, DCMF, LAPI, Charm++, X10, etc. Explicit sends, implicit receives User-defined handler called on receiver for each message

13 Active Messages Created by von Eicken et al, for Split-C (1992) Messages sent explicitly Receivers register handlers but not involved with individual messages Messages often asynchronous for higher throughput Send Reply handler Process 1 Process 2 Message handler Reply Time

14 Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets, not processors Coalescing Message Type Message Type AM++ Transport Coalescing Message Type Termination Detection TD Level Epoch MPI or Vendor Communication Library

15 Low-Level vs. High-Level AM Systems Active messaging systems (loosely) on a spectrum of features vs. performance Low-level systems typically have restrictions on message handler behavior, explicit buffer management, etc. High-level systems often provide dynamic load balancing, service discovery, authentication/security, etc. Low DCMF GASNet Charm++/X10 Java RMI High

16 The AM++ Framework AM++ provides a middle ground between low- and high-level systems Gets performance from low-level systems Gets programmability from high-level systems High-level features can be built on top of AM++ Low AM++ DCMF GASNet Charm++/X10 Java RMI High

17 Messages Can Send Messages send<t>(t)! send<t>(t)! owner(t) local? send<t>(t)! Yes Execute handler<t>(t) owner(t) local? Yes Execute handler<t>(t) owner(t) local? No Termination detection Detect network quiescence Pluggable Send message return!

18 Active Pebbles Need to separate what the programmer expresses from what is actually executed A programming model and an execution model

19 Active Pebbles Features Programming model Active messages (pebbles) Fine-grained addressing (targets) Execution model Flexible message coalescing Message reductions Active routing Termination detection Features are synergistic AM++ is our implementation of Active Pebbles model

20 Programming Model Program with natural granularity No need to artificially coarsen object granularity Transparent addressing static and dynamic local and remote Bulk, anonymous handling of messages and targets Epoch model Enforce message delivery Controlled relaxed consistency!"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ; &! ;..,#-./0&12#3/& "#$%&'& "#$%&(& "#$%&)& "#$%&*& "#$%&+&

21 Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant computation Distributed computation into network Active routing Exploit physical topology Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.insert(6, D) 0 1 F table.insert(7,b) table.insert(6,a) 4 5 P0!"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.insert(6,a) table.insert(...) 2 3 MULTI-SOURCE REDUCTION 4 B A SINGLE-SOURCE REDUCTION table.insert(4,a) table.insert(4,b) 6 7

22 Routing + Message Coalescing Coalescing buffers limit scalability Communications typically all-to-all Impose a limited topology with fewer neighbors Better scalability, higher latency 11 P0 P1 P2 P3 P0 P1 P2 P P4 P5 P6 P7 P4 P5 P6 P7 11 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P Multi-source coalescing

23 Message Reductions + Routing Messages to same target can often be combined Reductions application-specific, user-defined Routing allows cache hits at intermediate hops P0 P1 P2 P3 Automatically synthesize optimized collectives P4 P5 P6 P7 Distribute computation into the network P8 P9 P10 P11 P12 P13 P14 P15

24 Active Message Abstraction Pebbles are agnostic as to where they execute, operate on targets Independent of how messages are processed Network communication (MPI, GASNet, DCMF, IB Verbs ) Work stealing (Cilk++, Task parallelism) OpenMP (over coalescing buffers) Immediate execution in caller (of send()) Thread-safe metadata allows weakening message order Updates to targets must be atomic Algorithms may have to tolerate weak consistency

25 Evaluation: Message Latency 25

26 Evaluation: Message Bandwidth 26

27 Active Pebbles Meant to support all kinds of parallelism Started with optimizing distributed memory communication Same features allow integration of fine-grained parallelism

28 Active Messages for Work Decomposition Key idea is to find natural granularity Each pebble represents an independent computation that can be executed in parallel ENQUEUE(Q, s) while (Q = ) #pragma omp parallel for u DEQUEUE(Q) for(each v Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) queue push msg send() else color[u] BLACK

29 What s Thread-safe in AP? Messaging Epoch begin/end Termination detection NOT: Message creation Modifying message features: routing, coalescing, reductions Modifying termination detection Modifying the number of threads

30 Parallel BGL Architecture AP makes messaging threadsafe. Property maps make metadata manipulation thread-safe and allow messages to be processed in arbitrary order Just as transactional memory generalizes processor atomics to arbitrary transactions, Active Pebbles generalizes one-sided operations to user-defined handlers T0 T1 T2 T3 Collectives Object- Based Addr. Message Type Algorithm Class Entry point operator() Shared State Message Types Message Types Message Types Object- Based Addr. Reductions Coalescing Message Type AM++ Transport LockMap Object- Based Addr. Coalescing Message Type Graph (static... for now) Termination Detection TD Level Graph Property Maps Epoch MPI or Vendor Communication Library

31 Active Pebbles Breadth-First Search a b explore(b, 1) c Process 0 explore(h, 2) explore(b, 1) explore(h, 3) d f h e g i Process 1 Process 2

32 BFS: Strong Scaling Breadth-First Search (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 80 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

33 BFS: Weak Scaling Breadth-First Search (2 25 vertices 2 27 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

34 Delta-Stepping: Strong Scaling 1000 Delta-Stepping SSSP (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 100 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

35 Delta-Stepping: Weak Scaling 1000 Delta-Stepping SSSP (2 24 vertices 2 26 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.

36 Transactional Metadata updates Sometimes we want to update non-contiguous, dependent data atomically Predecessor and BFS level Or arbitrary visitor code supplied by users Limited-scope transactions, are these any easier? T0 T1 T2 T3 Algorithm Class Entry point std::atomic<std::pair<t1, T2> >!?! operator() template <typename T> struct Shared State Foo;! std::atomic<foo<t> >!!?!! LockMap Graph (static... for now) Graph Property Maps

37 Summary Active Messages Express fine-grained, asynchronous operations elegantly Well-matched to data-driven problems Enable fine-grained parallelism (threads, GPUs, FPGAs, ) Asynchrony allows latency hiding Concise expression and efficient execution Separate programming and execution models Active Pebbles Simple programming model Execution model maps programs to hardware efficiently AM++ is our implementation Could be targeted as a runtime by languages (X10, Chapel, )

38 Future Work Constrained parallelism in shared memory Not entirely work queue-based Not entirely recursion-based Acceleration Coalesced message buffers offer great opportunities here if getting to the accelerator is cheap Optimizing local memory transactions Intel s TSX in Haswell CPUs Declarative language for transaction code generation

39 Questions? More info on Active Pebbles Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. Active Pebbles: Parallel Programming for Data-Driven Applications. ICS 11. More info on AM++ Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. AM++: A Generalized Active Message Framework. PACT 10. More info on the Parallel Boost Graph Library and graph applications: Watch for a new release of PBGL based on Active Pebbles, running on AM++ soon! (Ask if you d like access to a pre-release, very alpha copy) ngedmond@cs.indiana.edu

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important

More information

AM++: A Generalized Active Message Framework

AM++: A Generalized Active Message Framework AM++: A Generalized Active Message Framework Jeremiah J. Willcock Indiana University 150 S. Woodlawn Ave Bloomington, IN 47401, USA jewillco@cs.indiana.edu Torsten Hoefler University of Illinois at Urbana-Champaign

More information

The Case for Collective Pattern Specification

The Case for Collective Pattern Specification The Case for Collective Pattern Specification Torsten Hoefler, Jeremiah Willcock, ArunChauhan, and Andrew Lumsdaine Advances in Message Passing, Toronto, ON, June 2010 Motivation and Main Theses Message

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Parallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds

Parallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds Parallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds Outline Introduction to BGL Machines as they were, motivation for the existing architecture

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Active Pebbles: Parallel Programming for Data-Driven Applications

Active Pebbles: Parallel Programming for Data-Driven Applications Active Pebbles: Parallel Programming for Data-Driven Applications Jeremiah J. Willcock Open Systems Lab Indiana University jewillco@cs.indiana.edu Torsten Hoefler University of Illinois at Urbana-Champaign

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Patterns for! Parallel Programming!

Patterns for! Parallel Programming! Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program

More information

Scalable Software Transactional Memory for Chapel High-Productivity Language

Scalable Software Transactional Memory for Chapel High-Productivity Language Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL

More information

Message Passing Models and Multicomputer distributed system LECTURE 7

Message Passing Models and Multicomputer distributed system LECTURE 7 Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE

More information

Hybrid Model Parallel Programs

Hybrid Model Parallel Programs Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Grappa: A latency tolerant runtime for large-scale irregular applications

Grappa: A latency tolerant runtime for large-scale irregular applications Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

PGAS: Partitioned Global Address Space

PGAS: Partitioned Global Address Space .... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Writing Parallel Libraries with MPI -- The Good, the Bad, and the Ugly --

Writing Parallel Libraries with MPI -- The Good, the Bad, and the Ugly -- Torsten Hoefler Writing Parallel Libraries with MPI -- The Good, the Bad, and the Ugly -- With input from Bill Gropp and Marc Snir Keynote at EuroMPI 2011, Sept 21 st 2011, Santorini, Greece Outline Modular

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System

A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System Andrei Kazarov, CERN / Petersburg NPI, NRC Kurchatov Institute 19th IEEE Real Time Conference 2014, Nara

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Marco Danelutto. May 2011, Pisa

Marco Danelutto. May 2011, Pisa Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Graph: representation and traversal

Graph: representation and traversal Graph: representation and traversal CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang! Acknowledgement The set of slides have use materials from the following resources Slides for textbook

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Shared Virtual Memory. Programming Models

Shared Virtual Memory. Programming Models Shared Virtual Memory Arvind Krishnamurthy Fall 2004 Programming Models Shared memory model Collection of threads Sharing the same address space Reads/writes on shared address space visible to all other

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Chapter 12: Multiprocessor Architectures. Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing

Chapter 12: Multiprocessor Architectures. Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing Chapter 12: Multiprocessor Architectures Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing Objective Understand the message routing schemes and the difference between shared-memory

More information

Trees and Graphs Shabsi Walfish NYU - Fundamental Algorithms Summer 2006

Trees and Graphs Shabsi Walfish NYU - Fundamental Algorithms Summer 2006 Trees and Graphs Basic Definitions Tree: Any connected, acyclic graph G = (V,E) E = V -1 n-ary Tree: Tree s/t all vertices of degree n+1 A root has degree n Binary Search Tree: A binary tree such that

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Introduction to Distributed Systems

Introduction to Distributed Systems Introduction to Distributed Systems Minsoo Ryu Department of Computer Science and Engineering 2 Definition A distributed system is a collection of independent computers that appears to its users as a single

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Definition of a Distributed System (1) A distributed system is: A collection of

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Programming Models for Supercomputing in the Era of Multicore

Programming Models for Supercomputing in the Era of Multicore Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Graph Representation

Graph Representation Graph Representation Adjacency list representation of G = (V, E) An array of V lists, one for each vertex in V Each list Adj[u] contains all the vertices v such that there is an edge between u and v Adj[u]

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations for PGAS Programs LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson

More information

Latency-Tolerant Software Distributed Shared Memory

Latency-Tolerant Software Distributed Shared Memory Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25

More information

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,

More information

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu) HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform,

More information

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Parallel Algorithm Design. CS595, Fall 2010

Parallel Algorithm Design. CS595, Fall 2010 Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the

More information

Graph implementations :

Graph implementations : Graphs Graph implementations : The two standard ways of representing a graph G = (V, E) are adjacency-matrices and collections of adjacencylists. The adjacency-lists are ideal for sparse trees those where

More information

Graphs. Graph G = (V, E) Types of graphs E = O( V 2 ) V = set of vertices E = set of edges (V V)

Graphs. Graph G = (V, E) Types of graphs E = O( V 2 ) V = set of vertices E = set of edges (V V) Graph Algorithms Graphs Graph G = (V, E) V = set of vertices E = set of edges (V V) Types of graphs Undirected: edge (u, v) = (v, u); for all v, (v, v) E (No self loops.) Directed: (u, v) is edge from

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

The Legion Mapping Interface

The Legion Mapping Interface The Legion Mapping Interface Mike Bauer 1 Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight

More information

Graph Search. Adnan Aziz

Graph Search. Adnan Aziz Graph Search Adnan Aziz Based on CLRS, Ch 22. Recall encountered graphs several weeks ago (CLRS B.4) restricted our attention to definitions, terminology, properties Now we ll see how to perform basic

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information