The Parallel Boost Graph Library spawn(active Pebbles)
|
|
- Jerome Butler
- 6 years ago
- Views:
Transcription
1 The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University
2 Origins Boost Graph Library (1999) Generic programming with templates Good sequential performance with high-level abstractions Algorithm composition, visitors, property maps, etc. Lie-Quan Lee, Jeremy Siek, and Andrew Lumsdaine. Generic Graph Algorithms for Sparse Matrix Ordering. ISCOPE. Lecture Notes in Computer Science
3 Origins Parallel BGL (2005) Lifting abstraction Make what we already have parallel Same interfaces as sequential BGL Coarse-grained, BSP Douglas Gregor and Andrew Lumsdaine. The Parallel BGL: A Generic Library for Distributed Graph Computation. Workshop on Parallel Object- Oriented Scientific Computing. July, Douglas Gregor and Andrew Lumsdaine. Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. Object-Oriented Programming, Systems, Languages, and Applications. October, 2005
4 Case Study: Breadth-First Search ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK
5 SPMD Breadth-First Search a b c h g i d f e a b c h g i d f e Process 0 Process 1 Process 2 a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q Q.push(d) a b c h g i d f e Process 0 Process 1 Process 2 Q Q Q d a b c h g i d f e
6 SPMD Breadth-First Search Rank 0 Rank 1 Rank 2 Rank 3 Get neighbors Redistribute queues Combine received queues 6
7 SPMD Breadth-First Search (Strong Scaling) Time [s] Breadth-First Search (2 27 vertices 2 29 edges) Number of Nodes SPMD/BSP Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
8 SPMD Breadth-First Search (Weak Scaling) Breadth-First Search (2 25 vertices 2 27 edges per node) SPMD/BSP 60 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with four cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
9 Find the Sequential Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK
10 Find the Synchronization Trap ENQUEUE(Q, s) while (Q = ) u DEQUEUE(Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) else color[u] BLACK for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications
11 What s Wrong Here? SPMD Traditional Scientific Apps Coarse-grained Static Balanced Bandwidth-bound with large messages Communicates data Data-driven Apps Fine-grained dependencies Dynamic, data-dependent Irregular Latency-bound with small messages Communicates control flow
12 Messaging Models Two-sided MPI Explicit sends and receives One-sided MPI-2 one-sided, ARMCI, PGAS languages Remote put and get operations Limited set of atomic updates into remote memory Active messages GASNet, DCMF, LAPI, Charm++, X10, etc. Explicit sends, implicit receives User-defined handler called on receiver for each message
13 Active Messages Created by von Eicken et al, for Split-C (1992) Messages sent explicitly Receivers register handlers but not involved with individual messages Messages often asynchronous for higher throughput Send Reply handler Process 1 Process 2 Message handler Reply Time
14 Active Messages Move control flow to data Fine-grained Asynchronous Uniformity of access User Reductions AM++ Generic User-level Flexible/modular Send to targets, not processors Coalescing Message Type Message Type AM++ Transport Coalescing Message Type Termination Detection TD Level Epoch MPI or Vendor Communication Library
15 Low-Level vs. High-Level AM Systems Active messaging systems (loosely) on a spectrum of features vs. performance Low-level systems typically have restrictions on message handler behavior, explicit buffer management, etc. High-level systems often provide dynamic load balancing, service discovery, authentication/security, etc. Low DCMF GASNet Charm++/X10 Java RMI High
16 The AM++ Framework AM++ provides a middle ground between low- and high-level systems Gets performance from low-level systems Gets programmability from high-level systems High-level features can be built on top of AM++ Low AM++ DCMF GASNet Charm++/X10 Java RMI High
17 Messages Can Send Messages send<t>(t)! send<t>(t)! owner(t) local? send<t>(t)! Yes Execute handler<t>(t) owner(t) local? Yes Execute handler<t>(t) owner(t) local? No Termination detection Detect network quiescence Pluggable Send message return!
18 Active Pebbles Need to separate what the programmer expresses from what is actually executed A programming model and an execution model
19 Active Pebbles Features Programming model Active messages (pebbles) Fine-grained addressing (targets) Execution model Flexible message coalescing Message reductions Active routing Termination detection Features are synergistic AM++ is our implementation of Active Pebbles model
20 Programming Model Program with natural granularity No need to artificially coarsen object granularity Transparent addressing static and dynamic local and remote Bulk, anonymous handling of messages and targets Epoch model Enforce message delivery Controlled relaxed consistency!"#$%&'()*%+,-./...0)()*1,2.3.!(405...")*.4((.6%789!)*#.:.)".,....!"#$%&'()*%+,-. ; &! ;..,#-./0&12#3/& "#$%&'& "#$%&(& "#$%&)& "#$%&*& "#$%&+&
21 Execution Model Message coalescing Amortize latency Message reduction Eliminate redundant computation Distributed computation into network Active routing Exploit physical topology Termination detection Handlers send messages Detect quiescence table.insert(0, F) table.insert(6, D) 0 1 F table.insert(7,b) table.insert(6,a) 4 5 P0!"#$%&'#&(" 6 A 6 D A ')*%+,"+-) P2 HYPERCUBE ROUTING 6 D COALESCING 7 B 6 A 6 D P1 P3 table.insert(6,a) table.insert(...) 2 3 MULTI-SOURCE REDUCTION 4 B A SINGLE-SOURCE REDUCTION table.insert(4,a) table.insert(4,b) 6 7
22 Routing + Message Coalescing Coalescing buffers limit scalability Communications typically all-to-all Impose a limited topology with fewer neighbors Better scalability, higher latency 11 P0 P1 P2 P3 P0 P1 P2 P P4 P5 P6 P7 P4 P5 P6 P7 11 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P Multi-source coalescing
23 Message Reductions + Routing Messages to same target can often be combined Reductions application-specific, user-defined Routing allows cache hits at intermediate hops P0 P1 P2 P3 Automatically synthesize optimized collectives P4 P5 P6 P7 Distribute computation into the network P8 P9 P10 P11 P12 P13 P14 P15
24 Active Message Abstraction Pebbles are agnostic as to where they execute, operate on targets Independent of how messages are processed Network communication (MPI, GASNet, DCMF, IB Verbs ) Work stealing (Cilk++, Task parallelism) OpenMP (over coalescing buffers) Immediate execution in caller (of send()) Thread-safe metadata allows weakening message order Updates to targets must be atomic Algorithms may have to tolerate weak consistency
25 Evaluation: Message Latency 25
26 Evaluation: Message Bandwidth 26
27 Active Pebbles Meant to support all kinds of parallelism Started with optimizing distributed memory communication Same features allow integration of fine-grained parallelism
28 Active Messages for Work Decomposition Key idea is to find natural granularity Each pebble represents an independent computation that can be executed in parallel ENQUEUE(Q, s) while (Q = ) #pragma omp parallel for u DEQUEUE(Q) for(each v Q) for (each v Adj[u]) if (color[v] =WHITE) color[v] GRAY ENQUEUE(Q, v) queue push msg send() else color[u] BLACK
29 What s Thread-safe in AP? Messaging Epoch begin/end Termination detection NOT: Message creation Modifying message features: routing, coalescing, reductions Modifying termination detection Modifying the number of threads
30 Parallel BGL Architecture AP makes messaging threadsafe. Property maps make metadata manipulation thread-safe and allow messages to be processed in arbitrary order Just as transactional memory generalizes processor atomics to arbitrary transactions, Active Pebbles generalizes one-sided operations to user-defined handlers T0 T1 T2 T3 Collectives Object- Based Addr. Message Type Algorithm Class Entry point operator() Shared State Message Types Message Types Message Types Object- Based Addr. Reductions Coalescing Message Type AM++ Transport LockMap Object- Based Addr. Coalescing Message Type Graph (static... for now) Termination Detection TD Level Graph Property Maps Epoch MPI or Vendor Communication Library
31 Active Pebbles Breadth-First Search a b explore(b, 1) c Process 0 explore(h, 2) explore(b, 1) explore(h, 3) d f h e g i Process 1 Process 2
32 BFS: Strong Scaling Breadth-First Search (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 80 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
33 BFS: Weak Scaling Breadth-First Search (2 25 vertices 2 27 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
34 Delta-Stepping: Strong Scaling 1000 Delta-Stepping SSSP (2 27 vertices 2 29 edges) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) 100 Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
35 Delta-Stepping: Weak Scaling 1000 Delta-Stepping SSSP (2 24 vertices 2 26 edges per node) PBGL AM++ (t=1) AM++ (t=2) AM++ (t=4) Time [s] Number of Nodes Results were run on Erdős-Renyí graphs using a cluster of Ghz Opteron 270 processors with two cores and 8GB of PC2700 DDR-DRAM per node connected via SDR Infiniband.
36 Transactional Metadata updates Sometimes we want to update non-contiguous, dependent data atomically Predecessor and BFS level Or arbitrary visitor code supplied by users Limited-scope transactions, are these any easier? T0 T1 T2 T3 Algorithm Class Entry point std::atomic<std::pair<t1, T2> >!?! operator() template <typename T> struct Shared State Foo;! std::atomic<foo<t> >!!?!! LockMap Graph (static... for now) Graph Property Maps
37 Summary Active Messages Express fine-grained, asynchronous operations elegantly Well-matched to data-driven problems Enable fine-grained parallelism (threads, GPUs, FPGAs, ) Asynchrony allows latency hiding Concise expression and efficient execution Separate programming and execution models Active Pebbles Simple programming model Execution model maps programs to hardware efficiently AM++ is our implementation Could be targeted as a runtime by languages (X10, Chapel, )
38 Future Work Constrained parallelism in shared memory Not entirely work queue-based Not entirely recursion-based Acceleration Coalesced message buffers offer great opportunities here if getting to the accelerator is cheap Optimizing local memory transactions Intel s TSX in Haswell CPUs Declarative language for transaction code generation
39 Questions? More info on Active Pebbles Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. Active Pebbles: Parallel Programming for Data-Driven Applications. ICS 11. More info on AM++ Jeremiah Willcock, Torsten Hoefler, Nicholas Edmonds, and Andrew Lumsdaine. AM++: A Generalized Active Message Framework. PACT 10. More info on the Parallel Boost Graph Library and graph applications: Watch for a new release of PBGL based on Active Pebbles, running on AM++ soon! (Ask if you d like access to a pre-release, very alpha copy) ngedmond@cs.indiana.edu
Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages
MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important
More informationAM++: A Generalized Active Message Framework
AM++: A Generalized Active Message Framework Jeremiah J. Willcock Indiana University 150 S. Woodlawn Ave Bloomington, IN 47401, USA jewillco@cs.indiana.edu Torsten Hoefler University of Illinois at Urbana-Champaign
More informationThe Case for Collective Pattern Specification
The Case for Collective Pattern Specification Torsten Hoefler, Jeremiah Willcock, ArunChauhan, and Andrew Lumsdaine Advances in Message Passing, Toronto, ON, June 2010 Motivation and Main Theses Message
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationParallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds
Parallel Graphs circa 2009: Concepts, Hardware Platforms, and Communication Infrastructure (oh my!) Nick Edmonds Outline Introduction to BGL Machines as they were, motivation for the existing architecture
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationActive Pebbles: Parallel Programming for Data-Driven Applications
Active Pebbles: Parallel Programming for Data-Driven Applications Jeremiah J. Willcock Open Systems Lab Indiana University jewillco@cs.indiana.edu Torsten Hoefler University of Illinois at Urbana-Champaign
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationPatterns for! Parallel Programming!
Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationMessage Passing Models and Multicomputer distributed system LECTURE 7
Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE
More informationHybrid Model Parallel Programs
Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationGrappa: A latency tolerant runtime for large-scale irregular applications
Grappa: A latency tolerant runtime for large-scale irregular applications Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin Computer Science & Engineering, University
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationPGAS: Partitioned Global Address Space
.... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationWriting Parallel Libraries with MPI -- The Good, the Bad, and the Ugly --
Torsten Hoefler Writing Parallel Libraries with MPI -- The Good, the Bad, and the Ugly -- With input from Bill Gropp and Marc Snir Keynote at EuroMPI 2011, Sept 21 st 2011, Santorini, Greece Outline Modular
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationA Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System
A Scalable and Reliable Message Transport Service for the ATLAS Trigger and Data Acquisition System Andrei Kazarov, CERN / Petersburg NPI, NRC Kurchatov Institute 19th IEEE Real Time Conference 2014, Nara
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationGraph: representation and traversal
Graph: representation and traversal CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang! Acknowledgement The set of slides have use materials from the following resources Slides for textbook
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationShared Virtual Memory. Programming Models
Shared Virtual Memory Arvind Krishnamurthy Fall 2004 Programming Models Shared memory model Collection of threads Sharing the same address space Reads/writes on shared address space visible to all other
More informationCurrent Topics in OS Research. So, what s hot?
Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general
More informationChapter 12: Multiprocessor Architectures. Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing
Chapter 12: Multiprocessor Architectures Lesson 12: Message passing Systems and Comparison of Message Passing and Sharing Objective Understand the message routing schemes and the difference between shared-memory
More informationTrees and Graphs Shabsi Walfish NYU - Fundamental Algorithms Summer 2006
Trees and Graphs Basic Definitions Tree: Any connected, acyclic graph G = (V,E) E = V -1 n-ary Tree: Tree s/t all vertices of degree n+1 A root has degree n Binary Search Tree: A binary tree such that
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationIntroduction to Distributed Systems
Introduction to Distributed Systems Minsoo Ryu Department of Computer Science and Engineering 2 Definition A distributed system is a collection of independent computers that appears to its users as a single
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationDISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Definition of a Distributed System (1) A distributed system is: A collection of
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationProgramming Models for Supercomputing in the Era of Multicore
Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationSSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh
SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationGraph Representation
Graph Representation Adjacency list representation of G = (V, E) An array of V lists, one for each vertex in V Each list Adj[u] contains all the vertices v such that there is an edge between u and v Adj[u]
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING
ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,
More informationHPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser
HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu) HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform,
More informationPatterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices
Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationUnder the Hood, Part 1: Implementing Message Passing
Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationGraph implementations :
Graphs Graph implementations : The two standard ways of representing a graph G = (V, E) are adjacency-matrices and collections of adjacencylists. The adjacency-lists are ideal for sparse trees those where
More informationGraphs. Graph G = (V, E) Types of graphs E = O( V 2 ) V = set of vertices E = set of edges (V V)
Graph Algorithms Graphs Graph G = (V, E) V = set of vertices E = set of edges (V V) Types of graphs Undirected: edge (u, v) = (v, u); for all v, (v, v) E (No self loops.) Directed: (u, v) is edge from
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationOur new HPC-Cluster An overview
Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationThe Legion Mapping Interface
The Legion Mapping Interface Mike Bauer 1 Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight
More informationGraph Search. Adnan Aziz
Graph Search Adnan Aziz Based on CLRS, Ch 22. Recall encountered graphs several weeks ago (CLRS B.4) restricted our attention to definitions, terminology, properties Now we ll see how to perform basic
More information6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT
6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More information