Structured Parallel Programming

Similar documents
Structured Parallel Programming Patterns for Efficient Computation

An Introduction to Parallel Programming

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Application Programming

Algorithmic Graph Theory and Perfect Graphs

Computers as Components Principles of Embedded Computing System Design

Computer Architecture A Quantitative Approach

Information Modeling and Relational Databases

CLASSIC DATA STRUCTURES IN JAVA

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Heuristic Search. Theory and Applications. Stefan Edelkamp. Stefan Schrodl ELSEVIER. Morgan Kaufmann is an imprint of Elsevier HEIDELBERG LONDON

Computer Architecture

Engineering Real- Time Applications with Wild Magic

Embedded Systems Architecture

SQL Queries. for. Mere Mortals. Third Edition. A Hands-On Guide to Data Manipulation in SQL. John L. Viescas Michael J. Hernandez

Programming. In Ada JOHN BARNES TT ADDISON-WESLEY

M (~ Computer Organization and Design ELSEVIER. David A. Patterson. John L. Hennessy. University of California, Berkeley. Stanford University

PROBLEM SOLVING WITH FORTRAN 90

An Introduction to Programming with IDL

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Foundations of Multidimensional and Metric Data Structures

MPI: A Message-Passing Interface Standard

Maya Python. for Games and Film. and the Maya Python API. A Complete Reference for Maya Python. Ryan Trowbridge. Adam Mechtley ELSEVIER

Curriculum 2013 Knowledge Units Pertaining to PDC

The Unified Modeling Language User Guide

DB2 SQL Tuning Tips for z/os Developers

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

Moving to the Cloud. Developing Apps in. the New World of Cloud Computing. Dinkar Sitaram. Geetha Manjunath. David R. Deily ELSEVIER.

The Definitive Guide to the ARM Cortex-M3

The Essential Guide to Video Processing

Real World Multicore Embedded Systems

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Programming with POSIX Threads

LOGIC AND DISCRETE MATHEMATICS

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Multi-Core Programming

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Coding for Penetration

Cilk Plus GETTING STARTED

Parallel Computing. November 20, W.Homberg

Programming 8-bit PIC Microcontrollers in С

Introduction to Algorithms Third Edition

MULTIDIMENSIONAL SIGNAL, IMAGE, AND VIDEO PROCESSING AND CODING

DATABASE SYSTEM CONCEPTS

Digital Signal Processing System Design: LabVIEW-Based Hybrid Programming Nasser Kehtarnavaz

ARCHITECTURE DESIGN FOR SOFT ERRORS

Oracle Real Application Clusters Handbook

Coding for Penetration Testers Building Better Tools

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

FPGAs: Instant Access

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

Computer Animation. Algorithms and Techniques. z< MORGAN KAUFMANN PUBLISHERS. Rick Parent Ohio State University AN IMPRINT OF ELSEVIER SCIENCE

PTC Mathcad Prime 3.0

Modern Embedded Computing Designing Connected, Pervasive, Media-Rich Systems

An Introduction to Object-Oriented Programming

Anany Levitin 3RD EDITION. Arup Kumar Bhattacharjee. mmmmm Analysis of Algorithms. Soumen Mukherjee. Introduction to TllG DCSISFI &

Modern Information Retrieval

Understand and Implement Effective PCI Data Security Standard Compliance

Parallelization on Multi-Core CPUs

Contents. Preface. About the Authors BASIC TECHNIQUES CHAPTER 1 PARALLEL COMPUTERS. l. 1 The Demand for Computational Speed 3

System Assurance. Beyond Detecting. Vulnerabilities. Djenana Campara. Nikolai Mansourov

Intel Thread Building Blocks, Part II

Algorithms and Parallel Computing

F. THOMSON LEIGHTON INTRODUCTION TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS TREES HYPERCUBES

Managed. Code Rootkits. Hooking. into Runtime. Environments. Erez Metula ELSEVIER. Syngress is an imprint of Elsevier SYNGRESS

Programming in Python 3

Essential MATLAB for Engineers and Scientists

The Designer's Guide to VHDL Second Edition

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Intel Array Building Blocks

CSE 613: Parallel Programming

List of Figures. About the Authors. Acknowledgments

DATA ABSTRACTION AND PROBLEM SOLVING WITH JAVA

Trends and Challenges in Multicore Programming

Real-Time Systems and Programming Languages

4.1.2 Merge Sort Sorting Lower Bound Counting Sort Sorting in Practice Solving Problems by Sorting...

Chapter 1 Introduction

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

Data Structures and Algorithm Analysis in C++

Contents. 1 Introduction. 2 Searching and Traversal Techniques. Preface... (vii) Acknowledgements... (ix)

The Automatic Design of Batch Processing Systems

CS 445: Data Structures Final Examination: Study Guide

DATA STRUCTURES AND PROBLEM SOLVING USING JAVA

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Networked Graphics 01_P374423_PRELIMS.indd i 10/27/2009 6:57:42 AM

Summary of Contents LIST OF FIGURES LIST OF TABLES

Parallel and Distributed Computing (PD)

FISMAand the Risk Management Framework

JAVA CONCEPTS Early Objects

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Analytical Modeling of Parallel Programs

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

CSCE 321/3201 Analysis and Design of Algorithms. Prof. Amr Goneid. Fall 2016

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Computer Organization and Design

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

High Performance Computing. Introduction to Parallel Computing

Fundamentals of. Database Systems. Shamkant B. Navathe. College of Computing Georgia Institute of Technology PEARSON.

Transcription:

Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier M< rtorc.lui "AU I MANN

Contents \ " J\..,.:...,,.......... ~! Listings.......':.,.'............................ xv Preface... ~.". :.................................... xx Preliminaries................................................................. xxiii CHAPTER Introduction............. Think Parallel.......................................... Performance...................................... 4.3 Motivation: Pervasive Parallelism...............................3. Hardware Trends Encouraging Parallelism......................3. Observed Historical Trends in Parallelism.................3.3 Need for Explicit Parallel Programming......................... 4.4 Structured Pattern-Based Programming............................ 9. Parallel Programming Models................................. Desired Properties..... Abstractions Instead of Mechanisms........................ 3..3 Expression of Regular Data Parallelism.............................. 4..4 Composability...................................... Portability of Functionality..................................... Performance Portability................................... Safety, Determinism, and Maintainability........................... 9.. Overview of Programming Models Used........................ 9..9 When to Use Which Model?...... 3. Organization of this Book...... 3. Summary......... 3 CHAPTER Background... 39. Vocabulary and Notation......................................... 39. Strategies........................................ 40.3 Mechanisms............... 4.4 Machine Models......................................... 44.4. Machine Model....................... 44.4. Key Features for Performance................................. 0.4.3 Flynn's Characterization.....................................4.4 Evolution... 3. Performance Theory........................... 4.. Latency and Throughput........ Speedup, Efficiency, and Scalability....... v

vi Contents..3 Power..........................................4 Amdahl's Law................................... Gustafson-Barsis' Law........... 0.. Work-Span Model.................................. Asymptotic Complexity................... Asymptotic Speedup and Efficiency...............9 Little's Formula............................ Pitfalls........................ Race Conditions................................... Mutual Exclusion and Locks................. 0..3 Deadlock...............................4 Strangled Scaling.............................. 3.. Lack of Locality............................. 3.. Load Imbalance...................... 4.. Overhead.................................... 4. Summary................. PART I PATTERNS CHAPTER 3 Patterns.......................................... 9 3. Nesting Pattern............................... 0 3. Structured Serial Control Flow Patterns.................................... 3.. Sequence................. 3.. Selection.................................... 4 3..3 Iteration.................................. 4 3..4 Recursion................... 3.3 Parallel Control Patterns................................. 3.3. Fork-Join................................... 3.3. Map................... 3.3.3 Stencil......................... 9 3.3.4 Reduction.......................................... 90 3.3. Scan.................................... 9 3.3. Recurrence.................................... 9 3.4 Serial Data Management Patterns............ 9 3.4. Random Read and Write..................................................... 9 3.4. Stack Allocation......................... 9 3.4.3 Heap Allocation........................................ 9 3.4.4 Closures..................................... 9 3.4. Objects............................. 9

Contents vii 9 0 9 0 3. Parallel Data Management Patterns........................... 9 3.. Pack........................................ 9 3.. Pipeline................................... 99 3..3 Geometric Decomposition................................................. 00 3..4 Gather................. 0 3.. Scatter...................... 0 3. Other Parallel Patterns........................................................ 0 3.. Superscalar Sequences....................................................... 0 3.. Futures.................................................................... 0 3..3 Speculative Selection.............................. 04 3..4 Workpile.................................................................. 0 3.. Search.................... 0 3.. Segmentation.............................................................. 0 3.. Expand...................................................................... 0 3.. Category Reduction........................................................ 0 3..9 Term Graph Rewriting....................................................... 0 3. Non-Deterministic Patterns..................................................... 0 3.. Branch and Bound............. 0 3.. Transactions............ 09 3. Programming Model Support for Patterns................... 0 3.. CilkPlus.................................... 3.. Threading Building Blocks.................................................. 3 3..3 OpenMP.................... 4 3..4 Array Building Blocks....................................................... 3.. OpenCL............................................................ 3.9 Summary.............................. CHAPTER 4 Map......................... 4. Map............... 3 4. Scaled Vector Addition (SAXPY).................................................. 4 4.. Description of the Problem.................. 4 4.. Serial Implementation................. 4..3 TBB........................................................................... 4..4 Cilk Plus...................................................................... 4.. Cilk Plus with Array Notation............................................... 4.. OpenMP.............. 4.. ArBB Using Vector Operations........ 4.. ArBB Using Elemental Functions........................................... 9 4..9 OpenCL........................................................ 30

. --------~----- viii Contents 4.3 Mandelbrot............... 3 4.3. Description of the Problem................ 3 4.3. Serial Implementation........................ 3 4.3.3 TBB............... 3 4.3.4 Cilk Plus........................ 3 4.3. Cilk Plus with Array Notations................. 34 4.3. OpenMP.............................. 34 4.3. ArBB................. 34 4.3. OpenCL............ 3 4.4 Sequence of Maps versus Map of Sequence.......... 39 4. Comparison of Parallel Models............ 4 4. Related Patterns............. 4 4.. Stencil............. 4 4.. Workpile.................. 4 4..3 Divide-and-conquer........................... 4 4. Summary............... 43 CHAPTER Collectives................................ 4. Reduce.................................... 4.. Reordering Computations......................... 4.. Vectorization............................. 4..3 Tiling........................ 49..4 Precision................... 0.. Implementation......................... Fusing Map and Reduce............................ Explicit Fusion in TBB.................... Explicit Fusion in Cilk Plus.............. 3..3 Automatic Fusion in ArBB.................... 3.3 Dot Product.................... 4.3. Description of the Problem................. 4.3. Serial Implementation..................................... 4.3.3 SSE Intrinsics......................................3.4 TBB......................................3. Cilk Plus.................................3. OpenMP........................ 0 c.3. ArBB..................4 Scan....................4. Cilk Plus.................. 4.4. TBB..................4.3 ArBB................................4.4 OpenMP............................. Fusing Map and Scan....................

Contents ix 3 3 3 3 3 34 34 34 3 39 4 4 4 3 49 0 3 3 4 4 4 0 4. Integration......................... 9.. Description of the Problem........................... 0.. Serial Implementation.................... 0..3 Cilk Plus...................................................................... 0..4 OpenMP........................................ 0.. TBB...................................... ArBB........................................................................... Summary................. CHAPTER Data Reorganization......,......................................... 9. Gather................................................................................ 0.. General Gather................................................................ 0.. Shift.....................3 Zip.............................................................................. Scatter.................................................................................. Atomic Scatter................................................................ 4.. Permutation Scatter........... 4..3 Merge Scatter..................................4 Priority Scatter................3 Converting Scatter to Gather.....................4 Pack............................... Fusing Map and Pack..................... 9. Geometric Decomposition and Partition.................... 9. Array of Structures vs. Structures of Arrays...... 94. Summary............................................................................. 9 CHAPTER Stencil and Recurrence............................................................. 99. Stencil................................................................................ 99. Implementing Stencil with Shift............... 0.3 Tiling Stencils for Cache................... :.............. 0.4 Optimizing Stencils for Communication........................................... 03. Recurrence........................................................................... 04. Summary............................................................................. 0 CHAPTER Fork-Join............. 09. Definition.................. 0. Programming Model Support for Fork-Join......................................... Cilk Plus Support for Fork-Join.................... TBB Support for Fork-Join................... 3..3 OpenMP Support for Fork-Join............................................. 4.3 Recursive Implementation of Map................................4 Choosing Base Cases................................

p x Contents. Load Balancing............. Complexity of Parallel Divide-and-Conquer.......... Karatsuba Multiplication of Polynomials.......................................... 4.. Note on Allocating Scratch Space.................. Cache Locality and Cache-Oblivious Algorithms..................................9 Quicksort...... 30.9. Cilk Quicksort............ 3.9. TBB Quicksort... 33.9.3 Work and Span for Quicksort............. 3.0 Reductions and Hyperobjects....................................................... 3. Implementing Scan with Fork-Join... 00... 00.. 00 00 00 00 00 00... 00.... 00 00.... 4. Applying Fork-Join to Recurrences................................................ 4.. Analysis................... 0.. Flat Fork-Join................................................................3 Summary............................................................................. CHAPTER 9 Pipeline................... 3 9. Basic Pipeline................. 3 9. Pipeline with Parallel Stages........................................................ 4 9.3 Implementation of a Pipeline... 9.4 Programming Model Support for Pipelines........................................ 9.4. Pipeline in TBB 00 00 00 00 00 00 00 00 00 00 00... 00 00... 00... 00.. 9.4. Pipeline in Cilk Plus............ 9. More General Topologies........................................................... 9. Mandatory versus Optional Parallelism............................................. 9. Summary............................................................................. PART II EXAMPLES CHAPTER 0 Forward Seismic Simulation...................................................... 0. Background............................................. 0. Stencil Computation.............................. 0.3 Impact of Caches on Arithmetic Intensity............................... 0.4 Raising Arithmetic Intensity with Space-Time Tiling................... 0 0. Cilk Plus Code.............................. 0. ArBB Implementation.................... 0. Summary................................

Contents xi 0 3 o CHAPTER K-Means Clustering............ 9. Algorithm............................................................................ 9. K-Means with Cilk Plus..................................... Hyperobjects..............................3 K-Means with TBB.................4 Summary............................................................................. 9 CHAPTER Bzip Data Compression.................. 9. The Bzip Algorithm............... :............................ 9. Three-Stage Pipeline Using TBB................................................... 9.3 Four-Stage Pipeline Using TBB.................................................... 9.4 Three-Stage Pipeline Using Cilk Plus.............................................. 9. Summary............................................................................. 9 CHAPTER 3 Merge Sort......................................................................... 99 3. Parallel Merge....................................................................... 99 3.. TBB Parallel Merge......................................................... 30 3.. Work and Span of Parallel Merge............................ 30 3. Parallel Merge Sort.................................................................. 303 3.. Work and Span of Merge Sort............................. 304 3.3 Summary............................................................................. 30 CHAPTER 4 Sample Sort........................................................................ 30 4. Overall Structure..................................................................... 30 4. Choosing the Number of Bins................ 309 4.3 Binning............................................................................... 309 4.3. TBB Implementation................. 30 4.4 Repacking and Subsorting......................................... 30 4. Performance Analysis of Sample Sort...... 3 4. For C++ Experts..................................... 3 3 4. Summary...................................... 33 CHAPTER Cholesky Factorization............................................................ 3. Fortran Rules!............................................................... 3. Recursive Cholesky Decomposition................................................ 3.3 Triangular Solve..................................................................... 3.4 Symmetric Rank Update.............................................. 39. Where Is the Time Spent?.................. 3. Summary............................................................................. 3

xii Contents APPENDICES APPENDIX A Further Reading.......................... 3 A. Parallel Algorithms and Patterns......................... 3 A. Computer Architecture Including Parallel Systems........ 3 A.3 Parallel Programming Models..................................................... 3 APPENDIX B Cilk Plus.................................. 39. Shared Design Principles with TBB............................................... 39. Unique Characteristics............................................................. 39.3 Borrowing Components from TBB................................................ 33.4 Keyword Spelling.................................................................. 33. ci l k_for...................................... 33. ci l k_s pa wn and ci l Lsync............... 333. Reducers (Hyperobjects).......................... 334 B.. C++ Syntax............................................ 33 B.. CSyntax.......................................'........ 33. Array Notation........................................ 33 B.. Specifying Array Sections........................ 339 B.. Operations on Array Sections.............................................. 340 B..3 Reductions on Array Sections.............................. 34 B..4 Implicit Index............................................................... 34 B.. Avoid Partial Overlap of Array Sections.................................. 34.9 #p r a gma s i md................................. 343.0 Elemental Functions.......................... 344 B.0. Attribute Syntax............................... 34. Note on C++..................... 34. Notes on Setup.................................. 34.3 History.............................................................................. 34.4 Summary.............................. 34 APPENDIX c T.................................. 349 C. Unique Characteristics............................................................. 349 C. Using TBB.......................................................................... 30 C.3 parall el_for........................................... 3 C.3. bl oc ked_r ange................................... 3 C.3. Partitioners.................................................................. 3 C.4 pa rall el_redu ce............................... 33 C. parall eldeterministi c_r educe........ 34 C.G parall el pi pel i ne.......................... 34 C. parall el_i nvok e.............................. 34

p Contents xiii 9 9 9 3 3 3 33 34 3 3 3 39 40 4 4 4 43 4 C. task_group......................... 3 C.9 task.............. 3 C.9. empty_task.......... 3 C.0 atomic.......................... 3 C. enumerabl e_thread_speci fi c.... 3 C. NotesonC++ll...................... 3 C.3 History.............................................................................. 39 C.4 Summary............................ 30 APPENDIX D C++ll........ 3 0. Declaring with auto.............. :......................................... 3 0. Lambda Expressions............................................................... 3 0.3 std:: move........................ 3 APPENDIX E Glossary............ 3 Bibliography.................. 39 Index.................... 39 9 9 0 3 4 4 4