EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Similar documents
EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 11 Parallelism in Software II

Scan Algorithm Effects on Parallelism and Memory Conflicts

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

The University of Texas at Austin

Parallelism in Software

Mattan Erez. The University of Texas at Austin

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Patterns for! Parallel Programming II!

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS 677: Parallel Programming for Many-core Processors Lecture 6

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Using GPUs to compute the multilevel summation of electrostatic forces

Programming in CUDA. Malik M Khan

Parallel Patterns Ezio Bartocci

Algorithms and Applications

Patterns for! Parallel Programming!

Parallelization Strategy

The University of Texas at Austin

Parallel Computing. Parallel Algorithm Design

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPGPU: Parallel Reduction and Scan

Marco Danelutto. May 2011, Pisa

Scan Primitives for GPU Computing

Parallelism I. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Mattan Erez. The University of Texas at Austin

Spring Prof. Hyesoon Kim

Mattan Erez. The University of Texas at Austin

Lecture 5: Input Binning

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Accelerating Molecular Modeling Applications with Graphics Processors

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms

Master Informatics Eng.

Parallelization Strategy

ECE 8823: GPU Architectures. Objectives

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

Warps and Reduction Algorithms

Module 12 Floating-Point Considerations

Parallel Sorting Algorithms

CS671 Parallel Programming in the Many-Core Era

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Lecture 11 Parallel Computation Patterns Reduction Trees

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Technical Tricks of Coarse-Grained MD Visualization with VMD

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Parallel Programming. Easy Cases: Data Parallelism

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

Master Informatics Eng.

Parallel Programming Patterns Overview and Concepts

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Scan Primitives for GPU Computing

Module 12 Floating-Point Considerations

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Roofline Model (Will be using this in HW2)

Porting Performance across GPUs and FPGAs

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

A Pattern Language for Parallel Programming

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Lecture 5. Applications: N-body simulation, sorting, stencil methods

CS 179: GPU Programming. Lecture 7

Parallel Computing. Hwansoo Han (SKKU)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

Table of Contents. Chapter 1: Introduction to Data Structures... 1

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

CUDA Experiences: Over-Optimization and Future HPC

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Introduction to CUDA (1 of n*)


Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Update on Angelic Programming synthesizing GPU friendly parallel scans

Sorting (Chapter 9) Alexandre David B2-206

COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Data Parallel Execution Model

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Basic Communication Operations (Chapter 4)

Parallel Programming

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Transcription:

EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 1

Quick recap 2 Decomposition High-level and fairly abstract Consider machine scale for the most part Task, Data, Pipeline Find dependencies Supporting structures Loop Master/worker Fork/join SPMD MapReduce Algorithm structure Still abstract, but a bit less so Consider communication, sync, and bookkeeping Task (collection/recursive) Data (geometric/recursive) Dataflow (pipeline/eventbased-coordination) EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Algorithm Structure and Organization (my view) 3 Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination SPMD **** ** **** ** **** * Loop Parallelism **** when no dependencies * **** * **** SWP to hide comm. Master/ Worker **** *** *** *** ** **** Fork/ Join **** **** ** **** * Patterns can be hierarchically composed so that a program uses more than one pattern EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 4 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005).

ILP, DLP, and TLP in SW and HW 5 ILP OOO Dataflow VLIW DLP TLP SIMD Vector Essentially multiple cores with multiple sequencers ILP Within straight-line code DLP Parallel loops Tasks operating on disjoint data o dependencies within parallelism phase TLP All of DLP + Producer-consumer chains EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

ILP, DLP, and TLP and Supporting Patterns Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination ILP inline / unroll inline unroll inline inline / unroll inline DLP natural or localconditions after enough divisions natural after enough branches difficult localconditions TLP natural natural natural natural natural natural EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 6

ILP, DLP, and TLP and Supporting Patterns Task parallelism Divide and conquer Geometric decomposition Recursive data Pipeline Event-based coordination ILP inline / unroll inline unroll inline inline / unroll inline DLP natural or localconditions after enough divisions natural after enough branches difficult localconditions TLP natural natural natural natural natural natural EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 7

ILP, DLP, and TLP and Implementation Patterns SPMD Loop Parallelism Mater/Worker Fork/Join ILP pipeline unroll inline inline DLP natural or localconditional natural local-conditional after enough divisions + local-conditional TLP natural natural natural natural EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 8

ILP, DLP, and TLP and Implementation Patterns SPMD Loop Parallelism Master/ Worker Fork/Join ILP pipeline unroll inline inline DLP natural or localconditional natural local-conditional after enough divisions + local-conditional TLP natural natural natural natural EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 9

Outline 10 Molecular dynamics example Problem description Steps to solution Build data structures; Compute forces; Integrate for new; positions; Check global solution; Repeat Finding concurrency Scans; data decomposition; reductions Algorithm structure Supporting structures EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Credits 11 Parallel Scan slides courtesy David Kirk (VIDIA) and Wen-Mei Hwu (UIUC) Taken from EE493-AI taught at UIUC in Sprig 2007 Reduction slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007 EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS 12 Highly optimized molecular-dynamics package Popular code Specifically tuned for protein folding Hand optimized loops for SSE3 (and other extensions) EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Gromacs Components 13 on-bonded forces Water-water with cutoff Protein-protein tabulated Water-water tabulated Protein-water tabulated Bonded forces Angles Dihedrals Boundary conditions Verlet integrator Constraints SHAKE SETTLE Other Temperature pressure coupling Virial calculation EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Water-Water Force Calculation 14 on-bonded long-range interactions Coulomb Lennard-Jones 234 operations per interaction + H Lennard-Jones O + H O H + + H Electrostatic Water-water interaction ~75% of GROMACS run-time EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses on-trivial eighbor-list Algorithm Full non-bonded force calculation is o(n 2 ) GROMACS approximates with a cutoff Molecules located more than r c apart do not interact O(nr c3 ) 15 Efficient algorithm leads to variable rate input streams EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses on-trivial eighbor-list Algorithm Full non-bonded force calculation is o(n 2 ) GROMACS approximates with a cutoff Molecules located more than r c apart do not interact O(nr c3 ) 16 central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses on-trivial eighbor-list Algorithm Full non-bonded force calculation is o(n 2 ) GROMACS approximates with a cutoff Molecules located more than r c apart do not interact O(nr c3 ) 17 central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses on-trivial eighbor-list Algorithm Full non-bonded force calculation is o(n 2 ) GROMACS approximates with a cutoff Molecules located more than r c apart do not interact O(nr c3 ) 18 central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

GROMACS Uses on-trivial eighbor-list Algorithm Full non-bonded force calculation is o(n 2 ) GROMACS approximates with a cutoff Molecules located more than r c apart do not interact O(nr c3 ) Separate neighbor-list for each molecule eighbor-lists have variable number of elements 19 central molecules neighbor molecules Efficient algorithm leads to variable rate input streams EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Parallel Prefix Sum (Scan) 20 Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1,, a n-1 ] and returns the ordered set [I, a 0, (a 0 a 1 ),, (a 0 a 1 a n-2 )]. Example: if is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] Exclusive scan: last input element is not included in the result David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez (From Blelloch, 1990, Prefix Sums and Their Applications)

Applications of Scan 21 Scan is a simple and useful parallel building block Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp); Useful for many parallel algorithms: David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, radix sort quicksort String comparison Lexical analysis Stream compaction Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez Polynomial evaluation Solving recurrences Tree operations Building data structures Etc.

Building Data Structures with Scans 22 Fun on the board EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Scan on a serial CPU 23 void scan( float* scanned, float* input, int length) { scanned[0] = 0; for(int i = 1; i < length; ++i) { scanned[i] = input[i-1] + scanned[i-1]; } } Just add each element to the sum of the elements before it Trivial, but sequential Exactly n adds: optimal David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm 24 0 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 Each UE reads one value from the input array in device memory into shared memory array T0. UE 0 writes 0 into shared memory array. 1. Read input to shared memory. Set first element to zero and shift others right by one. David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm 25 0 In 3 1 7 0 4 1 6 3 1. (previous slide) Stride 1 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 2. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #1 Stride = 1 Active UEs: stride to n-1 (n-stride UEs) UE j adds elements j and j-stride from T0 and writes result into shared memory buffer T1 (ping-pong) David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm 26 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #2 Stride = 2 David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm 27 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 T1 0 3 4 11 11 15 16 22 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #3 Stride = 4 David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

A First-Attempt Parallel Scan Algorithm 28 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 T1 0 3 4 11 11 15 16 22 Ou t David Kirk/VIDIA and 0 3 4 11 11 15 16 22 Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: UEs stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) 3. Write output.

What is wrong with our first-attempt parallel scan? 29 Work Efficient: A parallel algorithm is work efficient if it does the same amount of work as an optimal sequential complexity Scan executes log(n) parallel iterations The steps do n-1, n-2, n-4,... n/2 adds each Total adds: n * (log(n) 1) + 1 O(n*log(n)) work This scan algorithm is OT work efficient Sequential scan algorithm does n adds A factor of log(n) hurts: 20x for 10^6 elements! David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Improving Efficiency 30 A common parallel algorithm pattern: David Kirk/VIDIA and Wen-mei W. Hwu, 2007 Balanced Trees Build a balanced binary tree on the input data and sweep it to and from the root Tree is not an actual data structure, but a concept to determine what each UE does at each step For scan: Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves Traverse back up the tree building the scan from the partial sums ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree 31 T 3 1 7 0 4 1 6 3 Assume array is already in shared memory David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree 32 Stride 1 T 3 1 7 0 4 1 6 3 Iteration 1, n/2 UEs T 3 4 7 7 4 5 6 9 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree 33 Stride 1 T 3 1 7 0 4 1 6 3 Stride 2 T 3 4 7 7 4 5 6 9 Iteration 2, n/4 UEs T 3 4 7 11 4 5 6 14 Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build the Sum Tree 34 Stride 1 Stride 2 T 3 1 7 0 4 1 6 3 T 3 4 7 7 4 5 6 9 Stride 4 T 3 4 7 11 4 5 6 14 T 3 4 7 11 4 5 6 25 Iteration log(n), 1 UE Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value. ote that this algorithm operates in-place: no need for double buffering David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Zero the Last Element 35 T 3 4 7 11 4 5 6 0 We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element. David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 36 T 3 4 7 11 4 5 6 0 David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 37 Stride 4 T 3 4 7 11 4 5 6 0 T 3 4 7 0 4 5 6 11 Iteration 1 1 UE Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 38 Stride 4 T 3 4 7 11 4 5 6 0 Stride 2 T 3 4 7 0 4 5 6 11 T 3 0 7 4 4 11 6 16 Iteration 2 2 UEs Each corresponds to a single UE. Iterate log(n) times. Each UE adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Build Scan From Partial Sums 39 Stride 4 Stride 2 T 3 4 7 11 4 5 6 0 T 3 4 7 0 4 5 6 11 Stride 1 T 3 0 7 4 4 11 6 16 T 0 3 4 11 11 15 16 22 Iteration log(n) n/2 UEs Each corresponds to a single UE. Done! We now have a completed scan that we can write out to device memory. Total steps: 2 * log(n). Total work: 2 * (n-1) adds = O(n) Work Efficient! David Kirk/VIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Reductions 40 Many to one Many to many Use Simply multiple reductions Also known as scatter-add and subset of parallel prefix sums Histograms Superposition Physical properties EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Serial Reduction 41 A[0] A[1] A[2] A[3] A[0:1] A[0:2] A[0:3] When reduction operator is not associative Usually followed by a broadcast of result EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Tree-based Reduction 42 A[0] A[1] A[2] A[3] A[0:1] A[2:3] n steps for 2 n units of execution When reduction operator is associative Especially attractive when only one task needs result A[0:3] EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Recursive-doubling Reduction 43 A[0] A[1] A[2] A[3] A[0:1] A[0:1] A[2:3] A[2:3] A[0:3] A[0:3] A[0:3] A[0:3] n steps for 2 n units of execution If all units of execution need the result of the reduction EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Recursive-doubling Reduction 44 Better than tree-based approach with broadcast Each units of execution has a copy of the reduced value at the end of n steps In tree-based approach with broadcast Reduction takes n steps Broadcast cannot begin until reduction is complete Broadcast can take n steps (architecture dependent) EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez

Other Examples 45 More patterns Reductions Scans Building a data structure More examples Search Sort FFT as divide and conquer Structured meshes and grids Sparse algebra Unstructured meshes and graphs Trees Collections Particles Rays EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 14 (c) Rodric Rabbah, Mattan Erez