Lecture 11 Parallel Computation Patterns Reduction Trees

Similar documents
CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS 677: Parallel Programming for Many-core Processors Lecture 6

Optimization Strategies Global Memory Access Pattern and Control Flow

Global Memory Access Pattern and Control Flow

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Warps and Reduction Algorithms

CS671 Parallel Programming in the Many-Core Era

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 314 Principles of Programming Languages

Optimizing Parallel Reduction in CUDA

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Lecture 4: warp shuffles, and reduction / scan operations

Parallel Patterns Ezio Bartocci

Closest Pair of Points. Cormen et.al 33.4

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Scan Algorithm Effects on Parallelism and Memory Conflicts

Lecture 2: CUDA Programming

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Parallel Sorting Algorithms

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU programming basics. Prof. Marco Bertini

CS473 - Algorithms I

Hardware/Software Co-Design

Clustering Documents. Case Study 2: Document Retrieval

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

CS516 Programming Languages and Compilers II

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Parallel Programming

CSE Winter 2015 Quiz 2 Solutions

Parallel Programming. Easy Cases: Data Parallelism

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Computer Architecture

Lecture 5. Performance Programming with CUDA

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Outline

ECE453: Advanced Computer Architecture II Homework 1

Programming II (CS300)

Programming in CUDA. Malik M Khan

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

We can use a max-heap to sort data.

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Parallel Implementation of Sparse Coding and Dictionary Learning on GPU

CSL 860: Modern Parallel

Algorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms

Sorting & Searching (and a Tower)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Database Applications (15-415)

Objectives. Recursion. One Possible Way. How do you look up a name in the phone book? Recursive Methods Must Eventually Terminate.

CSL 730: Parallel Programming

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Subset sum problem and dynamic programming

An Introduction to OpenAcc

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Tesla Architecture, CUDA and Optimization Strategies

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Programming II (CS300)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS61B Lectures # Purposes of Sorting. Some Definitions. Classifications. Sorting supports searching Binary search standard example

CSE 332: Analysis of Fork-Join Parallel Programs. Richard Anderson Spring 2016

Sorting. Sorting. Stable Sorting. In-place Sort. Bubble Sort. Bubble Sort. Selection (Tournament) Heapsort (Smoothsort) Mergesort Quicksort Bogosort

Algorithms with numbers (1) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Programming 2. Topic 8: Linked Lists, Basic Searching and Sorting

CSC Design and Analysis of Algorithms

Bits and Bytes. Why bits? Representing information as bits Binary/Hexadecimal Byte representations» numbers» characters and strings» Instructions

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

Advanced OpenMP. Lecture 11: OpenMP 4.0

Introduction to Data Management CSE 344

Distributed Programming the Google Way

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Code Optimizations for High Performance GPU Computing

Divide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions

KF5008 Algorithm Efficiency; Sorting and Searching Algorithms;

Dept. Of Computer Science, Colorado State University

Algorithm Design and Analysis

PRAM Divide and Conquer Algorithms

STA141C: Big Data & High Performance Statistical Computing

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

Introduction to Data Management CSE 344

Data Representation Floating Point

Lecture 6 Sorting and Searching

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CSC 222: Object-Oriented Programming. Spring 2012

08 A: Sorting III. CS1102S: Data Structures and Algorithms. Martin Henz. March 10, Generated on Tuesday 9 th March, 2010, 09:58

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

Introduction to OpenMP

Data Representation Floating Point

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Lecture 3: CUDA Programming & Compiler Front End

Big Data and Scripting map reduce in Hadoop

Transcription:

ECE408 Applied Parallel Programming Lecture 11 Parallel Computation Patterns Reduction Trees 1

Objective To master Reduction Trees, arguably the most widely used parallel computation pattern Basic concept A class of computation Memory coalescing Control divergence Thread utilization 2

Partition and Summarize A commonly used strategy for processing large input data sets There is no required order of processing elements in a data set (associative and commutative) Partition the data set into smaller chunks Have each thread to process a chunk Use a reduction tree to summarize the results from each chunk into the final answer We will focus on the reduction tree step for now. Google and Hadoop MapReduce frameworks are examples of this pattern 3

Reduction enables other techniques Reduction is also needed to clean up after some commonly used parallelizing transformations Privatization Multiple threads write into an output location Replicate the output location so that each thread has a private output location Use a reduction tree to combine the values of private locations into the original output location 4

What is a reduction computation Summarize a set of input values into one value using a reduction operation Max Min Sum Product Often with user defined reduction operation function as long as the operation Is associative and commutative Has a well-defined identity value (e.g., 0 for sum) 5

An efficient sequential reduction algorithm performs N operations - O(N) Initialize the result as an identity value for the reduction operation Smallest possible value for max reduction Largest possible value for min reduction 0 for sum reduction 1 for product reduction Scan through the input and perform the reduction operation between the result value and the current input value 6

A parallel reduction tree algorithm performs N-1 Operations in log(n) steps 3 1 7 0 4 1 6 3 max max max max 3 7 4 6 max max 7 6 max 7 7

A tournament is a reduction tree with max operation A more artful rendition of the reduction tree. 8

A Quick Analysis For N input values, the reduction tree performs (1/2)N + (1/4)N + (1/8)N + (1/N) = (1- (1/N))N = N-1 operations In Log (N) steps 1,000,000 input values take 20 steps Assuming that we have enough execution resources Average Parallelism (N-1)/Log(N)) For N = 1,000,000, average parallelism is 50,000 However, peak resource requirement is 500,000! This is a work-efficient parallel algorithm The amount of work done is comparable to sequential Many parallel algorithms are not work efficient 9

A Sum Reduction Example Parallel implementation: Recursively halve # of threads, add two values per thread in each step Takes log(n) steps for n elements, requires n/2 threads Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory is used to hold a partial sum vector Each step brings the partial sum vector closer to the sum The final sum will be in element 0 Reduces global memory traffic due to partial sum values 10

Vector Reduction with Branch Divergence Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Data 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 steps 3 0..7 8..15 Partial Sum elements 11

A Sum Example Thread 0 Thread 1 Thread 2 Thread 3 3 1 7 0 4 1 6 Data 3 1 2 4 7 5 9 11 14 Active Partial Sum elements steps 3 25 12

Simple Thread Index to Data Mapping Each thread is responsible of an even-index location of the partial sum vector One input is the location of responsibility After each step, half of the threads are no longer needed In each step, one of the inputs comes from an increasing distance away 13

A Simple Thread Block Design Each thread block takes 2* BlockDim input elements Each thread loads 2 elements into shared memory shared float partialsum[2*block_size]; unsigned int t = threadidx.x; Unsigned int start = 2*blockIdx.x*blockDim.x; partialsum[t] = input[start + t]; partialsum[blockdim+t] = input[start+ blockdim.x+t]; 14

ANY MORE QUESTIONS? 15