Parallel Programming. Easy Cases: Data Parallelism

Similar documents
Lectures Parallelism

Scan Algorithm Effects on Parallelism and Memory Conflicts

SIMD Programming CS 240A, 2017

An Introduction to Trees

CSE524 Parallel Algorithms

Kampala August, Agner Fog

Compiler Code Generation COMP360

COS 226 Algorithms and Data Structures Fall Final Solutions. 10. Remark: this is essentially the same question from the midterm.

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Scan Primitives for GPU Computing

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Computer Science Foundation Exam

Recitation 1. Scan. 1.1 Announcements. SkylineLab has been released, and is due Friday afternoon. It s worth 125 points.

Lecture 11 Parallel Computation Patterns Reduction Trees

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

_Synthesis Of Dynamic Programming Algorithms. Yewen (Evan) Pu Rastislav Bodik Saurabh Srivastava

3. Priority Queues. ADT Stack : LIFO. ADT Queue : FIFO. ADT Priority Queue : pick the element with the lowest (or highest) priority.

COMP 322: Fundamentals of Parallel Programming. Lecture 22: Parallelism in Java Streams, Parallel Prefix Sums

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

Analysis of Algorithms

COMP 322: Fundamentals of Parallel Programming. Lecture 13: Parallelism in Java Streams, Parallel Prefix Sums

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Parallelized Hashing via j-lanes and j-pointers Tree Modes, with Applications to SHA-256

PRIORITY QUEUES AND HEAPS

CS 223: Data Structures and Programming Techniques. Exam 2

EE 368. Weeks 5 (Notes)

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Instructions PLEASE READ (notice bold and underlined phrases)

DATA STRUCTURES AND ALGORITHMS

CSE 4500 (228) Fall 2010 Selected Notes Set 2

14.4 Description of Huffman Coding

Data Structures Lesson 9

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

AT Arithmetic. Integer addition

Programming Languages: Lecture 12

COS 226 Midterm Exam, Spring 2009

CSE 530A. B+ Trees. Washington University Fall 2013

Lecture 32. No computer use today. Reminders: Homework 11 is due today. Project 6 is due next Friday. Questions?

Algorithms and Data Structures CS-CO-412

CS 315 Data Structures mid-term 2

Single Pass Connected Components Analysis

Chapter 11: Indexing and Hashing

CS 179: GPU Programming. Lecture 7

Physical Level of Databases: B+-Trees

TREES Lecture 10 CS2110 Spring2014

Data Blocks: Hybrid OLTP and OLAP on compressed storage

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Program Optimization Through Loop Vectorization

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

CSC Design and Analysis of Algorithms

Tree-Based Minimization of TCAM Entries for Packet Classification

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

Chapter 12: Indexing and Hashing

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)

Unit 3. Assembly programming Exercises

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CSCI 2212: Intermediate Programming / C Review, Chapters 10 and 11

Solutions. Suppose we insert all elements of U into the table, and let n(b) be the number of elements of U that hash to bucket b. Then.

CSE 373 Spring 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

Assembly Language - SSE and SSE2. Floating Point Registers and x86_64

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

CSC Design and Analysis of Algorithms. Lecture 7. Transform and Conquer I Algorithm Design Technique. Transform and Conquer

Solutions to Exam Data structures (X and NV)

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Auto-Vectorization with GCC

Assembly Language - SSE and SSE2. Introduction to Scalar Floating- Point Operations via SSE

Tree-Structured Indexes

March 20/2003 Jayakanth Srinivasan,

CPS 616 TRANSFORM-AND-CONQUER 7-1

Chapter 10. Implementing Subprograms ISBN

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

CS671 Parallel Programming in the Many-Core Era

Lecture Notes on Binary Search Trees

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Advanced Set Representation Methods

Self-Balancing Search Trees. Chapter 11

CSL 860: Modern Parallel

Lecture 15 Notes Binary Search Trees

Intermediate Representations

By the end of this section you should: Understand what the variables are and why they are used. Use C++ built in data types to create program

Lecture 15 Binary Search Trees

CS 171: Introduction to Computer Science II. Binary Search Trees

A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions

CS301 - Data Structures Glossary By

9. Heap : Priority Queue

Data Structure and Algorithm Midterm Reference Solution TA

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Pointers as Arguments

Binary Trees and Huffman Encoding Binary Search Trees

DATA STRUCTURES AND ALGORITHMS. Hierarchical data structures: AVL tree, Bayer tree, Heap

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

CIS 120 Midterm I February 16, 2015 SOLUTIONS

Chapter 11: Indexing and Hashing

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor

Chapter 11: Indexing and Hashing

1 Short Answer (15 Points Each)

Transcription:

Parallel Programming The preferred parallel algorithm is generally different from the preferred sequential algorithm Compilers cannot transform a sequential algorithm into a parallel one with adequate consistency Legacy code must be rewritten to use ism Your knowledge of sequential algorithms is not that useful for parallel programming There is no silver bullet 1 Easy Cases: Data Parallelism Iteration body for a given index is independent of all others can be performed in parallel void array_add(int A[], int B[], int C[], int length) { int i; for (i = ; i < length ; i) { C[i] = A[i] B[i]; The standard programming abstraction would be for_all i in [..length-1]{c[i]=a[i] B[i]; 2 1

Is it always that easy? Not always a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = ; for (int i = ; i < length ; i) { total = array[i]; return total; Is there parallelism here? 3 We first need to restructure the code unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {,,, ; for (i = ; i < length & ~x3 ; i = 4) { temp[] = array[i]; temp[1] = array[i1]; temp[2] = array[i2]; temp[3] = array[i3]; total = temp[] temp[1] temp[2] temp[3]; return total; 4 2

Then generate SIMD code for hot part SIMD == Single Instruction, Multiple Data unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {,,, ; for (i = ; i < length & ~x3 ; i = 4) { temp[] = array[i]; temp[1] = array[i1]; temp[2] = array[i2]; temp[3] = array[i3]; total = temp[] temp[1] temp[2] temp[3]; return total; 5 Intel SSE/SSE2 as an example of SIMD SSE == X86 Streaming SIMD Extensions Added new 128 bit registers (XMM XMM7), each can store 4 single precision FP values (SSE) 4 * 32b 2 double precision FP values (SSE2) 2 * 64b 16 byte values (SSE2) 16 * 8b 8 word values (SSE2) 8 * 16b 4 double word values (SSE2) 4 * 32b 1 128-bit integer value (SSE2) 1 * 128b bits) bits) bits) bits) bits) bits) bits) bits) bits) bits) bits) bits) 6 3

Many Computations Have Dependences Aggregate or Reduction Operations unsigned sum_array(unsigned *array, int length) { int total = ; for (int i = ; i < length ; i) { total = array[i]; return total; 68 52 Standard abstraction is total = sum(array); 26 which allows -solution 1 7 Overcoming Sequential Control Many computations on a data sequence seem to be essentially sequential Prefix sum is an example: for n inputs, the i th output is the sum of the first i items Input: 2 1 5 3 7 Output: 2 3 8 11 18 Given x 1, x 2,, x n find y 1, y 2,, y n s.t. y i = Σ j i x j 4

Sequential Computation Consider computing the prefix sums for (i=1; i<n1; i) { A[i] = A[i-1]; Semantics... A[1] is unchanged A[2] = A[2] A[1] A[3] = A[3] (A[2] A[1])... A[n] = A[n] (A[n-1] (... (A[2] A[1]) ) A[i] is the sum of the first i elements What advantage can ism give? Illustrating The Semantics The computation of the items can be described pictorially The picture illustrates the dependences of the sequential code 26 1 52 Addition, of course, is associative 68 5

Restructuring the Computation Express the computation as a tree 68 52 4 26 1 26 3 1 1 Dependence chain shallower -- faster Sequential: 7, Tree: 3 Restructuring the Computation Express the computation as a tree 52 68 26 1 1 26 3 1 Dependence chain shallower -- faster 4 Sequential: 7, Tree: 3 Operation count is unchanged: 7 each 6

Naïve Use of Parallelism For any y i a height log i tree finds the prefix Much redundant computation Requires O(n 2 ) parallelism for n prefixes It may be parallel but it is unrealistic Naïve Use of Parallelism For any y i a height log i tree finds the prefix Much redundant computation Requires O(n 2 ) parallelism for n prefixes Look closer at meaning of tree s intermediate sums 68 52 26 1 4 1 26 3 1 root summarizes its leaves 7

Speeding Up Prefix Calculations Putting the observations together One pass over the data computes global sum Intermediate values are saved A second pass over data uses intermediate sums to compute prefixes Each pass will be logarithmic for n = P Solution is called: The parallel prefix algorithm Parallel Prefix Algorithm Compute sum going up 4 1 26 3 1 8

Parallel Prefix Algorithm Compute sum going up Figure prefixes going down Introduce a virtual parent, the sum of values to tree s left: 4 1 26 3 1 Parallel Prefix Algorithm Compute sum going up Figure prefixes going down Invariant: Parent data is sum of elements to left of subtree 4 1 26 3 1 9

Parallel Prefix Algorithm Compute sum going up Figure prefixes going down Invariant: Parent data is sum of elements to left of subtree 4 1 26 3 1 Parallel Prefix Algorithm Compute sum going up Figure prefixes going down Invariant: Parent data is sum of elements to left of subtree 4 1 3 1 26 3 1 1

Parallel Prefix Algorithm Compute sum going up Figure prefixes going down Invariant: Parent data is sum of elements to left of subtree 4 1 3 1 26 3 1 6 1 116 16 2 6 6 46 4 16 161 126 1 16 16 1452 14 2 2 868 8 6 1 26 52 68 Parallel Prefix Algorithm Each prefix is computed in 2log n time, if P = n 4 1 3 1 26 3 1 6 1 116 16 2 6 6 46 4 16 161 126 1 16 16 1452 14 2 2 868 8 6 1 26 52 68 11

Fundamental Tool of Pgmming Original research on parallel prefix algorithm published by R. E. Ladner and M. J. Fischer Parallel Prefix Computation Journal of the ACM 27(4):831-838, 198 The Ladner-Fischer algorithm requires 2log n time, twice as much as simple tournament global sum, not linear time Applies to a wide class of operations Available Prefix Operators Most languages have reduce and scan ( prefix) built-in for:, *, min, max, &&, A few languages allow users to define prefix operations themselves Parallel prefix is MUCH more useful Length of Longest Run of x Length of Longest Increasing Run Number of Occurrences of x Binary String Space Compression Histogram Run Length Encoding Mode and Average Balanced Parentheses Count Words Skyline 12

Summary Sequential computation is a special case of parallel computation (P==1) Generalizing from sequential computations usually arrives at the wrong solution rethinking the problem to develop a parallel algorithm is the only real solution It s a good time to start acquiring parallel knowledge 25 13