Structured Parallel Programming with Deterministic Patterns

Size: px

Start display at page:

Download "Structured Parallel Programming with Deterministic Patterns"

Horace Parks
5 years ago
Views:

Sotware and Services Group, Intel Corporation Sotware & Services Group, Developer

1 Structured Parallel Programming with Deterministic Patterns May 14, 2010 USENIX HotPar 2010, Berkeley, Caliornia Michael McCool, Sotware Architect, Ct Technology Sotware and Services Group, Intel Corporation Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners.

2 Patterns A parallel pattern is a commonly occurring combination o task distribution and data access Many common programming models support either only a small number o patterns, or only low-level hardware mechanisms So oten common patterns implemented only as conventions Observation: a small number o patterns, most o them deterministic, can support a wide range o applications Thesis: A system that directly supports these deterministic patterns and allows their composition can generate eicient implementations on a variety o hardware architectures Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 2

3 Motivation or Pattern-based Design Deterministic patterns higher maintainability No need to debug race conditions i it not possible to create them Allow introduction o races only where necessary, and limit scope Determinism and consistency with single serial execution order simpliies user understanding, debugging and testing Application oriented patterns higher productivity Patterns derived rom common use cases in applications Subset o patterns are universal: gives wide applicability Patterns can also target speciic domains: Makes simple things simple Patterns encourage high-level reasoning Focus users on what really matters: parallelism and data locality Simpliies learning how to write eicient programs Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 3

4 Serial Patterns The ollowing patterns are the basis o structured programming or serial computation: Sequence Selection Iteration Recursion Random read Random write Stack allocation Heap allocation Objects/closures Compositions o control low patterns can be used in place o unstructured mechanisms such as goto. Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 4

5 Parallel Patterns The ollowing additional parallel patterns can be used or structured parallel programming : Superscalar sequence Speculative selection Map Recurrence/scan Reduce Pack/expand Nest Pipeline Partition Stencil Search/match Gather *Permutation scatter *Merge scatter!atomic scatter Priority scatter Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 5

6 Sequence g p A serial sequence is executed in the exact order given: B = (A); C = g(b); E = p(c); F = q(a); q Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 6

7 Superscalar Sequence g h p q Developer writes serial code: B = (A); C = g(b); E = (C); F = h(c); G = g(e,f); P = p(b); Q = q(b); R = r(g,p,q); g r However, tasks only need to be ordered by data dependencies Depends on limiting scope o data dependencies Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 7

8 Selection c g The condition is evaluated irst, then one o two tasks is executed based on the result. IF (c) { } ELSE { g } Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 8

9 Speculative Selection c g Examples: collision culling; ray tracing; clipping; discrete event simulation; search Both sides o a conditional and the condition are evaluated in parallel, then the unused branch is cancelled. SELECT (c) { } ELSE { g } Eort in cancelled task wasted Use only when a computational resource would otherwise be idle, or tasks are on critical path Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 9

10 Map Map replicates a unction over every element o an index set (which may be abstract or associated with the elements o an array). A = map(,b); Examples: gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing. This replaces one speciic usage o iteration in serial programs: processing every element o a collection with an independent operation. Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 10

11 Reduction Reduce combines every element in a collection into one element using an associative operator. b = reduce(,b); For example, reduce can be used to ind the sum or maximum o an array. Examples: averaging o Monte Carlo samples; convergence testing; image comparison metrics; sub-task in matrix operations. There are some variants that arise rom combination with partition and search Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 11

12 Scan Scan computes all partial reductions Allows parallelization o many 1D recurrences Requires an associative operator Requires 2n work over serial execution, but lg n steps Examples: integration, sequential decision simulations in inancial engineering, can also be used to implement pack Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 12

13 Recurrences Examples: ininite impulse response ilters; sequence alignment (Smith- Waterman dynamic programming); matrix actorization Recurrences arise rom the data dependency pattern given by nested loopcarried dependencies. nd recurrences can always be parallelized over n-1 dimensions by Lamport s hyperplane theorem Execution o parallel slices can be perormed either via iterative map or via waveront parallelism Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 13

14 Recurrences: Implementation Note Implementation can use blocking or higher perormance When combined with the pipeline pattern recurrences implements waveront computation Can also be combined with superscalar execution (see recent ICS paper...) Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 14

15 Partition Examples: JPG and other macroblock compression; divideand-conquer matrix multiplication; coherency optimization or conebeam reconstruction Partition breaks an input collection into a collection o collections Useul or divide-and-conquer algorithms Variants: Uniorm: dice Non-uniorm: segment Overlapping: tile Issues: How to deal with boundary conditions? Partitions don t move data, they just provide an alternative view o its organization Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 15

16 Stencil Apply unction to all neighbourhoods o an array Neighbourhoods given by set o relative osets Optimized implementation requires blocking and sliding windows Boundary modes on array accesses useul Examples: image iltering including convolution, median, anisotropic diusion; simulation including luid low, electromagnetic, and inancial PDE solvers, lattice QCD Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 16

17 Pipeline Tasks can be organized in chain with local state Useul or serially dependent tasks like codecs Whole chain applied like map to collection or stream Implementation o many sub-patterns may be optimized or pipeline execution when inside this pattern Examples: codecs with variablerate compression; video processing; spam iltering. Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 17

18 Pack Pack allows deletion o elements rom a collection and elimination o unused space Useul when used with map and other patterns to avoid unnecessary output Examples: narrow-phase collision detection pair testing (only want to report valid collisions), peak detection or template matching. Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 18

19 Expand Expand allows element o map operation to insert any number o elements (including none) into its output stream Examples: broad-phase collision detection pair testing (want to report potentially colliding pairs); compression and decompression. Useul when used with map and other patterns to support variable-rate output Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 19

20 Fused Patterns Programs are built rom combinations o patterns Should be able to use patterns or perormance May be useul to explicitly support speciic combinations Examples: Gather = map + random read Scatter = map + random write Map + reduce or preprocessing beore reduction Map + pack/expand or culling operations Partition + reduce or multidimensional reduction Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 20

21 Search/Match Examples: computation o metrics on segmented regions in vision; computation o web analytics Searching and matching undamental capabilities Use to select data or another operation, by creating a (virtual) collection or partitioned collection. Example: category reduction reduces all elements in an array with the same label, and is the orm used in Google s map-reduce Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 21

22 Gather Map + Random Read Read rom a random (computed) location in an array When used inside a map or as a collective, becomes a parallel operation Views into arrays, but no global pointers Write-ater-read semantics or kernels to avoid races A B C D E F G B F A C C E Examples: sparse matrix operations; ray tracing; proximity queries; collision detection. August 18, 2008 Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 22

23 *!Scatter Map + Random Write Write into a random (computed) location in an array When used inside a map, becomes a parallel operation Race conditions possible when there are duplicate write addresses ( collisions ) To obtain deterministic scatter, need a deterministic rule to resolve collisions A B C D E F Examples: marking pairs in collision detection; handling database update transactions. C A? F B Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 23

24 *Permutation Scatter Make collisions illegal Only guaranteed to work i no duplicate addresses Danger is that programmer will use it when addresses do in act have collisions, then will depend on undeined behaviour Similar saety issue as with out-o-bounds array accesses. Can test or collisions in debug mode A B C D E F Examples: FFT scrambling; matrix/image transpose; unpacking. C A E D F B Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 24

25 *Merge Scatter Use an associative operator to combine values upon collision Problem: as with reduce, depends on programmer to deine associative operator Gives non-deterministic read-modiy-write when used with nonassociative operators Due to structured nature o other patterns, can still provide tool to check or race conditions Examples: histogram; mutual inormation and entropy; database updates Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 25

26 !Atomic Scatter Resolve collisions atomically but non-deterministically Use o this pattern will result in non-deterministic programs Structured nature o rest o patterns makes it possible to test or race conditions A B C D E F Examples: marking pairs in collision detection; computing set intersection or union (used in text databases) C A D F B or E Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 26

27 Priority Scatter Assign every parallel element a priority NOTE: Need hierarchical structure o other patterns to do this Deterministically determine winner based on priority When converting rom serial code, priority can be based on original ordering, giving results consistent with serial program Eicient implementation is similar to hierarchical z-buer A B C D E F C A E F B Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 27

28 Nesting Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 28

29 Conclusion Patterns can be used to reason about and organize development o parallel algorithms and programming models Integrating these patterns into Ct or heterogeneous computing Many useul patterns are deterministic Compositions o deterministic patterns lead to deterministic programs Discussion: Are there a smaller number o primitive patterns? Are any important patterns missing? Can structured be well-deined? How important are non-deterministic patterns? Can any o these be considered structured? Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 29

30 BACKUP Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners.

31 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Perormance tests and ratings are measured using speciic computer systems and/or components and relect the approximate perormance o Intel products as measured by those tests. Any dierence in system hardware or sotware design or coniguration may aect actual perormance. Buyers should consult other sources o inormation to evaluate the perormance o systems or components they are considering purchasing. For more inormation on perormance tests and on the perormance o Intel products, reerence Intel, Intel Core and the Intel logo are trademarks o Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property o others. Copyright Intel Corporation. Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 31

32 Challenge: Multiple Parallelism Mechanisms Modern processors have many kinds o parallelism: Pipelining SIMD within a register (SWAR) vectorization Superscalar instruction issue or VLIW Overlapping memory access with computation (preetch) Simultaneous multithreading (hyperthreading) per core Multiple cores Multiple processors Asynchronous host and accelerator execution HPC adds: clusters, distributed memory, grid Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 32

33 General Factors Aecting Algorithm Perormance 1. Parallelism Choose or design a good parallel algorithm Large amount o latent parallelism, low serial overhead Asymptotically eicient Should scale to large number o processing elements 2. Locality Eicient use o the memory hierarchy More requent use o aster local memory Coherent use o memory and data transer Good alignment, predictable memory access; blocking High arithmetic intensity Sotware & Services Group, Developer Products Division *Other brands and names are the property o their respective owners. 33

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information