Parallel Programming Patterns
|
|
- Marshall Johns
- 5 years ago
- Views:
Transcription
1 Parallel Programming Patterns Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna
2 Copyright 2013, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 2
3 What is a pattern? A design pattern is a general solution to a recurring engineering problem A design pattern is not a ready-made solution to a given problem......rather, it is a description of how a certain kind of problem can be solved 3
4 Architectural patterns The term architectural pattern was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction 4
5 Example Building a bridge across a river You do not invent a brand new type of bridge each time Instead, you adapt an already existing type of bridge 5
6 Example 6
7 Example 7
8 Example 8
9 Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan 9
10 Parallel programming patterns: Embarrassingly parallel 10
11 Embarrassingly Parallel Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking... Processor 0 Processor 1 Processor = = = a[] b[] c[] 11
12 Parallel programming patterns: Partition 12
13 Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors 13
14 Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] 14
15 Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions 15
16 1-D Partitioning Block Core 0 Core 1 Core 2 Core 3 Cyclic 16
17 2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 17
18 2-D Cyclic Partitioning Cyclic, * *, Cyclic 18
19 2-D Cyclic Partitioning Cyclic-cyclic 19
20 Irregular partitioning example A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: 20
21 Computation Fine grained vs Coarse grained partitioning Coarse-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Time Fine-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Communication 21
22 Example: Mandelbrot set The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as { 0 if n=0 z n (c)= 2 z n 1 (c) + c otherwise does not diverge when n + 22
23 Mandelbrot set in color If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become >2 23
24 Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0; y = 0; it = 0; while ( it < maxit AND x*x + y*y 2*2 ) { xnew = x*x - y*y + cx; ynew = 2*x*y + cy; x = xnew; y = ynew; } plot(x0, y0, it); } Source: 24
25 Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations Other pixels require fewer iterations 25
26 Load balancing Ideally, each processor should perform the same amount of work If the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization 26
27 Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use fine-grained partitioning...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm)...but beware that dynamic task allocation might incur in higher overhead with respect to static task allocation 27
28 Master-worker paradigm (process farm, work pool) Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker 0 Master Bag of tasks of possibly different duration Worker 1 Worker P-1 28
29 Choosing the partition size Wall-clock time The optimal partition size is in general system- and applicationdependent; it might be estimated by measurement Optimal partition size Partition size Too small = higher scheduling overhead Too large = unbalanced workload 29
30 coarse-grained decomposition static task assignment block size = 64 static task assignment P0 P0 P1 P2 P3 P1 P0 P1 P2 P2 P3 P0 P3 P1 block size = 64 dynamic (master-worker) task assignment P0 (example) P2 P1 P3 P0 P2 P0 P3 P2 P0 30
31 Example omp-mandelbrot.c Coarse-grained partitioning Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static,64"./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static"./omp-mandelbrot OMP_SCHEDULE="dynamic,64"./omp-mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE="dynamic"./omp-mandelbrot 31
32 Parallel programming patterns: Stencil 32
33 Stencils Stencil computations involve a grid whose values are updated according to a fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 5 neighborhood
34 2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil 34
35 3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil 35
36 3D Stencils 72-point 3-plane 3D stencil 36
37 2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase 37
38 Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are glued together to form a torus Ghost cells Domain In either case, we extend the domain with ghost cells, so that cells on the border do not require any special treatment Parallel Programming Patterns i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torus
39 Periodic boundary conditions: How to fill ghost cells 39
40 2D Stencil Example: Game of Life 2D cyclic domain, each cell has two possible states The state of a cell at time t + 1 depends on 0 = dead 1 = alive the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors dies Alive cell with two or three alive neighbors lives Alive cell with more than three alive neighbors dies Dead cell with three alive neighbors lives 40
41 Example: Game of Life See game-of-life.c 41
42 Periodic boundary conditions: Another way to fill ghost cells 42
43 Periodic boundary conditions: Another way to fill ghost cells 43
44 Periodic boundary conditions: Another way to fill ghost cells 44
45 Periodic boundary conditions: Another way to fill ghost cells 45
46 Periodic boundary conditions: Another way to fill ghost cells 46
47 Periodic boundary conditions: Another way to fill ghost cells 47
48 Parallelizing stencil computations Computing the next grid from the current one has embarassingly parallel structure Initialize current grid while (!terminated) { Fill ghost cells Compute next grid Exchange current and next grids } Embarassingly Parallel However, domain partitioning on distributed-memory architectures requires special care 48
49 Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of logically adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 49
50 Example: 2D partitioning with 5P stencil Periodic boundary P0 P1 P2 P3 P4 P5 P6 P7 P8 50
51 Example: 2D partitioning with 5P stencil Periodic boundary 51
52 Example: 2D partitioning with 5P stencil Periodic boundary 52
53 Example: 2D partitioning with 5P stencil Periodic boundary 53
54 Example: 2D partitioning with 5P stencil Periodic boundary 54
55 Example: 2D partitioning with 9P stencil 55
56 Example: 2D partitioning with 9P stencil 56
57 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary P0 P1 P2 57
58 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 58
59 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 59
60 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 60
61 Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition......assuming the following boundary conditions: (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors 61
62 Choosing a decomposition (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 62
63 Choosing a decomposition (Block, *), periodic boundary conditions N P0 P1 The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size 8 N ghost cells P2 P3 63
64 Choosing a decomposition (Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 64
65 Choosing a decomposition (Block, Block), periodic boundary conditions N/2 N/2 P0 P1 8 N ghost cells P2 P3 65
66 Choosing a decomposition (Block, Block), non-periodic boundary conditions N/2 N/2 P0 P1 4 N ghost cells P2 P3 66
67 Recap (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 (Block, *) (Block, Block) 8N 8N Non periodic 6 N 4N Periodic 67
68 1D Stencil Example: Rule 30 Cellular Automaton The state at time t + 1 depends on the state of the red cells at time t Time t t+1 t+2 Rule 30 cellular automaton 68
69 Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 69
70 Rule 30 cellular automaton Conus textile shell Rule 30 CA 70
71 1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each subdomain P0 P1 P2 Cur Next 71
72 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Communication Compute next step Communication Compute next step Communication 72
73 Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA 73
74 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 74
75 Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks 75
76 Parallel programming patterns: Reduce 76
77 Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, xn-1] sum-reduce( [x0, x1, xn-1] ) = x0+ x1+ + xn-1 min-reduce( [x0, x1, xn-1] ) = min { x0, x1, xn-1} A reduction can be realized in O(log2 n) parallel steps 77
78 Example: sum
79 Example: sum
80 Example: sum
81 Example: sum
82 Example: sum
83 Example: sum d int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } } return x[0]; 83 See reduction.c
84 Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level n / 2j sums at the j-th level 1 sum at the (log2 n)-th level n n/2 n/4 n/8... Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of work of the optimal serial algorithm 84
85 Parallel programming patterns: Scan 85
86 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = inclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= x0 x0 op x1 x0 op x1 op x2 x0 op x1 op op xn
87 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = exclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= 0 x0 x0 op x1 this is the neutral element of the binary operator (zero for sum, 1 for product,...) x0 op x1 op op xn
88 Blelloch Scan 88
89 Exclusive scan: Up-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[4..7] x[0] x[0..1] x[2] x[2..3] x[4] x[4..5] x[6] x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } } O(n) additions 89
90 Exclusive scan: Down-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] zero x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] 0 x[0] x[0..1] x[2] 0 x[4] x[4..5] x[6] x[0..3] x[0] 0 x[2] x[0..1] x[4] x[0..3] x[6] x[0..5] 0 x[0] x[0..1] x[0..2] x[0..3] x[0..4] x[0..5] x[0..6] O(n) x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } } additions See prefix-sum.c 90
91 Example: Line of Sight n peaks of heights h[0], h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? h[0] h[1] h[2] h[3] visible not visible h[4] h[5] h[6] h[7] 91
92 Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications 92
93 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 93
94 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 94
95 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 95
96 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 96
97 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 97
98 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 98
99 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 99
100 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 100
101 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 101
102 Serial algorithm For each i = 0, n 1 For each i = 0, n 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] - a[i] arctan( ( h[i] h[0] ) / i ), se i > 0 amax[0] - amax[i] max {a[0], a[1], a[i 1]}, se i > 0 For each i = 0, n 1 If a[i] amax[i] then the peak i is visible otherwise the peak i is not visible 102
103 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v 103
104 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v ) Embarassingly parallel Embarassingly parallel 104
105 Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do in parallel a[i] arctan( ( h[i] h[0] ) / i ) endfor amax exclusive-scan( max, a ) for i 0 to n-1 do in parallel v[i] ( a[i] amax[i] ) endfor return v 105
106 Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns 106
Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.
Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/)
More informationL-Systems and Affine Transformations
L-Systems and Affine Transformations Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2014, Moreno Marzolla, Università di
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationParallelization Strategy
COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationSimulating ocean currents
Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on
More informationCellular Automata. Cellular Automata contains three modes: 1. One Dimensional, 2. Two Dimensional, and 3. Life
Cellular Automata Cellular Automata is a program that explores the dynamics of cellular automata. As described in Chapter 9 of Peak and Frame, a cellular automaton is determined by four features: The state
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationCOMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction
COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationMultigrid Pattern. I. Problem. II. Driving Forces. III. Solution
Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids
More informationShared Memory Programming with OpenMP
Shared Memory Programming with OpenMP Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna moreno.marzolla@unibo.it Copyright 2013, 2014, 2017 2019 Moreno Marzolla, Università
More informationFractals. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.
Fractals Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 Geometric Objects Man-made objects are geometrically simple (e.g., rectangles,
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationCOSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns
COSC 6374 Parallel Computation Parallel Design Patterns Fall 2014 Design patterns A design pattern is a way of reusing abstract knowledge about a problem and its solution Patterns are devices that allow
More informationFall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12
Fall 2012 CSE 633 Parallel Algorithms Cellular Automata Nils Wisiol 11/13/12 Simple Automaton: Conway s Game of Life Simple Automaton: Conway s Game of Life John H. Conway Simple Automaton: Conway s Game
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationOutline: Embarrassingly Parallel Problems
Outline: Embarrassingly Parallel Problems what they are Mandelbrot Set computation cost considerations static parallelization dynamic parallelizations and its analysis Monte Carlo Methods parallel random
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationFractals exercise. Investigating task farms and load imbalance
Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationFractals. Investigating task farms and load imbalance
Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationParallel Programming
Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)
More informationCS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following
More informationData Structures and Algorithms
Data Structures and Algorithms Luciano Bononi Computer Science Engineering University of Bologna bononi@cs.unibo.it http://www.cs.unibo.it/~bononi/ Slide credits: these slides have been translated from
More informationParallel Programming Patterns Overview and Concepts
Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationECE7660 Parallel Computer Architecture. Perspective on Parallel Programming
ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationUNIT 9C Randomness in Computation: Cellular Automata Principles of Computing, Carnegie Mellon University
UNIT 9C Randomness in Computation: Cellular Automata 1 Exam locations: Announcements 2:30 Exam: Sections A, B, C, D, E go to Rashid (GHC 4401) Sections F, G go to PH 125C. 3:30 Exam: All sections go to
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationParallel Computing Architectures
Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2013 2018 Moreno Marzolla, Università di
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationBenchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
More informationHomework #4 Due Friday 10/27/06 at 5pm
CSE 160, Fall 2006 University of California, San Diego Homework #4 Due Friday 10/27/06 at 5pm 1. Interconnect. A k-ary d-cube is an interconnection network with k d nodes, and is a generalization of the
More informationLecture 5. Applications: N-body simulation, sorting, stencil methods
Lecture 5 Applications: N-body simulation, sorting, stencil methods Announcements Quiz #1 in section on 10/13 Midterm: evening of 10/30, 7:00 to 8:20 PM In Assignment 2, the following variation is suggested
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationSome aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)
Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously
More informationMessage Passing with MPI
Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2
More informationCommon Parallel Programming Paradigms
Parallel Program Models Last Time» Message Passing Model» Message Passing Interface (MPI) Standard» Examples Today» Embarrassingly Parallel» Master-Worer Reminders/Announcements» Homewor #3 is due Wednesday,
More informationParallel Computing Architectures
Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 An Abstract Parallel Architecture Processor Processor
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More informationMPI introduction - exercises -
MPI introduction - exercises - Introduction to Parallel Computing with MPI and OpenMP P. Ramieri May 2015 Hello world! (Fortran) As an ice breaking activity try to compile and run the Helloprogram, either
More information6. Parallel Volume Rendering Algorithms
6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks
More informationCode Parallelization
Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain
More informationGraph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California
Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California Abstract We define a graph adjacency matrix automaton (GAMA)
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationTransform & Conquer. Presorting
Transform & Conquer Definition Transform & Conquer is a general algorithm design technique which works in two stages. STAGE : (Transformation stage): The problem s instance is modified, more amenable to
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationAlgorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen
Algorithms PART I: Embarrassingly Parallel HPC Fall 2012 Prof. Robert van Engelen Overview Ideal parallelism Master-worker paradigm Processor farms Examples Geometrical transformations of images Mandelbrot
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationDigital Computer Arithmetic
Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products
More informationLecture 18 Representation and description I. 2. Boundary descriptors
Lecture 18 Representation and description I 1. Boundary representation 2. Boundary descriptors What is representation What is representation After segmentation, we obtain binary image with interested regions
More informationSC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level
SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationScanning Real World Objects without Worries 3D Reconstruction
Scanning Real World Objects without Worries 3D Reconstruction 1. Overview Feng Li 308262 Kuan Tian 308263 This document is written for the 3D reconstruction part in the course Scanning real world objects
More informationCSC630/COS781: Parallel & Distributed Computing
CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity
More informationMPI Case Study. Fabio Affinito. April 24, 2012
MPI Case Study Fabio Affinito April 24, 2012 In this case study you will (hopefully..) learn how to Use a master-slave model Perform a domain decomposition using ghost-zones Implementing a message passing
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationParallel Computing. Parallel Algorithm Design
Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels
More informationWeek 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh
Week 3: MPI Day 04 :: Domain decomposition, load balancing, hybrid particlemesh methods Domain decompositon Goals of parallel computing Solve a bigger problem Operate on more data (grid points, particles,
More informationData parallel algorithms 1
Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationDetermining Line Segment Visibility with MPI
Determining Line Segment Visibility with MPI CSE 633: Parallel Algorithms Fall 2012 Jayan Patel Problem Definition Computational Geometry From Algorithms Sequential and Parallel: Given a set of n pair-wise
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationCS 664 Segmentation. Daniel Huttenlocher
CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationGPU-accelerated data expansion for the Marching Cubes algorithm
GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationDesign of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono
Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise
More informationDynamic load balancing in OSIRIS
Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance
More informationProblem 3. (12 points):
Problem 3. (12 points): This problem tests your understanding of basic cache operations. Harry Q. Bovik has written the mother of all game-of-life programs. The Game-of-life is a computer game that was
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationParallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies
slides3-1 Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Load Balancing
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationOutline: Embarrassingly Parallel Problems. Example#1: Computation of the Mandelbrot Set. Embarrassingly Parallel Problems. The Mandelbrot Set
Outline: Embarrassingly Parallel Problems Example#1: Computation of the Mandelbrot Set what they are Mandelbrot Set computation cost considerations static parallelization dynamic parallelizations and its
More informationMultidimensional Indexes [14]
CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting
More informationGhost Cell Pattern. Fredrik Berg Kjolstad. January 26, 2010
Ghost Cell Pattern Fredrik Berg Kjolstad University of Illinois Urbana-Champaign, USA kjolsta1@illinois.edu Marc Snir University of Illinois Urbana-Champaign, USA snir@illinois.edu January 26, 2010 Problem
More informationCPS343 Parallel and High Performance Computing Project 1 Spring 2018
CPS343 Parallel and High Performance Computing Project 1 Spring 2018 Assignment Write a program using OpenMP to compute the estimate of the dominant eigenvalue of a matrix Due: Wednesday March 21 The program
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationKevin J. Barker. Scott Pakin and Darren J. Kerbyson
Experiences in Performance Modeling: The Krak Hydrodynamics Application Kevin J. Barker Scott Pakin and Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal/ Computer,
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationWhat are Cellular Automata?
What are Cellular Automata? It is a model that can be used to show how the elements of a system interact with each other. Each element of the system is assigned a cell. The cells can be 2-dimensional squares,
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationCoE4TN4 Image Processing
CoE4TN4 Image Processing Chapter 11 Image Representation & Description Image Representation & Description After an image is segmented into regions, the regions are represented and described in a form suitable
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationCS535 Fall Department of Computer Science Purdue University
Spatial Data Structures and Hierarchies CS535 Fall 2010 Daniel G Aliaga Daniel G. Aliaga Department of Computer Science Purdue University Spatial Data Structures Store geometric information Organize geometric
More information