Parallel Programming Patterns

Size: px
Start display at page:

Download "Parallel Programming Patterns"

Transcription

1 Parallel Programming Patterns Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna

2 Copyright 2013, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 2

3 What is a pattern? A design pattern is a general solution to a recurring engineering problem A design pattern is not a ready-made solution to a given problem......rather, it is a description of how a certain kind of problem can be solved 3

4 Architectural patterns The term architectural pattern was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction 4

5 Example Building a bridge across a river You do not invent a brand new type of bridge each time Instead, you adapt an already existing type of bridge 5

6 Example 6

7 Example 7

8 Example 8

9 Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan 9

10 Parallel programming patterns: Embarrassingly parallel 10

11 Embarrassingly Parallel Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking... Processor 0 Processor 1 Processor = = = a[] b[] c[] 11

12 Parallel programming patterns: Partition 12

13 Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors 13

14 Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] 14

15 Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions 15

16 1-D Partitioning Block Core 0 Core 1 Core 2 Core 3 Cyclic 16

17 2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 17

18 2-D Cyclic Partitioning Cyclic, * *, Cyclic 18

19 2-D Cyclic Partitioning Cyclic-cyclic 19

20 Irregular partitioning example A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: 20

21 Computation Fine grained vs Coarse grained partitioning Coarse-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Time Fine-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Communication 21

22 Example: Mandelbrot set The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as { 0 if n=0 z n (c)= 2 z n 1 (c) + c otherwise does not diverge when n + 22

23 Mandelbrot set in color If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become >2 23

24 Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0; y = 0; it = 0; while ( it < maxit AND x*x + y*y 2*2 ) { xnew = x*x - y*y + cx; ynew = 2*x*y + cy; x = xnew; y = ynew; } plot(x0, y0, it); } Source: 24

25 Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations Other pixels require fewer iterations 25

26 Load balancing Ideally, each processor should perform the same amount of work If the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization 26

27 Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use fine-grained partitioning...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm)...but beware that dynamic task allocation might incur in higher overhead with respect to static task allocation 27

28 Master-worker paradigm (process farm, work pool) Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker 0 Master Bag of tasks of possibly different duration Worker 1 Worker P-1 28

29 Choosing the partition size Wall-clock time The optimal partition size is in general system- and applicationdependent; it might be estimated by measurement Optimal partition size Partition size Too small = higher scheduling overhead Too large = unbalanced workload 29

30 coarse-grained decomposition static task assignment block size = 64 static task assignment P0 P0 P1 P2 P3 P1 P0 P1 P2 P2 P3 P0 P3 P1 block size = 64 dynamic (master-worker) task assignment P0 (example) P2 P1 P3 P0 P2 P0 P3 P2 P0 30

31 Example omp-mandelbrot.c Coarse-grained partitioning Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static,64"./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static"./omp-mandelbrot OMP_SCHEDULE="dynamic,64"./omp-mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE="dynamic"./omp-mandelbrot 31

32 Parallel programming patterns: Stencil 32

33 Stencils Stencil computations involve a grid whose values are updated according to a fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 5 neighborhood

34 2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil 34

35 3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil 35

36 3D Stencils 72-point 3-plane 3D stencil 36

37 2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase 37

38 Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are glued together to form a torus Ghost cells Domain In either case, we extend the domain with ghost cells, so that cells on the border do not require any special treatment Parallel Programming Patterns i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torus

39 Periodic boundary conditions: How to fill ghost cells 39

40 2D Stencil Example: Game of Life 2D cyclic domain, each cell has two possible states The state of a cell at time t + 1 depends on 0 = dead 1 = alive the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors dies Alive cell with two or three alive neighbors lives Alive cell with more than three alive neighbors dies Dead cell with three alive neighbors lives 40

41 Example: Game of Life See game-of-life.c 41

42 Periodic boundary conditions: Another way to fill ghost cells 42

43 Periodic boundary conditions: Another way to fill ghost cells 43

44 Periodic boundary conditions: Another way to fill ghost cells 44

45 Periodic boundary conditions: Another way to fill ghost cells 45

46 Periodic boundary conditions: Another way to fill ghost cells 46

47 Periodic boundary conditions: Another way to fill ghost cells 47

48 Parallelizing stencil computations Computing the next grid from the current one has embarassingly parallel structure Initialize current grid while (!terminated) { Fill ghost cells Compute next grid Exchange current and next grids } Embarassingly Parallel However, domain partitioning on distributed-memory architectures requires special care 48

49 Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of logically adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 49

50 Example: 2D partitioning with 5P stencil Periodic boundary P0 P1 P2 P3 P4 P5 P6 P7 P8 50

51 Example: 2D partitioning with 5P stencil Periodic boundary 51

52 Example: 2D partitioning with 5P stencil Periodic boundary 52

53 Example: 2D partitioning with 5P stencil Periodic boundary 53

54 Example: 2D partitioning with 5P stencil Periodic boundary 54

55 Example: 2D partitioning with 9P stencil 55

56 Example: 2D partitioning with 9P stencil 56

57 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary P0 P1 P2 57

58 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 58

59 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 59

60 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 60

61 Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition......assuming the following boundary conditions: (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors 61

62 Choosing a decomposition (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 62

63 Choosing a decomposition (Block, *), periodic boundary conditions N P0 P1 The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size 8 N ghost cells P2 P3 63

64 Choosing a decomposition (Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 64

65 Choosing a decomposition (Block, Block), periodic boundary conditions N/2 N/2 P0 P1 8 N ghost cells P2 P3 65

66 Choosing a decomposition (Block, Block), non-periodic boundary conditions N/2 N/2 P0 P1 4 N ghost cells P2 P3 66

67 Recap (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 (Block, *) (Block, Block) 8N 8N Non periodic 6 N 4N Periodic 67

68 1D Stencil Example: Rule 30 Cellular Automaton The state at time t + 1 depends on the state of the red cells at time t Time t t+1 t+2 Rule 30 cellular automaton 68

69 Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 69

70 Rule 30 cellular automaton Conus textile shell Rule 30 CA 70

71 1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each subdomain P0 P1 P2 Cur Next 71

72 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Communication Compute next step Communication Compute next step Communication 72

73 Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA 73

74 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 74

75 Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks 75

76 Parallel programming patterns: Reduce 76

77 Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, xn-1] sum-reduce( [x0, x1, xn-1] ) = x0+ x1+ + xn-1 min-reduce( [x0, x1, xn-1] ) = min { x0, x1, xn-1} A reduction can be realized in O(log2 n) parallel steps 77

78 Example: sum

79 Example: sum

80 Example: sum

81 Example: sum

82 Example: sum

83 Example: sum d int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } } return x[0]; 83 See reduction.c

84 Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level n / 2j sums at the j-th level 1 sum at the (log2 n)-th level n n/2 n/4 n/8... Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of work of the optimal serial algorithm 84

85 Parallel programming patterns: Scan 85

86 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = inclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= x0 x0 op x1 x0 op x1 op x2 x0 op x1 op op xn

87 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = exclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= 0 x0 x0 op x1 this is the neutral element of the binary operator (zero for sum, 1 for product,...) x0 op x1 op op xn

88 Blelloch Scan 88

89 Exclusive scan: Up-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[4..7] x[0] x[0..1] x[2] x[2..3] x[4] x[4..5] x[6] x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } } O(n) additions 89

90 Exclusive scan: Down-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] zero x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] 0 x[0] x[0..1] x[2] 0 x[4] x[4..5] x[6] x[0..3] x[0] 0 x[2] x[0..1] x[4] x[0..3] x[6] x[0..5] 0 x[0] x[0..1] x[0..2] x[0..3] x[0..4] x[0..5] x[0..6] O(n) x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } } additions See prefix-sum.c 90

91 Example: Line of Sight n peaks of heights h[0], h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? h[0] h[1] h[2] h[3] visible not visible h[4] h[5] h[6] h[7] 91

92 Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications 92

93 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 93

94 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 94

95 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 95

96 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 96

97 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 97

98 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 98

99 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 99

100 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 100

101 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 101

102 Serial algorithm For each i = 0, n 1 For each i = 0, n 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] - a[i] arctan( ( h[i] h[0] ) / i ), se i > 0 amax[0] - amax[i] max {a[0], a[1], a[i 1]}, se i > 0 For each i = 0, n 1 If a[i] amax[i] then the peak i is visible otherwise the peak i is not visible 102

103 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v 103

104 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v ) Embarassingly parallel Embarassingly parallel 104

105 Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do in parallel a[i] arctan( ( h[i] h[0] ) / i ) endfor amax exclusive-scan( max, a ) for i 0 to n-1 do in parallel v[i] ( a[i] amax[i] ) endfor return v 105

106 Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns 106

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/)

More information

L-Systems and Affine Transformations

L-Systems and Affine Transformations L-Systems and Affine Transformations Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2014, Moreno Marzolla, Università di

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Parallelization Strategy

Parallelization Strategy COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Cellular Automata. Cellular Automata contains three modes: 1. One Dimensional, 2. Two Dimensional, and 3. Life

Cellular Automata. Cellular Automata contains three modes: 1. One Dimensional, 2. Two Dimensional, and 3. Life Cellular Automata Cellular Automata is a program that explores the dynamics of cellular automata. As described in Chapter 9 of Peak and Frame, a cellular automaton is determined by four features: The state

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP Shared Memory Programming with OpenMP Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna moreno.marzolla@unibo.it Copyright 2013, 2014, 2017 2019 Moreno Marzolla, Università

More information

Fractals. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Fractals. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna. Fractals Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 Geometric Objects Man-made objects are geometrically simple (e.g., rectangles,

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns COSC 6374 Parallel Computation Parallel Design Patterns Fall 2014 Design patterns A design pattern is a way of reusing abstract knowledge about a problem and its solution Patterns are devices that allow

More information

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12 Fall 2012 CSE 633 Parallel Algorithms Cellular Automata Nils Wisiol 11/13/12 Simple Automaton: Conway s Game of Life Simple Automaton: Conway s Game of Life John H. Conway Simple Automaton: Conway s Game

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Outline: Embarrassingly Parallel Problems

Outline: Embarrassingly Parallel Problems Outline: Embarrassingly Parallel Problems what they are Mandelbrot Set computation cost considerations static parallelization dynamic parallelizations and its analysis Monte Carlo Methods parallel random

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

Fractals exercise. Investigating task farms and load imbalance

Fractals exercise. Investigating task farms and load imbalance Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Fractals. Investigating task farms and load imbalance

Fractals. Investigating task farms and load imbalance Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Parallel Programming

Parallel Programming Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)

More information

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009. CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont. Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms Luciano Bononi Computer Science Engineering University of Bologna bononi@cs.unibo.it http://www.cs.unibo.it/~bononi/ Slide credits: these slides have been translated from

More information

Parallel Programming Patterns Overview and Concepts

Parallel Programming Patterns Overview and Concepts Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

UNIT 9C Randomness in Computation: Cellular Automata Principles of Computing, Carnegie Mellon University

UNIT 9C Randomness in Computation: Cellular Automata Principles of Computing, Carnegie Mellon University UNIT 9C Randomness in Computation: Cellular Automata 1 Exam locations: Announcements 2:30 Exam: Sections A, B, C, D, E go to Rashid (GHC 4401) Sections F, G go to PH 125C. 3:30 Exam: All sections go to

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Parallel Computing Architectures

Parallel Computing Architectures Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2013 2018 Moreno Marzolla, Università di

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

Homework #4 Due Friday 10/27/06 at 5pm

Homework #4 Due Friday 10/27/06 at 5pm CSE 160, Fall 2006 University of California, San Diego Homework #4 Due Friday 10/27/06 at 5pm 1. Interconnect. A k-ary d-cube is an interconnection network with k d nodes, and is a generalization of the

More information

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Lecture 5. Applications: N-body simulation, sorting, stencil methods Lecture 5 Applications: N-body simulation, sorting, stencil methods Announcements Quiz #1 in section on 10/13 Midterm: evening of 10/30, 7:00 to 8:20 PM In Assignment 2, the following variation is suggested

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

Message Passing with MPI

Message Passing with MPI Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2

More information

Common Parallel Programming Paradigms

Common Parallel Programming Paradigms Parallel Program Models Last Time» Message Passing Model» Message Passing Interface (MPI) Standard» Examples Today» Embarrassingly Parallel» Master-Worer Reminders/Announcements» Homewor #3 is due Wednesday,

More information

Parallel Computing Architectures

Parallel Computing Architectures Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 An Abstract Parallel Architecture Processor Processor

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

MPI introduction - exercises -

MPI introduction - exercises - MPI introduction - exercises - Introduction to Parallel Computing with MPI and OpenMP P. Ramieri May 2015 Hello world! (Fortran) As an ice breaking activity try to compile and run the Helloprogram, either

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Code Parallelization

Code Parallelization Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain

More information

Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California

Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California Abstract We define a graph adjacency matrix automaton (GAMA)

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Transform & Conquer. Presorting

Transform & Conquer. Presorting Transform & Conquer Definition Transform & Conquer is a general algorithm design technique which works in two stages. STAGE : (Transformation stage): The problem s instance is modified, more amenable to

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Algorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen

Algorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen Algorithms PART I: Embarrassingly Parallel HPC Fall 2012 Prof. Robert van Engelen Overview Ideal parallelism Master-worker paradigm Processor farms Examples Geometrical transformations of images Mandelbrot

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

Lecture 18 Representation and description I. 2. Boundary descriptors

Lecture 18 Representation and description I. 2. Boundary descriptors Lecture 18 Representation and description I 1. Boundary representation 2. Boundary descriptors What is representation What is representation After segmentation, we obtain binary image with interested regions

More information

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level

SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level SC12 HPC Educators session: Unveiling parallelization strategies at undergraduate level E. Ayguadé, R. M. Badia, D. Jiménez, J. Labarta and V. Subotic August 31, 2012 Index Index 1 1 The infrastructure:

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Scanning Real World Objects without Worries 3D Reconstruction

Scanning Real World Objects without Worries 3D Reconstruction Scanning Real World Objects without Worries 3D Reconstruction 1. Overview Feng Li 308262 Kuan Tian 308263 This document is written for the 3D reconstruction part in the course Scanning real world objects

More information

CSC630/COS781: Parallel & Distributed Computing

CSC630/COS781: Parallel & Distributed Computing CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity

More information

MPI Case Study. Fabio Affinito. April 24, 2012

MPI Case Study. Fabio Affinito. April 24, 2012 MPI Case Study Fabio Affinito April 24, 2012 In this case study you will (hopefully..) learn how to Use a master-slave model Perform a domain decomposition using ghost-zones Implementing a message passing

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh Week 3: MPI Day 04 :: Domain decomposition, load balancing, hybrid particlemesh methods Domain decompositon Goals of parallel computing Solve a bigger problem Operate on more data (grid points, particles,

More information

Data parallel algorithms 1

Data parallel algorithms 1 Data parallel algorithms (Guy Steele): The data-parallel programming style is an approach to organizing programs suitable for execution on massively parallel computers. In this lecture, we will characterize

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Determining Line Segment Visibility with MPI

Determining Line Segment Visibility with MPI Determining Line Segment Visibility with MPI CSE 633: Parallel Algorithms Fall 2012 Jayan Patel Problem Definition Computational Geometry From Algorithms Sequential and Parallel: Given a set of n pair-wise

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I EE382 (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise

More information

Dynamic load balancing in OSIRIS

Dynamic load balancing in OSIRIS Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance

More information

Problem 3. (12 points):

Problem 3. (12 points): Problem 3. (12 points): This problem tests your understanding of basic cache operations. Harry Q. Bovik has written the mother of all game-of-life programs. The Game-of-life is a computer game that was

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies slides3-1 Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Load Balancing

More information

Cost-Effective Parallel Computational Electromagnetic Modeling

Cost-Effective Parallel Computational Electromagnetic Modeling Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,

More information

Outline: Embarrassingly Parallel Problems. Example#1: Computation of the Mandelbrot Set. Embarrassingly Parallel Problems. The Mandelbrot Set

Outline: Embarrassingly Parallel Problems. Example#1: Computation of the Mandelbrot Set. Embarrassingly Parallel Problems. The Mandelbrot Set Outline: Embarrassingly Parallel Problems Example#1: Computation of the Mandelbrot Set what they are Mandelbrot Set computation cost considerations static parallelization dynamic parallelizations and its

More information

Multidimensional Indexes [14]

Multidimensional Indexes [14] CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting

More information

Ghost Cell Pattern. Fredrik Berg Kjolstad. January 26, 2010

Ghost Cell Pattern. Fredrik Berg Kjolstad. January 26, 2010 Ghost Cell Pattern Fredrik Berg Kjolstad University of Illinois Urbana-Champaign, USA kjolsta1@illinois.edu Marc Snir University of Illinois Urbana-Champaign, USA snir@illinois.edu January 26, 2010 Problem

More information

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

CPS343 Parallel and High Performance Computing Project 1 Spring 2018 CPS343 Parallel and High Performance Computing Project 1 Spring 2018 Assignment Write a program using OpenMP to compute the estimate of the dominant eigenvalue of a matrix Due: Wednesday March 21 The program

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson Experiences in Performance Modeling: The Krak Hydrodynamics Application Kevin J. Barker Scott Pakin and Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal/ Computer,

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

What are Cellular Automata?

What are Cellular Automata? What are Cellular Automata? It is a model that can be used to show how the elements of a system interact with each other. Each element of the system is assigned a cell. The cells can be 2-dimensional squares,

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

CoE4TN4 Image Processing

CoE4TN4 Image Processing CoE4TN4 Image Processing Chapter 11 Image Representation & Description Image Representation & Description After an image is segmented into regions, the regions are represented and described in a form suitable

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

CS535 Fall Department of Computer Science Purdue University

CS535 Fall Department of Computer Science Purdue University Spatial Data Structures and Hierarchies CS535 Fall 2010 Daniel G Aliaga Daniel G. Aliaga Department of Computer Science Purdue University Spatial Data Structures Store geometric information Organize geometric

More information