Parallel Computing. Parallel Algorithm Design
|
|
- Eileen Welch
- 6 years ago
- Views:
Transcription
1 Parallel Computing Parallel Algorithm Design
2 Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels 2010@FEUP Parallel Algorithm Design 2
3 Task/Channel Model Task Channel Parallel Algorithm Design 3
4 Foster s Design Methodoly 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping Problem Partitioning Communication Mapping Agglomeration 2010@FEUP Parallel Algorithm Design 4
5 1. Partitioning Dividing computation and data into pieces Domain decomposition Divide data into pieces e.g., An array into sub-arrays (reduction); A loop into sub-loops (matrix multiplication), A search space into sub-spaces (chess) Functional decomposition Divide computation into pieces e.g., pipelines (floating point multiplication), workflows (pay roll processing) Determine how to associate data with computations 2010@FEUP Parallel Algorithm Design 5
6 Partitioning The individual pieces are called primitive tasks. Desirable attributes for partition Many more primitive tasks than processors on target computer. Tasks of roughly equal size (in computation and data). Number of tasks increases with problem size. Parallel Algorithm Design 6
7 Example of domain decomposition Parallel Algorithm Design 7
8 Example of Functional Decomposition Parallel Algorithm Design 8
9 2. Communication Determine values passed among tasks Local communication Task needs values from a small number of other tasks Create channels illustrating data flow Global communication Significant number of tasks contribute data to perform a computation Don t create channels for them early in design 2010@FEUP Parallel Algorithm Design 9
10 Desirable attributes for communication Balanced Communication operations balanced among tasks Small degree: Each task communicates with only small group of neighbors Concurrency Tasks can perform communications concurrently Task can perform computations concurrently Parallel Algorithm Design 10
11 3. Agglomeration Agglomeration is the process of grouping tasks into larger tasks to improve performance. Here, minimizing communication is typically a design goal. Grouping tasks that communicate with each other eliminates the need for communication, called increasing the locality Grouping tasks can also allow us to combine multiple communications into one. 2010@FEUP Parallel Algorithm Design 11
12 Desirable attributes of agglomeration Increased the locality of the parallel algorithm Agglomerated tasks have similar computational and communication costs Number of tasks increases with problem size Number of tasks is as small as possible, yet at least as great as the number of processors on target computer Parallel Algorithm Design 12
13 4. Mapping Mapping is the process of assigning agglomerated tasks to the processors Here, were thinking of a distributed memory machine If we choose the number of agglomerated tasks to equal the number of processors then the mapping is already done. Each processor gets one agglomerated task 2010@FEUP Parallel Algorithm Design 13
14 Mapping Goals Processor utilization: would like processors to have roughly equal computational and communication costs Minimize interprocessor communication This can be posed as a graph partitioning problem: Each partition should have roughly the same number of nodes The partition should cut a minimal amount of edges 2010@FEUP Parallel Algorithm Design 14
15 Partitioning a graph P0 P1 P0 P1 P0 P1 Equalizing processor utilization and minimizing interprocessor communication are often competing forces 2010@FEUP Parallel Algorithm Design 15
16 Mapping heuristics Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize comm Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks Use a run-time task-scheduling algorithm e.g., a master slave strategy Use a dynamic load balancing algorithm e.g., share load among neighboring processors; remapping periodically 2010@FEUP Parallel Algorithm Design 16
17 Example 1. Boundary value problems Ice water Rod Insulation Parallel Algorithm Design 17
18 Parallel Algorithm Design 18 Boundary Value Problem x u a u a t u c k a 2 t u u t u j i j i, 1, Heat conduction physics Discretization u i,j = temperature at position i and time j 2 1,, 1, x u u u x u j i j i j i j i j i j i j i ru u r ru u, 1, 1, 1, ) 2 1 ( 2 2 ( x) t a r
19 Boundary Value Problem Partition One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels Parallel Algorithm Design 19
20 Boundary Value Problem Agglomeration and mapping Agglomeration Parallel Algorithm Design 20
21 Model Analysis Sequential execution time to update element n number of elements m number of iterations Sequential execution time: m n Parallel execution p number of processors message time = + q/β, if q «β Parallel execution time m (n /p + 2) 2010@FEUP Parallel Algorithm Design 21
22 Example Parallel reduction Given associative operator a 0 a 1 a 2 a n-1 Examples Add Multiply And, Or Maximum, Minimum Data decomposition 1 task 1 of the values to operate (1 of the a s) 2010@FEUP Parallel Algorithm Design 22
23 Parallel reduction Further steps to reach a binomial tree 2010@FEUP Parallel Algorithm Design 23
24 Parallel reduction @FEUP Parallel Algorithm Design 24
25 Parallel reduction @FEUP Parallel Algorithm Design 25
26 Parallel reduction @FEUP Parallel Algorithm Design 26
27 Parallel reduction @FEUP Parallel Algorithm Design 27
28 Parallel reduction Binomial tree 25 Parallel Algorithm Design 28
29 Agglomeration sum sum sum sum Parallel Algorithm Design 29
30 Analysis Parallel running time time to perform the binary operation - time to communicate a value via a channel n values and p tasks Time for the tasks perform its inner calculations: (n/p - 1) Communication steps: log p After each receiving communication there is an operation Total time: (n/p - 1) + log p ( + ) 2010@FEUP Parallel Algorithm Design 30
31 Example: the N-body problem m (x,y) f1 B1 v f2 B2 B3 2010@FEUP Parallel Algorithm Design 31
32 The N-body problem Parallel Algorithm Design 32
33 The N-body problem partitioning Domain partitioning Assume one task per particle Task has particle s position, velocity vector and mass Iteration Get positions and mass of all other particles Compute new position and velocity 2010@FEUP Parallel Algorithm Design 33
34 Gather and All-Gather operations Gather operation (sequential) (p-1) All-Gather operation Parallel Algorithm Design 34
35 All-Gather To avoid conflicts all-gather is performed in log p steps, doubling the data in each step Communication (n items) = + (n / ) With p tasks there are log p iterations The number of items doubles at each iteration log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 35
36 Analysis N-body problem parallel version n bodies and p tasks m iterations over time Total time excluding I/O m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 36
37 Considering I/O Reading or writing n items of data through an I/O channel io + n/ io In N-body problem the initial values must be transmitted to the other tasks 2010@FEUP Parallel Algorithm Design 37
38 Scatter operation Improving 1. First task transmits n/2 items to another task 2. The 2 tasks transmits n/4 items to 2 other tasks 3. The 4 tasks transmits n/8 items to 8 other tasks 4. And so on log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 38
39 Analysis considering I/O Total time after m iterations Initial reading + scattering Computing m iterations Final gathering + writing 2 io n io 2 log p n( p 1) p m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 39
Parallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationFoster s Methodology: Application Examples
Foster s Methodology: Application Examples Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 19, 2011 CPD (DEI / IST) Parallel and
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationParallel Algorithm Design
Chapter Parallel Algorithm Design Debugging is twice as hard as writing the code in the rst place. Therefore, if you write the code as cleverly as possible, you are, by denition, not smart enough to debug
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationCOMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction
COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationCS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor
CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationIN5050: Programming heterogeneous multi-core processors Thinking Parallel
IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationSimulating ocean currents
Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationCOSC 462. Parallel Algorithms. The Design Basics. Piotr Luszczek
COSC 462 Parallel Algorithms The Design Basics Piotr Luszczek September 18, 2017 1/16 Levels of Abstraction 2/16 Concepts Tools Algorithms: partitioning, communication, agglomeration, mapping Domain, channel,
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationMatrix-vector Multiplication
Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm
More informationParallel Programming. Functional Decomposition (Document Classification)
Parallel Programming Functional Decomposition (Document Classification) Document Classification Problem Search directories, subdirectories for text documents (look for.html,.txt,.tex, etc.) Using a dictionary
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationCopyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 6 Parallel Program Development 1 Roadmap Solving non-trivial problems. The n-body problem. The traveling salesman problem. Applying Foster
More informationParallel Real-Time Systems
Parallel Real-Time Systems Parallel Computing Overview References (Will be expanded as needed) Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/pdc-f08/ Selected slides from Introduction
More informationWeek 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh
Week 3: MPI Day 04 :: Domain decomposition, load balancing, hybrid particlemesh methods Domain decompositon Goals of parallel computing Solve a bigger problem Operate on more data (grid points, particles,
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationProject C/MPI: Matrix-Vector Multiplication
Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving
More informationf xx + f yy = F (x, y)
Application of the 2D finite element method to Laplace (Poisson) equation; f xx + f yy = F (x, y) M. R. Hadizadeh Computer Club, Department of Physics and Astronomy, Ohio University 4 Nov. 2013 Domain
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationFlow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.
To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science
More informationDemonstration of Legion Runtime Using the PENNANT Mini-App
Demonstration of Legion Runtime Using the PENNANT Mini-App Charles Ferenbaugh Los Alamos National Laboratory LA-UR-14-29180 1 A brief overview of PENNANT Implements a small subset of basic physics from
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationL15: Putting it together: N-body (Ch. 6)!
Outline L15: Putting it together: N-body (Ch. 6)! October 30, 2012! Review MPI Communication - Blocking - Non-Blocking - One-Sided - Point-to-Point vs. Collective Chapter 6 shows two algorithms (N-body
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationAbstract. Introduction. Kevin Todisco
- Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image
More informationUniversity of Innsbruck. Topology Aware Data Organisation for Large Scale Simulations
University of Innsbruck Institute of Computer Science Research Group DPS (Distributed and Parallel Systems) Topology Aware Data Organisation for Large Scale Simulations Master Thesis Supervisor: Herbert
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationParallel Programming. Matrix Decomposition Options (Matrix-Vector Product)
Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using
More informationCOLA: Optimizing Stream Processing Applications Via Graph Partitioning
COLA: Optimizing Stream Processing Applications Via Graph Partitioning Rohit Khandekar, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Joel Wolf, Kun-Lung Wu, Henrique Andrade, and Bugra Gedik Streaming
More information17/03/2018. Meltem Özturan
Meltem Özturan www.mis.boun.edu.tr/ozturan/samd 2 1 Traditional Approach to Requirements Traditional Analysis Model Data flow diagrams Process description Data flow definiton Data store definition (Entity-Relationship
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationCommunicating Process Architectures in Light of Parallel Design Patterns and Skeletons
Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons Dr Kevin Chalmers School of Computing Edinburgh Napier University Edinburgh k.chalmers@napier.ac.uk Overview ˆ I started
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationParallelization Strategy
COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationMULTIPLE OPERAND ADDITION. Multioperand Addition
MULTIPLE OPERAND ADDITION Chapter 3 Multioperand Addition Add up a bunch of numbers Used in several algorithms Multiplication, recurrences, transforms, and filters Signed (two s comp) and unsigned Don
More informationA Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs
MACRo 2015-5 th International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More informationMPI Casestudy: Parallel Image Processing
MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationMPI Case Study. Fabio Affinito. April 24, 2012
MPI Case Study Fabio Affinito April 24, 2012 In this case study you will (hopefully..) learn how to Use a master-slave model Perform a domain decomposition using ghost-zones Implementing a message passing
More informationMore Communication (cont d)
Data types and the use of communicators can simplify parallel program development and improve code readability Sometimes, however, simply treating the processors as an unstructured collection is less than
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationThe Icosahedral Nonhydrostatic (ICON) Model
The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =
More informationHomework # 1 Due: Feb 23. Multicore Programming: An Introduction
C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #
More informationBasic MPI Communications. Basic MPI Communications (cont d)
Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each
More informationGraph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen
Graph Partitioning for High-Performance Scientific Simulations Advanced Topics Spring 2008 Prof. Robert van Engelen Overview Challenges for irregular meshes Modeling mesh-based computations as graphs Static
More informationCS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning Parallel sparse matrix-vector product Lay out matrix and vectors by rows y(i) = sum(a(i,j)*x(j)) Only compute terms with A(i,j) 0 P0 P1
More informationCS 426. Building and Running a Parallel Application
CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations
More informationShape optimisation using breakthrough technologies
Shape optimisation using breakthrough technologies Compiled by Mike Slack Ansys Technical Services 2010 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Introduction Shape optimisation technologies
More informationBasic Communication Ops
CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction
More informationJune 27, Real-Time Analytics through Convergence. of User-Defined Functions. Vinay Deolalikar. HP-Autonomy Research. Sunnyvale, CA.
June 27, 2013 Outline 1 2 3 are Measurements 4 Results Example 20 Newsgroups 5 Explosive growth in unstructured data Already comprises about 80% enterprise data Growing faster than structured data Enterprises
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationLecture 27: Board Notes: Parallel Programming Examples
Lecture 27: Board Notes: Parallel Programming Examples Part A: Consider the following binary search algorithm (a classic divide and conquer algorithm) that searches for a value X in a sorted N-element
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationCSC630/COS781: Parallel & Distributed Computing
CSC630/COS781: Parallel & Distributed Computing Algorithm Design Chapter 3 (3.1-3.3) 1 Contents Preliminaries of parallel algorithm design Decomposition Task dependency Task dependency graph Granularity
More informationParallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS
Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationStreaming Massive Environments From Zero to 200MPH
FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios) Turn 10 Internal studio at Microsoft Game Studios - we make Forza Motorsport Around 70 full time staff 2 Why am I
More informationData Structures and Algorithms
Data Structures and Algorithms Session 26. April 29, 2009 Instructor: Bert Huang http://www.cs.columbia.edu/~bert/courses/3137 Announcements Homework 6 due before last class: May 4th Final Review May 4th
More informationLesson 2 7 Graph Partitioning
Lesson 2 7 Graph Partitioning The Graph Partitioning Problem Look at the problem from a different angle: Let s multiply a sparse matrix A by a vector X. Recall the duality between matrices and graphs:
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationBasic Idea. The routing problem is typically solved using a twostep
Global Routing Basic Idea The routing problem is typically solved using a twostep approach: Global Routing Define the routing regions. Generate a tentative route for each net. Each net is assigned to a
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationOpenMP and MPI parallelization
OpenMP and MPI parallelization Gundolf Haase Institute for Mathematics and Scientific Computing University of Graz, Austria Chile, Jan. 2015 OpenMP for our example OpenMP generation in code Determine matrix
More informationScalable Software Components for Ultrascale Visualization Applications
Scalable Software Components for Ultrascale Visualization Applications Wes Kendall, Tom Peterka, Jian Huang SC Ultrascale Visualization Workshop 2010 11-15-2010 Primary Collaborators Jian Huang Tom Peterka
More informationParallel FEM Computation and Multilevel Graph Partitioning Xing Cai
Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example
More informationHardware-Software Codesign
Hardware-Software Codesign 4. System Partitioning Lothar Thiele 4-1 System Design specification system synthesis estimation SW-compilation intellectual prop. code instruction set HW-synthesis intellectual
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationTELCOM2125: Network Science and Analysis
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning
More informationCPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016
CPSC 340: Machine Learning and Data Mining Density-Based Clustering Fall 2016 Assignment 1 : Admin 2 late days to hand it in before Wednesday s class. 3 late days to hand it in before Friday s class. 0
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationA First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche
A First Step to the Evaluation of SimGrid in the Context of a Real Application Abdou Guermouche Hélène Renard 19th International Heterogeneity in Computing Workshop April 19, 2010 École polytechnique universitaire
More information