Parallel Programming
|
|
- Mercy Miles
- 6 years ago
- Views:
Transcription
1 Parallel Programming 7. Data Parallelism Christoph von Praun 07-1
2 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3) Task Parallelism Organization by Data Flow (1.5) Pipeline (1.2) Recursive Data (1.4) Divide and Conquer 07-2
3 (1.1) Geometric decomposition Context: Application operates on a large data structure with multiple data items. Operation on each data item has regular access with clear dependencies. Application is typically data-intensive, little computation per data item 07-3
4 Example: Heat transfer... Temperature: normal hot n simulation steps 07-4
5 Stencil (=schema according to which T is updated): t 1 t 2 T t 3 t 4 Temperature: normal hot T new = Iteration until T new - T old < ε. (t 1 old +t 2 old +t 3 old +t 4 old )
6 Forces Data decomposition: Different parts of the data structure are assigned to different activities. granularity and topology? naive decomposition may not be ideal. Scheduling: Coordination required if operation of one activity require input from data belonging to another activity. 07-6
7 Solution (template) Partition data into chunks, one chunk per activity. activities must access their chunks + inputs efficiently (may require explicit communication if data is distributed) each activity updates only its chunk
8 Use data copies ( ghost cells ) to reduce dependencies across different chunks (old/new schema): - Red activity keeps copy of t 2 old. - Red and blue activity exchange data in lockstep. T new = t 1 t 2 T t 3 t 4 (t old 1 +t old 2 +t old 3 +t old 4 )
9 For certain stencils: Avoid dependencies by alternating updates of red and black elements. activities operate in lockstep (all activities update red, then all activities update black...) 07-9
10 Computations that proceed in lock-step are best described following the SPMD (Single Program Multiple Data) Clock clk = Clock.make(); clk.drop(); for (c in chunks) { // each chunk is processed // by a separate activity async clocked (clk) { <update red in chunk c> clk.next(); <update black in chunk c> } } same/single program run by different activities on multiple chunks/data
11 Consequences Data decomposition (chunk processed by an activity) must be chosen wisely according to caching and in-memory layout of data structures. Patterns in Category (2) 07-11
12 Which decomposition is preferable? for (r in rows) async { for (c in columns) <update array at (r,c)> } for (c in columns) async { for (r in rows) <update array at (r,c)> } 07-12
13 ... it depends on the memory layout: cache line (holds variables at consecutive addresses) Row major: offset = row*ncols+col Programming language C, X10 Column major: offset = col*nrows+row Programming language Fortran, Matlab 07-13
14 Sequential traversal in X10 (row major): val region = (0..NROWS) * (0..NCOLS); val arr = new Array[Double](region, 0.0); for ([r,c] in region) // r, c: Int <update arr(r,c)> for (p in region) <update arr(p)> // p: Point{rank=2} 07-14
15 Parallel traversal in X10 (row major): val region = (0..NROWS) * (0..NCOLS); val arr = new Array[Double](region, 0.0); for ([r] in region.projection(0)) async for ([c] in region.projection(1)) <update arr(r,c)> 07-15
16 Consequences (cont.) Explicit exchange of data at synchronization points may require explicit communication on platforms w/o shared memory (e.g. MPI) copy operation in shared memory (from real cells to ghost cells) 07-16
17 Abstract model: Mesh Implementation: Array with ghost cells. copy from real to ghost local memory of blue activity local memory of red activity copy from real to ghost 07-17
18 Why ghost cells? To facilitate data independence on shared memory machines To aggregate communication in distributed memory systems and enable computations on local memory 07-18
19 Consequences (cont.) Activities operate in lockstep; performance depends on (dynamic) frequency of synchronization. frequent synchronization is a sign for frequent dependences, hence little parallelism
20 Lockstep computation Data exchange (ghost cell update) at synchronization points. Clock clk = Clock.make(); clk.drop(); for (c in chunks) { async clocked (clk) { <initialize ghost cells> clk.next(); while (!done) { <read local data and ghost cells, update local data> clk.next(); <update ghost cells> clk.next(); } } 07-20
21 Lockstep computation: <init GC> <init GC> <local stencil computation> <local stencil computation> <update GC> <update GC> clk.next() clk.next() 07-21
22 Further examples Dense linear algebra computations, e.g. solver for systems of linear equations (LINPACK, measure of floating-point performance for supercomputer TOP 500) matrix muliplication 07-22
23 Matrix muliplication C = A B B (m k) A (n m) * C (n k) Naive parallel algorithm: - Each element in c(i,j) is computed by an activity - Activity reads row of A and column of B
24 B (m k) 3 lines 1 line A (n m) * C (n k) Challenge: - Computation of c(i,j) requires 2m read operations - 2m variables fall typically on many different cache lines (row major) - reading a line from memory into the cache incurs significant delay (cache miss) 07-24
25 Further Examples (cont.) Finite element methods (structured grids), e.g. simulation of electromagnetism, fluid dynamics, heat transfer (PDE solvers) Simulation of airflow and temperature in data center rack with different component layout: Source: [2] 07-25
26 Image processing, e.g., Gaussian image blurring: per pixel stencil computation value of a pixel is weighted average of neighboring points (#px) orig 5px 20px 07-26
27 Limits on parallelism? Conceptually: Most data-parallel algorithms are embarrasingly parallel no dependency among tasks, e.g. matrix multiplication no or few synchronization points lots / arbitrary parallelism perfect scaling In practice: Limitations due to the implementation and physics of the machine... Source: [3] 07-27
28 Scaling of data parallel problems Strong Scaling Fix overall problem/data size. Varying number of computational resources Do additional computational resources shorten solution of a fixed-size problem? Sometimes called scale-up Very challenging: decreasing amount of computation per activity, less opportunity for data reuse within activity (caching), requires very efficient coordination and sharing between computational resources. Source: [3] 07-28
29 Scaling of data parallel problems Weak Scaling Fixed problem size per computational unit. Varying number of computational units. Can a larger problem be computed in the the same time with additional computational resources? Sometimes called scale-out Less challenging than strong scaling. Examples: clusters computing, Blade centers, many Google applications. Source: [3] 07-29
30 (1) Parallel algorithm structure design space Organization by Data (1.1) Data Parallelism (Geometric Decomposition) (1.2) Recursive Data Organization by Tasks (1.3) Task Parallelism (1.4) Divide and Conquer Organization by Data Flow (1.5) Pipeline 07-30
31 (1.2) Recursive data Context: Like (1.1) Data structure is recursive lists, trees, graphs Operations are sometimes recursive, sometimes seem inherently sequential
32 Example: Reduction Data structure: List of values 3, 5, 17, 3, 6, 8, 12, 13 Operation: compute sum of values 07-32
33 Sequential algorithm: ((((((3+5)+17)+3)+6)+8)+12) time 07-33
34 Sequential algorithm def sum(arr: Array[Int]{rank==1}): Int { var sum: Int = 0; for (i in arr) sum += arr(i); return sum; } 07 -
35 Parallel algorithm: pair-wise summation ((3+5)+(17+3))+((6+8)+(12+13)) 3 67 Just changed the evaluation order of sequential program Simple change of schedule enables / increases parallelism time
36 Parallel algorithm def sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (lo == hi) return arr(lo); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); } } 07 -
37 Semantics of X10 future S1; S1 val v1: Future[T] = Future.make(E1); S2; val v2: T = v1.force(); A feasible execution: 1) spaw async evaluation of expression E1 2) force future and claim result. s1 s2 v2 = <val> e1 hb-order 07-37
38 Parallel algorithm import x10.util.concurrent.future; def sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (lo == hi) return arr(lo); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); } } concurrent recursive descent val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); block until results are available 07 -
39 Algorithm follows divide and conquer pattern always possible for recursive operations natural opportunity for parallelization 07-39
40 Consequences Problem and its solution must be cast into a recursive form: Incurs sometimes additional cost that must be traded-off against the performance improvement due to parallelization. In the example: additional variables for temporary results, recursive caller chain Recursive formulation may not be intuitive to read 07-40
41 Amount of computation in the recursive descent must be significant to offset the cost of communication and synchronization. Example: sequential sum my be faster for arrays of size smaller than For larger arrays, take recursive, parallel algorithm
42 Parallel algorithm import x10.util.concurrent.future; val THRESHOLD = 1000; def par_sum(arr: Array[Int]{rank==1}): Int { return pairwise(arr, arr.region.min(0), arr.region.max(0)); } def pairwise(arr: Array[Int]{rank==1}, lo: Int, hi: Int) : Int { if (hi-lo < THRESHOLD) return seq_sum(arr, lo, hi); else { val lsum = Future.make(() => pairwise(arr, lo, lo + (hi-lo)/2)); val rsum = Future.make(() => pairwise(arr, lo + (hi-lo)/2 + 1, hi)); return lsum.force() + rsum.force(); } } def seq_sum(arr: Array[Int]{rank==1}, lo: Int, hi: Int): Int { var sum: Int = 0; for ((i): Point{rank==1} in [lo..hi]) sum += arr(i); return sum; } 07 -
43 Another example: Prefix sum (=scan) of list Data structure: List of values 3, 5, 17, 3, 6, 8, 12, 13 Operation: compute partial sum of first, up to current variable in the list: 3, 5, 17, 3, 6, 8, 12, , 8, 25, 28, 34, 42, 54,
44 Sequential algorithm def prefix_sum(arr: Array[Int]{rank==1}, res: Array[Int]{rank==1 && self.region == arr.region}) { for ((i): Point{rank==1} in arr) { if (i == 0) res(i) = arr(i); else res(i) = res(i-1) + arr(i); } } 07 -
45 Prefix sum is more difficult to parallelize than sum because all values (res(i), i<k) in the sequential solution are required to compute res(k)
46 Parallel prefix sum O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:
47 Parallel prefix sum O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-47
48 Parallel prefix sum 3 8 O:O 0: :2 28 0:3 1:4 2:5 3:6 4: O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-48
49 Parallel prefix sum 35 0:4 42 0:5 57 0:6 67 0:7 3 8 O:O 0: :2 28 0:3 1:4 2:5 3:6 4: O:O 0:1 1:2 2:3 3:4 4:5 5:6 6: O:O 1:1 2:2 3:3 4:4 5:5 6:6 7:7 copy add complete temporary 07-49
50 Sources [1] Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill: Patterns for Parallel Programming, Addison Wesley [2] Future Facilities: [3] Maged Michael, Jose Moreira, Doron Shiloach, Robert Wisniewski: "Scale-up x Scale-out: A Case Study using Nutch/Lucene". Parallel and Distributed Processing Symposium (IPDPS),
51 This work is licensed under a Creative Commons Attribution- ShareAlike 3.0 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights
Parallel Programming
Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationConcurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Concurrent Skip Lists Companion slides for The by Maurice Herlihy & Nir Shavit Set Object Interface Collection of elements No duplicates Methods add() a new element remove() an element contains() if element
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationCoarse-grained and fine-grained locking Niklas Fors
Coarse-grained and fine-grained locking Niklas Fors 2013-12-05 Slides borrowed from: http://cs.brown.edu/courses/cs176course_information.shtml Art of Multiprocessor Programming 1 Topics discussed Coarse-grained
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationPatterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices
Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)
More informationTask Graph. Name: Problem: Context: D B. C Antecedent. Task Graph
Graph Name: Graph Note: The Graph pattern is a concurrent execution pattern and should not be confused with the Arbitrary Static Graph architectural pattern (1) which addresses the overall organization
More informationSolution: a lock (a/k/a mutex) public: virtual void unlock() =0;
1 Solution: a lock (a/k/a mutex) class BasicLock { public: virtual void lock() =0; virtual void unlock() =0; ; 2 Using a lock class Counter { public: int get_and_inc() { lock_.lock(); int old = count_;
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationIntroduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Introduction Companion slides for The by Maurice Herlihy & Nir Shavit Moore s Law Transistor count still rising Clock speed flattening sharply 2 Moore s Law (in practice) 3 Nearly Extinct: the Uniprocesor
More informationParallel Programming Patterns
Parallel Programming Patterns Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2013, 2017, 2018 Moreno Marzolla, Università
More informationParallel Programming. March 15,
Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationCOSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns
COSC 6374 Parallel Computation Parallel Design Patterns Fall 2014 Design patterns A design pattern is a way of reusing abstract knowledge about a problem and its solution Patterns are devices that allow
More information7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!
7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker! Roadmap > Introduction! > Optimizations in the Back-end! > The Optimizer! > SSA Optimizations! > Advanced Optimizations! 2 Literature!
More informationParallelization Strategy
COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationAde Miller Senior Development Manager Microsoft patterns & practices
Ade Miller (adem@microsoft.com) Senior Development Manager Microsoft patterns & practices Save time and reduce risk on your software development projects by incorporating patterns & practices, Microsoft's
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationLinked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Linked Lists: Locking, Lock- Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Coarse-Grained Synchronization Each method locks the object Avoid
More informationLinked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Linked Lists: Locking, Lock-Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Adding threads should not lower throughput Contention
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger. Sources:
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More information5. Semantic Analysis!
5. Semantic Analysis! Prof. O. Nierstrasz! Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes.! http://www.cs.ucla.edu/~palsberg/! http://www.cs.purdue.edu/homes/hosking/!
More informationAnalytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationCOMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction
COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationA Pattern Language for Parallel Programming
A Pattern Language for Parallel Programming Tim Mattson timothy.g.mattson@intel.com Beverly Sanders sanders@cise.ufl.edu Berna Massingill bmassing@cs.trinity.edu Motivation Hardware for parallel computing
More informationParallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves
Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Daniel Butnaru butnaru@in.tum.de Advisor: Michael Bader bader@in.tum.de JASS 08 Computational Science and Engineering
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationChapter 17 - Parallel Processing
Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )
CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2012 2D Heat Diffusion
More informationSimulating ocean currents
Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on
More informationComplexity and Advanced Algorithms Monsoon Parallel Algorithms Lecture 2
Complexity and Advanced Algorithms Monsoon 2011 Parallel Algorithms Lecture 2 Trivia ISRO has a new supercomputer rated at 220 Tflops Can be extended to Pflops. Consumes only 150 KW of power. LINPACK is
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More information1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3
6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationParallel Programming Patterns
Parallel Programming Patterns Pattern-Driven Parallel Application Development 7/10/2014 DragonStar 2014 - Qing Yi 1 Parallelism and Performance p Automatic compiler optimizations have their limitations
More informationClustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY
Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,
More informationMPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education
MPI Tutorial Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education Center for Vision, Cognition, Learning and Art, UCLA July 15 22, 2013 A few words before
More informationShared-Memory Computability
Shared-Memory Computability 10011 Universal Object Wait-free/Lock-free computable = Threads with methods that solve n- consensus Art of Multiprocessor Programming Copyright Herlihy- Shavit 2007 93 GetAndSet
More informationDistributed Computing through Combinatorial Topology. Maurice Herlihy & Dmitry Kozlov & Sergio Rajsbaum
Distributed Computing through Maurice Herlihy & Dmitry Kozlov & Sergio Rajsbaum 1 In the Beginning 1 0 1 1 0 1 0 a computer was just a Turing machine Distributed Computing though 2 Today??? Computing is
More informationIN5050: Programming heterogeneous multi-core processors Thinking Parallel
IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good
More informationIronclad C++ A Library-Augmented Type-Safe Subset of C++
Ironclad C++ A Library-Augmented Type-Safe Subset of C++ Christian DeLozier, Richard Eisenberg, Peter-Michael Osera, Santosh Nagarakatte*, Milo M. K. Martin, and Steve Zdancewic October 30, 2013 University
More informationCS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor
CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html
More informationDistributed Computing through Combinatorial Topology MITRO207, P4, 2017
Distributed Computing through MITRO207, P4, 2017 Administrivia Language: (fr)anglais? Lectures: Fridays (28.04, 20.05-23.06, 30.06), Thursday (29.06), 8:30-11:45, B555-557 Web page: http://perso.telecom-paristech.fr/~kuznetso/
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationCSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015
CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015 Outline Done: How to write a parallel algorithm with fork and join Why using divide-and-conquer
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationSparse Training Data Tutorial of Parameter Server
Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu High-dimensional data are sparse Why high dimension?! make the classifier s job
More informationCS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1
CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationChapter 10. Implementing Subprograms ISBN
Chapter 10 Implementing Subprograms ISBN 0-321-33025-0 Chapter 10 Topics The General Semantics of Calls and Returns Implementing Simple Subprograms Implementing Subprograms with Stack-Dynamic Local Variables
More informationAdvanced Parallel Programming
Sebastian von Alfthan Jussi Enkovaara Pekka Manninen Advanced Parallel Programming February 15-17, 2016 PRACE Advanced Training Center CSC IT Center for Science Ltd, Finland All material (C) 2011-2016
More informationPerformance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads
Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationA Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/
More informationHigh Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms
High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing
More informationImage Processing. Filtering. Slide 1
Image Processing Filtering Slide 1 Preliminary Image generation Original Noise Image restoration Result Slide 2 Preliminary Classic application: denoising However: Denoising is much more than a simple
More informationLecture 7: Mutual Exclusion 2/16/12. slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit
Principles of Concurrency and Parallelism Lecture 7: Mutual Exclusion 2/16/12 slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit Time Absolute, true and mathematical time, of
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationPatterns for! Parallel Programming!
Lecture 4! Patterns for! Parallel Programming! John Cavazos! Dept of Computer & Information Sciences! University of Delaware!! www.cis.udel.edu/~cavazos/cisc879! Lecture Overview Writing a Parallel Program
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized
More informationIntroduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI
Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I
EE382 (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Spring 2015
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --
More informationSome aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)
Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously
More informationAlgorithms for GIS csci3225
Algorithms for GIS csci3225 Laura Toma Bowdoin College Flow on digital terrain models (I) Flow Where does the water go when it rains? Flooding: What are the areas susceptible to flooding? Sea level rise:
More informationUniversality of Consensus. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Universality of Consensus Companion slides for The by Maurice Herlihy & Nir Shavit Turing Computability 1 0 1 1 0 1 0 A mathematical model of computation Computable = Computable on a T-Machine 2 Shared-Memory
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationTopic Notes: Message Passing Interface (MPI)
Computer Science 400 Parallel Processing Siena College Fall 2008 Topic Notes: Message Passing Interface (MPI) The Message Passing Interface (MPI) was created by a standards committee in the early 1990
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationTitanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium
Titanium Titanium and Java Parallelism Arvind Krishnamurthy Fall 2004 Take the best features of threads and MPI (just like Split-C) global address space like threads (ease programming) SPMD parallelism
More informationParallelism in Software
Parallelism in Software Minsoo Ryu Department of Computer Science and Engineering 2 1 Parallelism in Software 2 Creating a Multicore Program 3 Multicore Design Patterns 4 Q & A 2 3 Types of Parallelism
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationMilind Kulkarni Research Statement
Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More information