EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

Size: px
Start display at page:

Download "EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100"

Transcription

1 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task parallelism exploits the parallelism by distributing the execution of different tasks across different parallel processing elements. 2. Race condition: When the output of a parallel program for a given input is nondeterministic as it depends upon the rate at which the various threads are executing, the program has a race condition. 3. PRAM: PRAM is a shared memory programming model which consists of p (p > 1) processors connected to a shared memory executing in synchronous manner (using a common clock). Each computation and each access to memory take 1 unit of time. 4. Shared memory programming model: Shared memory programming model provides a globally shared data space that is accessible to all the threads. Threads can also have their own private data. Programmer is responsible for synchronizing access globally shared data to ensure correctness of the program. 5. Asynchronous execution: Asynchronous execution has no global clock to coordinate execution. The order of execution of instructions depends on input data, scheduling algorithm, speed of the processors, and speed of communication network. 1

2 2 [25 points] 1. For simplicity, let us assume the number of threads w is a power of 2 and denoted as 2 m, m k. We evenly divide the input vector p into w sub-vectors, each sub-vector with length 2 k m. These sub-vectors are denoted as p sub0,...,p subw 1. Similarly, we can obtain q sub0,...,q subw 1. Then, we use T hread i (0 i < w) to compute the dot product of p subi and q subi. After each thread obtains a partial dot product, we sum up these partial dot products following the algorithm in Lecture 7 (Title: Adding in PRAM) to obtain the final result. The time complexity for the serial execution is O(2 k )+O(2 k 1) = O(2 k ). The time complexity for the parallel execution is O(2 k m )+O(log w) = O(2 k m )+O(m) = O(2 k m ) = O(2 k O(2 /w). Therefore, the speedup is ) =O(w) scalable solution. O(2 k /w) 2. 1 /* Pseudo code executed by the thread with index id */ 2 Partial_dot_product[id]=0; // Partial_dot_product is a shared array to store partial dot products; the final result will be output as Partial_dot_product[0] by the thread with index 0 3 for (i = id*2 k m ; i < (id+1)*2 k m ; i++) 4 Partial_dot_product[id]+ = p i q i ; 5 end for 6 barrier; // A barrier is needed here to synchronize threads 7 for (i = 0; i < m; i++) 8 if(id mod 2 i+1 = 0) then 9 Partial_dot_product[id]+ =Partial_dot_product[id+2 i ]; 10 end if 11 barrier; 12 end for 2

3 3 [30 points] 1. The shared variables include At least one vertex has update and the s array which records the shortest path lengths /* Pseudo code executed by Thread(i,j) */ 2 for (k = 0; k < # of vertices; k++) 3 if At_least_one_vertex_has_update = true then 4 At_least_one_vertex_has_update = false; 5 barrier; 6 Lock(s(i), s(j)); 7 if s(i)+w(i,j)<s(j) then 8 s(j) = s(i)+w(i,j); 9 At_least_one_vertex_has_update = true; 10 end if 11 Unlock(s(i), s(j)); 12 else then 13 Return; 14 end if 15 barrier; 16 end for 3

4 4 [25 points] The Pthreads program discussed in class when converted into PRAM will cause multiple writes to same location C(i, k) in k-th iteration by threads (i, 1 : n). We can have another for loop within each of the k iterations to serialize the thread accesses by threads (i, 1 : n). The final program will take n 2 clocks. A better approach is to rearrange the data accesses by threads to the elements in C. The program can be implemented in n clocks as shown by the program below. Thread(i,j) { Do k from 1 to n index = (k + j - 1) % (n+1) + floor((k + j - 1)/(n+1)) C(i, index) = C(i, index) + A(i,j)*B(j, index) End } 4

5 5 [10 points] No, the parallel execution will not produce the same output A. This is because the loop we are parallelizing has dependence within the loop (i.e., loop dependence). For example, in the serial version, A[i][j 1] is always computed before A[i][j]; but in the parallel version, it is likely that A[i][j 1] is computed after A[i][j] due to the scheduling of threads; this is problematic because the computation of A[i][j] depends on A[i][j 1]. 5

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem. EE/CSCI 451 Spring 2018 Homework 8 Total Points: 100 1 [10 points] Explain the following terms: EREW PRAM CRCW PRAM Brent s Theorem BSP model 1 2 [15 points] Assume two sorted sequences of size n can be

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Lesson 1 1 Introduction

Lesson 1 1 Introduction Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Implementation of Parallel Path Finding in a Shared Memory Architecture

Implementation of Parallel Path Finding in a Shared Memory Architecture Implementation of Parallel Path Finding in a Shared Memory Architecture David Cohen and Matthew Dallas Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 Email: {cohend4, dallam}

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1 CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2 The PRAM model A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2 Introduction The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer. A PRAM consists of

More information

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12 CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, 2012 1! Homework 3: Due Before Class, Thurs. Oct. 18 handin cs4230 hw3 Problem 1 (Amdahl s Law): (i) Assuming a

More information

and 6.855J. The Successive Shortest Path Algorithm and the Capacity Scaling Algorithm for the Minimum Cost Flow Problem

and 6.855J. The Successive Shortest Path Algorithm and the Capacity Scaling Algorithm for the Minimum Cost Flow Problem 15.082 and 6.855J The Successive Shortest Path Algorithm and the Capacity Scaling Algorithm for the Minimum Cost Flow Problem 1 Pseudo-Flows A pseudo-flow is a "flow" vector x such that 0 x u. Let e(i)

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Parallel Random-Access Machines

Parallel Random-Access Machines Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69 Plan 1 The PRAM Model 2 Performance

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for

More information

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n

More information

Parallel Random Access Machine (PRAM)

Parallel Random Access Machine (PRAM) PRAM Algorithms Parallel Random Access Machine (PRAM) Collection of numbered processors Access shared memory Each processor could have local memory (registers) Each processor can access any shared memory

More information

Data Structure and Algorithm, Spring 2013 Midterm Examination 120 points Time: 2:20pm-5:20pm (180 minutes), Tuesday, April 16, 2013

Data Structure and Algorithm, Spring 2013 Midterm Examination 120 points Time: 2:20pm-5:20pm (180 minutes), Tuesday, April 16, 2013 Data Structure and Algorithm, Spring 2013 Midterm Examination 120 points Time: 2:20pm-5:20pm (180 minutes), Tuesday, April 16, 2013 Problem 1. In each of the following question, please specify if the statement

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

Chapter 2 Abstract Machine Models. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Chapter 2 Abstract Machine Models. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Chapter 2 Abstract Machine Models Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Parallel Computer Models (1) A parallel machine model (also known as programming model, type architecture, conceptual

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 6 Parallel Program Development 1 Roadmap Solving non-trivial problems. The n-body problem. The traveling salesman problem. Applying Foster

More information

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans Lesson 1 4 Prefix Sum Definitions Prefix sum given an array...the prefix sum is the sum of all the elements in the array from the beginning to the position, including the value at the position. The sequential

More information

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100 EE/CSCI 45 Spring 08 Homework Assigned: February 7, 08 Due: February 4, 08, before :59 pm Total Points: 00 [0 points] Explain the following terms:. Diameter of a network. Bisection width of a network.

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014

Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014 Tree Search for Travel Salesperson Problem Pacheco Text Book Chapt 6 T. Yang, UCSB CS140, Spring 2014 Outline Tree search for travel salesman problem. Recursive code Nonrecusive code Parallelization with

More information

L21: Putting it together: Tree Search (Ch. 6)!

L21: Putting it together: Tree Search (Ch. 6)! Administrative CUDA project due Wednesday, Nov. 28 L21: Putting it together: Tree Search (Ch. 6)! Poster dry run on Dec. 4, final presentations on Dec. 6 Optional final report (4-6 pages) due on Dec. 14

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Name: Sample Solution Email address (UWNetID): CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering.

More information

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53 Flat Parallelization V. Aksenov, ITMO University P. Kuznetsov, ParisTech July 4, 2017 1 / 53 Outline Flat-combining PRAM and Flat parallelization PRAM binary heap with Flat parallelization ExtractMin Insert

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

CSE332 Summer 2010: Final Exam

CSE332 Summer 2010: Final Exam CSE332 Summer 2010: Final Exam Closed notes, closed book; calculator ok. Read the instructions for each problem carefully before answering. Problems vary in point-values, difficulty and length, so you

More information

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont. Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26, 2010 1 Homework 1 Due 10:00 PM, Wed., Sept. 1 To submit your homework: - Submit a PDF file - Use the handin program

More information

L20: Putting it together: Tree Search (Ch. 6)!

L20: Putting it together: Tree Search (Ch. 6)! Administrative L20: Putting it together: Tree Search (Ch. 6)! November 29, 2011! Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) - Goal is to prepare you for final - We ll discuss it in class on Thursday

More information

Section. Announcements

Section. Announcements Lecture 7 Section Announcements Have you been to section; why or why not? A. I have class and cannot make either time B. I have work and cannot make either time C. I went and found section helpful D. I

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

Verifying Concurrent Programs

Verifying Concurrent Programs Verifying Concurrent Programs Daniel Kroening 8 May 1 June 01 Outline Shared-Variable Concurrency Predicate Abstraction for Concurrent Programs Boolean Programs with Bounded Replication Boolean Programs

More information

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)

More information

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits CS307 What is a thread? Threads A thread is a basic unit of CPU utilization contains a thread ID, a program counter, a register set, and a stack shares with other threads belonging to the same process

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 )

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 ) CSE613: Parallel Programming, Spring 2012 Date: March 31 Homework #2 ( Due: April 14 ) Serial-BFS( G, s, d ) (Inputs are an unweighted directed graph G with vertex set G[V ], and a source vertex s G[V

More information

Breadth First Search. cse2011 section 13.3 of textbook

Breadth First Search. cse2011 section 13.3 of textbook Breadth irst Search cse section. of textbook Graph raversal (.) Application example Given a graph representation and a vertex s in the graph, find all paths from s to the other vertices. wo common graph

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors 1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of

More information

Efficient Data Race Detection for Unified Parallel C

Efficient Data Race Detection for Unified Parallel C P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,

More information

AMath 483/583 Lecture 16 May 2, Notes: Notes: Fine vs. coarse grain parallelism. Solution of independent ODEs by Euler s method.

AMath 483/583 Lecture 16 May 2, Notes: Notes: Fine vs. coarse grain parallelism. Solution of independent ODEs by Euler s method. AMath 483/583 Lecture 16 May 2, 2011 Today: Fine grain vs. coarse grain parallelism Manually splitting do loops among threads Wednesday: Adaptive quadrature, recursive functions Start MPI? Read: Class

More information

Dynamic Programming II

Dynamic Programming II Lecture 11 Dynamic Programming II 11.1 Overview In this lecture we continue our discussion of dynamic programming, focusing on using it for a variety of path-finding problems in graphs. Topics in this

More information

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne Chapter 4: Threads Silberschatz, Galvin and Gagne Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Linux Threads 4.2 Silberschatz, Galvin and

More information

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Threaded Programming. Lecture 1: Concepts

Threaded Programming. Lecture 1: Concepts Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts in Threaded Programming 2 Shared memory systems Threaded programming is most often used on shared memory parallel

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control Processes & Threads Concurrent Programs Process = Address space + one thread of control Concurrent program = multiple threads of control Multiple single-threaded processes Multi-threaded process 2 1 Concurrent

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Lectures 11 & 12: Synchronous Sequential Circuits Minimization

Lectures 11 & 12: Synchronous Sequential Circuits Minimization Lectures & 2: Synchronous Sequential Circuits Minimization. This week I noted that our seven-state edge detector machine on the left side below could be simplified to a five-state machine on the right.

More information

Discrete Mathematics, Spring 2004 Homework 8 Sample Solutions

Discrete Mathematics, Spring 2004 Homework 8 Sample Solutions Discrete Mathematics, Spring 4 Homework 8 Sample Solutions 6.4 #. Find the length of a shortest path and a shortest path between the vertices h and d in the following graph: b c d a 7 6 7 4 f 4 6 e g 4

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

1 On the reduce implementation

1 On the reduce implementation 1 On the reduce implementation Definition 1 (Reduce Operator). Given an associative operator and a vector A R M, we define the a second order reduce operator as y = reduce(a, ) = A 1 A 2... A M (1) If

More information

Fork / Join Parallelism

Fork / Join Parallelism Fork / Join Parallelism Image courtesy of http://www.llnl.gov/computing/tutorials/openmp/ Speedup limited by linear portion Amdahl s Law, Speedup = 1 / [(1- F) + F/S] Synchronization wait time OpenMP:

More information

Comp2310 & Comp6310 Systems, Networks and Concurrency

Comp2310 & Comp6310 Systems, Networks and Concurrency The Australian National University Final Examination November 2017 Comp2310 & Comp6310 Systems, Networks and Concurrency Study period: 15 minutes Writing time: 3 hours (after study period) Total marks:

More information

Routing algorithms. Jan Lönnberg, 51101M. October 2, Based on G. Tel: Introduction to Distributed Algorithms, chapter 4.

Routing algorithms. Jan Lönnberg, 51101M. October 2, Based on G. Tel: Introduction to Distributed Algorithms, chapter 4. Routing algorithms Jan Lönnberg, 51101M October 2, 2002 Based on G. Tel: Introduction to Distributed Algorithms, chapter 4. 1 Contents Introduction Destination-based routing Floyd-Warshall (single processor)

More information

Introduction Single-source shortest paths All-pairs shortest paths. Shortest paths in graphs

Introduction Single-source shortest paths All-pairs shortest paths. Shortest paths in graphs Shortest paths in graphs Remarks from previous lectures: Path length in unweighted graph equals to edge count on the path Oriented distance (δ(u, v)) between vertices u, v equals to the length of the shortest

More information

CMSC 714 Lecture 14 Lamport Clocks and Eraser

CMSC 714 Lecture 14 Lamport Clocks and Eraser Notes CMSC 714 Lecture 14 Lamport Clocks and Eraser Midterm exam on April 16 sample exam questions posted Research project questions? Alan Sussman (with thanks to Chris Ackermann) 2 Lamport Clocks Distributed

More information

COSC 6385 Computer Architecture - Pipelining (II)

COSC 6385 Computer Architecture - Pipelining (II) COSC 6385 Computer Architecture - Pipelining (II) Edgar Gabriel Spring 2018 Performance evaluation of pipelines (I) General Speedup Formula: Time Speedup Time IC IC ClockCycle ClockClycle CPI CPI For a

More information

Processor speed. Concurrency Structure and Interpretation of Computer Programs. Multiple processors. Processor speed. Mike Phillips <mpp>

Processor speed. Concurrency Structure and Interpretation of Computer Programs. Multiple processors. Processor speed. Mike Phillips <mpp> Processor speed 6.037 - Structure and Interpretation of Computer Programs Mike Phillips Massachusetts Institute of Technology http://en.wikipedia.org/wiki/file:transistor_count_and_moore%27s_law_-

More information

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications Lecture 7 OpenMP: Reduction, Synchronization, Scheduling & Applications Announcements Section and Lecture will be switched on Thursday and Friday Thursday: section and Q2 Friday: Lecture 2010 Scott B.

More information

Generation of parallel synchronization-free tiled code

Generation of parallel synchronization-free tiled code Computing (2018) 100:277 302 https://doi.org/10.1007/s00607-017-0576-3 Generation of parallel synchronization-free tiled code Wlodzimierz Bielecki 1 Marek Palkowski 1 Piotr Skotnicki 1 Received: 22 August

More information

CSE : PARALLEL SOFTWARE TOOLS

CSE : PARALLEL SOFTWARE TOOLS CSE 4392-601: PARALLEL SOFTWARE TOOLS (Summer 2002: T R 1:00-2:50, Nedderman 110) Instructor: Bob Weems, Associate Professor Office: 344 Nedderman, 817/272-2337, weems@uta.edu Hours: T R 3:00-5:30 GTA:

More information

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 16/17: Distributed Shared Memory CSC 469H1F Fall 2006 Angela Demke Brown Outline Review distributed system basics What is distributed shared memory? Design issues and tradeoffs Distributed System

More information

Lecture 9: Load Balancing & Resource Allocation

Lecture 9: Load Balancing & Resource Allocation Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

C09: Process Synchronization

C09: Process Synchronization CISC 7310X C09: Process Synchronization Hui Chen Department of Computer & Information Science CUNY Brooklyn College 3/29/2018 CUNY Brooklyn College 1 Outline Race condition and critical regions The bounded

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

Total Points: 60. Duration: 1hr

Total Points: 60. Duration: 1hr CS5800 : Algorithms Fall 015 Nov, 015 Quiz Practice Total Points: 0. Duration: 1hr 1. (7,8) points Binary Heap. (a) The following is a sequence of elements presented to you (in order from left to right):

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Dynamic-Programming algorithms for shortest path problems: Bellman-Ford (for singlesource) and Floyd-Warshall (for all-pairs).

Dynamic-Programming algorithms for shortest path problems: Bellman-Ford (for singlesource) and Floyd-Warshall (for all-pairs). Lecture 13 Graph Algorithms I 13.1 Overview This is the first of several lectures on graph algorithms. We will see how simple algorithms like depth-first-search can be used in clever ways (for a problem

More information