Lecture 11 Loop Transformations for Parallelism and Locality

Size: px
Start display at page:

Download "Lecture 11 Loop Transformations for Parallelism and Locality"

Transcription

1 Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter , , Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality P P P P M M M M nterconnect 2 1

2 Parallelism and Locality Parallelism DOES NOT imply speed up! Parallel performance: mprove locality with loop transformations Minimize communication Operations using the same data are executed on the same processor Sequential performance: mprove locality with loop transformations Minimize cache misses Operations using the same data are executed close in time. 3 Loop Permutation (Loop nterchange) Z[,] = Z[-1,] for = 1 to 3 for = 1 to 4 Z[, ] = Z[ -1, ] " j' % " ' = 0 1 % " i % ' ' # i' & # 1 0& # j& 4 2

3 Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) for = 1 to 4 T[]= A[]+B[] (s1) C[]= T[] x T[] (s2) s1: [ j] = 1 [ ] [i] [ j] = [ 1] [i'] s2: 5 Loop Transformations Unimodular transforms on loop nests nterchange Skewing Reversal Cross statement transforms Loop fusion Loop fission Re-indexing How to combine them to get parallelism and locality? 6 3

4 Affine Partitioning: An Contrived but llustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; (S 1 ) B[i,j] = A[i,j-1]*B[i,j]; (S 2 ) j S 1 S 2 i 7 Best Parallelization Scheme Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. SPMD code: Let p be the processor s D number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S 2 ) for i 1 = max[1,1+p) to min[n,n-1+p) do A[i 1,i 1 -p] = A[i 1,i 1 -p] + B[i 1-1,i 1 -p]; (S 1 ) B[i 1,i 1 -p+1] = A[i 1,i 1 -p] * B[i 1,i 1 -p+1]; (S 2 ) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S 1 ) 8 4

5 2. teration Space FOR i = 0 to 5 FOR j = i to 7 n-deep loop nests: n-dimensional polytope terations: coordinates in the iteration space Assume: iteration index is incremented in the loop Sequential execution order: lexicographic order [0,0], [0,1],, [0,6], [0,7], [1,1],, [1,6], [1,7], 9 Maximum Parallelism & No Communication Loops F 1 i 1 +f 1 F 2 i 2 +f 2 Array C 2 i 2 +c 2 C 1 i 1 +c 1 Processor D For every pair of data dependent accesses F 1 i 1 +f 1 and F 2 i 2 +f 2 Find C 1, c 1, C 2, c 2 : i 1, i 2 F 1 i 1 + f 1 = F 2 i 2 +f 2 C 1 i 1 +c 1 = C 2 i 2 +c 2 with the objective of maximizing the rank of C 1, C

6 Rank of Partitioning = Degree of Parallelism Affine Mapping i j [ 0 0] [ 1] i 1 0 i 0 j 0 1 j Rank Mapped to same processor 11 Example 1: Loop Transform Z[,] = Z[-1,] Find affine partitioning: c 1, c 2, c 0 such that " p = [ c 1 c 2 ] i % ' + c 0 # j& Suppose itera4on i,j & i, j refer to same loca4on i = i - 1 j = j No communica4on means: c 1 i + c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 (i - 1) +c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 = 0 p = c 2 j + c 0 Pick simplest c 2, c 0 : c 2 = 1, c 0 = 0 p = j 12 6

7 Code Generation Naive Each processor visits all the iterations Executes only if it owns that iteration Optimization Removes unnecessary looping and condition evaluation 13 Code Generation Z[,] = Z[-1,] p = j for P = 1 to 3 if (j == P) Z[,] = Z[-1,] for P = 1 to 3 Z[,P] = Z[-1,P] SPMD (single program multiple data) code: Z[,P] = Z[-1,P] 14 7

8 Loop Permutation (Loop nterchange) Z[,] = Z[-1,] for P = 1 to 3 Z[,P] = Z[-1,P] " p' % " ' = 0 1 % " # i' & # 1 0& ' i % # j ' & P 15 Example 2: Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) Find affine partitioning: c 1,1, c 1,0, c 2.1, c 1,0, such that s1: s2: [ ] [i] + c 1,0 [ p] = [ c 2,1 ] [i'] + c 2,0 [ p] = c 1,1 Suppose itera4on i & i refer to the same loca4on i = i No communica4on means: c 1,1 i + c 1,0 = c 2,1 i + c 2,0 c 1,1 = c 2,1 c 1,0 = c 2,0 Pick simplest values: c 1,1 = c 2,1 = 1, c 1,0 = c 2,0 = 0 p = i; p=i 16 8

9 Loop Fusion T[]= A[]+B[] (s1) for = 1 to 4 C[ ]= T[ ] x T[ ] (s2) for P = 1 to 4 if ( == P) T[]= A[]+B[] (s1) for = 1 to 4 if ( == P) C[ ]= T[ ] x T[ ] (s2) s1: s2: [ p] = 1 [ ] [i] [ p] = 1 [ ] [i'] for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) 17 Example 3: 2 Nested, Parallel Loops Z[,] = Z[,]+1 Find affine partitioning: c 1, c 2, c 0 such that " p = [ c 1 c 2 ] i % ' + c 0 # j& Suppose itera4on i,j & i, j refer to same loca4on i = i j = j No communica4on means: c 1 i + c 2 j + c 0 = c 1 i + c 2 j + c 0 c 1 i +c 2 j + c 0 = c 1 i + c 2 j + c 0 No constraints Two basis vectors: [c 1 c 2 ]=[1 0], or [c 1 c 2 ] = [0 1] Two answers for p: two degrees of parallelism " # p 1 p 2 % " ' = 1 0 %" & # 0 1& ' i % # j ' & 18 9

10 Example 3: 2 Nested, Parallel Loops Z[,] = Z[,]+1 " # p 1 p 2 % " ' = 1 0 %" & # 0 1& ' i % # j ' & for p1 = 1 to 4 for p2 = 1 to 3 if (==p1 & == p2) Z[,] = Z[,]+1 for p1 = 1 to 4 for p2 = 1 to 3 Z[p1,p2] = Z[p1,p2]+1 19 Optimizing Arbitrary Loop Nesting Using Affine Partitions (chotst, NAS) DO 1 = 0, N 0 = MAX ( -M, - ) DO 2 = 0, -1 DO 3 = 0 -, -1 DO 3 L = 0, NMAT 3 A(L,,) = A(L,,) - A(L,,+) * A(L,+,) DO 2 L = 0, NMAT 2 A(L,,) = A(L,,) * A(L,0,+) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,) DO 5 = 0, -1 DO 5 L = 0, NMAT 5 A(L,0,) = A(L,0,) - A(L,,) ** 2 DO 1 L = 0, NMAT 1 A(L,0,) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,)) ) DO 6 = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(,L,K) = B(,L,K) * A(L,0,K) DO 7 = 1, MN (M, N-K) DO 7 L = 0, NMAT 7 B(,L,K+) = B(,L,K+) - A(L,-,K+) * B(,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(,L,K) = B(,L,K) * A(L,0,K) DO 6 = 1, MN (M, K) DO 6 L = 0, NMAT 6 B(,L,K-) = B(,L,K-) - A(L,-,K) * B(,L,K) A B EPSS 20 L L L 10

11 Chotst: Results with Affine Partitioning + Blocking (Unimodular: a subset of affine partitioning for perfect loop nests) Speedup Unimodular + Blocking Affine Partitioning + Blocking Number of Processors 21 Summary of Affine Partitioning Communication-Free Loops F 1 i 1 +f 1 F 2 i 2 +f 2 Array C 2 i 2 +c 2 C 1 i 1 +c 1 Processor D 22 11

12 Advanced topic: Pipelining SOR (Successive Over-Relaxation): An Example for i = 1 TO m for j = 1 to n A[i,j] = c * (A[i-1,j] + A[i,j-1]) i j 23 Finding the Maximum Degree of Pipelining i 1 i 2 Loops F 2 i 2 +f 2 F 1 i 1 +f 1 C C 1 i 1 +c 2 i 2 +c 2 1 Time Stage Array For every pair of data dependent accesses F 1 i 1 +f 1 and F 2 i 2 +f 2 Let B 1 i 1 +b 1 0, B 2 i 2 +b 2 0 be the corresponding loop bound constraints, Find C 1, c 1, C 2, c 2 : i 1, i 2 B 1 i 1 + b 1 0, B 2 i 2 + b 2 0 (i 1 i 2 ) (F 1 i 1 + f 1 = F 2 i 2 +f 2 ) C 1 i 1 +c 1 C 2 i 2 +c 2 with the objective of maximizing the rank of C 1, C

13 Key nsight Choice in time mapping => (pipelined) parallelism Rank(C) 1 degree of parallelism with 1 degree of synchronization Can create blocks with Rank(C) dimensions Find time partitions is not as straightforward as space partitions Need to deal with linear inequalities Solved using Farkas Lemma no simple intuitive proof 25 Communication-Free Summary of Affine Partitioning F 1 i 1 +f 1 Loops Array F 2 i 2 +f 2 C 1 i 1 +c 1 C 2 i 2 +c 2 Processor D Pipelining i 1 i 2 Loops F 2 i 2 +f 2 F 1 i 1 +f 1 Array C 1 i 1 +c 1 C 2 i 2 +c 2 Time Stage 26 13

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing

More information

The Polyhedral Model (Transformations)

The Polyhedral Model (Transformations) The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞 Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space. The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

Review. Loop Fusion Example

Review. Loop Fusion Example Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006 Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Data Dependence Analysis

Data Dependence Analysis CSc 553 Principles of Compilation 33 : Loop Dependence Data Dependence Analysis Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Data Dependence

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Maximizing parallelism and minimizing synchronization with affine partitions 1

Maximizing parallelism and minimizing synchronization with affine partitions 1 Ž. Parallel Computing 24 998 445 475 Maximizing parallelism and minimizing synchronization with affine partitions Amy W. Lim ), Monica S. Lam 2 Computer Systems Laboratory, Stanford UniÕersity, Stanford,

More information

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism

Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Efficient Polynomial-Time Nested Loop Fusion with Full Parallelism Edwin H.-M. Sha Timothy W. O Neil Nelson L. Passos Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Erik

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops

Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops Transformations Techniques for extracting Parallelism in Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and SAHAR A. GOMAA ) Kaferelsheikh University, Kaferelsheikh, EGYPT

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Class Information INFORMATION and REMINDERS Homework 8 has been posted. Due Wednesday, December 13 at 11:59pm. Third programming has been posted. Due Friday, December 15, 11:59pm. Midterm sample solutions

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

LooPo: Automatic Loop Parallelization

LooPo: Automatic Loop Parallelization LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era 1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,

More information

11.1 Facility Location

11.1 Facility Location CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Facility Location ctd., Linear Programming Date: October 8, 2007 Today we conclude the discussion of local

More information

Graphs and Network Flows IE411. Lecture 13. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 13. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 13 Dr. Ted Ralphs IE411 Lecture 13 1 References for Today s Lecture IE411 Lecture 13 2 References for Today s Lecture Required reading Sections 21.1 21.2 References

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 29

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 29 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 29 CS 473: Algorithms, Spring 2018 Simplex and LP Duality Lecture 19 March 29, 2018

More information

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops The Lecture Contains: Cycle Shrinking Cycle Shrinking in Distance Varying Loops Loop Peeling Index Set Splitting Loop Fusion Loop Fission Loop Reversal Loop Skewing Iteration Space of The Loop Example

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

A Generic Separation Algorithm and Its Application to the Vehicle Routing Problem

A Generic Separation Algorithm and Its Application to the Vehicle Routing Problem A Generic Separation Algorithm and Its Application to the Vehicle Routing Problem Presented by: Ted Ralphs Joint work with: Leo Kopman Les Trotter Bill Pulleyblank 1 Outline of Talk Introduction Description

More information

Linear Programming in Small Dimensions

Linear Programming in Small Dimensions Linear Programming in Small Dimensions Lekcija 7 sergio.cabello@fmf.uni-lj.si FMF Univerza v Ljubljani Edited from slides by Antoine Vigneron Outline linear programming, motivation and definition one dimensional

More information

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017 COMPSCI 632: Approximation Algorithms August 30, 2017 Lecturer: Debmalya Panigrahi Lecture 2 Scribe: Nat Kell 1 Introduction In this lecture, we examine a variety of problems for which we give greedy approximation

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

Loop Tiling for Parallelism

Loop Tiling for Parallelism Loop Tiling for Parallelism THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE LOOP TILING FOR PARALLELISM JINGLING XUE School of Computer Science and Engineering The University of New

More information

ACO Comprehensive Exam October 12 and 13, Computability, Complexity and Algorithms

ACO Comprehensive Exam October 12 and 13, Computability, Complexity and Algorithms 1. Computability, Complexity and Algorithms Given a simple directed graph G = (V, E), a cycle cover is a set of vertex-disjoint directed cycles that cover all vertices of the graph. 1. Show that there

More information

4 Greedy Approximation for Set Cover

4 Greedy Approximation for Set Cover COMPSCI 532: Design and Analysis of Algorithms October 14, 2015 Lecturer: Debmalya Panigrahi Lecture 12 Scribe: Allen Xiao 1 Overview In this lecture, we introduce approximation algorithms and their analysis

More information

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian

Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors. Naraig Manjikian Program Transformations for Cache Locality Enhancement on Shared-memory Multiprocessors by Naraig Manjikian A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4: Loads/Stores and RISC/CISC Video

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer

Loops and Locality. with an introduc-on to the memory hierarchy. COMP 506 Rice University Spring target code. source code OpJmizer COMP 506 Rice University Spring 2017 Loops and Locality with an introduc-on to the memory hierarchy source code Front End IR OpJmizer IR Back End target code Copyright 2017, Keith D. Cooper & Linda Torczon,

More information

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748 COT 6936: Topics in Algorithms! Giri Narasimhan ECS 254A / EC 2443; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/cot6936_s12.html https://moodle.cis.fiu.edu/v2.1/course/view.php?id=174

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Steiner Trees and Forests

Steiner Trees and Forests Massachusetts Institute of Technology Lecturer: Adriana Lopez 18.434: Seminar in Theoretical Computer Science March 7, 2006 Steiner Trees and Forests 1 Steiner Tree Problem Given an undirected graph G

More information

5. Lecture notes on matroid intersection

5. Lecture notes on matroid intersection Massachusetts Institute of Technology Handout 14 18.433: Combinatorial Optimization April 1st, 2009 Michel X. Goemans 5. Lecture notes on matroid intersection One nice feature about matroids is that a

More information

ORIE 6300 Mathematical Programming I September 2, Lecture 3

ORIE 6300 Mathematical Programming I September 2, Lecture 3 ORIE 6300 Mathematical Programming I September 2, 2014 Lecturer: David P. Williamson Lecture 3 Scribe: Divya Singhvi Last time we discussed how to take dual of an LP in two different ways. Today we will

More information

Linear programming and the efficiency of the simplex algorithm for transportation polytopes

Linear programming and the efficiency of the simplex algorithm for transportation polytopes Linear programming and the efficiency of the simplex algorithm for transportation polytopes Edward D. Kim University of Wisconsin-La Crosse February 20, 2015 Loras College Department of Mathematics Colloquium

More information

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday Outline L5: Writing Correct Programs, cont. 1 How to tell if your parallelization is correct? Race conditions and data dependences Tools that detect race conditions Abstractions for writing correct parallel

More information

Parallel Auction Algorithm for Linear Assignment Problem

Parallel Auction Algorithm for Linear Assignment Problem Parallel Auction Algorithm for Linear Assignment Problem Xin Jin 1 Introduction The (linear) assignment problem is one of classic combinatorial optimization problems, first appearing in the studies on

More information

CS6841: Advanced Algorithms IIT Madras, Spring 2016 Lecture #1: Properties of LP Optimal Solutions + Machine Scheduling February 24, 2016

CS6841: Advanced Algorithms IIT Madras, Spring 2016 Lecture #1: Properties of LP Optimal Solutions + Machine Scheduling February 24, 2016 CS6841: Advanced Algorithms IIT Madras, Spring 2016 Lecture #1: Properties of LP Optimal Solutions + Machine Scheduling February 24, 2016 Lecturer: Ravishankar Krishnaswamy Scribe: Anand Kumar(CS15M010)

More information

of Nebraska - Lincoln. Computer Science and Engineering, Department of

of Nebraska - Lincoln. Computer Science and Engineering, Department of University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 2005 DCDP: A Novel Data-Centric and Design-Pattern

More information

Polyhedral Operations. Algorithms needed for automation. Logistics

Polyhedral Operations. Algorithms needed for automation. Logistics Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,

More information

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Marc Uetz University of Twente m.uetz@utwente.nl Lecture 5: sheet 1 / 26 Marc Uetz Discrete Optimization Outline 1 Min-Cost Flows

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Lecture 19. Broadcast routing

Lecture 19. Broadcast routing Lecture 9 Broadcast routing Slide Broadcast Routing Route a packet from a source to all nodes in the network Possible solutions: Flooding: Each node sends packet on all outgoing links Discard packets received

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall. CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7, 2012 Administrative I will be on travel on Tuesday, September 11 There will be no lecture Instead,

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Network Design and Optimization course

Network Design and Optimization course Effective maximum flow algorithms Modeling with flows Network Design and Optimization course Lecture 5 Alberto Ceselli alberto.ceselli@unimi.it Dipartimento di Tecnologie dell Informazione Università degli

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

x ji = s i, i N, (1.1)

x ji = s i, i N, (1.1) Dual Ascent Methods. DUAL ASCENT In this chapter we focus on the minimum cost flow problem minimize subject to (i,j) A {j (i,j) A} a ij x ij x ij {j (j,i) A} (MCF) x ji = s i, i N, (.) b ij x ij c ij,

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Based on the notes for the U of Toronto course CSC 364 Dynamic Programming Algorithms The setting is as follows. We wish to find a solution to a given problem which optimizes some quantity Q of interest;

More information

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.

More information

Lecture 8: The EM algorithm

Lecture 8: The EM algorithm 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 8: The EM algorithm Lecturer: Manuela M. Veloso, Eric P. Xing Scribes: Huiting Liu, Yifan Yang 1 Introduction Previous lecture discusses

More information

Automatic Tiling of Iterative Stencil Loops

Automatic Tiling of Iterative Stencil Loops Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation

More information

Assignment 5: Solutions

Assignment 5: Solutions Algorithm Design Techniques Assignment 5: Solutions () Port Authority. [This problem is more commonly called the Bin Packing Problem.] (a) Suppose K = 3 and (w, w, w 3, w 4 ) = (,,, ). The optimal solution

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation via an Extended Polyhedral Representation Habanero Extreme Scale Software Research Group Department of Computer Science Rice University 6th International Workshop on Polyhedral Compilation Techniques (IMPACT

More information

Clustering: Centroid-Based Partitioning

Clustering: Centroid-Based Partitioning Clustering: Centroid-Based Partitioning Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 29 Y Tao Clustering: Centroid-Based Partitioning In this lecture, we

More information

1. Lecture notes on bipartite matching February 4th,

1. Lecture notes on bipartite matching February 4th, 1. Lecture notes on bipartite matching February 4th, 2015 6 1.1.1 Hall s Theorem Hall s theorem gives a necessary and sufficient condition for a bipartite graph to have a matching which saturates (or matches)

More information

6.842 Randomness and Computation September 25-27, Lecture 6 & 7. Definition 1 Interactive Proof Systems (IPS) [Goldwasser, Micali, Rackoff]

6.842 Randomness and Computation September 25-27, Lecture 6 & 7. Definition 1 Interactive Proof Systems (IPS) [Goldwasser, Micali, Rackoff] 6.84 Randomness and Computation September 5-7, 017 Lecture 6 & 7 Lecturer: Ronitt Rubinfeld Scribe: Leo de Castro & Kritkorn Karntikoon 1 Interactive Proof Systems An interactive proof system is a protocol

More information

Polar Duality and Farkas Lemma

Polar Duality and Farkas Lemma Lecture 3 Polar Duality and Farkas Lemma October 8th, 2004 Lecturer: Kamal Jain Notes: Daniel Lowd 3.1 Polytope = bounded polyhedron Last lecture, we were attempting to prove the Minkowsky-Weyl Theorem:

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

Increasing Parallelism of Loops with the Loop Distribution Technique

Increasing Parallelism of Loops with the Loop Distribution Technique Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw

More information

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector

Page 1. Parallelization techniques. Dependence graph. Dependence Distance and Distance Vector Parallelization techniques The parallelization techniques for loops normally follow the following three steeps: 1. Perform a Data Dependence Test to detect potential parallelism. These tests may be performed

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Lecture 5: Properties of convex sets

Lecture 5: Properties of convex sets Lecture 5: Properties of convex sets Rajat Mittal IIT Kanpur This week we will see properties of convex sets. These properties make convex sets special and are the reason why convex optimization problems

More information

1. Lecture notes on bipartite matching

1. Lecture notes on bipartite matching Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans February 5, 2017 1. Lecture notes on bipartite matching Matching problems are among the fundamental problems in

More information

4 Integer Linear Programming (ILP)

4 Integer Linear Programming (ILP) TDA6/DIT37 DISCRETE OPTIMIZATION 17 PERIOD 3 WEEK III 4 Integer Linear Programg (ILP) 14 An integer linear program, ILP for short, has the same form as a linear program (LP). The only difference is that

More information

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process Lecture 2: Parallel Programs Topics: consistency, parallel applications, parallelization process 1 Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achievable

More information