CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

Similar documents
CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Homework 1: Parallel Programming Basics Homework 1: Parallel Programming Basics Homework 1, cont.

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming with OpenMP. CS240A, T. Yang

CS 293S Parallelism and Dependence Theory

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

CS560 Lecture Parallelism Review 1

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8


L20: Putting it together: Tree Search (Ch. 6)!

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

ECE 574 Cluster Computing Lecture 10

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

CS691/SC791: Parallel & Distributed Computing

Introduction to OpenMP. Lecture 4: Work sharing directives

Parallel Programming in C with MPI and OpenMP

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallel Computing. Hwansoo Han (SKKU)

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Parallel Programming in C with MPI and OpenMP

Parallel Programming

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Lecture 4: OpenMP Open Multi-Processing

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Allows program to be incrementally parallelized

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Multithreading in C with OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

COMP Parallel Computing. SMM (2) OpenMP Programming Model

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Shared Memory Programming with OpenMP (3)

EE/CSCI 451: Parallel and Distributed Computation

Parallel Programming in C with MPI and OpenMP

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

ECE 563 Spring 2012 First Exam

Introduction to OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Exploring Parallelism At Different Levels

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Chapter 4: Multithreaded Programming

Alfio Lazzaro: Introduction to OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

High Performance Computing Systems

Shared Memory Programming Models I

} Evaluate the following expressions: 1. int x = 5 / 2 + 2; 2. int x = / 2; 3. int x = 5 / ; 4. double x = 5 / 2.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CS420: Operating Systems

Parallel Computing. Prof. Marco Bertini

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

CS370 Operating Systems

Introduction to OpenMP

Parallelising serial applications. Darryl Gove Compiler Performance Engineering

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Introduction to OpenMP.

Introduction to Parallel Programming. Tuesday, April 17, 12

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Parallel Numerical Algorithms

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

ECE 563 Second Exam, Spring 2014

Introduction to OpenMP

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

OpenMP. A.Klypin. Shared memory and OpenMP. Simple Example. Threads. Dependencies. Directives. Handling Common blocks.

CSL 860: Modern Parallel

Shared Memory programming paradigm: openmp

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP

Introduction to OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

Data Dependences and Parallelization

Introduction to OpenMP

Online Course Evaluation. What we will do in the last week?

Transcription:

CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7, 2012 Administrative I will be on travel on Tuesday, September 11 There will be no lecture Instead, you should watch the following two videos: - Intro to Parallel Programming. Lesson 3, pt. 1 - Implementing Domain Decompositons with OpenMP, by Clay Breshears, Intel, http://www.youtube.com/watch?v=9n1b9uwvgaw&feature=relmfu (9:53), reinforces OpenMP data parallelism - Dense Linear Algebra by Jim Demmel, CS237 at UC Berkeley, http://www.youtube.com/watch?v=2pfoivcj74g&feature=relmfu (start at 18:15 until 30:40), establishes bounds of dense linear algebra performance, taking parallelism and locality into account - There will be a quiz on this material on Thursday, September 13 (participation grade) 1! 2! Homework 2: Mapping to Architecture Due before class, Thursday, September 6 Objective: Begin thinking about architecture mapping issues Turn in electronically on the CADE machines using the handin program: handin cs4230 hw2 <probfile> Problem 1: (2.3 in text) [Locality] Problem 2: (2.8 in text) [Caches and multithreading] Problem 3: [Amdahl s Law] A multiprocessor consists of 100 processors, each capable of a peak execution rate of 20 Gflops. What is performance of the system as measured in Gflops when 20% of the code is sequential and 80% is parallelizable? Problem 4: (2.16 in text) [Parallelization scaling] Problem 5: [Buses and crossbars] Suppose you have a computation that uses two vector inputs to compute a vector output, where each vector is stored in consecutive memory locations. Each input and output location is unique, but data is loaded/stored from cache in 4-word transfers. Suppose you have P processors and N data elements, and execution time is a function of time L for a load from memory and time C for the computation. Compare parallel execution time for a shared memory architecture with a bus (Nehalem) versus a full crossbar (Niagara) from Lecture 3, assuming a write back cache that is larger than the data footprint. 3! Programming Assignment 1: Due Friday, Sept. 14, 11PM MDT To be done on water.eng.utah.edu (you all have accounts passwords available if your CS account doesn t work) 1. Write a program to calculate π in OpenMP for a problem size and data set to be provided. Use a block data distribution. 2. Write the same computation in Pthreads. Report your results in a separate README file. - What is the parallel speedup of your code? To compute parallel speedup, you will need to time the execution of both the sequential and parallel code, and report speedup = Time(seq) / Time (parallel) - If your code does not speed up, you will need to adjust the parallelism granularity, the amount of work each processor does between synchronization points. You can do this by either decreasing the number of threads or increasing the number of iterations. - Report results for two different # of threads, holding iterations fixed, and two different # of iterations holding threads fixed. Also report lines of code for the two solutions. Extra credit: Rewrite both codes using a cyclic distribution and measure performance for same configurations. 4! 1

Programming Assignment 1, cont. A test harness is provided in pi-test-harness.c that provides a sequential pi calculation, validation, speedup timing and substantial instructions on what you need to do to complete the assignment. Here are the key points: - You ll need to write the parallel code, and the things needed to support that. Read the top of the file, and search for TODO. - Compile w/ OpenMP: cc o pi-openmp O3 xopenmp pi-openmp.c - Compile w/ Pthreads: cc o pi-pthreads O3 pi-pthreads.c lpthread - Run OpenMP version:./pi-openmp > openmp.out - Run Pthreads version:./pi-pthreads > pthreads.out Note that editing on water is somewhat primitive I m using vim. You may want to edit on a different CADE machine. 5! Estimating π CS4230! 09/07/2012! 6! Today s Lecture OpenMP parallel, parallel for and for constructs from last time OpenMP loop scheduling demonstration Data Dependences - How compilers reason about them - Formal definition of reordering transformations that preserve program meaning - Informal determination of parallelization safety Sources for this lecture: - Notes on website OpenMP Sum Examples: Improvement Last time (sum v3): int sum, mysum[64]; sum = 0; #pragma omp parallel { mysum[my_id] = 0; #pragma omp parallel for for (i=0; i<size; i++) { mysum[my_id] += _iplist[i]; #pragma omp parallel { #pragma omp critical sum += mysum[my_id]; return sum; Improved, posted (sum v3): int sum, mysum[64]; sum = 0; #pragma omp parallel { mysum[my_id] = 0; #pragma omp for for (i=0; i<size; i++) { mysum[my_id] += _iplist[i]; #pragma omp critical sum += mysum[my_id]; return sum; 7! 8! 2

Common Data Distributions Consider a 1-Dimensional array to solve the global sum problem, 16 elements, 4 threads CYCLIC (chunk = 1) (version 2): for (i = 0; i<blocksize; i++) in [i*blocksize + tid]; 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 BLOCK (chunk = 4) (default, version 1): for (i=tid*blocksize; i<(tid+1) *blocksize; i++) in[i]; The Schedule Clause Default schedule: Cyclic schedule: 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 BLOCK-CYCLIC (chunk = 2) (version 3): 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 9! CS4230! 09/07/2012! 10! OpenMP environment variables OMP_NUM_THREADS! sets the number of threads to use during execution when dynamic adjustment of the number of threads is enabled, the value of this environment variable is the maximum number of threads to use For example,!setenv OMP_NUM_THREADS 16 [csh, tcsh]!export OMP_NUM_THREADS=16 [sh, ksh, bash] OMP_SCHEDULE (version 4)! applies only to do/for and parallel do/for directives that have the schedule type RUNTIME! sets schedule type and chunk size for all such loops For example,!setenv OMP_SCHEDULE GUIDED,4 [csh, tcsh]!export OMP_SCHEDULE= GUIDED,4 [sh, ksh, bash]! 11! Race Condition or Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events. A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. 12! 3

Key Control Concept: Data Dependence Question: When is parallelization guaranteed to be safe? Answer: If there are no data dependences across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein s conditions (1966): I j is the set of memory locations read by process P j, and O j the set updated by process P j. To execute P j and another process P k in parallel, I j O k = ϕ write after read I k O j = ϕ read after write O j O k = ϕ write after write 13! Data Dependence and Related Definitions Actually, parallelizing compilers must formalize this to guarantee correct code. Let s look at how they do it. It will help us understand how to reason about correctness as programmers. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write. A data dependence can either be between two distinct program statements or two different dynamic executions of the same program statement. Source: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach, Allen and Kennedy, 2002, Ch. 2. 14! Data Dependence of Scalar Variables True (flow) dependence Anti-dependence Output dependence Input dependence (for locality) Definition: Data dependence exists from a reference instance i to i iff either i or i is a write operation i and i refer to the same variable i executes before i 15! Some Definitions (from Allen & Kennedy) Definition 2.5: - Two computations are equivalent if, on the same inputs, - they produce identical outputs - the outputs are executed in the same order Definition 2.6: - A reordering transformation - changes the order of statement execution - without adding or deleting any statement executions. Definition 2.7: - A reordering transformation preserves a dependence if - it preserves the relative execution order of the dependences source and sink. 16! 09/07/2012! CS4230! 4

Fundamental Theorem of Dependence Theorem 2.2: - Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. 17! 09/07/2012! CS4230! In this course, we consider two kinds of reordering transformations Parallelization - Computations that execute in parallel between synchronization points are potentially reordered. Is that reordering safe? According to our definition, it is safe if it preserves the dependences in the code. Locality optimizations - Suppose we want to modify the order in which a computation accesses memory so that it is more likely to be in cache. This is also a reordering transformation, and it is safe if it preserves the dependences in the code. Reduction computations - We have to relax this rule for reductions. It is safe to reorder reductions for commutative and associative operations. 18! How to determine safety of reordering transformations Informally - Must preserve relative order of dependence source and sink - So, cannot reverse order Formally - Tracking dependences - A simple abstraction: Distance vectors Brief Detour on Parallelizable Loops as a Reordering Transformation Forall or Doall loops: Loops whose iterations can execute in parallel (a particular reordering transformation) Example Meaning? forall (i=1; i<=n; i++) A[i] = B[i] + C[i]; Each iteration can execute independently of others Free to schedule iterations in any order (e.g., pragma omp forall) Source of scalable, balanced work Common to scientific, multimedia, graphics & other domains 19! 20! 5

Data Dependence for Arrays for (i=2; i<5; i++) A[i] = A[i-2]+1; Loop- Carried dependence for (i=1; i<=3; i++) A[i] = A[i]+1; Loop- Independent dependence Recognizing parallel loops (intuitively) - Find data dependences in loop - No dependences crossing iteration boundary parallelization of loop s iterations is safe 21! 09/07/2012! CS4230! 6