CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Similar documents
Παράλληλη Επεξεργασία

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Online Course Evaluation. What we will do in the last week?

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

EE/CSCI 451: Parallel and Distributed Computation

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

EE/CSCI 451: Parallel and Distributed Computation

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Programming in CUDA. Malik M Khan

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Parallel Numerical Algorithms

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Parallel Programming in C with MPI and OpenMP

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Parallel Programming

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

ECE 574 Cluster Computing Lecture 10

Parallel Processing SIMD, Vector and GPU s cont.

Allows program to be incrementally parallelized

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Computing. Hwansoo Han (SKKU)

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Overview: The OpenMP Programming Model

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

High Performance Computing on GPUs using NVIDIA CUDA

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallelism. CS6787 Lecture 8 Fall 2017

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

15-418, Spring 2008 OpenMP: A Short Introduction

Lecture 2. Memory locality optimizations Address space organization

Multithreaded Processors. Department of Electrical Engineering Stanford University

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

An Introduction to Parallel Programming

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Multithreading in C with OpenMP

L20: Putting it together: Tree Search (Ch. 6)!

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

CUDA Performance Considerations (2 of 2)

Control Hazards. Prediction

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

Introduction to OpenMP

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Tiling: A Data Locality Optimizing Algorithm

Performance Optimization Part II: Locality, Communication, and Contention

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

Parallel Programming Multicore systems

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Computer Architecture and Programming Written Assignment 3

Loop Modifications to Enhance Data-Parallel Performance

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

! Readings! ! Room-level, on-chip! vs.!

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

CS 5220: Shared memory programming. David Bindel

Parallel Programming in C with MPI and OpenMP

CS 475: Parallel Programming Introduction

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Maximizing Face Detection Performance

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

OpenMP Examples - Tasking

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Shared Memory Programming Model

Parallelization Principles. Sathish Vadhiyar

CSL 860: Modern Parallel

CS/COE1541: Intro. to Computer Architecture

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Portland State University ECE 588/688. Cray-1 and Cray T3E

Synchronization. Event Synchronization

Parallel Programming with OpenMP. CS240A, T. Yang

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

HPC VT Machine-dependent Optimization

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

Program Transformations for the Memory Hierarchy

Transcription:

CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday, October 25 Midterm Quiz on October 26 Start CUDA after break Start thinking about independent/group projects Mary Hall October 7, 2010 10/07/2010 CS4961 1 10/07/2010 CS4961 2 Today s Lecture Estimating Locality Benefits Finish OpenMP example Ch. 3, Reasoning About Performance 10/07/2010 CS4961 3 Programming Assignment 2: Due 11:59 PM, Friday October 8 Combining Locality, Thread and SIMD Parallelism: The following code excerpt is representative of a common signal processing technique called convolution. Convolution combines two signals to form a third signal. In this example, we slide a small (32x32) signal around in a larger (4128x4128) signal to look for regions where the two signals have the most overlap. for (l=0; l<n; l++) { for (k=0; k<n; k++) { C[k][l] = 0.0; for (j=0; j<w; j++) { for (i=0; i<w; i++) { C[k][l] += A[k+i][l+j]*B[i][j]; 10/07/2010 CS4961 4 1

How to use Tiling What data you are trying to get to reside in cache? Is your tiling improving whether that data is reused out of cache? Do you need to tile different loops? Do you need to tile multiple loops? Do you need a different loop order after tiling? How big are W and N? The right way to approach this is to reason about what data is accessed in the innermost 2 or 3 loops after tiling. You can permute to get the order you want. - Let s review Matrix Multiply from Lecture 12 - Calculate footprint of data in innermost loops not controlled by N, which is presumably very large - Advanced idea: take spatial locality into account Locality + SIMD (SSE-3) Example Example: matrix multiply for (J=0; J<N; J++) for (K=0; K<N; K++) for (I=0; I<N; I++) C[J][I]= C[J][I] + A[K][I] * B[J][K] C A B 10/07/2010 CS4961 5 10/07/2010 CS4961 6 I K I J K Locality + SIMD (SSE-3) Example Tiling inner loops I and K (+permutation) for (K = 0; K<N; K+=T K ) for (I = 0; I<N; I+=T I ) for (J =0; J<N; J++) for (KK = K; KK<min(K+T K, N); KK++) for (II = I; II<min(I+ T I, N); II++) C[J][II] = C[J][II] + A[KK][II] * B[J][KK]; T I T K C A B 10/07/2010 CS4961 7 Motivating Example: Linked List Traversal... while(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; // End of while loop... How to express with parallel for? - Must have fixed number of iterations - Loop-invariant loop condition and no early exits Convert to parallel for - A priori count number of iterations (if possible) 10/07/2010 CS4961 8 2

OpenMP 3.0: Tasks! my_pointer = listhead; #pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; // End of single - no implied barrier (nowait) // End of parallel region - implied barrier here firstprivate = private and copy initial value from global variable lastprivate = private and copy back final value to global variable 10/07/2010 CS4961 9 Chapter 3: Reasoning about Performance Recall introductory lecture: Easy to write a parallel program that is slower than sequential! Naïvely, many people think that applying P processors to a T time computation will result in T/P time performance Generally wrong - For a few problems (Monte Carlo) it is possible to apply more processors directly to the solution - For most problems, using P processors requires a paradigm shift, additional code, communication and therefore overhead - Also, differences in hardware - Assume P processors => T/P time to be the best case possible - In some cases, can actually do better (why?) 10/07/2010 CS4961 10 Sources of Performance Loss Overhead not present in sequential computation Non-parallelizable computation Idle processors, typically due to load imbalance Contention for shared resources 10/07/2010 CS4961 11 Sources of parallel overhead Thread/process management (next few slides) Extra computation - Which part of the computation do I perform? - Select which part of the data to operate upon - Local computation that is later accumulated with a reduction - Extra storage - Auxiliary data structures - Ghost cells Communication - Explicit message passing of data - Access to remote shared global data (in shared memory) - Cache flushes and coherence protocols (in shared memory) - Synchronization (book separates synchronization from communication) 10/07/2010 CS4961 12 3

Processes and Threads (& Filaments ) Let s formalize some things we have discussed before Threads - consist of program code, a program counter, call stack, and a small amount of thread-specific data - share access to memory (and the file system) with other threads - communicate through the shared memory Processes - Execute in their own private address space - Do not communicate through shared memory, but need another mechanism like message passing; shared address space another possibility - Logically subsume threads - Key issue: How is the problem divided among the processes, which includes data and work 10/07/2010 CS4961 13 Comparison Both have code, PC, call stack, local data - Threads -- One address space - Processes -- Separate address spaces - Filaments and similar are extremely fine-grain threads Weight and Agility - Threads: lighter weight, faster to setup, tear down, more dynamic - Processes: heavier weight, setup and tear down more time consuming, communication is slower 10/07/2010 CS4961 14 Latency vs. Throughput Parallelism can be used either to reduce latency or increase throughput - Latency refers to the amount of time it takes to complete a given unit of work (speedup). - Throughput refers to the amount of work that can be completed per unit time (pipelining computation). There is an upper limit on reducing latency - Speed of light, esp. for bit transmissions - In networks, switching time (node latency) - (Clock rate) x (issue width), for instructions - Diminishing returns (overhead) for problem instances - Limitations on #processors or size of memory - Power/energy constraints Throughput Improvements Throughput improvements are often easier to achieve by adding hardware - More wires improve bits/second - Use processors to run separate jobs - Pipelining is a powerful technique to execute more (serial) operations in unit time Common way to improve throughput - Multithreading (e.g., Nvidia GPUs and Cray El Dorado) 10/07/2010 CS4961 15 10/07/2010 CS4961 16 4

Latency Hiding from Multithreading Reduce wait times by switching to work on different operation - Old idea, dating back to Multics - In parallel computing it s called latency hiding Idea most often used to lower λ costs - Have many threads ready to go - Execute a thread until it makes nonlocal ref - Switch to next thread - When nonlocal ref is filled, add to ready list Interesting phenomenon: Superlinear speedup Why might Program 1 be exhibiting superlinear speedup? Different amount of work? Cache effects? 10/07/2010 CS4961 17 3-18 10/07/2010 Figure 3.5 from text A typical speedup graph showing performance for two programs; the dashed line represents linear speedup. CS4961 Performance Loss: Contention Contention -- the action of one processor interferes with another processor s actions -- is an elusive quantity - Lock contention: One processor s lock stops other processors from referencing; they must wait - Bus contention: Bus wires are in use by one processor s memory reference - Network contention: Wires are in use by one packet, blocking other packets Performance Loss: Load Imbalance Load imbalance, work not evenly assigned to the processors, underutilizes parallelism - The assignment of work, not data, is key - Static assignments, being rigid, are more prone to imbalance - Because dynamic assignment carries overhead, the quantum of work must be large enough to amortize the overhead - With flexible allocations, load balance can be solved late in the design programming cycle - Bank contention: Multiple processors try to access different locations on one memory chip simultaneously 10/07/2010 CS4961 19 10/07/2010 CS4961 20 5

Scalability Consideration: Efficiency Efficiency example from textbook, page 82 Parallel Efficiency = Speedup / NumberofProcessors - Tells you how much gain is likely from adding more processors Assume for this example that overhead is fixed at 20% of T S What is speedup and efficiency of 2 processors? 10 processors? 100 processors? Summary: Issues in reasoning about performance Finish your assignment (try tiling!) Have a nice fall break! 10/07/2010 CS4961 21 10/07/2010 CS4961 22 6