EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

Similar documents
EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 11 Parallelism in Software II

Patterns for! Parallel Programming!

An Introduction to Parallel Programming

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Patterns for! Parallel Programming II!

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

Parallelism I. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Marco Danelutto. May 2011, Pisa

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Parallelism in Software

Parallelization Strategy

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

I-1 Introduction. I-0 Introduction. Objectives:

Ade Miller Senior Development Manager Microsoft patterns & practices

MPEG-2 Decoding in a Stream Programming Language

A Pattern-supported Parallelization Approach

Introduction to Parallel Computing

Parallelization Strategy

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Task Graph. Name: Problem: Context: D B. C Antecedent. Task Graph

Review: Creating a Parallel Program. Programming for Performance

Co-synthesis and Accelerator based Embedded System Design

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

High Performance Computing on GPUs using NVIDIA CUDA

A Pattern Language for Parallel Programming

Topics in Object-Oriented Design Patterns

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter

Techno Expert Solutions An institute for specialized studies! Introduction to Advance QTP course Content

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Lectures 24 and 25 Introduction to Architectural Styles and Design Patterns

Mattan Erez. The University of Texas at Austin

Programmazione Avanzata e Paradigmi Ingegneria e Scienze Informatiche - UNIBO a.a 2013/2014 Lecturer: Alessandro Ricci

Object-Oriented Design

Design Patterns. An introduction

An Introduction to Software Architecture By David Garlan & Mary Shaw 94

Design Patterns. Observations. Electrical Engineering Patterns. Mechanical Engineering Patterns

Unit-3 Software Design (Lecture Notes)

MPEG-2. And Scalability Support. Nimrod Peleg Update: July.2004

Parallel Algorithm Design. CS595, Fall 2010

MPEG-4: Simple Profile (SP)

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Frequently asked questions from the previous class survey

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

Parallel Numerics, WT 2013/ Introduction

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

1 Software Architecture

The University of Texas at Austin

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

An Introduction to Software Architecture. David Garlan & Mary Shaw 94

Lecture 8. The StreamIt Language IAP 2007 MIT

Parallel Programming Patterns Overview and Concepts

Application Architectures, Design Patterns

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Design of Parallel Algorithms. Models of Parallel Computation

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Principles of Parallel Algorithm Design: Concurrency and Mapping

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

CS/CE 2336 Computer Science II

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

High Performance Computing

Porting an MPEG-2 Decoder to the Cell Architecture

Chapter 10: Performance Patterns

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger. Sources:

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Systems I: Programming Abstractions

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

An Introduction to Patterns

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Teleport Messaging for. Distributed Stream Programs

EE382 Processor Design. Concurrent Processors

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Refactoring. Why? When? What?

Professor Yashar Ganjali Department of Computer Science University of Toronto

Parallel Programming Patterns. Overview and Concepts

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Principles of Parallel Algorithm Design: Concurrency and Mapping

CHAPTER 6: CREATIONAL DESIGN PATTERNS

EE 5359 H.264 to VC 1 Transcoding

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Lecture 19: Introduction to Design Patterns

Functional modeling style for efficient SW code generation of video codec applications

CSC7203 : Advanced Object Oriented Development. J Paul Gibson, D311. Design Patterns

Transcription:

EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez 1

Credits 2 Most of the slides courtesy Dr. Rodric Rabbah (IBM) Taken from 6.189 IAP taught at MIT in 2007. EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

4 Common Steps to Creating a Parallel Program 3 Partitioning d e c o m p o s i t i o n a s s i g n m e n t UE 0 UE 1 UE 2 UE 3 o r c h e s t r a t i o n UE 0 UE 1 UE 2 UE 3 m a p p i n g UE 0 UE 1 UE 2 UE 3 Sequential Computation Tasks Units of Execution Parallel program Processors EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Decomposition 4 Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically umber of tasks may vary with time Enough tasks to keep processors busy umber of tasks available at a time is upper bound on achievable speedup Main consideration: coverage and Amdahl s Law EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Assignment 5 Specify mechanism to divide work among PEs Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model? Complexity often affects decisions Architectural model affects decisions Main considerations: granularity and locality EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Orchestration and Mapping 6 Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Survey available mechanisms on target system Main considerations: locality, parallelism, mechanisms (efficiency and dangers) EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Parallel Programming by Pattern 7 Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

History 8 Berkeley architecture professor Christopher Alexander In 1977, patterns for city planning, landscaping, and architecture in an attempt to capture principles for living design EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Example 167 (p. 783): 6ft Balcony 9 EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns in Object-Oriented Programming 10 Design Patterns: Elements of Reusable Object- Oriented Software (1995) Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides Catalogue of patterns Creation, structural, behavioral EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 11 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005).

Here s my algorithm. Where s the concurrency? 12 MPEG Decoder frequency encoded macroblocks ZigZag IQuantization MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode IDCT Repeat Saturation spatially encoded macroblocks join motion vectors Motion Compensation Picture Reorder recovered picture Color Conversion Display EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here s my algorithm. Where s the concurrency? 13 MPEG Decoder frequency encoded macroblocks ZigZag IQuantization MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Task decomposition Independent coarse-grained computation Inherent to algorithm IDCT Saturation spatially encoded macroblocks join Motion Compensation Picture Reorder recovered picture Repeat motion vectors Sequence of statements (instructions) that operate together as a group Corresponds to some logical part of program Usually follows from the way programmer thinks about a problem Color Conversion Display EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here s my algorithm. Where s the concurrency? 14 MPEG Decoder frequency encoded macroblocks ZigZag MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Task decomposition Parallelism in the application IQuantization IDCT Saturation Motion Vector Decode Repeat Pipeline task decomposition Data assembly lines spatially encoded macroblocks join motion vectors Producer-consumer chains Motion Compensation recovered picture Picture Reorder Color Conversion Display EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Here s my algorithm. Where s the concurrency? 15 MPEG Decoder frequency encoded macroblocks ZigZag MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Task decomposition Parallelism in the application IQuantization IDCT Saturation Motion Vector Decode Repeat Pipeline task decomposition Data assembly lines spatially encoded macroblocks join motion vectors Producer-consumer chains Motion Compensation Picture Reorder Color Conversion recovered picture Data decomposition Same computation is applied to small data chunks derived from large data set Display EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Task Decomposition 16 Algorithms start with a good understanding of the problem being solved Programs often naturally decompose into tasks Two common decompositions are Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Task Decomposition 17 Flexibility Program design should afford flexibility in the number and size of tasks generated Tasks should not tied to a specific architecture Fixed tasks vs. Parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies doesn t become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Case for Pipeline Decomposition 18 Data is flowing through a sequence of stages Assembly line is a good analogy ZigZag IQuantization IDCT What s a prime example of pipeline decomposition in computer architecture? Instruction pipeline in modern CPUs Saturation What s an example pipeline you may use in your UIX shell? Pipes in UIX: cat foobar.c grep bar wc Other examples Signal processing Graphics EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Guidelines for Data Decomposition 19 Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Common Data Decompositions 20 Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Common Data Decompositions 21 Geometric data structures Decomposition of arrays along rows, columns, blocks Decomposition of meshes into domains Recursive data structures Example: decomposition of trees into sub-trees problem split split split compute compute compute compute merge merge EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez merge solution

Guidelines for Data Decomposition 22 Flexibility Size and number of data chunks should support a wide range of executions Efficiency Data chunks should generate comparable amounts of work (for load balancing) Simplicity Complex data compositions can get difficult to manage and debug EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Data Decomposition Examples 23 Molecular dynamics Compute forces Update accelerations and velocities Update positions Decomposition Baseline algorithm is 2 All-to-all communication Best decomposition is to treat mols. as a set Some advantages to geometric discussed in future lecture EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Data Decomposition Examples 24 Molecular dynamics Geometric decomposition Merge sort Recursive decomposition split problem split split compute compute compute compute merge merge EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez merge solution

Dependence Analysis 25 Given two tasks how to determine if they can safely run in parallel? EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Bernstein s Condition 26 R i : set of memory locations read (input) by task T i W j : set of memory locations written (output) by task T j Two tasks T 1 and T 2 are parallel if input to T 1 is not part of output from T 2 input to T 2 is not part of output from T 1 outputs from T 1 and T 2 do not overlap EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Example 27 T 1 T 2 a = x + y b = x + z R 1 = { x, y } W 1 = { a } R R W 1 1 2 W W 2 W 1 2 R 2 = { x, z } W 2 = { b } EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Patterns for Parallelizing Programs 28 4 Design Spaces Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005).

Algorithm Structure Design Space 29 Given a collection of concurrent tasks, what s the next step? Map tasks to units of execution (e.g., threads) Important considerations Magnitude of number of execution units platform will support Cost of sharing information among execution units Avoid tendency to over constrain the implementation Work well on the intended platform Flexible enough to easily adapt to different architectures EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Major Organizing Principle 30 How to determine the algorithm structure that represents the mapping of tasks to units of execution? Concurrency usually implies major organizing principle Organize by tasks Organize by data decomposition Organize by flow of data EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez

Organize by Tasks? 31 Recursive? yes Divide and Conquer no Task Parallelism EE382: Parallelilsm and Locality, Fall 2011 -- Lecture 11 (c) Rodric Rabbah, Mattan Erez