Review: Creating a Parallel Program. Programming for Performance

Similar documents
Programming as Successive Refinement. Partitioning for Performance

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

ECE 669 Parallel Computer Architecture

Steps in Creating a Parallel Program

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Performance Optimization Part II: Locality, Communication, and Contention

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

An Introduction to Parallel Programming

Παράλληλη Επεξεργασία

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Practical Near-Data Processing for In-Memory Analytics Frameworks

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CSC630/CSC730 Parallel & Distributed Computing

Page 1. Multilevel Memories (Improving performance using a little cash )

Computer Architecture Spring 2016

Introduction to Parallel Computing

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Lecture notes for CS Chapter 2, part 1 10/23/18

Software-Controlled Multithreading Using Informing Memory Operations

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Introduction to parallel Computing

ECE/CS 757: Homework 1

Martin Kruliš, v

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Computer Architecture Crash course

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

SGI Challenge Overview

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel Algorithm Design. CS595, Fall 2010

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

Caching Basics. Memory Hierarchies

Chapter 2: Memory Hierarchy Design Part 2

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

ECE453: Advanced Computer Architecture II Homework 1

Memory Hierarchy Basics

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

15-740/ Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 2: Memory Hierarchy Design Part 2

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Memory Consistency. Challenges. Program order Memory access order

Parallelization Principles. Sathish Vadhiyar

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Online Course Evaluation. What we will do in the last week?

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Handout 3 Multiprocessor and thread level parallelism

Adapted from David Patterson s slides on graduate computer architecture

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Tutorial 11. Final Exam Review

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Parallel Processing SIMD, Vector and GPU s cont.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming

Concurrent Programming Introduction

LECTURE 5: MEMORY HIERARCHY DESIGN

Caches Concepts Review

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

1. Memory technology & Hierarchy

Limitations of parallel processing

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Write only as much as necessary. Be brief!

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Transcription:

Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 1 Programming for Performance Partitioning Granularity Communication Caches and Their Effects (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 2 Page 1

Where Do Programs Spend Time? Sequential Performing real computation Memory system stalls Parallel Performing real computation Stalled for local memory Stalled for remote memory (communication) Synchronizing (load imbalance and operations) Overhead Speedup (p) = time(1)/time(p) Amdahl s Law (low concurrency limits speedup) Superlinear speedup possible (how?) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 3 Partitioning for Performance Balance workload reduce time spent at synchronization Reduce communication Reduce extra work determining and managing assignment These are at odds with each other e.g. communication reduced by using one processor (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 4 Page 2

Load Balance and Synch. Wait Time Basic load balance equation: Speedupproblem Sequential work max work on any processor work includes not only computation but communication and data access work should also be done at the same time Real goal: reduce time spent at synchronization points including implied one at end of program (C) 2003 J. E. Smith CS/ECE 757 5 Load Balance and Synch. Wait Time Identify concurrency Managing concurrency static dynamic Granularity of concurrency Serialization and synchronization costs (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 6 Page 3

Data vs. Functional Parallelism Data Parallelism same ops on different data items leads to SPMD programming Functional (control, task) Parallelism Also called control parallelism or task parallelism example: task pipeline (series of producers and consumers) Hybrids are possible: pipeline of data parallel tasks Impact on load balancing? Functional is more difficult relatively few functions» longer running tasks often requires more software development» consider 2K processors, each running a different functional task (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 7 Managing Concurrency: Load Balance Static Algorithmic mapping of tasks to processes» e.g. example solver Requires little task management overhead Better with predictable amounts of work Can not adapt to runtime events and data variations Dynamic Semi-Static: assignment for a phase is determined before that phase» assignments are reconfigured periodically to restore load balance» e.g. Barnes-Hut where bodies move among regions Dynamic tasking: Task Queue Centralized task queue» contention Distributed task queue» Can steal from other queues In general: more overhead with dynamic methods (C) 2003 J. E. Smith CS/ECE 757 8 Page 4

Task Queues copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 9 Dynamic Load Balancing copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 10 Page 5

Impact of Task Granularity Granularity = Amount of work associated with task Large tasks more difficult load balancing lower overhead less contention less communication Small tasks more load balancing possibilities too much synchronization too much management overhead might have too much communication (use affinity scheduling) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 11 Impact of Synchronization and Serialization Too coarse synchronization barriers instead of point-to-point synch poor load balancing Too many synchronization operations e.g. lock each element of array more difficult to program too many synchronization operations Coarse grain locking lock entire array serialize access to array (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 12 Page 6

Example: Task Queue Add a task to queue, search queue for another task, remove task from queue Option 1: Make whole process one critical section Option 2: critical section: add task to queue non-critical section: search queue for another task critical section: remove task from queue» (if task is still there) General guideline searching (reading) does not have to be in critical section updating does have to be in critical section Programming trick first check in non-critical section then lock and re-verify (C) 2003 J. E. Smith CS/ECE 757 13 Architectural Support for Dynamic Task Stealing How can architecture help? Communication support for transfer of small amount of data and mutual exclusion can make tasks smaller better load balance Naming make it easy to name or access data associated with stolen task Synchronization support point-to-point synchronization better load balancing (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 14 Page 7

Reducing Inherent Communication Communication required for parallel program Communication to Computation Ratio bytes / time or bytes / instruction Affected by assignment (task -> process) assign heavily communicating tasks to same process Domain decomposition interact with neighbors in space good for simulation of physical systems P0 P1 P2 P4 P5 P6 P3 P7 P8 P9 P10 P11 P12 P13 P14 P15 Data Space Communicated Values (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 15 Domain Decomposition, contd. Communication grows with surface Computation grows with volume Shape of partitions is application and architecture dependent squares, row blocks, interleaved rows, etc. Communicated Values P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Data Space (C) 2003 J. E. Smith CS/ECE 757 16 Page 8

Speedup Revisited Speedup Sequential Work max(work on any processor) Speedup Sequential Work max(work + Synch Wait + Communication) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 17 Reducing Extra Work Redundant Computation if node would be idle anyway, compute data to avoid communication e.g. at startup all processes compute shared table» vs one computes and then communicates to others Creating processes (high cost) create once and manage in application Speedup Sequential Work max(work + Synch Wait + Communication + extra work) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 18 Page 9

Multiprocessor as an Extended Mem. Hier. Example: computation: 2 instructions per cycle (IPC)» or 500 cycles per 1000 instructions 1 long miss per 1000 instructions» 200 cycles per long miss => 700 cycles per 1000 instructions (40% slowdown) P $ P $ Node 0 Node 1 Mem CA Mem CA P $ Interconnect P $ Mem CA Mem CA Node 2 Node 3 (C) 2003 J. E. Smith CS/ECE 757 19 Inherent vs. Artifactual Communication Poor allocation of data many accesses to remote nodes Unnecessary data in transfer Unnecessary data transfer because of system granularity e.g. cache line sizes larger than inherent data transfer Redundant communication Limited capacity for replication Communication Structure large vs. small messages bursty overlap do communication patterns match the network structure (mapping) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 20 Page 10

Cache Memory 101 STOP HERE: Locality + smaller HW is faster = memory hierarchy Levels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom Definitions Upper is closer to processor Block: minimum unit of data present or not in upper level Frame: HW (physical) place to put block (same size as block) Address = Block address + block offset address Hit time: time to access upper level, including hit determination 3C Model compulsory, capacity, conflict Add another C: communication misses (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 21 Cache Coherent Shared Memory P1 P2 Time Interconnection Network x Main Memory (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 22 Page 11

Cache Coherent Shared Memory P1 P2 ld r2, x Time Interconnection Network x Main Memory (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 23 Cache Coherent Shared Memory P1 P2 ld r2, x Time ld r2, x Interconnection Network x Main Memory (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 24 Page 12

Cache Coherent Shared Memory P1 P2 ld r2, x Time ld r2, x add r1, r2, r4 st x, r1 Interconnection Network x Main Memory (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 25 Orchestration for Performance Exploit Temporal and Spatial Locality Temporal locality affects replication Touch too much data == capacity misses Computation Blocking Naïve Computation Order Blocked Computation order (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 26 Page 13

Spatial Locality Communication grain Allocation grain Coherence grain (for CC shared memory) What benefit do you get from larger block size Potential disadvantage is false sharing Two or more processors accessing same cache block but don t share any of the data. (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 27 Poor Data Allocation Elements on Same Page Elements on Same Cache Block (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 28 Page 14

Data Blocking Elements on Same Page Elements on Same Cache Block (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 29 Data Structuring and Performance copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 30 Page 15

Reducing Communication Cost Cost = Freq. x (Overhead + Latency + Xfer size/bw - Overlap) Reduce Overhead Fewer, larger messages» easier with message passing (more control over messages)» easier with regular data access (e.g. send entire row in solver) Reduce Delay (latency) Depends on hardware» use pipelined networks (not store and forward)» reduce number of hops (generally not considered important today) Reduce Contention (bandwidth) Resources have nonzero occupancy Difficult to program for Can cause bottlenecks (and under utilization elsewhere) Endpoint contention (e.g. memory banks) Network contention (e.g. interconnect net) (C) 2003 J. E. Smith CS/ECE 757 31 Reducing Communication Cost, Contd. Hotspots Consider global sum» one processor vs tree of processors Reducing Ovelap Pre-Communicating (like prefetching) Non-blocking communication (do something else and come back) Multi-threading (assign multiple tasks to same process and switch) (C) 2003 J. E. Smith CS/ECE 757 32 Page 16

Review: Programming for Performance Partitioning for Performance Identify concurrency Managing concurrency» static» dynamic Granularity of concurrency Serialization and synchronization costs Communication Orchestration for Performance Exploit Locality Data and Computation Blocking Match system (page size, cache block size) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 33 Page 17