Lect. 2: Types of Parallelism

Similar documents
Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multiprocessors - Flynn s Taxonomy (1966)

Introduction II. Overview

Computer parallelism Flynn s categories

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Simulating ocean currents

Advanced Computer Architecture

Lecture: Memory Technology Innovations

Parallelization of an Example Program

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Computer Systems Architecture

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Why Multiprocessors?

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multiprocessors & Thread Level Parallelism

Chapter 5: Thread-Level Parallelism Part 1

Computer Systems Architecture

COSC 6385 Computer Architecture - Multi Processor Systems

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Computing Platforms

Flynn's Classification

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Parallel Architectures

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Lecture 24: Virtual Memory, Multiprocessors

Introduction to Parallel Computing

Parallelization Principles. Sathish Vadhiyar

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

THREAD LEVEL PARALLELISM

Parallel Computing: Parallel Architectures Jin, Hai

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Online Course Evaluation. What we will do in the last week?

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

BlueGene/L (No. 4 in the Latest Top500 List)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Lecture 24: Memory, VM, Multiproc

Lecture notes for CS Chapter 4 11/27/18

High Performance Computing in C and C++

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

The Art of Parallel Processing

Parallel Numerics, WT 2013/ Introduction

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Computers. c R. Leduc

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

Dr. Joe Zhang PDC-3: Parallel Platforms

WHY PARALLEL PROCESSING? (CE-401)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

EECS4201 Computer Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Issues in Multiprocessors

Lecture 1: Introduction

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Lecture 7: Parallel Processing

Chapter 11. Introduction to Multiprocessors

Introduction to Parallel Programming

Lecture 7: Parallel Processing

Handout 3 Multiprocessor and thread level parallelism

Multiprocessors at Earth This lecture: How do we program such computers?

Programmation Concurrente (SE205)

Processor Architecture and Interconnect

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computing Introduction

CMSC 611: Advanced. Parallel Systems

Processor Architectures

Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Processor Performance and Parallelism Y. K. Malaiya

Lecture 2. Memory locality optimizations Address space organization

Multiprocessors 1. Outline

Multi-core Architectures. Dr. Yingwu Zhu

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Issues in Multiprocessors

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Transcription:

Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric shared-memory multiprocessors Distributed-memory multiprocessors Chip-multiprocessors a.k.a. Multi-cores Multicomputers a.k.a. clusters Parallelism in Software Instruction level parallelism Task-level parallelism Data parallelism Transaction level parallelism 1

Taxonomy of Parallel Computers According to instruction and data streams (Flynn): Single instruction single data (SISD): this is the standard uniprocessor Single instruction, multiple data streams (SIMD): Same instruction is executed in all processors with different data E.g., Vector processors, SIMD instructions, GPUs Multiple instruction, single data streams (MISD): Different instructions on the same data Fault-tolerant computers, Near memory computing (Micron Automata processor). 2

Taxonomy of Parallel Computers According to instruction and data streams (Flynn): Single instruction single data (SISD): this is the standard uniprocessor Single instruction, multiple data streams (SIMD): Same instruction is executed in all processors with different data E.g., Vector processors, SIMD instructions, GPUs Multiple instruction, single data streams (MISD): Different instructions on the same data Fault-tolerant computers, Near memory computing (Micron Automata processor). Multiple instruction, multiple data streams (MIMD): the common multiprocessor Each processor uses it own data and executes its own program Most flexible approach Easier/cheaper to build by putting together off-the-shelf processors 2

Taxonomy of Parallel Computers According to physical organization of processors and memory: Physically centralized memory, uniform memory access (UMA) All memory is allocated at same distance from all processors Also called symmetric multiprocessors (SMP) Memory bandwidth is fixed and must accommodate all processors does not scale to large number of processors Used in CMPs today (single-socket ones) CPU CPU CPU CPU Cache Cache Cache Cache Interconnection Main memory 3

Taxonomy of Parallel Computers According to physical organization of processors and memory: Physically distributed memory, non-uniform memory access (NUMA) A portion of memory is allocated with each processor (node) Accessing local memory is much faster than remote memory If most accesses are to local memory than overall memory bandwidth increases linearly with the number of processors Used in multi-socket CMPs E.g Intel Nehalem CPU CPU CPU CPU Node Nehalem-EP Nehalem-EP Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L1 L1 L1 L1 L1 L1 L1 L1 Cache Cache Cache Cache L2 L2 L2 L2 Shared L3 Cache (inclusive) L2 L2 L2 L2 Shared L3 Cache (inclusive) IMC QPI QPI IMC Mem. Mem. Mem. Mem. DDR3 A DDR3 B DDR3 C I/O I/O DDR3 D DDR3 E DDR3 F Interconnection 4

Taxonomy of Parallel Computers According to memory communication model Shared address or shared memory Processes in different processors can use the same virtual address space Any processor can directly access memory in another processor node Communication is done through shared memory variables Explicit synchronization with locks and critical sections Arguably easier to program?? Distributed address or message passing Processes in different processors use different virtual address spaces Each processor can only directly access memory in its own node Communication is done through explicit messages Synchronization is implicit in the messages Arguably harder to program?? Some standard message passing libraries (e.g., MPI) 5

Shared Memory vs. Message Passing Shared memory Producer (p1) flag = 0; a = 10; flag = 1; Consumer (p2) flag = 0; while (!flag) { x = a * y; Message passing Producer (p1) a = 10; send(p2, a, label); Consumer (p2) receive(p1, b, label); x = b * y; 6

Types of Parallelism in Applications Instruction-level parallelism (ILP) Multiple instructions from the same instruction stream can be executed concurrently Generated and managed by hardware (superscalar) or by compiler (VLIW) Limited in practice by data and control dependences Thread-level or task-level parallelism (TLP) Multiple threads or instruction sequences from the same application can be executed concurrently Generated by compiler/user and managed by compiler and hardware Limited in practice by communication/synchronization overheads and by algorithm characteristics 7

Types of Parallelism in Applications Data-level parallelism (DLP) Instructions from a single stream operate concurrently on several data Limited by non-regular data manipulation patterns and by memory bandwidth Transaction-level parallelism Multiple threads/processes from different transactions can be executed concurrently Limited by concurrency overheads 8

Example: Equation Solver Kernel The problem: Operate on a (n+2)x(n+2) matrix Points on the rim have fixed value Inner points are updated as: A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j]) Updates are in-place, so top and left are new values and bottom and right are old ones Updates occur at multiple sweeps Keep difference between old and new values and stop when difference for all points is small enough 9

Example: Equation Solver Kernel Dependences: Computing the new value of a given point requires the new value of the point directly above and to the left By transitivity, it requires all points in the sub-matrix in the upper-left corner Points along the top-right to bottom-left diagonals can be computed independently 10

Example: Equation Solver Kernel ILP version (from sequential code): Some machine instructions from each j iteration can occur in parallel Branch prediction allows overlap of multiple iterations of j loop Some of the instructions from multiple j iterations can occur in parallel while (!done) { diff = 0; for (i=1; i<=n; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); diff += abs(a[i,j] temp); if (diff/(n*n) < TOL) done=1; 11

Example: Equation Solver Kernel TLP version (shared-memory): int mymin = 1+(pid * n/p); int mymax = mymin + n/p 1; while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(a[i,j] temp); lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P); 12

Example: Equation Solver Kernel TLP version (shared-memory) (for 2 processors): Each processor gets a chunk of rows E.g., processor 0 gets: mymin=1 and mymax=2 and processor 1 gets: mymin=3 and mymax=4 int mymin = 1+(pid * n/p); int mymax = mymin + n/p 1; while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(a[i,j] temp);... 13

Example: Equation Solver Kernel TLP version (shared-memory): All processors can access freely the same data structure A Access to diff, however, must be in turns All processors update together their own done variable... for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(a[i,j] temp); lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P); 14

Types of Speedups and Scaling Scalability: adding x times more resources to the machine yields close to x times better performance Usually resources are processors (but can also be memory size or interconnect bandwidth) Usually means that with x times more processors we can get ~x times speedup for the same problem In other words: How does efficiency (see Lecture 1) hold as the number of processors increases? In reality we have different scalability models: Problem constrained Time constrained Most appropriate scalability model depends on the user interests 15

Types of Speedups and Scaling Problem constrained (PC) scaling: Problem size is kept fixed Wall-clock execution time reduction is the goal Number of processors and memory size are increased Speedup is then defined as: S PC = Time(1 processor) Time(p processors) Example: Weather simulation that does not complete in reasonable time 16

Types of Speedups and Scaling Time constrained (TC) scaling: Maximum allowable execution time is kept fixed Problem size increase is the goal Number of processors and memory size are increased Speedup is then defined as: S TC = Work(p processors) Work(1 processor) Example: weather simulation with refined grid 17