ECE 669 Parallel Computer Architecture

Similar documents
Handout 3 Multiprocessor and thread level parallelism

CS252 Lecture Notes Multithreaded Architectures

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Software-Controlled Multithreading Using Informing Memory Operations

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

Getting CPI under 1: Outline

Chapter 5. Thread-Level Parallelism

Multithreaded Processors. Department of Electrical Engineering Stanford University

ECE 2300 Digital Logic & Computer Organization. Caches

Chap. 4 Multiprocessors and Thread-Level Parallelism

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Aleksandar Milenkovich 1

Itanium 2 Processor Microarchitecture Overview

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

UNIT I (Two Marks Questions & Answers)

Anant Agarwal, John Kubiatowicz, David Kranz, Beng-Hong Lim, Donald Yeung, Godfrey D'Souza, and Mike Parkin. Massachusetts Institute of Technology

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer Systems Architecture

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Computer Science 146. Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ECE 486/586. Computer Architecture. Lecture # 12

ECE 669 Parallel Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Administrivia. p. 1/20

Review: Creating a Parallel Program. Programming for Performance

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Multiprocessor Synchronization

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

ECE 669 Parallel Computer Architecture

Lecture 18: Multithreading and Multicores

CS 654 Computer Architecture Summary. Peter Kemper

Parallel Computer Architecture

Interconnect Routing

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Advanced issues in pipelining

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Multiprocessor Synchronization

Introducing Multi-core Computing / Hyperthreading

Multiprocessors and Locking

University of Toronto Faculty of Applied Science and Engineering

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Academic Course Description. EM2101 Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Instruction Pipelining Review

Lecture 7: Implementing Cache Coherence. Topics: implementation details

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 19 Summary

Portland State University ECE 588/688. Graphics Processors

HY425 Lecture 09: Software to exploit ILP

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HY425 Lecture 09: Software to exploit ILP

Computer Systems Architecture

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture 14: Multithreading

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Simultaneous Multithreading (SMT)

Portland State University ECE 588/688. Cray-1 and Cray T3E

Lecture 8: RISC & Parallel Computers. Parallel computers


ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Topics in computer architecture

Topic 18 (updated): Virtual Memory

NAME: STUDENT ID: MIDTERM 235 POINTS. Closed book, closed notes 70 minutes

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

EE382 Processor Design. Processor Issues for MP

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture-13 (ROB and Multi-threading) CS422-Spring

Pipelining, Branch Prediction, Trends

5008: Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Chapter 1: Perspectives

Transcription:

ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

Overview Special features in microprocessors provide support for parallel processing Already discussed bus snooping Memory latency becoming worse so multi-process support important Provide for rapid context switches inside the processor Support for prefetching Directly affects processor utilization

Why are traditional RISCs ill-suited for multiprocessing? Cannot handle asynchrony well - Traps - Context switches Cannot deal with pipelined memories - (multiple outstanding requests) Inadequate support for synchronization (Eg. R2000 (SGI No synchro instruction) Had to memory map synchronization)

Three major topics Pipeline processor-memory-network - Fast context switching - Prefetching (Pipelining: Multithreading) Synchronization Messages

Pipelining Multithreading Resource Usage Mem. Bus ALU Fetch (Inst. or operand) Execute Overlap memory/alu usage More effective use of resources Prefetch Cache Pipeline (general)

RISC Issues 1 Inst/cycle Huge memory bandwidth requirements - Caches: 1 Data Cache or - Separate I&D caches Lots of registers, state Pipeline Hazards Compiler Reservation bits Bypass Paths Interlocks More state! Other stuff - register windows Even more state!

Fundamental conflict Better single-thread performance (sequential) More on-chip state More on-chip state Harder to handle asynchronous events - Traps - Context switches - Synchronization faults - Message arrivals But, why is this a problem in MPs? Makes pipelining proc-mem-net harder. Consider...

Ignore communication system latency (T=0) Net. Then, max bandwidth per node limits max processor speed Proc. Above BK d t Cache miss interval Processor-network matched rate=net bandwidth Network request Network response Processor requests i.e. proc request If processor has higher request rate, it will suffer idle time 1 BK d

Now, include network latency Each request suffers T cycles of latency Latency T Net. BK d Proc. Processor idle t Processor utilization = Processor utilization Network bandwidth also wasted because of lost issue opportunities! FIX? t t + T = t = 1 1 + mt 1 m

One solution Overlap communication with computation. Net. BK d T Multithread the processor - Need rapid context switch. See HEP, Sparcle. Processor utilization = pt t + T if pt < (t + T) requests -- non- And/or allow multiple outstanding blocking memory

One solution Overlap communication with computation. Net. BK d T Context switch interval Z Multithread the processor Processor utilization Need rapid context switch. See HEP, Sparcle. pt = t + T if pt < (t + T) or = t t + And/or allow multiple outstanding requests -- non-blocking memory Z otherwise

Caveat! Of course, previous analysis assumed network bandwidth was not a limitation. Consider: Net. BK Proc. Z t Must wait till next (issue opportunity) Computation speed (proc. util.) limited by network bandwidth. Lessons: Multithreading allows full utilization of network bandwidth. Processor util. will reach 1 only if net BW is not a limitation.

Same applies to synchronization delays as well Process 1 Synchronization fault 1 Process 2 Process 3 Synchronization fault 1 satisfied If no multithreading Wasted processor cycles Fault Satisfied

Requirements for latency tolerance (comm or synch) Processors must switch contexts fast Memory system must allow multiple outstanding requests Processors must handle traps fast (esp synchronization) Can also allow multiple memory requests But, caution: Latency tolerant processors are no excuse for not exploiting locality and trying to minimize latency Consider...

Fine multithreading versus block multithreading Block multithreading 1. Switch on cache miss or synchro fault 2. Long runs between switches because of caches 3. Fewer request in network Fine multithreading 1. Switch on each mem. request 2. Short runs need very fast context switch - minimal processor state - poor single-thread performance 3. Need huge amount of network bandwidth; need lots of threads

How to implement fast context switches? Instructions Data Memory PC Processor Switch by putting new value into PC Minimize processor state Very poor single-thread performance

How to implement fast context switches? Memory Process i regs PC Registers High BW transfer Processor Process j regs Special state memory Dedicate memory to hold state & high bandwidth path to state memory Is this best use of expensive off-chip bandwidth?

How to implement fast context switches? Memory FP PC Proc i regs Proc j regs Processor Include few (say 4) register frames for each process context. Switch by bumping FP (frame pointer) Switches between 4 processes fast, otherwise invoke software loader/unloader - Sparcle uses SPARC windows

How to implement fast context switches? Memory PC Processor Registers Block register files Fast transfer of registers to on-chip data cache via wide path

How to implement fast context switches? Memory Registers PC Processor Trap frame Fast traps also needed. Also need dedicated synchronous trap lines --- synchronization, cache miss... Need trap vector spreading to inline common trap code

Pipelining processor - memory - network Prefetching LD LD 0 1 2 3 A 4 5 6 7 8 B 9 10 LD LD 3 8 0 1 2 4 A B 5 6 7 8 9 10

Synchronization Key issue What hardware support What to do in software Consider atomic update of the bound variable in traveling salesman

Synchronization P L Mem bound P P Atomic While (LOCK(L)==1); read bound incr bound store bound unlock(l) Lock(L) read L if (L==1) return 1; else L=1 store L return o; Loop test set Need mechanism to lock out other request to L

In uniprocessors Raise interrupt level to max, to gain uninterrupted access In multiprocessors Need instruction to prevent access to L. Methods Keep synchro vars in memory, do not release bus Keep synchro vars in cache, prevent outside invalidations Usually, can memory map some data fetches such that cache controller locks out other requests

Data-parallel synchronization Can also allow controller to do update of L. Eg. Sparcle (in Alewife machine) Mem word Full/ Empty Bit (as in HEP) Controller Cache L trap if f/e bit=1 ldet ldent Processor load, trap if full, set empty

Given primitive atomic operation can synthesize in software higher forms Eg. 1. Producer-consumer Producer stf D lde D Consumer Store if f/e=0 set f/e=1 D Load if f/e=1 set f/e=0 trap otherwise...retry trap otherwise...retry

Some provide massive HW support for synchronization -- eg. Ultracomputer, RP3 Combining networks. Say, each processor wants a unique i 1 5 5 5 L=5 4 2 1 1 5 1 6 7 2 F&A(L,1) F&A(L,1) F&A(L,2) Switches become processors -- slow, expensive 6 Software combining -- implement combining tree in software using a tree data structure 7 2 9 7 2 mem requests variable in software

Summary Processor support for parallel processing growing Latency tolerance supports by fast context switching Also more advanced software systems Maintaining processor utilization is a key Ties to network performance Important to maintain RISC performance Even uniprocessors can benefit from context switches Register windows