Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Similar documents
Introducing Multi-core Computing / Hyperthreading

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Systems Architecture

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Handout 3 Multiprocessor and thread level parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Computer Systems Architecture

Flynn s Classification

Multiprocessors & Thread Level Parallelism

Computer Science 146. Computer Architecture

COSC4201 Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Architecture. Hwansoo Han

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Chapter-4 Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multithreaded Processors. Department of Electrical Engineering Stanford University

Comp. Org II, Spring

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Processing & Multicore computers

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Comp. Org II, Spring

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS/COE1541: Intro. to Computer Architecture

Computer parallelism Flynn s categories

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Aleksandar Milenkovich 1


Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Architectures

Lecture 24: Virtual Memory, Multiprocessors

Parallel Computing Platforms

WHY PARALLEL PROCESSING? (CE-401)

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture

Issues in Multiprocessors

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Multiprocessors and Locking

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC 6385 Computer Architecture - Multi Processor Systems

Issues in Multiprocessors

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CSCI 4717 Computer Architecture

Parallel Processing SIMD, Vector and GPU s cont.

Lecture 14: Multithreading

Introduction II. Overview

ECE 485/585 Microprocessor System Design

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

THREAD LEVEL PARALLELISM

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Chapter 9 Multiprocessors

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Chapter 18 Parallel Processing

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multiprocessors 1. Outline

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

CSE 392/CS 378: High-performance Computing - Principles and Practice

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Computer Organization. Chapter 16

Processor Architecture and Interconnect

Kaisen Lin and Michael Conley

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

1. Memory technology & Hierarchy

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Online Course Evaluation. What we will do in the last week?

Parallel Computing: Parallel Architectures Jin, Hai

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chapter 6: Multiprocessors and Thread-Level Parallelism

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Chapter 5. Thread-Level Parallelism

Organisasi Sistem Komputer

Transcription:

Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network Topology Programming Model SISD (Single Instruction Single Data) Uniprocessors SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2, Nvidia GPUs, etc» Simple programming model» Low overhead MIMD (Multiple Instruction Multiple Data) Examples: many, nearly all modern multiprocessors or multicores» Flexible» Use off-the-shelf microprocessors or microprocessor cores MISD (Multiple Instruction Single Data)???

Interconnection Network Topology Bus P 0 Network pros/cons? P 1 P 2 P 3 P 4 P 5 UMA (Uniform Access) NUMA (Non-uniform Access) pros/cons? P 6 P 7 Cache Cache Cache Cache Cache Cache Cache Cache Cache cpu M cpu cpu Network M M Network cpu M Programming Model Parallel Programming Shared -- every processor can name every address location Message Passing -- each processor can name only it s local memory Communication is through explicit messages pros/cons? A index = i++; i = 47 B index = i++; Cache Cache Cache Network find the max of 100,000 integers on 10 processors Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions locks (semaphores) barriers

Multiprocessor Caches (Shared ) the problem -- cache coherency the solution? Cache Cache Cache What Does Coherence Mean? Informally: Any read must return the most recent write Too strict and very difficult to implement Better: A processor sees its own writes to a location in the correct order Any write must eventually be seen by a read All writes are seen in order ( serialization ) Writes to the same location are seen in the same order by all processors Without these guarantees, synchronization doesn t work Potential Solutions ing Solution (y Bus): Send all requests for unknown data to all processors s snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory y( (avoids bottlenecks) Send point-to-point requests to processors Scales better than Actually existed BEFORE -based schemes Basic y Protocols write-update on each write, each cache holding that location updates its value write-invalidate <= most common on each write, each cache holding that location invalidates the cache line Cache Cache Cache both schemes MUCH easier on a bus-based multiprocessor potentially requires a LOT of messages, but

Basic y Protocols Write Invalidate versus Broadcast: Invalidate requires one transaction per write-run Invalidate exploits spatial locality: one transaction per block Broadcast has lower latency between write and read Broadcast: BW (increased) vs latency (decreased) d) tradeoff An Example y Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory Dirty in exactly one cache Not in any caches Each cache block is in one state: (S)hared: block can be read (E)xclusive: cache has only copy, its writeable, and dirty (I)nvalid: block contains no data Read misses: cause all caches to snoop Writes to clean line are treated as misses Other protocols (more common) y-cache State Machine ESI = Exclusive, Shared, Invalid CPU read hit MESI = Modified, Exclusive, Shared, Invalid Exclusive = private, clean Modified = private, dirty MOESI = Modified, Owned, Exclusive, Shared, Invalid Owned = responsible for writing back shared, dirty line Invalid Place write miss on bus CPU write Exclusive (read/write) CPU read Place read miss on bus CPU read miss Write-back block Place write miss on bus CPU write Shared (read only) Cache state transitions based on requests from CPU Place read miss on bus CPU read miss Wit Write miss for this block Invalid Write-back block; abort memory access Exclusive (read/write) Write miss for this block Write-back block; abort memory access Read miss for this block Shared (read only) Cache state transitions based on requests from the bus How does MESI avoid sending messages across bus (compared to ESI)? What traffic does MOESI avoid? CPU write hit CPU read hit CPU write miss Write-back cache block Place write miss on bus

Cache Coherency Simultaneous Multithreading How do you know when an external (not this processor core) read/write occurs? ing protocols Directory protocols Cache Cache Cache Cache Cache Directory Directory Network Hardware Multithreading Multithreaded Conventional regs regs regs regs CPU i nstructio on stream Time (proc cycles) Superscalar Execution Issue Slots

Superscalar Execution Superscalar Execution with Fine-Grain Multithreading cycles) Time (proc Issue Slots Vertical waste Time (proc cycles) Issue Slots Thread 1 Thread 2 Thread 3 Horizontal waste Simultaneous Multithreading SMT Performance Time (proc cycles) Issue Slots Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Th roughput (Instructio ons per Cy ycle) 7 6 5 4 3 2 1 0 Simultaneous Multithreading Fine-Grain i Multithreading Conventional Superscalar 1 2 3 4 5 6 7 8 Number of Threads

Goals A Conventional Superscalar Architecture We had three primary goals for this architecture: 1 Minimize the architectural impact on conventional superscalar design Fth Fetch Unit floating point instruction queue fp fp reg s Data Cache 2 Minimize the performance impact on a single thread Instruction Cache 3 Achieve significant throughput gains with many threads 8 Decode Register Renaming integer instruction queue int integer reg s int/ldstore Fetch up to 8 instructions ti per cycle Out-of-order, speculative lti execution Issue 3 floating point, 6integer instructions per cycle An SMT Architecture Performance of the Naïve Design Fth Fetch Unit Instruction Cache 8 Decode Register Renaming Fetch up to 8 instructions ti per cycle floating point instruction queue integer instruction queue Out-of-order, speculative lti execution fp int fp reg s integer reg s Issue 3 floating Data Cache point, 6integer instructions per cycle int/ldstore ) tructions Per Cycle) Throug ghput (Inst 5 4 3 2 1 Unmodified Superscalar 2 4 6 8 Number of Threads

Bottlenecks of the Baseline Architecture Improving Fetch Throughput Instruction queue full conditions (12-21% of cycles) Lack of parallelism in the queue Fetch throughput (42 instructions per cycle when queue not full) The fetch unit in an SMT architecture has two distinct advanes over a conventional architecture Can fetch from multiple threads at once Can choose which threads to fetch Improved Fetch Performance Improved Performance Fetching from 2 threads/cycle achieved most of the performance from multiple-thread l dfetch Fetching from the thread(s) which have the fewest unissued instructions in-flight significantly increases parallelism and throughput (ICOUNT fetch policy) Instruction ns per cyc le 5 4 3 2 Improved Baseline Unmodified superscalar 1 2 4 6 8 Number of Threads

This SMT Architecture, then: Borrows heavily from conventional superscalar design Minimizes the impact on single-thread performance Achieves significant throughput gains over the superscalar (25X, up to 54 I) Multithreaded s Coarse-grain multithreading (Alewife-MIT) context switch at long-latency operations (cache misses) Fine-grain multithreading (Tera Supercomputer) context switch every cycle Sun Niagara T1 Simultaneous multithreading (SMT) execute instructions i from multiple l threads in the same cycle is only different from fine-grain multithreading in the context of superscalar execution requires surprisingly few changes to a conventional out-of-orderof order superscalar processor Was announced to be featured in the next Compaq Alpha processor (21464), but that processor never completed Introduced din the Intel Pentium 4 processor announced as Hyperthreading technology (HT Technology) IBM Power 5, 6 has 2 cores, each 2-way SMT Nehalem Multicore has 2 threads/core Multi-Core s (aka Chip Multiprocessors) Why Multicore, Multithreaded? CPU CPU CPU CPU CPU CPU Multiple cores on the same die, may or may not share L2 cache Intel, AMD both have quad core processors Sun Niagara T2 is 8 cores x 8 threads (64 contexts!) Everyone s roadmap seems to be increasingly multi-core ormance perfo 1 25 Number of threads

Intel Nehalem Intel Nehalem Core First instantiation Intel Core i7 Intel Nehalem, summary Up to 8 cores (i7, 4 cores) 2 SMT threads per core 20+ se pipeline x86 86instructions ti translated ltdto RISC-like RISClik uops Superscalar, 4 instructions (uops) per cycle (more with fusing) Caches (i7) 32KB 4-way set-associative I cache per core 32KB, 8-way set-associative D cache per core 256 KB unified 8-way set-associative L2 cache per core 8 MB shared 16-way set-associative L3 cache Multiprocessors -- Key Points Network vs Bus Message-passing vs Shared d Shared is more intuitive, but creates problems for both the programmer (memory consistency, requiring synchronization) and the architect (cache coherency) Multithreading gives the illusion of multiprocessing (including, in many cases, the performance) with very little additional hardware When multiprocessing happens within a single die/processor, we call that a chip multiprocessor, or a multi-core architecture