Computer Architecture Spring 2016

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

THREAD LEVEL PARALLELISM

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Beyond ILP In Search of More Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing


CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer Systems Architecture

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CSC 631: High-Performance Computer Architecture

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Computer Systems Architecture

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

WHY PARALLEL PROCESSING? (CE-401)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Online Course Evaluation. What we will do in the last week?

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Chap. 4 Multiprocessors and Thread-Level Parallelism

Intro to Multiprocessors

Parallel Processing SIMD, Vector and GPU s cont.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

TDT 4260 lecture 7 spring semester 2015

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Architecture

Introduction II. Overview

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Parallel Architectures

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Science 146. Computer Architecture

High-Performance Processors Design Choices

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

CS377P Programming for Performance Multicore Performance Multithreading

Lecture 7: Parallel Processing

Multicore Hardware and Parallelism

EECS 570 Lecture 25 Genomics and Hardware Multi-threading

BlueGene/L (No. 4 in the Latest Top500 List)

Kaisen Lin and Michael Conley

Lec 25: Parallel Processors. Announcements

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Multiprocessors & Thread Level Parallelism

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

ILP Ends TLP Begins. ILP Limits via an Oracle

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Multi-core Architectures. Dr. Yingwu Zhu

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Issues in Multiprocessors

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture-13 (ROB and Multi-threading) CS422-Spring

Multiprocessors - Flynn s Taxonomy (1966)

Handout 3 Multiprocessor and thread level parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 14: Multithreading

Parallelism, Multicore, and Synchronization

Parallelism in Hardware

CS425 Computer Systems Architecture

An Introduction to Parallel Architectures

ECSE 425 Lecture 25: Mul1- threading

45-year CPU Evolution: 1 Law -2 Equations

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

! Readings! ! Room-level, on-chip! vs.!

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Chapter 1: Perspectives

Lecture 1: Introduction

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Parallel Architecture. Hwansoo Han

Lecture 7: Parallel Processing

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

The Art of Parallel Processing

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Simultaneous Multithreading on Pentium 4

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Outline Marquette University

Multicore and Parallel Processing

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Processor Architecture and Interconnect

Types of Parallel Computers

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Computer Architecture Spring 2016

Transcription:

Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University]

Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too costly

Bridging the Gap Watts / IPC 100 Power has been growing exponentially as well 10 1 Single-Issue Pipelined Superscalar Out-of-Order (Today) Superscalar Out-of-Order (Hypothetical- Aggressive) Diminishing returns w.r.t. larger instruction window, higher issue-width Limits

Higher Complexity not Worth Effort Performance Made sense to go Superscalar/OOO Very little gain for substantial effort Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Effort

User Visible/Invisible All performance gains up to this point were free No user intervention required (beyond buying new chip) Recompilation/rewriting could provide even more benefit Higher frequency & higher IPC Same ISA, different micro-architecture Multiprocessing pushes parallelism above ISA Coarse grained parallelism Provide multiple processing elements User (or developer) responsible for finding parallelism User decides how to use resources

Sources of (Coarse) Parallelism Different applications MP3 player in background while you work in Office Other background tasks: OS/kernel, virus check, etc Piped applications gunzip -c foo.gz grep bar perl some-script.pl Threads within the same application Java (scheduling, GC, etc...) Explicitly coded multi-threading pthreads, MPI, etc

Encountering Amdahl s Law Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

Examples: Amdahl s Law Speedup w/ E = 1 / ((1-F) + F/S) Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 +.25/20) = 1.31 What if its usable only 15% of the time? Speedup w/ E = 1/(.85 +.15/20) = 1.17 Amdahl s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less

Flynn s Classification Scheme SISD single instruction, single data stream aka uniprocessor - what we have been talking about all semester SIMD single instruction, multiple data streams single control unit broadcasting operations to multiple datapaths MISD multiple instruction, single data stream no such machine MIMD multiple instructions, multiple data streams aka multiprocessors, multicomputers (SMPs, clusters)

SMP Machines SMP = Symmetric Multi-Processing Symmetric = All CPUs have equal access to memory OS sees multiple CPUs Runs one process (or thread) on each CPU CPU 0 CPU 1 CPU 2 CPU 3

Multiprocessing Workload Benefits runtime 3-wide OOO CPU Task A Task B 4-wide OOO CPU Task A Task B Benefit 3-wide OOO CPU 3-wide OOO CPU Task A Task B 2-wide OOO CPU 2-wide OOO CPU Task A Task B Assumes you have multiple tasks/programs to run

If Only One Task Available runtime 3-wide OOO CPU 4-wide OOO CPU Task A Task A Benefit 3-wide OOO CPU 3-wide OOO CPU Task A No benefit over 1 CPU 2-wide OOO CPU 2-wide OOO CPU Task A Performance degradation! Idle

Benefit of Multiprocessing Depends on Workload Limited number of parallel tasks to run Adding more CPUs than tasks provides zero benefit For parallel code, Amdahl s law curbs speedup parallelizable 1CPU 2CPUs 3CPUs 4CPUs

Hardware Modifications for SMP Processor Memory interface Motherboard Multiple sockets (one per CPU) Datapaths between CPUs and memory Other Case: larger (bigger motherboard, better airflow) Power: bigger power supply for N CPUs Cooling: more fans to remove N CPUs worth of heat

Chip-Multiprocessing (CMP) Simple SMP on the same chip CPUs now called cores by hardware designers OS designers still call these CPUs Intel Smithfield Block Diagram AMD Dual- Athlon FX

On-chip Interconnects (1/5) Today, (+L1+L2) = core (L3+I/O+Memory) = uncore How to interconnect multiple core s to uncore? Possible topologies Bus Crossbar Ring LLC Mesh Torus Memory Controller

On-chip Interconnects (2/5) Possible topologies Bus Crossbar Ring Mesh Bank 0 Bank 1 Oracle UltraSPARC T5 (3.6GHz, 16 cores, 8 threads per core) Torus Bank 2 Bank 3 Memory Controller

On-chip Interconnects (3/5) Possible topologies Bus Crossbar Ring Memory Controller Mesh Intel Sandy Bridge (3.5GHz, 6 cores, 2 threads per core) Torus Bank 0 Bank 1 Bank 2 3 ports per switch Simple and cheap Can be bi-directional to reduce latency Bank 3

On-chip Interconnects (4/5) Possible topologies Tilera Tile64 (866MHz, 64 cores) Bus Crossbar Ring Mesh Torus Memory Controller Bank 2 Bank 5 Bank 0 Bank 3 Bank 6 Bank 1 Bank 4 Bank 7 Up to 5 ports per switch Tiled organization combines core and cache

On-chip Interconnects (5/5) Possible topologies Bus Crossbar Ring Mesh Torus Memory Controller Bank 2 Bank 0 Bank 3 Bank 1 Bank 4 5 ports per switch Can be folded to avoid long links Bank 5 Bank 6 Bank 7

Benefits of CMP Cheaper than multi-chip SMP All/most interface logic integrated on chip Fewer chips Single CPU socket Single interface to memory Less power than multi-chip SMP Efficiency Communication on die uses less power than chip to chip Use for transistors instead of wider/more aggressive OoO Potentially better use of hardware resources

Multi-Threading Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC Poor utilization of transistors SMP: 2-4 CPUs, but need independent threads Poor utilization as well (if limited tasks) {Coarse-Grained,Fine-Grained,Simultaneous}-MT Use single large uni-processor as a multi-processor provide multiple hardware contexts (threads) Per-thread PC Per-thread ARF (or map table) Each core appears as multiple CPUs OS designers still call these CPUs

Scalar Pipeline Time Dependencies limit functional unit utilization

Superscalar Pipeline Time Higher performance than scalar, but lower utilization

Chip Multiprocessing (CMP) Time Limited utilization when running one thread

Coarse-Grained Multithreading (1/3) Time Hardware Context Switch Only good for long latency ops (i.e., cache misses)

Coarse-Grained Multithreading (2/3) + Sacrifices a little single thread performance Tolerates only long latencies (e.g., L2 misses) Thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread B on thread A L2 miss Switch back to A when A L2 miss returns Pipeline partitioning None, flush on switch Can t tolerate latencies shorter than twice pipeline depth Need short in-order pipeline for good performance

Coarse-Grained Multithreading (3/3) original pipeline I B P regfile D thread scheduler I B P regfile regfile D L2 miss?

Fine-Grained Multithreading (1/3) Time Saturated workload Lots of threads Unsaturated workload Lots of stalls Intra-thread dependencies still limit performance

Fine-Grained Multithreading (2/3) Sacrifices significant single-thread performance + Tolerates everything + L2 misses + Mispredicted branches + etc... Thread scheduling policy Switch threads often (e.g., every cycle) Use round-robin policy, skip threads with long-latency ops Pipeline partitioning Dynamic, no flushing Length of pipeline doesn t matter

Fine-Grained Multithreading (3/3) (Many) more threads Multiple threads in pipeline at once thread scheduler I B P regfile regfile regfile regfile D

Simultaneous Multithreading (1/3) Time Max utilization of functional units

Simultaneous Multithreading (2/3) + Tolerates all latencies ±Sacrifices some single thread performance Thread scheduling policy Round-robin (like Fine-Grained MT) Pipeline partitioning Dynamic Examples Pentium4 (hyper-threading): 5-way issue, 2 threads Alpha 21464: 8-way issue, 4 threads (canceled)

Simultaneous Multithreading (3/3) original pipeline map table regfile I B P D thread scheduler map tables regfile I B P D

Latency vs. Throughput MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads But improves aggregate latency of both threads Improves utilization Example Thread A: individual latency=10s, latency with thread B=15s Thread B: individual latency=20s, latency with thread A=25s Sequential latency (first A then B or vice versa): 30s Parallel latency (A and B simultaneously): 25s MT slows each thread by 5s But improves total latency by 5s Benefits of MT depend on workload

CMP vs. MT If you wanted to run multiple threads would you build a Chip multiprocessor (CMP): multiple separate pipelines? A multithreaded processor (MT): a single larger pipeline? Both will get you throughput on multiple threads CMP will be simpler, possibly faster clock SMT will get you better performance (IPC) on a single thread SMT is basically an ILP engine that converts TLP to ILP CMP is mainly a TLP engine Do both (CMP of MTs), Example: Sun UltraSPARC T1 8 processors, each with 4-threads (fine-grained threading) 1Ghz clock, in-order, short pipeline Designed for power-efficient throughput computing

Combining MP Techniques (1/2) System can have SMP, CMP, and SMT at the same time Example machine with 32 threads Use 2-socket SMP motherboard with two chips Each chip with an 8-core CMP Where each core is 2-way SMT Makes life difficult for the OS scheduler OS needs to know which CPUs are Real physical processor (SMP): highest independent performance s in same chip: fast core-to-core comm., but shared resources Threads in same core: competing for resources Distinct apps. scheduled on different CPUs Cooperative apps. (e.g., pthreads) scheduled on same core Use SMT as last choice (or don t use for some apps.)

Combining MP Techniques (2/2)

Scalability Beyond the Machine

Server Racks

Datacenters (1/2)

Datacenters (2/2)