Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Similar documents
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ECE 588/688 Advanced Computer Architecture II

Convergence of Parallel Architecture

ECE 588/688 Advanced Computer Architecture II

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Online Course Evaluation. What we will do in the last week?

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer Architecture Spring 2016

Comp. Org II, Spring

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Parallel Processing & Multicore computers

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessors & Thread Level Parallelism

Comp. Org II, Spring

Issues in Multiprocessors

The Future Evolution of High-Performance Microprocessors

Parallel Computing Platforms

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Three basic multiprocessing issues

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Lecture 1: Parallel Architecture Intro

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Systems Architecture

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Computer Architecture

Computer parallelism Flynn s categories

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

THREAD LEVEL PARALLELISM

CS377P Programming for Performance Multicore Performance Multithreading

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multiprocessors - Flynn s Taxonomy (1966)

Why Multiprocessors?

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Lecture 9: MIMD Architecture

Parallel Programming Models and Architecture

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

Multithreading: Exploiting Thread-Level Parallelism within a Processor


Lect. 2: Types of Parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Systems Architecture

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Issues in Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Lecture 9: MIMD Architectures

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Lecture 9: MIMD Architectures

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Kaisen Lin and Michael Conley

Outline Marquette University

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CSCI 4717 Computer Architecture

Lecture 1: Introduction

Chapter 1: Perspectives

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Multi-core Architectures. Dr. Yingwu Zhu

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Parallel Computer Architecture

Transcription:

18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University

Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic? Does everyone have a partner? Does anyone have too many partners? 2

Last Lecture Programming model vs. architecture Message Passing vs. Shared Memory Data Parallel Dataflow Generic Parallel Machine 3

Readings for Today Required: Hill and Marty, Amdahl s Law in the Multi-Core Era, IEEE Computer 2008. Annavaram et al., Mitigating Amdahl s Law Through EPI Throttling, ISCA 2005. Suleman et al., Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ASPLOS 2009. Ipek et al., Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, ISCA 2007. Recommended: Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. Barroso et al., Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, ISCA 2000. Kongetira et al., Niagara: A 32-Way Multithreaded SPARC Processor, IEEE Micro 2005. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS 1967. 4

Reviews Due Today (Jan 21) Seitz, The Cosmic Cube, CACM 1985. Suleman et al., Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ASPLOS 2009. Due Next Tuesday (Jan 25) Papamarcos and Patel, A low-overhead coherence solution for multiprocessors with private cache memories, ISCA 1984. Kelm et al., Cohesion: a hybrid memory model for accelerators, ISCA 2010. Due Next Friday (Jan 28) Suleman et al., Data Marshaling for Multi-Core Architectures, ISCA 2010. 5

Recap of Last Lecture 6

Review: Shared Memory vs. Message Passing Loosely coupled multiprocessors No shared global memory address space Multicomputer network Network-based multiprocessors Usually programmed via message passing Explicit calls (send, receive) for communication Tightly coupled multiprocessors Shared global memory address space Traditional multiprocessing: symmetric multiprocessing (SMP) Existing multi-core processors, multithreaded processors Programming model similar to uniprocessors except (multitasking uniprocessor) Operations on shared data require synchronization 7

Programming Models vs. Architectures Five major models Shared memory Message passing Data parallel (SIMD) Dataflow Systolic Hybrid models? 8

Scalability, Convergence, and Some Terminology 9

Scaling Shared Memory Architectures 10

Interconnection Schemes for Shared Memory Scalability dependent on interconnect 11

UMA/UCA: Uniform Memory or Cache Access

Uniform Memory/Cache Access

Example SMP Quad-pack Intel Pentium Pro 14

How to Scale Shared Memory Machines? Two general approaches Maintain UMA Provide a scalable interconnect to memory Downside: Every memory access incurs the round-trip network latency Interconnect complete processors with local memory NUMA (Non-uniform memory access) Local memory faster than remote memory Still needs a scalable interconnect for accessing remote memory Not on the critical path of local memory access 15

NUMA/NUCA: NonUniform Memory/Cache Access

Example NUMA MAchines Sun Enterprise Server Cray T3E 17

Convergence of Parallel Architectures Scalable shared memory architecture is similar to scalable message passing architecture Main difference: is remote memory accessible with loads/stores? 18

Historical Evolution: 1960s & 70s

Historical Evolution: 1980s

Historical Evolution: Late 80s, mid 90s

Historical Evolution: Current Chip multiprocessors (multi-core) Small to Mid-Scale multi-socket CMPs One module type: processor + caches + memory Clusters/Datacenters Use high performance LAN to connect SMP blades, racks Driven by economics and cost Smaller systems => higher volumes Off-the-shelf components Driven by applications Many more throughput applications (web servers) than parallel applications (weather prediction) Cloud computing

Historical Evolution: Future Cluster/datacenter on a chip? Heterogeneous multi-core? Bounce back to small-scale multi-core???? 23

Multi-Core Processors 24

Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 25

Multi-Core Idea: Put multiple processors on the same die. Technology scaling (Moore s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Simultaneous multithreading Integrate platform components on chip (e.g., network interface, memory controllers) 26

Why Multi-Core? Alternative: Bigger, more powerful single core Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive) 27

Large Superscalar vs. Multi-Core Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. 28

Multi-Core vs. Large Superscalar Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand 29

Large Superscalar vs. Multi-Core Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. Technology push Instruction issue queue size limits the cycle time of the superscalar, OoO processor diminishing performance Quadratic increase in complexity with issue width Large, multi-ported register files to support large instruction windows and issue widths reduced frequency or longer RF access, diminishing performance Application pull Integer applications: little parallelism? FP applications: abundant loop-level parallelism Others (transaction proc., multiprogramming): CMP better fit 30

Why Multi-Core? Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy 31

Cache vs. Core Cache Number of Transistors Microprocessor Time 32

Why Multi-Core? Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance when there is a single thread + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance 33