Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Similar documents
Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Shared Symmetric Memory Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Online Course Evaluation. What we will do in the last week?

Computer Science 146. Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 24: Virtual Memory, Multiprocessors

Parallel Systems I The GPU architecture. Jan Lemeire

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Introduction II. Overview

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer parallelism Flynn s categories

THREAD LEVEL PARALLELISM

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

COSC4201 Multiprocessors

! Readings! ! Room-level, on-chip! vs.!

UNIT I (Two Marks Questions & Answers)

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Multiprocessors and Locking

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Module 5 Introduction to Parallel Processing Systems

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

COSC 6385 Computer Architecture - Multi Processor Systems

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Comp. Org II, Spring

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Processing & Multicore computers

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Comp. Org II, Spring

RECAP. B649 Parallel Architectures and Programming

Computer Organization. Chapter 16

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multiprocessor Synchronization

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Parallel Architecture. Hwansoo Han

Parallel Processing SIMD, Vector and GPU s cont.

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Issues in Multiprocessors

Processor Performance and Parallelism Y. K. Malaiya

CSE 392/CS 378: High-performance Computing - Principles and Practice

Multiprocessors - Flynn s Taxonomy (1966)

CSCI 4717 Computer Architecture

Organisasi Sistem Komputer

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Architectures

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Mestrado em Informática

Keywords and Review Questions

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Advanced Topics in Computer Architecture

CS 654 Computer Architecture Summary. Peter Kemper

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Parallelism, Multicore, and Synchronization

Multi-core Architectures. Dr. Yingwu Zhu

Chapter 17 - Parallel Processing

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Issues in Multiprocessors

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Parallel Computing Platforms

Multiprocessors 1. Outline

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Chapter 18 Parallel Processing

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Transcription:

Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP, goal is to minimize CPI! Pipeline CPI => Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls + Memory stalls... techniques... "! Multiple issue => find enough parallelism to keep pipeline(s) occupied! Multithreading => find ways to keep pipeline(s) occupied Insert data parallelism features (next set of slides) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 2 " To achieve CPI < 1, need to complete multiple instructions per clock " Solutions: " statically scheduled superscalar processors " VLIW (very long instruction word) processors " dynamically scheduled superscalar processors Multiple Issue 3 4

Multithreading " Performing multiple threads of execution in parallel " Replicate registers, PC, etc. " Fast switching between threads " Fine-grain multithreading " Switch threads after each cycle " Interleave instruction execution " If one thread stalls, others are executed " Coarse-grain multithreading " Only switch on long stall (e.g., L2- miss) " Simplifies hardware, but doesn t hide short stalls (eg, data hazards) 7.5 Hardware Multithreading Simultaneous Multithreading " In multiple-issue dynamically scheduled processor " Schedule instructions from multiple threads " Instructions from independent threads execute when function units are available " Within threads, dependencies handled by scheduling and register renaming " Example: Intel Pentium-4 HT " Two threads: duplicated registers, shared function units and s Chapter 7 Multicores, Multiprocessors, and Clusters 5 Chapter 7 Multicores, Multiprocessors, and Clusters 6 Multithreading Example Instruction and Data Streams " An alternate classification Instruction Streams Single Multiple Single SISD: Intel Pentium 4 MISD: No examples today Data Streams Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345 " SPMD: Single Program Multiple Data " A parallel program on a MIMD computer " Conditional code for different processors 7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 Multicores, Multiprocessors, and Clusters 7 Chapter 7 Multicores, Multiprocessors, and Clusters 8

Introduction to multithreading " Thread-Level parallelism " Have multiple program counters " Uses MIMD model " Targeted for tightly-coupled shared-memory multiprocessors " For n processors, need n threads " Amount of computation assigned to each thread = grain size " Threads can be used for data-level parallelism, but the overheads may outweigh the benefit Introduction Types " Symmetric multiprocessors (SMP) " Small number of cores " Share single memory with uniform memory latency " Distributed shared memory (DSM) " Memory distributed among processors " Non-uniform memory access/ latency (NUMA) " Processors connected via direct (switched) and nondirect (multi-hop) interconnection networks Introduction 9 10 Cache Coherence Problem " Suppose two CPU cores share a physical address space " Write-through s Time step Event CPU A s CPU B s Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 5.8 Parallelism and Memory Hierarchies: Cache Coherence Cache Coherence " Coherence " All reads by any processor must return the most recently written value " Writes to the same location by any two processors are seen in the same order by all processors " Consistency " When a written value will be returned by a read " If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A Chapter 5 Large and Fast: Exploiting Memory Hierarchy 11 12

Cache Coherence Protocols " Operations performed by s in multiprocessors to ensure coherence " Migration of data to local s " Reduces bandwidth for shared memory " Replication of read-shared data " Reduces contention for access " Snooping protocols " Each monitors bus reads/writes " Directory-based protocols " Caches and memory record sharing status of blocks in a directory Chapter 5 Large and Fast: Exploiting Memory Hierarchy 13 " Write invalidate " On write, invalidate all other copies " Use bus itself to serialize " Write cannot complete until bus access is obtained " Write update " On write, update all copies 14 " Locating an item when a read miss occurs " In write-back, the updated value must be sent to the requesting processor " Cache lines marked as shared or exclusive/modified " Only writes to shared lines need an invalidate broadcast " After this, the line is marked as exclusive Invalidating Snooping Protocols " Cache gets exclusive access to a block when it is to be written " Broadcasts an invalidate message on the bus " Subsequent read in another misses " Owning supplies updated value CPU activity Bus activity CPU A s CPU B s Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 0 15 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 16

17 18 " Complications for the basic MSI protocol: " Operations are not atomic " E.g. detect miss, acquire bus, receive a response " Creates possibility of deadlock and races " One solution: processor that sends invalidate can hold bus until other processors receive the invalidate " Extensions: " Add exclusive state to indicate clean block in only one (MESI protocol) " Prevents needing to write invalidate on a write " Owned state Reading suggestions (from CAQA 5 th Ed) Concepts and challenges in ILP: section 3.1 Exploiting ILP w/ multiple issue & static scheduling: 3.7 Exploiting ILP w/ dyn sched, multiple issue & specul: 3.8 Multithread: exploiting TLP on uniprocessors: 3.12 Multiprocessor coherence and snooping coherence protocol with example: 5.2 Basics on directory-based coherence: 5.4 Models of memory consistency: 5.6 A tutorial by Sarita Ave & K. Gharachorloo (see link at website) 19 AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 20