Similar documents
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu

Parallel Processing SIMD, Vector and GPU s cont.

Lecture-13 (ROB and Multi-threading) CS422-Spring

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hardware-Based Speculation

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Simultaneous Multithreading Architecture

TDT 4260 lecture 7 spring semester 2015

ECSE 425 Lecture 25: Mul1- threading

Handout 2 ILP: Part B

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Architecture Spring 2016

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multithreaded Processors. Department of Electrical Engineering Stanford University

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

THREAD LEVEL PARALLELISM

CSE 392/CS 378: High-performance Computing - Principles and Practice

Lecture 25: Board Notes: Threads and GPUs

Introducing Multi-core Computing / Hyperthreading

Lectures Parallelism

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS425 Computer Systems Architecture

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Kaisen Lin and Michael Conley

Exploitation of instruction level parallelism

Online Course Evaluation. What we will do in the last week?

Simultaneous Multithreading on Pentium 4

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Superscalar Processors

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

An Overview of MIPS Multi-Threading. White Paper

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Superscalar Machines. Characteristics of superscalar processors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

CS 654 Computer Architecture Summary. Peter Kemper

Hyperthreading Technology

UNIT I (Two Marks Questions & Answers)

WHY PARALLEL PROCESSING? (CE-401)

TDT 4260 TDT ILP Chap 2, App. C

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

ILP Ends TLP Begins. ILP Limits via an Oracle

Multicore Hardware and Parallelism

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Lecture 14: Multithreading

Parallelism in Hardware

Limits to ILP. Limits to ILP. Overcoming Limits. CS252 Graduate Computer Architecture Lecture 11. Conflicting studies of amount

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

! Readings! ! Room-level, on-chip! vs.!

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Multiple Instruction Issue. Superscalars

EC 513 Computer Architecture

Computer Systems Architecture

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Science 146. Computer Architecture

LIMITS OF ILP. B649 Parallel Architectures and Programming

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS 1013 Advance Computer Architecture UNIT I

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Tutorial 11. Final Exam Review

Hardware-based Speculation

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

November 7, 2014 Prediction

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer Architecture

Computer Systems Architecture

CS 426 Parallel Computing. Parallel Computing Platforms

Advanced Computer Architecture

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Simultaneous Multithreading (SMT)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Parallel Systems I The GPU architecture. Jan Lemeire

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

The Art of Parallel Processing

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Transcription:

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3. m = e * f 1 and 2 can operate in parallel. 3 depends to 1 and 2. A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows faster CPU throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at a given clock rate. Source: wiki

Pipelining Execution: Single CPU

Superscalar 2-issue pipleine

Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data

Thread Level Parallelism (TLP) ILP exploits implicit parallel operations within a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1.Throughput of computers that run many programs 2.Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP

New Approach: Mulithreaded Execution Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s Niagara

Source of Underutilization Stall or Underutilization: L1 cache miss (Data and Instruction) L2 cache miss Instruction dependency Long execution time operation, example: Divide Branch miss prediction

Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves need to have very fast thread-switching Doesn t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time Used in IBM AS/400

Cycle M Simultaneous Multi-threading... One thread, 8 units M FX FX FP FP BR CC 1 1 Two threads, 8 units Cycle M M FX FX FP FP BR CC 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

Bus L2 Cache and Control Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point BTB Schedulers Uop queues Rename/Alloc Trace Cache Decoder ucode ROM BTB and I-TLB Thread 1: floating point

Bus L2 Cache and Control Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point BTB Schedulers Uop queues Rename/Alloc Trace Cache Decoder ucode ROM BTB and I-TLB Thread 2: integer operation 22

Bus L2 Cache and Control SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point BTB Schedulers Uop queues Rename/Alloc Trace Cache Decoder ucode ROM BTB and I-TLB Thread 2: Thread 1: floating point integer operation 23

Simultaneous Multithreading (SMT) Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Just adding a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Source: Micrprocessor Report, December 6, 1999 Compaq Chooses SMT for Alpha

Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

Multi-core architectures

Single-core computer

Single-core CPU chip the single core

Multi-core architectures New trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip

Multi-core CPU chip The cores fit on a single processor socket Also called CMP (Chip Multi-Processor) c o r e c o r e c o r e c o r e 1 2 3 4

The cores run in parallel thread 1 thread 2 thread 3 thread 4 c o r e c o r e c o r e c o r e 1 2 3 4

Within each core, threads are timesliced (just like on a uniprocessor) several threads several threads several threads several threads c o r e c o r e c o r e c o r e 1 2 3 4

Interaction with the Operating System OS perceives each core as a separate processor OS scheduler maps threads/processes to different cores Most major OS support multi-core today: Windows, Linux, Mac OS X,

Why multi-core? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: heat problems difficult design and verification large design teams necessary server farms need expensive air-conditioning Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)

General context: Multiprocessors Multiprocessor is any computer with several processors SIMD Single instruction, multiple data Modern graphics cards MIMD Multiple instructions, multiple data Lemieux cluster, Pittsburgh supercomputing center

Programming for multi-core Programmers must use threads or processes Spread the workload across multiple cores Write parallel algorithms OS will map threads/processes to cores

As programmers, do we care? What happens if we run a program on a multi-core? void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } } #1 #2 38

What if we want a program to run on both processors? We have to explicitly tell the machine exactly how to do this This is called parallel programming or concurrent programming There are many parallel/concurrent programming models We will look at a relatively simple one: fork-join parallelism 39

1.Fork N-1 threads How good is this with caches? Fork/Join Logical Example 2.Break work into N pieces (and do it) 3.Join (N-1) threads void array_add(int A[], int B[], int C[], int length) { cpu_num = fork(n-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; } join(); } 40

How does this help performance? Parallel speedup measures improvement from parallelization: speedup(p) = time for best serial version time for version with p processors What can we realistically expect? 41

Reason #1: Amdahl s Law In general, the whole computation is not (easily) parallelizable In general, the whole computation is not (easily) parallelizable Serial regions 42

Reason #1: Amdahl s Law Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) New Execution Time = 1-s + s P For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? New Execution Time =.9T 3 +.1T = Speedup = What is the maximum speedup from parallelization?

Reason #2: Overhead void array_add(int A[], int B[], int C[], int length) { cpu_num = fork(n-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; } join(); } Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system Depends on the implementation New Execution Time = 1-s + s + overhead(p) P