Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Similar documents
Processor Design Pipelined Processor. Hung-Wei Tseng

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS425 Computer Systems Architecture

Multiple Instruction Issue. Superscalars

Multi-core Architectures. Dr. Yingwu Zhu

Computer Architecture Spring 2016

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT


CS 654 Computer Architecture Summary. Peter Kemper

Multi-core Architectures. Dr. Yingwu Zhu

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Simultaneous Multithreading: a Platform for Next Generation Processors

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Advanced Parallel Programming I

Handout 2 ILP: Part B

Portland State University ECE 588/688. Cray-1 and Cray T3E

Tutorial 11. Final Exam Review

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Introducing Multi-core Computing / Hyperthreading

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Modern CPU Architectures

Limitations of Scalar Pipelines

Kaisen Lin and Michael Conley

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Lecture 14: Multithreading

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Parallelism, Multicore, and Synchronization

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Hyperthreading Technology

CS 152, Spring 2011 Section 8

Keywords and Review Questions

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Last lecture. Some misc. stuff An older real processor Class review/overview.

Superscalar Processor

OOO Execution and 21264

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Advanced Processor Architecture

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

TDT 4260 lecture 7 spring semester 2015

Hardware-Based Speculation

Computer Science 146. Computer Architecture

CS 152, Spring 2012 Section 8

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Inside out of your computer memories (III) Hung-Wei Tseng

Multicore and Parallel Processing

Exploitation of instruction level parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Intel Architecture for Software Developers

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Simultaneous Multithreading (SMT)

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Simultaneous Multithreading on Pentium 4

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Cache Organizations for Multi-cores

Mo Money, No Problems: Caches #2...

Microarchitecture Overview. Performance

The Processor: Instruction-Level Parallelism

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS152 Computer Architecture and Engineering. Complex Pipelines

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2016 Professor George Michelogiannakis

The Pentium II/III Processor Compiler on a Chip

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT)

Getting CPI under 1: Outline

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

CS 351 Final Review Quiz

Pentium 4 Processor Block Diagram

Simultaneous Multithreading Architecture

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Case Study IBM PowerPC 620

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

EE 4980 Modern Electronic Systems. Processor Advanced

One-Level Cache Memory Design for Scalable SMT Architectures

Shared Symmetric Memory Systems

Outline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Pipelining. CSC Friday, November 6, 2015

Transcription:

Multi-threaded processors Hung-Wei Tseng x Dean Tullsen

OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to execution stage (issue) whenever all data inputs are ready for the instruction Put the instruction in reorder buffer and commit the instruction if the instruction is (1) not mis-predicted and (2) all the instruction prior to this instruction are committed 2

Intel SkyLake architecture BPU 32K L1 Instruction Cache MSROM 4 uops/cycle 6 uops/cycle Decoded Icache (DSB) 5 uops/cycle Legacy Decode Pipeline 4-issue integer pipeline Instruction Decode Queue (IDQ,, or micro-op queue) Allocate/Rename/Retire/MoveElimination/ZeroIdiom 4-issue memory pipeline Port 0 eduler Port 1 Port 5 Port 6 Port 2 LD/STA 256K L2 Cache (Unified) Int ALU, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Divide, Branch2 Int ALU, Fast LEA, Vec FMA, Vec MUL, Vec Add, Vec ALU, Vec Shft, Int MUL, Slow LEA Int ALU, Fast LEA, Vec SHUF, Vec ALU, CVT Int ALU, Int Shft, Branch1, Port 3 LD/STA Port 4 STD Port 7 STA 32K L1 Data Cache 3

Dynamic execution with register naming Consider the following dynamic instructions 1: lw $t1, 0($a0) 2: lw $a0, 4($a0) 3: add $v0, $v0, $t1 4: bne $a0, $zero, LOOP 5: lw $t1, 0($a0) 6: lw $t2, 4($a0) 7: add $v0, $v0, $t1 8: bne $t2, $zero, LOOP Assume a superscalar processor with unlimited issue width & physical registers that can fetch up to 4 instructions per cycle, 2 cycles to execute a memory instruction how many cycles it takes to issue all instructions? A. 1 B. 2 C. 3 D. 4 E. 5 You code looks like this when performing linked list traversal cycle #1 cycle #2 cycle #3 cycle #4 cycle #5 1 3 7 2 Wasted 4 5 6 Wasted 8 4

Outline Simultaneous multithreading Chip multiprocessor Parallel programming 5

Simultaneous Multi- Threading (SMT) 6

Simultaneous Multi-Threading (SMT) Fetch instructions from different threads/processes to fill the not utilized part of pipeline Exploit thread level parallelism (TLP) to solve the problem of insufficient ILP in a single thread Keep separate architectural states for each thread PC Register Files Reorder Buffer shared Create an illusion of multiple processors for OSs The rest of superscalar processor hardware is Invented by Dean Tullsen Now a professor in UCSD CSE! You may take his CSE148 in Spring 2015 7

Simplified SMT-OOO pipeline Instruction Fetch: T0 ROB: T0 Instruction Fetch: T1 Instruction Fetch: T2 Instruction Decode Register renaming logic edule Execution Units Data Cache ROB: T1 ROB: T2 Instruction Fetch: T3 ROB: T3 8

Simultaneous Multi-Threading (SMT) Fetch 2 instructions from each thread/process at each cycle to fill the not utilized part of pipeline Issue width is still 2, commit width is still 4 T1 1: lw $t1, 0($a0) T1 2: lw $a0, 0($t1) T2 1: sll $t0, $a1, 2 T2 2: add $t1, $a0, $t0 T1 3: addi $a1, $a1, -1 T1 4: bne $a1, $zero, LOOP T2 3: lw $v0, 0($t1) T2 4: addi $t1, $t1, 4 T2 5: add $v0, $v0, $t2 T2 6: jr $ra Can execute 6 instructions before bne resolved. IF IF IF IF ID ID ID ID IF IF IF IF Ren Ren Ren Ren ID ID ID ID IF IF Ren Ren Ren Ren ID ID EXE EXE Ren Ren MEM MEM EXE EXE C EXE C C C EXE MEM C C EXE EXE C C C MEM C EXE C C EXE C C C 9

Simultaneous Multithreading SMT helps covering the long memory latency problem But SMT is still a superscalar processor Power consumption / hardware complexity can still be high. Think about Pentium 4 10

SMT Improve the throughput of execution May increase the latency of a single thread Less branch penalty per thread Increase hardware utilization Register Files Simple hardware design: Only need to duplicate PC/ Real Case: Intel HyperThreading (supports up to two threads per core) Intel Pentium 4, Intel Atom, Intel Core i7 AMD Zen 11

SMT How many of the following about SMT are correct? SMT makes processors with deep pipelines more tolerable to mis-predicted branches SMT can improve the throughput of a single-threaded application hurt, b/c you are sharing resource with other threads. SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width SMT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications. A. 0 B. 1 C. 2 D. 3 E. 4 12

Chip multiprocessor (CMP) 13

A wide-issue processor or multiple narrower-issue processors What can you do within a 21 mm * 21 mm area? 21 mm 21 mm 21 mm Clocking & Pads External Interface Instruction Fetch Inst. Decode & Rename Reorder Buffer, Instruction Queues, and Out-of-Order Logic Floating Point Unit Instruction Cache (32 KB) TLB Data Cache (32 KB) Integer Unit On-Chip L2 Cache (256KB) 21 mm Clocking & Pads I-Cache #1 (8K) Processor #1 Processor #3 I-Cache #2 (8K) Processor #2 Processor #4 External Interface L2 Communication Crossbar On-Chip L2 Cache (256KB) You will have more ALUs D-Cache #1 (8K) if D-Cache you #2 (8K) choose D-Cache #3 (8K) D-Cache #4 (8K) this! A 6-issue superscalar processor 3 integer ALUs 3 floating point ALUs 3 load/store units 14 I-Cache #3 (8K) I-Cache #4 (8K) 4 2-issue superscalar processor 4*1 integer ALUs 4*1 floating point ALUs 4*1 load/store units

Die photo of a CMP processor 15

CMP advantages How many of the following are advantages of CMP over traditional superscalar processor CMP can provide better energy-efficiency within the same area CMP can deliver better instruction throughput within the same die area (chip size) CMP can achieve better ILP for each running thread CMP can improve the performance of a single-threaded application without modifying code A. 0 B. 1 C. 2 D. 3 E. 4 16

CMP v.s. SMT Assuming both application X and application Y have similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors: P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache P2: SMT with a 4-issue pipeline. 64KB L1 D-cache Which one do you think is better? A. P1 B. P2 17

Speedup a single application on multithreaded processors 18

Parallel programming To exploit CMP/SMT parallelism you need to break your computation into multiple processes or multiple threads Processes (in OS/software systems) Separate programs actually running (not sitting idle) on your computer at the same time. Each process will have its own virtual memory space and you need explicitly exchange data using inter-process communication APIs Threads (in OS/software systems) Independent portions of your program that can run in parallel All threads share the same virtual memory space We will refer to these collectively as threads A typical user system might have 1-8 actively running threads. Servers can have more if needed (the sysadmins will hopefully configure it that way) 19

Create threads/processes The only way we can improve a single application performance on CMP/SMT You can use fork() to create a child process (CSE120) Or you can use pthread or openmp to compose multi-threaded programs /* Do matrix multiplication */ for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL); Spawn a thread Synchronize and wait a for thread to terminate 20

Is it easy? Threads are hard to find and some algorithms are hard to parallelize Finite State Machine N-body Linear programming Circuit Value Problem LZW data Compression Data sharing among threads Architectural or API support You need to use locks to control race conditions Hard to debug! 21

Supporting shared memory model Threads communicate with each other using shared Provide a single memory space that all processors can share All threads within the same program shares the same address space. variables in memory Provide the same memory abstraction as singlethread programming 22

Simple idea... Connecting all processor and shared memory to a bus. Processor speed will be slow b/c all devices on a bus must run at the same speed Core 0 Core 1 Core 2 Core 3 Bus Shared $ 23

Memory hierarchy on CMP Each processor has its own local cache Core 0 Core 1 Local $ Local $ Bus Local $ Local $ Shared $ Core 2 Core 3 24

Cache on Multiprocessor Coherency Guarantees all processors see the same value for a variable/ memory address in the system when the processors need the value at the same time What value should be seen Consistency All threads see the change of data in the same order When the memory operation should be done 25

Simple cache coherency protocol Snooping protocol Each processor broadcasts / listens to cache misses State associate with each block (cacheline) Invalid The data in the current block is invalid Shared The processor can read the data The data may also exist on other processors Exclusive The processor has full permission on the data The processor is the only one that has up-to-date data 26

Cache coherency practice What happens when core 0 modifies 0x1000?, which belongs to the same cache block as 0x1000? Core 0 Core 1 Core 2 Core 3 Local $ Shared Excl. 0x1000 Shared Invalid 0x1000 Shared Invalid 0x1000 Shared Invalid 0x1000 Write miss 0x1000 Bus Shared $ 27

Cache coherency practice Then, what happens when core 2 reads 0x1000? Core 0 Core 1 Core 2 Core 3 Local $ Shared Excl. 0x1000 Invalid 0x1000 Shared Invalid 0x1000 Invalid 0x1000 Write back 0x1000 Bus Read miss 0x1000 Fetch 0x1000 Shared $ 28

Simple cache coherency protocol read/write miss (bus) read miss(processor) read miss/hit Invalid write miss(bus) Shared write hit write miss(processor) write miss(bus) write back data write request(processor) read miss(bus) write back data Exclusive 29

Cache coherency Assuming that we are running the following code on a CMP with some cache coherency protocol, which output is NOT possible? (a is initialized to 0) thread 1 thread 2 while(1) printf( %d,a); while(1) a++; A. 0 1 2 3 4 5 6 7 8 9 B. 1 2 5 9 3 6 8 10 12 13 C. 1 1 1 1 1 1 1 1 1 1 D. 1 1 1 1 1 1 1 1 1 100 30

It s show time! Demo! thread 1 thread 2 while(1) printf( %d,a); while(1) a++; 31

Cache coherency practice Now, what happens when core 2 writes 0x1004, which belongs the same block as 0x1000? Then, if Core 0 accesses 0x1000, it will be a miss! Core 0 Core 1 Core 2 Core 3 Local $ Shared Invalid 0x1000 Invalid 0x1000 Shared Excl. 0x1000 Invalid 0x1000 Bus Write miss 0x1004 Shared $ 32

4C model 3Cs: Compulsory, Conflict, Capacity Coherency miss: A block invalidated because of the sharing among processors. True sharing Processor A modifies X, processor B also want to access X. False Sharing Processor A modifies X, processor B also want to access Y. However, Y is invalidated because X and Y are in the same block! 33

Hard to debug thread 1 thread 2 int loop; int main() { pthread_t thread; loop = 1; pthread_create(&thread, NULL, modifyloop, NULL); while(loop == 1) { continue; } pthread_join(thread, NULL); fprintf(stderr,"user input: %d\n", loop); return 0; } void* modifyloop(void *x) { sleep(1); printf("please input a number:\n"); scanf("%d",&loop); return NULL; } 34

Performance of multi-threaded programs Multi-threaded block algorithm for matrix multiplication Demo! 35

Announcement CAPE (Course Evaluation) Drop one more lowest quiz grade (currently 2) if the response rate is higher than 60% Homework #5 due next Tuesday Special office hour: 2p-3p this Friday @ CSE 3208 36

Hardware complexity of wide issue processors Multi-ported Register File ~ IW4 IQ size ~ IW4 Bypass logic ~ IW4 Wiring delay 37

Chip Multiprocessor (CMP) Multiple processors on a single die! Increase the frequency: increase power consumption by cubic! Doubling frequency increases power by 8x, doubling cores increases power by 2x But the process technology (Moore s law) allows us to cram more core into a single chip! Instead of building a wide issue processor, we can have multiple narrower issue processor. e.g. 4-issue v.s. 2x 2-issue processor Now common place Improve the throughput of applications You may combine CMP and SMT together Like Core i7 38

Simple cache coherency protocol read/write miss (bus) read miss(processor) read miss/hit Invalid write miss(bus) Shared write hit write miss(processor) write miss(bus) write back data write request(processor) read miss(bus) write back data Exclusive 39