Simultaneous Multithreading (SMT)

Similar documents
Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Simultaneous Multithreading Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Hardware-Based Speculation

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS425 Computer Systems Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Getting CPI under 1: Outline

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Simultaneous Multithreading: a Platform for Next Generation Processors

TDT 4260 lecture 7 spring semester 2015

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS425 Computer Systems Architecture

Handout 2 ILP: Part B

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Exploitation of instruction level parallelism

TDT 4260 TDT ILP Chap 2, App. C

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture-13 (ROB and Multi-threading) CS422-Spring

Parallel Processing SIMD, Vector and GPU s cont.

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Advanced Computer Architecture

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Dynamic Scheduling. CSE471 Susan Eggers 1


5008: Computer Architecture

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Keywords and Review Questions

Multiple Instruction Issue. Superscalars

Lecture 9: Multiple Issue (Superscalar and VLIW)

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Simultaneous Multithreading and the Case for Chip Multiprocessing

CS 654 Computer Architecture Summary. Peter Kemper

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CS146 Computer Architecture. Fall Midterm Exam

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Four Steps of Speculative Tomasulo cycle 0

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Processor (IV) - advanced ILP. Hwansoo Han

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

EC 513 Computer Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Lecture 14: Multithreading

" # " $ % & ' ( ) * + $ " % '* + * ' "

Multi-cycle Instructions in the Pipeline (Floating Point)

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Kaisen Lin and Michael Conley

45-year CPU Evolution: 1 Law -2 Equations

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Hardware-based Speculation

Simultaneous Multithreading on Pentium 4

SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT.

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Introducing Multi-core Computing / Hyperthreading

Hardware-Based Speculation

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Metodologie di Progettazione Hardware-Software

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

EECC551 Exam Review 4 questions out of 6 questions

The Processor: Instruction-Level Parallelism

Hyperthreading Technology

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Course on Advanced Computer Architectures

ECE 571 Advanced Microprocessor-Based Design Lecture 4

One-Level Cache Memory Design for Scalable SMT Architectures

Advanced issues in pipelining

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Copyright 2012, Elsevier Inc. All rights reserved.

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Instruction-Level Parallelism and Its Exploitation

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

High-Performance Processors Design Choices

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Transcription:

Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. SMT has the potential of greatly enhancing processor computational capabilities by: Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle. Providing multiple hardware contexts, hardware thread scheduling and context switching capability. #1 Lec # 2 Fall 2002 9-11-2002

SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT performance evaluation vs. Fine-grain multithreading Superscalar, Chip Multiprocessors. Hardware techniques to improve SMT performance: Optimal level one cache configuration for SMT. SMT thread instruction fetch, issue policies. Instruction recycling (reuse) of decoded instructions. Software techniques: Compiler optimizations for SMT. Software-directed register deallocation. Operating system behavior and optimization. SMT support for fine-grain synchronization. SMT as a viable architecture for network processors. Current SMT implementation: Intel s Hyper-Threading (2-way SMT) Microarchitecture and performance in compute-intensive workloads. #2 Lec # 2 Fall 2002 9-11-2002

Microprocessor Architecture Trends CISC Machines instructions take variable times to complete RISC Machines (microcode) simple instructions, optimized for speed RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Superscalar Processors multiple instructions executing simultaneously Multithreaded Processors additional HW resources (regs, PC, SP) each context gets processor for x cycles VLIW "Superinstructions" grouped together decreased HW control complexity Single Chip Multiprocessors duplicate entire processors (tech soon due to Moore's Law) SIMULTANEOUS MULTITHREADING multiple HW contexts (regs, PC, SP) each cycle, any context may execute #3 Lec # 2 Fall 2002 9-11-2002

Evolution of Microprocessors Source: John P. Chen, Intel Labs #4 Lec # 2 Fall 2002 9-11-2002

CPU Architecture Evolution: Single Threaded/Issue Pipeline Traditional 5-stage integer pipeline. Increases Throughput: Ideal CPI = 1 Register File Fetch Decode Execute Memory Writeback PC SP Memory Hierarchy (Management) #5 Lec # 2 Fall 2002 9-11-2002

CPU Architecture Evolution: Superscalar Architectures Fetch, decode, execute, etc. more than one instruction per cycle (CPI < 1). Limited by instruction-level parallelism (ILP). Fetch i Decode i Execute i Memory i Writeback i Register File PC SP Fetch i+1 Decode i+1 Execute i+1 Memory i+1 Writeback i+1 Memory Hierarchy (Management) Fetch i Decode i Execute i Memory i Writeback i #6 Lec # 2 Fall 2002 9-11-2002

Superscalar Architectures: Issue Slot Waste Classification Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: Vertical waste is introduced when the processor issues no instructions in a cycle. Horizontal waste occurs when not all issue slots can be filled in a cycle. #7 Lec # 2 Fall 2002 9-11-2002

Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC92 benchmark suite. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #8 Lec # 2 Fall 2002 9-11-2002

Superscalar Architectures: All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #9 Lec # 2 Fall 2002 9-11-2002

Advanced CPU Architectures: Fine-grain or Traditional Multithreaded Processors Multiple HW contexts (PC, SP, and registers). One context gets CPU for x cycles at a time. Limited by thread-level parallelism (TLP): Can reduce some of the vertical issue slot waste. No reduction in horizontal issue slot waste. Example Architectures: HEP, Tera. #10 Lec # 2 Fall 2002 9-11-2002

Advanced CPU Architectures: VLIW: Intel/HP IA-64 Explicitly Parallel Instruction Computing (EPIC) Strengths: Allows for a high level of instruction parallelism (ILP). Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Limited by instruction-level parallelism (ILP) in a single thread. Keeping Functional Units (FUs) busy (control hazards). Static FUs Scheduling limits performance gains. Resulting overall performance heavily depends on compiler performance. #11 Lec # 2 Fall 2002 9-11-2002

Advanced CPU Architectures: Single Chip Multiprocessor Strengths: Create a single processor block and duplicate. Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Performance limited by individual thread performance (ILP). #12 Lec # 2 Fall 2002 9-11-2002

Advanced CPU Architectures: Single Chip Multiprocessor Register File i PC i SP i Control Unit i Superscalar (Two-way) Pipeline i Register File i+1 PC i+1 SP i+1 Register File n PC n SP n Control Unit i+1 Control Unit n Superscalar (Two-way) Pipeline i+1 Superscalar (Two-way) Pipeline n Memory Hierarchy (Management) #13 Lec # 2 Fall 2002 9-11-2002

SMT: Simultaneous Multithreading Multiple Hardware Contexts running at the same time (HW context: registers, PC, and SP). Reduces both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction. Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/chip. #14 Lec # 2 Fall 2002 9-11-2002

SMT With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. Pipelines are separated until issue stage. Functional units are shared among all contexts during every cycle: More complicated writeback stage. More threads issuing to functional units results in higher resource utilization. #15 Lec # 2 Fall 2002 9-11-2002

SMT: Simultaneous Multithreading Register File i PC i SP i Superscalar (Two-way) Pipeline i Register File i+1 PC i+1 SP i+1 Register File n Control Unit (Chip-Wide) Superscalar (Two-way) Pipeline i+1 Memory Hierarchy (Management) PC n SP n Superscalar (Two-way) Pipeline n #16 Lec # 2 Fall 2002 9-11-2002

Time (processor cycles) 1 1 1 1 1 1 1 1 1 1 The Power Of SMT 1 1 2 2 3 3 4 5 5 1 1 1 1 2 2 2 3 4 4 4 1 1 2 2 2 3 3 3 4 5 2 2 4 4 5 1 1 1 1 2 2 3 1 2 4 1 2 5 Superscalar Traditional Multithreaded Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted Simultaneous Multithreading #17 Lec # 2 Fall 2002 9-11-2002

SMT Performance Example Inst Code Description Functional unit A LUI R5,100 R5 = 100 Int ALU B FMUL F1,F2,F3 F1 = F2 x F3 FP ALU C ADD R4,R4,8 R4 = R4 + 8 Int ALU D MUL R3,R4,R5 R3 = R4 x R5 Int mul/div E LW R6,R4 R6 = (R4) Memory port F ADD R1,R2,R3 R1 = R2 + R3 Int ALU G NOT R7,R7 R7 =!R7 Int ALU H FADD F4,F1,F2 F4=F1 + F2 FP ALU I XOR R8,R1,R7 R8 = R1 XOR R7 Int ALU J SUBI R2,R1,4 R2 = R1 4 Int ALU K SW ADDR,R2 (ADDR) = R2 Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined #18 Lec # 2 Fall 2002 9-11-2002

SMT Performance Example (continued) Cycle Superscalar Issuing Slots SMT Issuing Slots 1 2 3 4 1 2 3 4 1 LUI (A) FMUL (B) ADD (C) T1.LUI (A) T1.FMUL T1.ADD (C) T2.LUI (A) (B) 2 MUL (D) LW (E) T1.MUL (D) T1.LW (E) T2.FMUL (B) T2.ADD (C) 3 T2.MUL (D) T2.LW (E) 4 5 ADD (F) NOT (G) T1.ADD (F) T1.NOT (G) 6 FADD (H) XOR (I) SUBI (J) T1.FADD (H) T1.XOR (I) T1.SUBI (J) T2.ADD (F) 7 SW (K) T1.SW (K) T2.NOT (G) T2.FADD (H) 8 T2.XOR (I) T2.SUBI (J) 9 T2.SW (K) 2 additional cycles to complete program 2 Throughput: Superscalar: 11 inst/7 cycles = 1.57 IPC SMT: 22 inst/9 cycles = 2.44 IPC #19 Lec # 2 Fall 2002 9-11-2002

Modifications to Superscalar CPUs Necessary to support SMT Multiple program counters and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch policy). A separate return stack for each thread for predicting subroutine return destinations. Per-thread instruction retirement, instruction queue flush, and trap mechanisms. A thread id with each branch target buffer entry to avoid predicting phantom branches. A larger register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). A higher available main memory fetch bandwidth may be required. Larger data TLB with more entries to compensate for increased virtual to physical address translations. Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. e.g Private per-thread vs. shared L1 cache. #20 Lec # 2 Fall 2002 9-11-2002

Current Implementations of SMT Intel s recent implementation of Hyper-Threading Technology (2-thread SMT) in its current P4 Xeon processor family represent the first and only current implementation of SMT in a commercial microprocessor. The Alpha EV8 (4-thread SMT) originally scheduled for production in 2001 is currently on indefinite hold :( Current technology has the potential for 4-8 simultaneous threads: Based on transistor count and design complexity. #21 Lec # 2 Fall 2002 9-11-2002

A Base SMT hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #22 Lec # 2 Fall 2002 9-11-2002

Example SMT Vs. Superscalar Pipeline The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #23 Lec # 2 Fall 2002 9-11-2002

Intel Xeon Processor Pipeline Source: Intel Technology Journal, Volume 6, Number 1, February 2002. #24 Lec # 2 Fall 2002 9-11-2002

Intel Xeon Out-of-order Execution Engine Detailed Pipeline Source: Intel Technology Journal, Volume 6, Number 1, February 2002. #25 Lec # 2 Fall 2002 9-11-2002

SMT Performance Comparison Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Traditional SMT Threads Multithreading 1 2.7 2.6 3.1 2-3.3 3.5 4-3.6 5.7 8-2.8 6.2 Parallel Workload Superscalar MP2 MP4 Traditional SMT Threads Multithreading 1 3.3 2.4 1.5 3.3 3.3 2-4.3 2.6 4.1 4.7 4 - - 4.2 4.2 5.6 8 - - - 3.5 6.1 #26 Lec # 2 Fall 2002 9-11-2002

Simultaneous Vs. Fine-Grain Multithreading Performance Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #27 Lec # 2 Fall 2002 9-11-2002

Simultaneous Multithreading Vs. Single-Chip Multiprocessing Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.the multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #28 Lec # 2 Fall 2002 9-11-2002

Impact of Level 1 Cache Sharing on SMT Performance Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p The caches are specified as: [total I cache size in KB][private or shared].[d cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #29 Lec # 2 Fall 2002 9-11-2002

SMT Thread Instruction Fetch Scheduling Policies Round Robin: Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions BR-Count: RR 2.4 each cycle two threads fetch up to four instructions each) Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. MISS-Count: Give priority to those threads that have the fewest outstanding Data cache misses. I-Count: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). IQPOSN: Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue). #30 Lec # 2 Fall 2002 9-11-2002

Instruction throughput & Thread Fetch Policy Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #31 Lec # 2 Fall 2002 9-11-2002

Possible SMT Instruction Issue Policies OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue). OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #32 Lec # 2 Fall 2002 9-11-2002

RIT-CE SMT Project Goals Investigate performance gains from exploiting Thread- Level Parallelism (TLP) in addition to current Instruction- Level Parallelism (ILP) in processor design. Design and simulate an architecture incorporating Simultaneous Multithreading (SMT) including OS interaction (LINUX-based kernel?). Study operating system and compiler optimizations to improve SMT processor performance. Performance studies with various workloads using the simulator/os/compiler: Suitability for fine-grained parallel applications? Effect on multimedia applications? #33 Lec # 2 Fall 2002 9-11-2002

RIT-CE SMT Project Project Chart Simulator will represent hardware with kernel context Simulator Compiler Compiler is simply a hacked version gcc (using assembler from host system) Linker/Loader System Call Proxy (OS specific) Kernel Code will provide the thread that will be held in the HW kernel context Kernel Code Memory Management Process Management Simulation Results (running program) SMT Kernel Simulation #34 Lec # 2 Fall 2002 9-11-2002

Simulator (sim( sim-smt) @ RIT CE Execution-driven, performance simulator. Derived from Simple Scalar tool set. Simulates cache, branch prediction, five pipeline stages Flexible: Configuration File controls cache size, buffer sizes, number of functional units. Cross compiler used to generate Simple Scalar assembly language. Binary utilities, compiler, and assembler available. Standard C library (libc) has been ported. Sim-SMT Simulator Limitations: Does not keep precise exceptions. System Call s instructions not tracked. Limited memory space: Four test programs memory spaces running on one simulator memory space Easy to run out of stack space #35 Lec # 2 Fall 2002 9-11-2002

Simulator Memory Address Space #36 Lec # 2 Fall 2002 9-11-2002

sim-smt -SMT Simulation Runs & Results Test Programs used: Newton interpolation. Matrix Solver using LU decomposition. Integer Test Program. FP Test Program. Simulations of a single program 1,2, and 4 threads. System simulations involve a combination of all programs simultaneously Several different combinations were run From simulation results: Performance increase: Biggest increase occurs when changing from one to two threads. Higher issue rate, functional unit utilization. #37 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Performance (IPC) #38 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Simulation Time #39 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Instruction Issue Rate #40 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Performance Vs. Issue BW Performance Vs. Issue BW #41 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Functional Unit Utilization #42 Lec # 2 Fall 2002 9-11-2002

Simulation Results: No Functional Unit Available #43 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Horizontal Waste Rate #44 Lec # 2 Fall 2002 9-11-2002

Simulation Results: Vertical Waste Rate #45 Lec # 2 Fall 2002 9-11-2002

SMT: Simultaneous Multithreading Strengths: Overcomes the limitations imposed by low single thread instruction-level parallelism. Multiple threads running will hide individual control hazards (branch mispredictions). Weaknesses: Additional stress placed on memory hierarchy Control unit complexity. Sizing of resources (cache, branch prediction, TLBs etc.) Accessing registers (32 integer + 32 FP for each HW context): Some designs devote two clock cycles for both register reads and register writes. #46 Lec # 2 Fall 2002 9-11-2002