Simultaneous Multithreading (SMT)
|
|
- Blaise Floyd
- 5 years ago
- Views:
Transcription
1 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. SMT has the potential of greatly enhancing processor computational capabilities by: Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle. Providing multiple hardware contexts, hardware thread scheduling and context switching capability. #1 Lec # 2 Fall
2 SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT performance evaluation vs. Fine-grain multithreading Superscalar, Chip Multiprocessors. Hardware techniques to improve SMT performance: Optimal level one cache configuration for SMT. SMT thread instruction fetch, issue policies. Instruction recycling (reuse) of decoded instructions. Software techniques: Compiler optimizations for SMT. Software-directed register deallocation. Operating system behavior and optimization. SMT support for fine-grain synchronization. SMT as a viable architecture for network processors. #2 Lec # 2 Fall
3 Microprocessor Architecture Trends CISC Machines instructions take variable times to complete RISC Machines (microcode) simple instructions, optimized for speed RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Superscalar Processors multiple instructions executing simultaneously Multithreaded Processors additional HW resources (regs, PC, SP) each context gets processor for x cycles VLIW "Superinstructions" grouped together decreased HW control complexity Single Chip Multiprocessors duplicate entire processors (tech soon due to Moore's Law) SIMULTANEOUS MULTITHREADING multiple HW contexts (regs, PC, SP) each cycle, any context may execute #3 Lec # 2 Fall
4 Performance Increase of Workstation-Class Microprocessors Integer SPEC92 Performance #4 Lec # 2 Fall
5 Microprocessor Logic Density Moore s Law i80386 i80486 Pentium Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million i i4004 i8080 i8086 Moore s Law: 2X transistors/chip Every 1.5 years Year #5 Lec # 2 Fall
6 Increase of Capacity of VLSI Dynamic RAM Chips size year size(megabit) Year 1.55X/yr, or doubling every 1.6 years #6 Lec # 2 Fall
7 CPU Architecture Evolution: Single Threaded Pipeline Traditional 5-stage pipeline. Increases Throughput: Ideal CPI = 1 Register File Fetch Decode Execute Memory Writeback PC SP Memory Hierarchy (Management) #7 Lec # 2 Fall
8 CPU Architecture Evolution: Superscalar Architectures Fetch, decode, execute, etc. more than one instruction per cycle (CPI < 1). Limited by instruction-level parallelism (ILP). Fetch i Decode i Execute i Memory i Writeback i Register File PC SP Fetch i+1 Decode i+1 Execute i+1 Memory i+1 Writeback i+1 Memory Hierarchy (Management) Fetch i Decode i Execute i Memory i Writeback i #8 Lec # 2 Fall
9 Superscalar Architectures: Issue Slot Waste Classification Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: Vertical waste is introduced when the processor issues no instructions in a cycle. Horizontal waste occurs when not all issue slots can be filled in a cycle. #9 Lec # 2 Fall
10 Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC92 benchmark suite. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #10 Lec # 2 Fall
11 Superscalar Architectures: All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #11 Lec # 2 Fall
12 Advanced CPU Architectures: Fine-grain or Traditional Multithreaded Processors Multiple HW contexts (PC, SP, and registers). One context gets CPU for x cycles at a time. Limited by thread-level parallelism (TLP): Can reduce some of the vertical issue slot waste. No reduction in horizontal issue slot waste. Example Architectures: HEP, Tera. #12 Lec # 2 Fall
13 Advanced CPU Architectures: VLIW: Intel/HP Explicitly Parallel Instruction Computing (EPIC) Strengths: Allows for a high level of instruction parallelism (ILP). Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Limited by instruction-level parallelism (ILP) in a single thread. Keeping Functional Units (FUs) busy (control hazards). Static FUs Scheduling limits performance gains. #13 Lec # 2 Fall
14 Advanced CPU Architectures: Single Chip Multiprocessor Strengths: Create a single processor block and duplicate. Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Performance limited by individual thread performance (ILP). #14 Lec # 2 Fall
15 Advanced CPU Architectures: Single Chip Multiprocessor Register File i PC i SP i Control Unit i Superscalar (Two-way) Pipeline i Register File i+1 PC i+1 SP i+1 Register File n PC n SP n Control Unit i+1 Control Unit n Superscalar (Two-way) Pipeline i+1 Superscalar (Two-way) Pipeline n Memory Hierarchy (Management) #15 Lec # 2 Fall
16 SMT: Simultaneous Multithreading Multiple Hardware Contexts running at the same time (HW context: registers, PC, and SP). Avoids both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction. Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/chip. #16 Lec # 2 Fall
17 SMT With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. Pipelines are separated until issue stage. Functional units are shared among all contexts during every cycle: More complicated writeback stage. More threads issuing to functional units results in higher resource utilization. #17 Lec # 2 Fall
18 SMT: Simultaneous Multithreading Register File i PC i SP i Superscalar (Two-way) Pipeline i Register File i+1 PC i+1 SP i+1 Register File n Control Unit (Chip-Wide) Superscalar (Two-way) Pipeline i+1 Memory Hierarchy (Management) PC n SP n Superscalar (Two-way) Pipeline n #18 Lec # 2 Fall
19 Time (processor cycles) The Power Of SMT Superscalar Traditional Multithreaded Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted Simultaneous Multithreading #19 Lec # 2 Fall
20 SMT Performance Example Inst Code Description Functional unit A LUI R5,100 R5 = 100 Int ALU B FMUL F1,F2,F3 F1 = F2 x F3 FP ALU C ADD R4,R4,8 R4 = R4 + 8 Int ALU D MUL R3,R4,R5 R3 = R4 x R5 Int mul/div E LW R6,R4 R6 = (R4) Memory port F ADD R1,R2,R3 R1 = R2 + R3 Int ALU G NOT R7,R7 R7 =!R7 Int ALU H FADD F4,F1,F2 F4=F1 + F2 FP ALU I XOR R8,R1,R7 R8 = R1 XOR R7 Int ALU J SUBI R2,R1,4 R2 = R1 4 Int ALU K SW ADDR,R2 (ADDR) = R2 Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined #20 Lec # 2 Fall
21 SMT Performance Example (continued) Cycle Superscalar Issuing Slots SMT Issuing Slots LUI (A) FMUL (B) ADD (C) T1.LUI (A) T1.FMUL T1.ADD (C) T2.LUI (A) (B) 2 MUL (D) LW (E) T1.MUL (D) T1.LW (E) T2.FMUL (B) T2.ADD (C) 3 T2.MUL (D) T2.LW (E) 4 5 ADD (F) NOT (G) T1.ADD (F) T1.NOT (G) 6 FADD (H) XOR (I) SUBI (J) T1.FADD (H) T1.XOR (I) T1.SUBI (J) T2.ADD (F) 7 SW (K) T1.SW (K) T2.NOT (G) T2.FADD (H) 8 T2.XOR (I) T2.SUBI (J) 9 T2.SW (K) 2 additional cycles to complete program 2 Throughput: Superscalar: 11 inst/7 cycles = 1.57 IPC SMT: 22 inst/9 cycles = 2.44 IPC #21 Lec # 2 Fall
22 Changes to Superscalar CPUs Necessary to support SMT Multiple program counters and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch policy). A separate return stack for each thread for predicting subroutine return destinations. Per-thread instruction retirement, instruction queue flush, and trap mechanisms. A thread id with each branch target buffer entry to avoid predicting phantom branches. A larger register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). A higher available main memory fetch bandwidth may be required. Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. e.g Private per-thread vs. shared L1 cache. #22 Lec # 2 Fall
23 A Base SMT hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #23 Lec # 2 Fall
24 Example SMT Vs. Superscalar Pipeline The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #24 Lec # 2 Fall
25 SMT Performance Comparison Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Traditional SMT Threads Multithreading Parallel Workload Superscalar MP2 MP4 Traditional SMT Threads Multithreading #25 Lec # 2 Fall
26 Simultaneous Vs. Fine-Grain Multithreading Performance Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #26 Lec # 2 Fall
27 Simultaneous Multithreading Vs. Single-Chip Multiprocessing Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.the multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #27 Lec # 2 Fall
28 Impact of Level 1 Cache Sharing on SMT Performance Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p The caches are specified as: [total I cache size in KB][private or shared].[d cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #28 Lec # 2 Fall
29 SMT Thread Instruction Fetch Scheduling Policies Round Robin: Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions BR-Count: RR 2.4 each cycle two threads fetch up to four instructions each) Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. MISS-Count: Give priority to those threads that have the fewest outstanding Data cache misses. I-Count: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). IQPOSN: Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue). #29 Lec # 2 Fall
30 Instruction throughput & Thread Fetch Policy #30 Lec # 2 Fall
31 Possible SMT Instruction Issue Policies OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue). OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #31 Lec # 2 Fall
32 Simulator (sim( RIT CE Execution-driven, performance simulator. Derived from Simple Scalar tool set. Simulates cache, branch prediction, five pipeline stages Flexible: Configuration File controls cache size, buffer sizes, number of functional units. Cross compiler used to generate Simple Scalar assembly language. Binary utilities, compiler, and assembler available. Standard C library (libc) has been ported. #32 Lec # 2 Fall
33 Simulator Memory Address Space #33 Lec # 2 Fall
34 Alternate Functional Unit Configurations New functional unit configurations attempted (by adding one of each type of FU): +1 integer multiplier/divider +2.8% IPC, issue rate -74% times with no FU available Simulator very flexible (only one line in configuration file required change) #34 Lec # 2 Fall
35 Sim-SMT Simulator Limitations Does not keep precise exceptions. System Call s instructions not tracked. Limited memory space: Four test programs memory spaces running on one simulator memory space Easy to run out of stack space #35 Lec # 2 Fall
36 Simulation Runs & Results Test Programs used: Newton interpolation. Matrix Solver using LU decomposition. Integer Test Program. FP Test Program. Simulations of a single program 1,2, and 4 threads. System simulations involve a combination of all programs simultaneously Several different combinations were run From simulation results: Performance increase: Biggest increase occurs when changing from one to two threads. Higher issue rate, functional unit utilization. #36 Lec # 2 Fall
37 Simulation Results: Performance (IPC) #37 Lec # 2 Fall
38 Simulation Results: Simulation Time #38 Lec # 2 Fall
39 Simulation Results: Instruction Issue Rate #39 Lec # 2 Fall
40 Simulation Results: Performance Vs. Issue BW Performance Vs. Issue BW #40 Lec # 2 Fall
41 Simulation Results: Functional Unit Utilization #41 Lec # 2 Fall
42 Simulation Results: No Functional Unit Available #42 Lec # 2 Fall
43 Simulation Results: Horizontal Waste Rate #43 Lec # 2 Fall
44 Simulation Results: Vertical Waste Rate #44 Lec # 2 Fall
45 SMT: Simultaneous Multithreading Strengths: Overcomes the limitations imposed by low single thread instruction-level parallelism. Multiple threads running will hide individual control hazards (branch mispredictions). Weaknesses: Additional stress placed on memory hierarchy Control unit complexity. Sizing of resources (cache, branch prediction, etc.) Accessing registers (32 integer + 32 FP for each HW context): Some designs devote two clock cycles for both register reads and register writes. #45 Lec # 2 Fall
46 SMT: Simultaneous Multithreading Kernel Code Many, if not all, benchmarks are based upon a limited interaction with kernel code. How can the kernel overhead be minimized (contextswitching, process management, etc.)? CHAOS (Context Hardware Accelerated Operating System). Introduce a lightweight dedicated kernel context to handle process management: When there are 4 contexts, there is a good chance that one of them will continue to run, why take an (expensive) chance in swapping it out when it will be brought right back in by the swapper (process management). #46 Lec # 2 Fall
47 SMT & Technology SMT architecture has not been implemented in any existing commercial microprocessor yet (First 4-thread SMT CPU: Alpha EV8 ~2001). Current technology has the potential for 4-8 simultaneous threads: Based on transistor count and design complexity. #47 Lec # 2 Fall
48 RIT-CE SMT Project Goals Investigate performance gains from exploiting Thread- Level Parallelism (TLP) in addition to current Instruction- Level Parallelism (ILP) in processor design. Design and simulate an architecture incorporating Simultaneous Multithreading (SMT). Study operating system and compiler modifications needed to support SMT processor architectures. Define a standard interface for efficient SMT-processor/OS kernel interaction. Modify an existing OS kernel (Linux?) to take advantage of hardware multithreading capabilities. Long term: VLSI implementation of an SMT prototype. #48 Lec # 2 Fall
49 Current Project Status Architecture/OS interface definition. Study of design alternatives and impact on performance. SMT Simulator Development: System call development, kernel support, and compiler/assembler changes. Development of code (programs and OS kernel) is key to getting results. #49 Lec # 2 Fall
50 Short-Term Project Chart Simulator will represent hardware with kernel context Simulator Compiler Compiler is simply a hacked version gcc (using assembler from host system) Linker/Loader System Call Proxy (OS specific) Kernel Code will provide the thread that will be held in the HW kernel context Kernel Code Memory Management Process Management Simulation Results (running program) SMT Kernel Simulation #50 Lec # 2 Fall
51 Current/Future Project Goals SMT simulator completion refinement, and further testing. Development of an SMT-capable OS kernel. Extensive performance studies with various workloads using the simulator/os/compiler: Suitability for fine-grained parallel applications? Effect on multimedia applications? Architectural changes based on benchmarks. Cache impact on SMT performance investigation. Investigation of an in-order SMT processor (C or VHDL model) MOSIS Tiny Chip (partial/full) implementation. Investigate the suitability of SMT processors as building blocks for MPPs. #51 Lec # 2 Fall
Simultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationLecture 26: Parallel Processing. Spring 2018 Jason Tang
Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationELE 375 Final Exam Fall, 2000 Prof. Martonosi
ELE 375 Final Exam Fall, 2000 Prof. Martonosi Question Score 1 /10 2 /20 3 /15 4 /15 5 /10 6 /20 7 /20 8 /25 9 /30 10 /30 11 /30 12 /15 13 /10 Total / 250 Please write your answers clearly in the space
More informationSRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design
SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationSimultaneous Multithreading Architecture
Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationTi Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr
Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationBasic Computer Architecture
Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More information