Simultaneous Multithreading (SMT)

Size: px
Start display at page:

Download "Simultaneous Multithreading (SMT)"

Transcription

1 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars). SMT has the potential of greatly enhancing superscalar processor computational capabilities by: Chip-Level TLP Exploiting thread-level parallelism (TLP) in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle. A single physical SMT processor core acts as a number of logical processors each executing a single thread Providing multiple hardware contexts, hardware thread scheduling and context switching capability. Providing effective long latency hiding. e.g FP, branch misprediction, memory latency #1 Lec # 2 Spring

2 SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture to support SMT. SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. Hardware techniques to improve SMT performance: Optimal level one cache configuration for SMT. SMT thread instruction fetch, issue policies. Ref. Papers SMT-1, SMT-2 Instruction recycling (reuse) of decoded instructions. Software techniques: Compiler optimizations for SMT. SMT-3 Software-directed register deallocation. Operating system behavior and optimization. SMT-7 SMT support for fine-grain synchronization. SMT-4 SMT as a viable architecture for network processors. Current SMT implementation: Intel s Hyper-Threading (2-way SMT) Microarchitecture and performance in compute-intensive workloads. SMT-8 SMT-9 #2 Lec # 2 Spring

3 Evolution of Microprocessors General Purpose Processors (GPPs) Multi-cycle Pipelined (single issue) Multiple Issue (CPI <1) Superscalar/VLIW/SMT/CMP 1 GHz to???? GHz Original (2002) Intel Predictions 15 GHz IPC T = I x CPI x C Source: John P. Chen, Intel Labs Single-issue Processor = Scalar Processor Instructions Per Cycle (IPC) = 1/CPI #3 Lec # 2 Spring

4 Microprocessor Frequency Trend 10,000 Mhz 1, A 21064A Intel IBM Power PC DEC Gate delays/clock 21264S Pentium(R) II MPC Pentium Pro 601, 603 (R) Pentium(R) Processor freq scales by 2X per generation Gate Delays/ Clock Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall?) Why? 1- Power leakage 2- Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle) 1. Frequency used to double each generation 2. Number of gates/clock reduce by 25% 3. Leads to deeper pipelines with more stages (e.g Intel Pentium 4E has 30+ pipeline stages) T = I x CPI x C Possible Solutions? Chip-Level TLP - Exploit Thread-Level Parallelism (TLP) at the chip level (SMT/CMP) - Utilize/integrate more-specialized computing elements other than GPPs #4 Lec # 2 Spring

5 Parallelism in Microprocessor VLSI Generations Transistors 100,000,000 10,000,000 1,000, ,000 10,000 i4004 Bit-level parallelism Instruction-level Thread-level (?) Multiple micro-operations per cycle (multi-cycle non-pipelined) AKA Operation-Level Parallelism Not Pipelined CPI >> 1 i8080 i8008 i80286 i8086 i80386 R10000 Pentium (ILP) Single-issue Pipelined CPI =1 R2000 Single Thread R3000 Superscalar /VLIW CPI <1 (TLP) Simultaneous Multithreading SMT: e.g. Intel s Hyper-threading Chip-Multiprocessors (CMPs) e.g IBM Power 4, 5 Intel Pentium D, Core 2 AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara) Chip-Level Parallel Processing Even more important due to slowing clock rate increase Thread-Level Parallelism (TLP) 1, Improving microprocessor generation performance by exploiting more levels of parallelism #5 Lec # 2 Spring

6 Microprocessor Architecture Trends General Purpose Processor (GPP) CISC Machines instructions take variable times to complete Single Threaded RISC Machines (microcode) simple instructions, optimized for speed RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Multithreaded Processors additional HW resources (regs, PC, SP) each context gets processor for x cycles SIMULTANEOUS MULTITHREADING multiple HW contexts (regs, PC, SP) each cycle, any context may execute Superscalar Processors multiple instructions executing simultaneously VLIW "Superinstructions" grouped together decreased HW control complexity (Single or Multi-Threaded) (SMT) CMPs Single Chip Multiprocessors duplicate entire processors (tech soon due to Moore's Law) Chip-level TLP (e.g IBM Power 4/5, AMD X2, X3, X4, Intel Core 2) e.g. Intel s HyperThreading (P4) SMT/CMPs e.g. IBM Power5,6,7, Intel Pentium D, Sun Niagara - (UltraSparc T1) Intel Nehalem (Core i7) #6 Lec # 2 Spring

7 CPU Architecture Evolution: Single Threaded/ Single Issue Pipeline Traditional 5-stage integer pipeline. Increases Throughput: Ideal CPI = 1 Register File Fetch Decode Execute Memory Writeback PC SP Memory Hierarchy (Management) #7 Lec # 2 Spring

8 CPU Architecture Evolution: Single-Threaded/Superscalar Architectures Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). Limited by instruction-level parallelism (ILP). Due to single thread limitations Fetch i Decode i Execute i Memory i Writeback i Register File PC SP Fetch i+1 Decode i+1 Execute i+1 Memory i+1 Writeback i+1 Memory Hierarchy (Management) Fetch i Decode i Execute i Memory i Writeback i #8 Lec # 2 Spring

9 Hardware-Based Commit or Retirement (In Order) FIFO Speculation Speculative Execution + Tomasulo s Algorithm = Speculative Tomasulo Instructions to issue in Order: Instruction Queue (IQ) Usually implemented as a circular buffer Next to commit Store Results Speculative Tomasulo-based Processor 4 th Edition: page 107 (3 rd Edition: page 228) #9 Lec # 2 Spring

10 Four Steps of Speculative Tomasulo Algorithm 1. Issue (In-order) Get an instruction from Instruction Queue If a reservation station and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called dispatch ) 2. Execution (out-of-order) Operate on operands (EX) Includes data MEM read When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result (out-of-order) Finish execution (WB) Write on Common Data Bus (CDB) to all awaiting FUs & reorder buffer; mark reservation station available. i.e Reservation Stations 4. Commit (In-order) Update registers, memory with reorder buffer result When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. Mispredicted Branch Handling Stage 0 Instruction Fetch (IF): No changes, in-order No write to registers or memory in WB A mispredicted branch at the head of the reorder buffer flushes the reorder buffer (cancels speculated instructions after the branch) Instructions issue in order, execute (EX), write result (WB) out of order, but must commit in order. No WB for stores or branches Successfully completed instructions write to registers and memory (stores) here 4 th Edition: pages (3 rd Edition: pages ) #10 Lec # 2 Spring

11 Advanced CPU Architectures: VLIW: Intel/HP IA-64 Explicitly Parallel Instruction Computing (EPIC) Strengths: Allows for a high level of instruction parallelism (ILP). Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Limited by instruction-level parallelism (ILP) in a single thread. Keeping Functional Units (FUs) busy (control hazards). Static FUs Scheduling limits performance gains. Resulting overall performance heavily depends on compiler performance. #11 Lec # 2 Spring

12 Superscalar Architecture Limitations: Issue Slot Waste Classification Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: Vertical waste is introduced when the processor issues no instructions in a cycle. Horizontal waste occurs when not all issue slots can be filled in a cycle. Why not 8-issue? Example: 4-Issue Superscalar Ideal IPC =4 Ideal CPI =.25 Also applies to VLIW Instructions Per Cycle = IPC = 1/CPI Result of issue slot waste: Actual Performance << Peak Performance #12 Lec # 2 Spring

13 Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. (wasted) Ideal Instructions Per Cycle, IPC = 8 Here real IPC about 1.5 (CPI = 1/8) (18.75 % of ideal IPC) Average IPC= 1.5 instructions/cycle issue rate Real IPC << Ideal IPC 1.5 << 8 ~ 81% of issue slots wasted Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC92 benchmark suite. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #13 Lec # 2 Spring

14 Superscalar Architecture Limitations : All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause. Main Issue: One Thread leads to limited ILP (cannot fill issue slots) Solution: Exploit Thread Level Parallelism (TLP) within a single microprocessor chip: How? SMT-1 Simultaneous Multithreaded (SMT) Processor: -The processor issues and executes instructions from a number of threads creating a number of logical processors within a single physical processor e.g. Intel s HyperThreading (HT), each physical processor executes instructions from two threads AND/OR Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages Chip-Multiprocessors (CMPs): - Integrate two or more complete processor cores on the same chip (die) - Each core runs a different thread (or program) - Limited ILP is still a problem in each core (Solution: combine this approach with SMT) #14 Lec # 2 Spring

15 Advanced CPU Architectures: Single Chip Multiprocessors (CMPs) Strengths: Create a single processor block and duplicate. AKA Multi-Core Processors Each Single Threaded Exploits Thread-Level Parallelism. at chip level Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: Performance within each processor still limited by individual thread performance (ILP). High power requirements using current VLSI processes. Almost entire processor cores are replicated on chip. May run at lower clock rates to reduce heat/power consumption. e.g IBM Power 4/5, Intel Pentium D, Core Duo, Core 2 (Conroe), Core i7 AMD Athlon 64 X2, X3, X4, Dual/quad Core Opteron Sun UltraSparc T1 (Niagara) #15 Lec # 2 Spring

16 Advanced CPU Architectures: Single Chip Multiprocessor (CMP) Register File i PC i SP i Register File i+1 PC i+1 SP i+1 Register File n PC n SP n Control Unit i Control Unit i+1 Control Unit n Or 4-way Superscalar (Two-way) Pipeline i Superscalar (Two-way) Pipeline i+1 Superscalar (Two-way) Pipeline n Memory Hierarchy (Management) Shared L2 Or L3 Cache CMP with n cores #16 Lec # 2 Spring

17 Current Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Shared L2 Cache Single Die Private Caches Shared System Interface Two Dies Shared Package Private Caches Private System Interface L 2 Or L 3 Cores communicate using shared cache (Lowest communication latency) Examples: IBM POWER4/5 Intel Pentium Core Duo (Yonah), Conroe Sun UltraSparc T1 (Niagara) Quad Core AMD K10 (shared L3 cache) On-chip crossbar/switch Source: Real World Technologies, Cores communicate using on-chip Interconnects (shared system interface) Examples: AMD Dual Core Opteron, Athlon 64 X2 Intel Itanium2 (Montecito) FSB Cores communicate over external Front Side Bus (FSB) (Highest communication latency) Example: Intel Pentium D #17 Lec # 2 Spring

18 Advanced CPU Architectures: Fine-grained or Traditional Multithreaded Processors Multiple hardware contexts (PC, SP, and registers). Only one context or thread issues instructions each cycle. Performance limited by Instruction-Level Parallelism (ILP) within each individual thread: Can reduce some of the vertical issue slot waste. No reduction in horizontal issue slot waste. Example Architecture: The Tera Computer System Cray MTA 2 #18 Lec # 2 Spring

19 Fine-grain or Traditional Multithreaded Processors The Tera (Cray MTA 2) Computer System The Tera computer system is a shared memory multiprocessor that can accommodate up to 256 processors. Each Tera processor is fine-grain multithreaded: Each processor can issue one 3-operation Long Instruction Word (LIW) every 3 ns cycle (333MHz) from among as many as 128 distinct instruction streams (hardware threads), thereby hiding up to 128 cycles (384 ns) of memory latency. In addition, each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. A stream implements a load/store architecture with three addressing modes and 31 general-purpose 64-bit registers. The instructions are 64 bits wide and can contain three operations: a memory reference operation (M-unit operation or simply M-op for short), an arithmetic or logical operation (A-op), and a branch or simple arithmetic or logical operation (C-op). Source: threads per core From one thread #19 Lec # 2 Spring

20 SMT: Simultaneous Multithreading Multiple Hardware Contexts (or threads) running at the same time (HW context: ISA registers, PC, and SP etc.). Thread state A single physical SMT processor core acts (and reports to the operating system) as a number of logical processors each executing a single thread Reduces both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction, multiple levels of cache, hardware pre-fetching etc. Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/chip. Potential performance gain is much greater than the increase in chip area and power consumption needed to support SMT. Improved Performance/Chip Area/Watt (Computational Efficiency) vs. single-threaded superscalar cores. 2-way SMT processor 10-15% increase in area Vs. ~ 100% increase for dual-core CMP #20 Lec # 2 Spring

21 SMT With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: Thus SMT is an effective long latency-hiding technique Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. Functional units are shared among all contexts during every cycle: More complicated register read and writeback stages. More threads issuing to functional units results in higher resource utilization. CPU resources may have to resized to accommodate the additional demands of the multiple threads running. (e.g cache, TLBs, branch prediction tables, rename registers) context = hardware thread #21 Lec # 2 Spring

22 SMT: Simultaneous Multithreading Register File i n Hardware Contexts PC i SP i Register File i+1 PC i+1 SP i+1 Register File n Control Unit (Chip-Wide) Superscalar (Two-way) Pipeline i Superscalar (Two-way) Pipeline i+1 Memory Hierarchy (Management) PC n SP n Superscalar (Two-way) Pipeline n Modified out-of-order Superscalar Core One n-way SMT Core #22 Lec # 2 Spring

23 The Power Of SMT Time (processor cycles) Superscalar Traditional Multithreaded Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted (Fine-grain) Simultaneous Multithreading #23 Lec # 2 Spring

24 SMT Performance Example Inst Code Description Functional unit A LUI R5,100 R5 = 100 Int ALU B FMUL F1,F2,F3 F1 = F2 x F3 FP ALU C ADD R4,R4,8 R4 = R4 + 8 Int ALU D MUL R3,R4,R5 R3 = R4 x R5 Int mul/div E LW R6,R4 R6 = (R4) Memory port F ADD R1,R2,R3 R1 = R2 + R3 Int ALU G NOT R7,R7 R7 =!R7 Int ALU H FADD F4,F1,F2 F4=F1 + F2 FP ALU I XOR R8,R1,R7 R8 = R1 XOR R7 Int ALU J SUBI R2,R1,4 R2 = R1 4 Int ALU K SW ADDR,R2 (ADDR) = R2 Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined #24 Lec # 2 Spring

25 SMT Performance Example 4-issue (single-threaded) (continued) Cycle Superscalar Issuing Slots SMT Issuing Slots LUI (A) FMUL (B) ADD (C) T1.LUI (A) T1.FMUL T1.ADD (C) T2.LUI (A) (B) 2 MUL (D) LW (E) T1.MUL (D) T1.LW (E) T2.FMUL (B) T2.ADD (C) 3 T2.MUL (D) T2.LW (E) 4 5 ADD (F) NOT (G) T1.ADD (F) T1.NOT (G) 6 FADD (H) XOR (I) SUBI (J) T1.FADD (H) T1.XOR (I) T1.SUBI (J) T2.ADD (F) 7 SW (K) T1.SW (K) T2.NOT (G) T2.FADD (H) 8 T2.XOR (I) T2.SUBI (J) 9 T2.SW (K) 2 additional cycles for SMT to complete program 2 Throughput: 2-thread SMT i.e 2 nd thread Superscalar: 11 inst/7 cycles = 1.57 IPC SMT: 22 inst/9 cycles = 2.44 IPC SMT is 2.44/1.57 = 1.55 times faster than superscalar for this example Ideal speedup = 2 #25 Lec # 2 Spring

26 Modifications to Superscalar CPUs to Support SMT i.e thread state Necessary Modifications: Multiple program counters (PCs), ISA registers and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch/issue policy). A separate return stack for each thread for predicting subroutine return destinations. Per-thread instruction issue/retirement, instruction queue flush, and trap mechanisms. A thread id with each branch target buffer entry to avoid predicting phantom branches. Modifications to Improve SMT performance: A larger rename register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). A higher available main memory fetch bandwidth may be required. Larger data TLB with more entries to compensate for increased virtual to physical address translations. Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. e.g Private per-thread vs. shared L1 cache. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages Resize some hardware resources for performance SMT-2 #26 Lec # 2 Spring

27 SMT Implementations Intel s implementation of Hyper-Threading (HT) Technology (2-thread SMT): Originally implemented in its NetBurst microarchitecture (P4 processor family). Current Hyper-Threading implementation: Intel s Nehalem (Core i7 introduced 4 th quarter 2008): 2, 4 or 8 cores per chip each 2-thread SMT (4-16 threads per chip). IBM POWER 5/6: Dual cores each 2-thread SMT. The Alpha EV8 (4-thread SMT) originally scheduled for production in A number of special-purpose processors targeted towards network processor (NP) applications. Sun UltraSparc T1 (Niagara): Eight processor cores each executing from 4 hardware threads (32 threads total). Actually, not SMT but fine-grain multithreaded (each core issues one instruction from one thread per cycle). Current technology has the potential for at least 4-8 simultaneous threads per core (based on transistor count and design complexity). #27 Lec # 2 Spring

28 A Base SMT Hardware Architecture. In-Order Front End Out-of-order Core Fetch/Issue Modified Superscalar Speculative Tomasulo Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, SMT-2 Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #28 Lec # 2 Spring

29 Example SMT Vs. Superscalar Pipeline SMT Based on the Alpha Two extra pipeline stages added for reg. Read/write to account for the size increase of the register file The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 #29 Lec # 2 Spring

30 Intel Hyper-Threaded (2-way SMT) P4 Processor Pipeline Source: Intel Technology Journal, Volume 6, Number 1, February SMT-8 #30 Lec # 2 Spring

31 Intel P4 Out-of of-order order Execution Engine Detailed Pipeline Hyper-Threaded (2-way SMT) Source: Intel Technology Journal, Volume 6, Number 1, February SMT-8 #31 Lec # 2 Spring

32 SMT Performance Comparison Instruction throughput (IPC) from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Traditional SMT Threads Multithreading issue 8-threads IPC Parallel Workload i.e Fine-grained Superscalar MP2 MP4 Traditional SMT (MP = Chip-multiprocessor) Threads Multithreading IPC Multiprogramming workload = multiple single threaded programs (multi-tasking) Parallel Workload = Single multi-threaded program #32 Lec # 2 Spring

33 Possible Machine Models for an 8-way Multithreaded Processor The following machine models for a multithreaded CPU that can issue 8 instruction per cycle differ in how threads use issue slots and functional units: Fine-Grain Multithreading: Only one thread issues instructions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. SM:Full Simultaneous Issue: i.e SM: Eight Issue Most Complex This is a completely flexible simultaneous multithreaded superscalar: all eight threads compete for each of the 8 issue slots each cycle. This is the least realistic model in terms of hardware complexity, but provides insight into the potential for simultaneous multithreading. The following models each represent restrictions to this scheme that decrease hardware complexity. SM:Single Issue,SM:Dual Issue, and SM:Four Issue: These three models limit the number of instructions each thread can issue, or have active in the scheduling window, each cycle. For example, in a SM:Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. SM:Limited Connection. SMT-1 Each hardware context is directly connected to exactly one of each type of functional unit. For example, if the hardware supports eight threads and there are four integer units, each integer unit could receive instructions from exactly two threads. The partitioning of functional units among threads is thus less dynamic than in the other models, but each functional unit is still shared (the critical factor in achieving high utilization). i.e. Partition functional units among threads Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #33 Lec # 2 Spring

34 Comparison of Multithreaded CPU Models Complexity A comparison of key hardware complexity features of the various models (H=high complexity). The comparison takes into account: the number of ports needed for each register file, the dependence checking for a single thread to issue multiple instructions, the amount of forwarding logic, and the difficulty of scheduling issued instructions onto functional units. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #34 Lec # 2 Spring

35 Simultaneous Vs. Fine-Grain Multithreading Performance 4.8? IPC 3.1 IPC 6.4 Workload: SPEC92 Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #35 Lec # 2 Spring

36 Simultaneous Multithreading (SM) Vs. Single-Chip Multiprocessing (MP) IPC Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.the multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages #36 Lec # 2 Spring

37 Notation: Impact of Level 1 Cache Sharing on SMT Performance Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p Instruction Data The caches are specified as: [total I cache size in KB][private or shared].[d cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data 64K instruction cache shared 64K data cache private (8K per thread) Best overall performance of configurations considered achieved by 64s.64s (64K instruction cache shared 64K data cache shared) Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1 #37 Lec # 2 Spring

38 The Impact of Increased Multithreading on Some Low Level Metrics for Base SMT Architecture IPC Renaming Registers SMT-2 So? More threads supported may lead to more demand on hardware resources (e.g here D and I miss rated increased substantially, and thus need to be resized) Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #38 Lec # 2 Spring

39 Possible SMT Thread Instruction Fetch Scheduling Policies Round Robin: Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions RR 2.4 each cycle two threads fetch up to four instructions each) BR-Count: Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. MISS-Count: Give priority to those threads that have the fewest outstanding Data cache misses. ICount: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). IQPOSN: Instruction Queue (IQ) Position Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue). SMT-2 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages #39 Lec # 2 Spring

40 Instruction Throughput For Round Robin Instruction Fetch Scheduling IPC = 4.2 RR.2.8 RR with best performance Best overall instruction throughput achieved using round robin RR.2.8 (in each cycle two threads each fetch a block of 8 instructions) Workload: SPEC92 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 #40 Lec # 2 Spring

41 Instruction throughput & Thread Fetch Policy ICOUNT.2.8 All other fetch heuristics provide speedup over round robin Instruction Count ICOUNT.2.8 provides most improvement 5.3 instructions/cycle vs 2.5 for unmodified superscalar. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Workload: SPEC92 Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages ICOUNT: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). SMT-2 #41 Lec # 2 Spring

42 Low-Level Metrics For Round Robin 2.8, Icount 2.8 Renaming Registers ICOUNT improves on the performance of Round Robin by 23% by reducing Instruction Queue (IQ) clog by selecting a better mix of instructions to queue SMT-2 #42 Lec # 2 Spring

43 Possible SMT Instruction Issue Policies OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue, the default). OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Instruction issue bandwidth is not a bottleneck in SMT as shown above Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages ICOUNT.2.8 Fetch policy used for all issue policies above SMT-2 #43 Lec # 2 Spring

44 SMT: Simultaneous Multithreading Strengths: Overcomes the limitations imposed by low single thread instruction-level parallelism. Resource-efficient support of chip-level TLP. Multiple threads running will hide individual control hazards ( i.e branch mispredictions) and other long latencies (i.e main memory access latency on a cache miss). Weaknesses: Additional stress placed on memory hierarchy. Control unit complexity. Sizing of resources (cache, branch prediction, TLBs etc.) Accessing registers (32 integer + 32 FP for each HW context): Some designs devote two clock cycles for both register reads and register writes. Deeper pipeline #44 Lec # 2 Spring

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

Instruction Fetch Bandwidth/Memory Latency Reduction: Conventional & Block-based Trace Cache (Intel P4).

Instruction Fetch Bandwidth/Memory Latency Reduction: Conventional & Block-based Trace Cache (Intel P4). Advanced Computer Architecture Course Goal: Understanding important emerging design techniques, machine structures, technology factors, evaluation methods that will determine the form of highperformance

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

High-Performance Processors Design Choices

High-Performance Processors Design Choices High-Performance Processors Design Choices Ramon Canal PD Fall 2013 1 High-Performance Processors Design Choices 1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW 2 Motivation Multiprocessors Outline

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Parallelism, Multicore, and Synchronization

Parallelism, Multicore, and Synchronization Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information