Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Similar documents
Dynamic Simultaneous Multithreaded Architecture

Hardware-based Speculation

Handout 2 ILP: Part B

Case Study IBM PowerPC 620

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

SPECULATIVE MULTITHREADED ARCHITECTURES

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Hardware-Based Speculation

Hardware-Based Speculation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Advanced issues in pipelining

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Getting CPI under 1: Outline

5008: Computer Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Exploitation of instruction level parallelism

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CS146 Computer Architecture. Fall Midterm Exam

Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

A Thread Partitioning Algorithm using Structural Analysis

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Complex Pipelines and Branch Prediction

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

SUPERSCALAR techniques that exploit instruction-level parallelism

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Static vs. Dynamic Scheduling

Dynamically Controlled Resource Allocation in SMT Processors

Implicitly-Multithreaded Processors

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Control and Data Dependence Speculation in Multithreaded Processors

Multithreaded Processors. Department of Electrical Engineering Stanford University

Limitations of Scalar Pipelines

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Computer System Architecture Final Examination Spring 2002

Architectures for Instruction-Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS 152, Spring 2011 Section 8

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011


Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Precise Exceptions and Out-of-Order Execution. Samira Khan

Superscalar Machines. Characteristics of superscalar processors

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

" # " $ % & ' ( ) * + $ " % '* + * ' "

Simultaneous Multithreading: a Platform for Next Generation Processors

Multiple Instruction Issue. Superscalars

Implicitly-Multithreaded Processors

Lecture 19: Instruction Level Parallelism


EECC551 Exam Review 4 questions out of 6 questions

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences

November 7, 2014 Prediction

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

TDT 4260 lecture 7 spring semester 2015

Superscalar Processors

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Portland State University ECE 588/688. Dataflow Architectures

Superscalar Processors

Dynamic Scheduling. CSE471 Susan Eggers 1

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

CS433 Homework 2 (Chapter 3)

Multithreaded Value Prediction

Out of Order Processing

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS / ECE 6810 Midterm Exam - Oct 21st 2008

E0-243: Computer Architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Transcription:

Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print Link to publication from Aalborg University Citation for published version (APA): Ortiz-Arroyo, D. (2006). Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture. Poster session presented at Advanced Computer Architecture and Compilation for Embedded Systems, L'Aquila, Italy. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.? Users may download and print one copy of any publication from the public portal for the purpose of private study or research.? You may not further distribute the material or use it for any profit-making activity or commercial gain? You may freely distribute the URL identifying the publication in the public portal? Take down policy If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from vbn.aau.dk on: juli 01, 2018

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Daniel Ortiz-Arroyo 1 Computer Science and Engineering Department, Aalborg University Esbjerg, Niels Bohrs Vej 8, 6700 Esbjerg, Denmark ABSTRACT In this paper we describe the Dynamic Simultaneous Multithreading (DSMT) architecture, together with a simple resource allocation policy that may be employed to improve its performance. DSMT exploits TLP and ILP available in single applications on a SMT architecture core. In DSMT, threads are speculatively created and executed, while special hardware mechanisms keep track of misspeculations and dependence violations through registers and memory. KEYWORDS: Thread Level Parallelism; Simultaneous Multithreading; Speculation 1 Introduction In multithreaded architectures, TLP can be exploited using implicit or explicit mechanisms. Implicit multithreaded architectures aim is to improve the execution time of a single program, whereas explicit multithreading goal is to improve the execution time of a multiprogrammed workload. Implicit multithreading comprises either static methods such as parallelizing compilers or binary annotators to identify threads or dynamic techniques that extract threads speculatively at run time from a single program. Explicitly multithreaded architectures such as Simultaneous Multithreading (SMT) exploit efficiently the resources in the processor by executing multiple independent threads simultaneously, at the cost of a small increment in complexity. However, despite its advantages, being SMT an explicit architecture, it does nothing to improve the performance of a single program. Moreover, SMT may cause cache pollution because different threads are competing for the shared cache and, as is described in [Caz04], the fetching policy employed greatly impacts the performance of the processor. This paper briefly describes an implicitly multithreaded architecture called Dynamic Simultaneous Multithreading (DSMT) based on a SMT microarchitecture. We also describe a resource allocation policy that can be used to improve DSMT s performance. This paper is organized as follows. Section 2 briefly describes previous related work. Section 3 provides a detailed description of the DSMT microarchitecture. The resource allocation policy proposed is briefly presented in Section 4. Finally Section 5 presents some conclusions. 1 Email: do@cs.aaue.dk

Execution Unit TCIU I-cache Fetch Unit Next_PC Inst. I Queue ROB Decode/ Dispatch Reservation Stations Integer Unit ALU Multiplier Divider FP Unit Adder Multiplier Divider Load/Store LSST Conf Op R d Immd J0 J1 J2 JN-1 R_Anchor D_Anchor Reg. Read Confidence Table Continuation Iteration Ctr. Head Tail M Loop Detection Scheduler Good/bad loop DSMT mode Contexts RDF LSQ MDF Context 0 PC V S H PC TCIU D L R Register 0 L2 Cache Main Memory D-cache D L R Register 63 Context N-1 Figure 1: DSMT Microarchitecture. Figure 2: TCIU and Multiple Contexts. 2 Related Work In Dynamic Multithreaded (DMT) processor [Akk98], threads are generated from loops and procedure continuations, i.e., instructions from either after the call or backward branch. Thus, it precludes the exploitation of the potentially high-degree of parallelism that exists across inner loop iterations. Moreover, threads are allowed to be spawned out of program order, which results in lower branch and return address prediction accuracies. These factors lead to long threads that increase the chance of data dependence misspeculation. In contrast, the Clustered Speculative Multithreaded (CSM) processor [Mar99] generates threads from target of backward branches, in particular innermost iterations of a loop. In CSM, inter-thread data dependencies through registers are resolved by performing data value prediction. When a thread finishes, its output predictions are verified and mispredictions are handled by selective re-execution. Inter-thread memory dependence speculation is performed by means of a multi-value cache. This special cache memory stores, for each address, as many different data words as the number of thread units. Among the SMTbased implicitly multithreaded processors proposed, DSMT employs less complex mechanisms comparatively with other similar architectures [Akk98, Mar99] specifically: a) a simple mechanism based on utility bits to detect and resolve inter-thread dependencies through registers and memory, b) no trace buffers or specialized cache structures, c) state bits and simple flushing mechanisms perform control of speculation, and d) a simple loop stride value prediction mechanism. Diverse resource allocation policies for SMT processors have been proposed. The goal of these mechanisms is to improve SMT s performance, avoiding resource contention by giving each thread fair access to resources. The dynamic resource allocation policies presented in [Caz04] (DCRA) classify threads according to their demands on resources as fast-slow, active-inactive. Special counters keep track of resource utilization. 3 DSMT Microarchitecture DSMT operates either in non-dsmt, pre-dsmt or DSMT mode [Ort03]. In the non-dsmt mode, there is only a single thread of execution and thus the processor behaves as a

superscalar processor. When a loop is detected, the processor transitions to the pre-dsmt mode to detect not only live registers, intra- and inter-thread dependent registers, but also those registers that are most likely used as induction variables in the loop. Later, if it is determined that multithreaded execution will improve performance, the microarchitecture transitions to the DSMT mode of execution where the overlapped execution of multiple threads occurs. Figure 1 shows the organization of the DSMT microarchitecture. It consists of a SMT architecture organized into nine pipelined stages: Fetch, Decode/Dispatch, Inter- Thread Register Resolution, Issue, Execute, Memory, Write-back, Commit and Thread Spawning. The Fetch stage fetches a block of instructions from a thread, but can also fetch instructions from different threads based on the scheduling policy offered by the Scheduler. The fetched instructions are then decoded, renamed, and dispatched to the Reservations Stations (RSs) associated with each functional unit. At the Inter-Thread Register Resolution stage, inter-thread dependencies through registers are resolved. Finally, during the Thread Spawning stage speculative contexts are spawned and the control of thread-level speculation is performed. To support simultaneous execution of multiple threads, each thread has its own set of Instruction Queue (IQ), Reorder Buffer (ROB), and Context registers. Each Context represents the state of a thread and the multiple Contexts are interfaced to the Thread Creation and Initiation Unit (TCIU), which controls how threads are spawned and executed. The Register and Memory Data Flow (MDF and RDF) mechanisms associated with the Contexts and Load Store Queues (LSQ) keep track of inter-thread dependencies through registers and memory respectively. The Loop Detection Unit is responsible for detecting loops and supplying target addresses so that the Thread Creation and Instatiation Unit (TCIU) which ultimately spawns multiple threads. Whenever a taken backward-branch instruction is detected, its branch and target addresses are recorded in the Loop Detection Unit. Later, a second branch instruction with the same branch target address causes the processor to enter its pre-dsmt mode of execution. A loop is classified as good or bad based on a break even policy, where the IPC measured for a loop during pre-dsmt mode is compared against the observed Sustained IPC (SIPC) during previous run(s) of the same loop in DSMT-mode. DSMT aims to guarantee that the DSMT mode of execution will result in better performance comparatively with the non-dsmt mode of execution regardless of misspeculations. The control of nested loop execution is handled by a special stack structure associated with the Loop Detection Unit. Figure 2 shows the multiple contexts in the processor physically interconnected in a ring fashion. In DMST mode the thread spawning process starts by copying the target address of a loop in the Continuation register to the PCs of each spawned context. To jump start each thread, values detected during pre-dsmt mode of immediate addressing mode instructions of the form addi rd,rd,#immd are predicted and stored in the Loop Stride Speculation Table (LSST). The speculation performed by DSMT is to predict the content of a source register rd in LSST as rd=rd+iteration*immd, where iteration is the current iteration number speculatively executed. To avoid blind speculation on rd, each entry in LSST is associated with confidence (Conf) bits based on 2-bit saturation counters. To keep track of inter-thread dependent registers, each physical register is associated with a set of utility bits. They are: Ready (R) bit that indicates some instruction(s) logically preceding the current one in the thread s program order has committed a value to the register; Dependency (D) bits that keep track of registers that have inter-thread dependencies; Load (L) bit that indicates that the register has been speculatively read from a predecessor context. Since register reads are speculative, a confidence based on 2-bit saturation counter is associated with each register and stored in the Register Read Confidence Table. Finally, to synchronize the execution of

multiple threads, each context has associated a Join (J) bit, which is set to indicate that a thread has completed the execution of a single iteration in a loop. When this occurs, speculative contexts that could have finished earlier due to dynamic behavior must wait until the non-speculative context commits its results and transfers the non-speculative status to a new successor context. Transferring the non-speculative status comprises, copying the register values generated locally by the non-speculative context to the successor context, and updating the TCIU with the tag identification of the new nonspeculative context (the head). In DSMT, loads from different threads, can be executed speculatively. However, only the non-speculative threads are allowed to perform stores. The Memory Dataflow Mechanism ensures that the sequential semantics is not violated. 4 Resource Allocation Policy for DSMT Our experiments presented in [Ort03] show that DSMT suffers from resource contention among threads. DSMT fetching policy modifies the original SMT s policy by assigning a fetching port to the non speculative thread and sharing the other available port among the speculative threads using ICOUNT. However, as pointed out by [Caz04], when loops have dynamic behavior i.e. diverse resource demands, resource allocation policies can improve performance by assigning, dynamically, critical resources among the speculative and nonspeculative threads. Our experiments show that DSMT s critical resources are the integer and floating point units, additionally to the load/store ports. The dynamic resource allocation policy proposed for DSMT is described by the following equation: R(1 + SD) for non speculative thread T 1 E = R(1 + SD) R for speculative threads T 1 where E is the number of resource entries allocated to each thread, R total number of entries of a resource, T the total number of contexts in the processor, S is the sharing factor (to be determined experimentally), D is the number of threads with dynamic behavior. 5 Conclusions We have presented the DSMT architecture and a resource allocation policy that dynamically allocates critical resources. The performance obtained by the use of this policy in loops with dynamic behavior will be evaluated experimentally. References [Akk98] Akkary H. and Driscoll M., A Dynamic Multithreading Processor, 31st Int l Symp. On Microarchitecture, Dec 1998. [Mar99] Marcuello P. and Gonzales A., Clustered Speculative Multithreaded Processors, Proc. of the Int l. Conference on Supercomputing (ICS 99), 1999. [Caz04] Cazorla, F.J.; Ramirez, A.; Valero, M.; Fernandez, E., "Dynamically Controlled Resource Allocation in SMT Processors," Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, vol., no.pp. 171-182, 04-08 Dec. 2004 [Ort03] Ortiz-Arroyo D. and Lee B., "Dynamic Simultaneous Multithreaded Architecture," 16th International Conference on Parallel and Distributed Computing Systems (PDCS-2003), August 13-15, 2003.