The Use of Multithreading for Exception Handling

Similar documents
Execution-based Prediction Using Speculative Slices

ECE404 Term Project Sentinel Thread

Speculative Multithreaded Processors

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

One-Level Cache Memory Design for Scalable SMT Architectures

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Speculative Multithreaded Processors

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

Hyperthreading Technology

Use-Based Register Caching with Decoupled Indexing

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading: a Platform for Next Generation Processors

More on Conjunctive Selection Condition and Branch Prediction

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Execution-based Prediction Using Speculative Slices

Understanding The Effects of Wrong-path Memory References on Processor Performance

CS425 Computer Systems Architecture

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Speculative Parallelization in Decoupled Look-ahead

Fetch Directed Instruction Prefetching

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

A Study of Control Independence in Superscalar Processors

PowerPC 620 Case Study

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

SPECULATIVE MULTITHREADED ARCHITECTURES

Multithreaded Processors. Department of Electrical Engineering Stanford University

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

A Study for Branch Predictors to Alleviate the Aliasing Problem

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Simultaneous Multithreading Processor

Techniques for Efficient Processing in Runahead Execution Engines

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

November 7, 2014 Prediction

Hardware-Based Speculation

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Exploring Efficient SMT Branch Predictor Design

Applications of Thread Prioritization in SMT Processors

TDT 4260 lecture 7 spring semester 2015

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Kaisen Lin and Michael Conley

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Alexandria University

Control Hazards. Prediction

Transient-Fault Recovery Using Simultaneous Multithreading

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Advanced Processor Architecture

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Simultaneous Multithreading (SMT)

Copyright 2012, Elsevier Inc. All rights reserved.

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Lecture 14: Multithreading

Data Prefetching by Dependence Graph Precomputation

Implicitly-Multithreaded Processors

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading and the Case for Chip Multiprocessing

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Simultaneous Multithreading (SMT)

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Chip-Multithreading Systems Need A New Operating Systems Scheduler

A Study of Slipstream Processors

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Master/Slave Speculative Parallelization

Speculative Execution for Hiding Memory Latency

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Getting CPI under 1: Outline

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Threaded Multiple Path Execution

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Using a Serial Cache for. Energy Efficient Instruction Fetching

Pre-Computational Thread Paradigm: A Survey

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Science 146. Computer Architecture

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Handout 2 ILP: Part B

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Transcription:

The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32 November, 1999

Overview Extensions to a multithreaded processor to reclaim lost performance during exception handling in a pipelined, out-oforder processor HARDWARE EXCEPTIONS PERFORMANCE IN TRADITIONAL IMPLEMENTATION IMPORTANT CHARACTERISTICS OF EXCEPTION HANDLERS EXPLOIT THEM WITH EXTENSION TO SMT PROCESSOR METHODOLOGY/PERFORMANCE AN OPTIMIZATION: QUICK-STARTING CONCLUSIONS International Symposium on Microarchitecture - 32, November 1999 2

Hardware exceptions COST-EFFECTIVE HARDWARE UNCOMMON CASE HANDLED BY SOFTWARE RECOVERABLE EXCEPTIONS (NOT SEGFAULTS) TLB miss unaligned access emulated instructions EVENT DETECTED BY HARDWARE, RESOLVED BY SOFTWARE A short piece of code is executed Control is returned to the application at the exception International Symposium on Microarchitecture - 32, November 1999 3

Performance problem MUCH LIKE BRANCH MISPREDICT causes CHANGE IN CONTROL FLOW often DETECTED AT EXECUTE TIME EXCEPTION DETECTED PRE-EXCEPT APPLICATION POST-EXCEPT APPLICATION T I M E SQUASH THE EXCEPTION AND POST-EXCEPTION INSTRUCTIONS PRE-EXCEPT APPLICATION FETCH/EXECUTE EXCEPTION HANDLER PRE-EXCEPT APPLICATION EXCPT. HANDLER REFETCH APPLICATION CODE PRE-EXCEPT APPLICATION EXCPT. HANDLER POST-EXCEPT APPLI DYNAMIC INSTRUCTION STREAM International Symposium on Microarchitecture - 32, November 1999 4

Trends WITH INCREASED PIPELINE LENGTH, SUPERSCALAR WIDTH, AND WINDOW SIZE IT ONLY GETS WORSE 45 40 3 STAGE 3 pipe stages 7pipe STAGE stages 11 pipe stages 11 STAGE penalty cycles per TLB miss 35 30 25 20 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortex gcc murphi AVERAGE average APPLU DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 5

Structure of Exception Handler RECONVERGENT CONTROL FLOW The same application instructions are executed in the same order INDEPENDENT of the exception handler s execution MINIMAL DATA DEPENDENCES between application and exception handler typically only involving excepting instruction Example: TLB MISS HANDLER reads miss address from privileged register loads from page table writes TLB International Symposium on Microarchitecture - 32, November 1999 6

Extension to SMT processor RECONVERGENT CONTROL FLOW DON T SQUASH ALLOCATE THE HANDLER TO SEPARATE THREAD FIFO management of window resources (within a thread) extra hardware required for ordering threads THREAD #1 PRE-EXCEPT APPLICATION POST-EXCEPT APPLICATION #2 EXCPT. HANDLER PROVIDE APPEARANCE OF SEQUENTIAL EXECUTION Control thread retirement order International Symposium on Microarchitecture - 32, November 1999 7

Extension to SMT processor MINIMAL DATA DEPENDENCES USE SEPARATE REGISTER FILE Avoids additional renamer complexity UNCOMMON CASE (TLB MISS PAGE FAULT CONTEXT SWITCH) REVERT TO NORMAL MECHANISM MEMORY DEPENDENCES DETECT CONFLICTS, RECOVER (MUCH LIKE R10K, OR ARB) International Symposium on Microarchitecture - 32, November 1999 8

Methodology EXAMPLE IMPLEMENTATION: SOFTWARE TLB MISS HANDLING EXECUTION DRIVEN SMT SIMULATOR BUILT FROM ALPHA ARCHITECTURE SIMPLESCALAR TOOLKIT SUPPORTS ENOUGH OF 21164 PRIVILEGED ARCHITECTURE TO RUN COMMON-CASE TLB HANDLER O SPECULATIVE EXECUTION, MULTIPLE IN-FLIGHT MISSES 8 WIDE, 128 WINDOW, 7 STAGE, BIG YAGS, 64K L1 S, 1M L2 BENCHMARKS WITH NON-TRIVIAL TLB BEHAVIOR (FROM SPEC AND ELSEWHERE) SCALED DOWN (64 ENTRY) DATA TLB METRIC: PENALTY PER MISS (additional overhead vs. simulation with perfect TLB) / misses International Symposium on Microarchitecture - 32, November 1999 9

Performance DOES MUCH BETTER THAN TRADITIONAL SOFTWARE APPROACH NOT AS GOOD AS AGGRESSIVE HARDWARE TLB MISS WIDGET 40 35 TRADITIONAL traditional software MULTITHREAD-1 multithreaded(1) MULTITHREAD-3 multithreaded(3) HARDWARE hardware penalty cycles per TLB miss 30 25 20 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortexaverage APPLU gcc murphi average DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 10

Optimization: Quick-starting PERFORMANCE GAP BETWEEN HARDWARE AND MULTI-THREADED FETCH/DECODE LATENCY SOLUTION: CACHE EXCEPTION HANDLER PARTWAY DOWN PIPELINE OUR SMT IMPLEMENTATION: PER THREAD FETCH BUFFERS, IDLE RESOURCES WHEN THREAD IS IDLE PREDICT NEXT EXCEPTION, USE IDLE FETCH CYCLES TO PREFETCH HANDLER. REDUCES MULTI-THREADED EXCEPTION LATENCY. FETCH FETCH DECODE RENAME REGREAD REGREAD EXECUTE International Symposium on Microarchitecture - 32, November 1999 11

Performance: Quick-starting ALMOST CUTS PERFORMANCE GAP IN HALF 25 MULTI-1 multithreaded(1) QUICKSTART-1 quick start(1) HARDWARE hardware 20 penalty cycles per TLB miss 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortexaverage gcc murphi average APPLU DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 12

Single Thread Performance vs. Throughput SINGLE APPLICATION: (PREVIOUS RESULTS) FOCUS: IMPROVE SINGLE THREAD PERFORMANCE MULTIPROGRAMED/MULTITHREADED WORKLOAD: FOCUS: MAXIMIZE THROUGHPUT OUR EXPERIMENT: (NOT NECESSARILY A FAIR COMPARISON) RUN 3 APPLICATIONS, 1 IDLE THREAD FOR EXCEPTION HANDLING International Symposium on Microarchitecture - 32, November 1999 13

Performance on Multiprogramed Workloads 12 10 TRADITIONAL traditional MULTI-1 multithreaded(1) QUICKSTART-1 quick start(1) HARDWARE hardware penalty cycles per TLB miss 8 6 4 2 0 ADMadm gcc vor ADMapl cmp h2d APL APL apl dbl vor APL CMP dbl gcc h2d DBL adm cmp vor adm h2d mph apl dbl mph cmp gcc mph AVERAGE average CMP GCC H2D CMP DBL DBL GCC GCC VOR VOR MPH H2D MPH VOR MPH H2D PERFORMANCE IS MORE COMPLICATED SMT IS MORE LATENCY TOLERANT SMT IS LESS TOLERANT OF WASTED BANDWIDTH International Symposium on Microarchitecture - 32, November 1999 14

Related Work ARCHITECTURES: M-MACHINE O O FILLO, KECKLER, DALLY, CARTER, CHANG, GUREVICH, LEE KECKLER, DALLY, CHANG, LEE, CHATTERJEE MULTISCALAR/KESTREL SUBORDINATE MULTITHREADING: CHAPPEL, STARK, KIM, REINHART, AND PATT SONG AND DUBOIS International Symposium on Microarchitecture - 32, November 1999 15

Conclusions SIGNIFICANTLY IMPROVES EXCEPTION HANDLING PERFORMANCE: software TLB miss performance approaching that of an aggressive hardware TLB miss performance NOT ALL EXCEPTIONS CAN BE IMPLEMENTED IN HARDWARE HIGH PERFORMANCE EXCEPTIONS ENABLE NOVEL SOFTWARE SYSTEMS a la SOFTWARE DSM or CONCURRENT GC International Symposium on Microarchitecture - 32, November 1999 16