Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Similar documents
Efficient Runahead Threads

Runahead Threads to Improve SMT Performance

Techniques for Efficient Processing in Runahead Execution Engines

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dynamically Controlled Resource Allocation in SMT Processors

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

15-740/ Computer Architecture

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Multithreaded Processors. Department of Electrical Engineering Stanford University

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Staged Memory Scheduling

Exploitation of instruction level parallelism

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Hardware-Based Speculation

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Multithreaded Value Prediction

Filtered Runahead Execution with a Runahead Buffer

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Superscalar Organization

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Software-Controlled Multithreading Using Informing Memory Operations

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Kilo-instruction Processors, Runahead and Prefetching

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

Lecture-13 (ROB and Multi-threading) CS422-Spring

Handout 2 ILP: Part B

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Chapter 2: Memory Hierarchy Design Part 2

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Review: Creating a Parallel Program. Programming for Performance

Architectures for Instruction-Level Parallelism

Execution-based Prediction Using Speculative Slices

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

Simultaneous Multithreading on Pentium 4

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2


CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Are we ready for high-mlp?

Achieving Out-of-Order Performance with Almost In-Order Complexity

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Optimizing SMT Processors for High Single-Thread Performance

LIMITS OF ILP. B649 Parallel Architectures and Programming

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Simultaneous Multithreading (SMT)

CS425 Computer Systems Architecture

1 Tomasulo s Algorithm

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Understanding The Effects of Wrong-path Memory References on Processor Performance

Dynamic Speculative Precomputation

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

EECS 570 Final Exam - SOLUTIONS Winter 2015

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Computer Architecture, Fall 2010 Midterm Exam I

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Lec 25: Parallel Processors. Announcements

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Simultaneous Multithreading (SMT)

Full Datapath. Chapter 4 The Processor 2

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Cache Optimisation. sometime he thought that there must be a better way

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Processor (IV) - advanced ILP. Hwansoo Han

Transcription:

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15 September, Vienna, Austria

Time Efficient Runahead Threads Tanausú Ramírez (UPC) 2 Introduction Simultaneous Multithreaded (SMT) processors execute multiple instructions from several threads at the same time resource sharing still hits with the memory wall Res. slots memory-intensive thread streams of long latency memory accesses holds too many resources - starves other threads of available resources - prevents their forward progress as well

3 Runahead Threads (RaT) Multithreaded approach of Runahead paradigm interesting way of using resources in SMT in the presence of long-latency memory instructions speculative thread runahead thread thread 1 normal thread long-latency miss time thread 2 prefetches runahead thread a runahead thread executes speculative instructions while a long-latency load is resolving follow the program path instead of being stalled overlap prefetch accesses with the main load

4 RaT benefits Allow memory-bound threads going speculatively in advance doing prefetching provide speedup increasing memory-level parallelism improve individual performance of this thread don t use resources free resources faster prefetches exploit MLP runahead thread Prevent memory bounded thread monopolization transform a resource hoarder thread into a lightconsumer thread shared resources can be fairly used by other threads

Extra instruction ratio Efficient Runahead Threads Tanausú Ramírez (UPC) 5 RaT shortcoming RaT causes extra work due to more extra executed instructions 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% more instructions more energy RaT increase ~50% power consumption ILP2 MIX2 MEM2 ILP4 MIX4 MEM4 AVG Impact on performance-energy efficiency

6 Proposal Improving RaT efficiency avoid inefficient cases: no prefetching useless instructions long-latency miss no LLL useless work runahead thread only useful executions that maximize the performance control the number of useless instructions executed by runahead threads RaT Our goal energy-performance balance reduce instruction without losing performance reduce power PERFORMANCE POWER

7 Talk outline Introduction RaT problem Proposal Efficient Runahead Threads Useful Runahead Distance Runahead Distance Prediction (RDP) Evaluation Framework Results Conclusions

8 Our approach Propose a mechanism to improve RaT efficiency Idea based on features of own runahead threads capture the useful lifetime of a runahead thread to exploit as much MLP as possible with the minimum number of speculative instructions long-latency miss Thread Prefetch (MLP) runahead thread Useless instructions load is resolved balance between prefetching and useless instructions

9 Useful Runahead Distance long-latency miss Thread LLL1 LLL2 LLL3 LLL4 Efficient runahead thread Useful Runahead distance how far thread should run ahead speculatively such that execution will be efficient By limiting the runahead execution of a thread, we generate efficient runahead threads that avoid unnecesary speculative execution and enhance runahead threads efficiency control the number of useless speculative instructions while obtaining the maximum performance for a given thread

10 Using useful runahead distance With the useful runahead distance we can help the processor to decide whether or not a runahead thread should be initiated on an L2- miss load eliminate useless runahead threads Thread load is resolved useless runahead thread indicate how long the runahead execution period should be when a runahead thread is initiated reduce unnecessary extra work done by useful runahead threads Useless instructions Thread load is resolved efficient runahead thread

11 Runahead Distance Prediction RDP operation calculate the useful runahead distances of runahead thread executions use that useful runahead distance for predicting future runahead periods caused by the same static load instruction RDP implementation use two useful runahead distances (full and last) improve reliability of the useful distance information runahead thread information is associated to its causing long-latency load store this information in a table

12 RDP mechanism (I) First time: no useful distance information for a load execute fully a runahead thread and compute the new useful runahead distance Ld PC1 LLL1 LLL2 LLL3 LLL4 Useful Runahead distance full runahead distance last runahead distance both full and last distances are updated with the obtained useful runahead distance

13 RDP mechanism (II) Next times: check whether distances are reliable No when full and last distances are very different based on a threshold: Distance Difference Threshold full last > DDT execute fully the runahead thread Ld PC1 LLL1 LLL2 new Useful Runahead distance full runahead distance last runahead distance and update both (full and last) runahead distances

14 RDP mechanism (III) Next times: check whether distances are reliable Yes full - last <= DDT take the last distance as useful runahead distance decide to start the runahead thread? if last < activation threshold do not start runahead thread Ld PC1 LLL1 LLL2 last runahead distance Useful Runahead distance how long executes the runahead thread else start the runahead thread executing as many instructions as last useful distance indicates

15 RDP mechanism (IV) Ensuring reliability of small usefull distances some loads have initial small runahead distances but useful runahead distance becomes larger later 26% of loads increase its useful runahead distance Use a simple heuristic provide an additional confidence mechanism that determine whether or not a small distance is reliable while useful runahead distance < activation threshold during N times (N=3) execute full runahead execution

16 Talk outline Introduction RaT problem Proposal Efficient Runahead Threads Useful Runahead Distance Runahead Distance Prediction (RDP) Evaluation Framework Results Conclusions

17 Framework Tools derived SMTSIM exec-driven simulator enhanced Wattch power model integrated CACTI Benchmarks Multiprogrammed workloads of 2 and 4 threads mix from SPEC 2000 benchmarks ILP,MEM,MIX Methodology FAME ensure multithreaded state continuously threads fairly represented in the final measurements

18 Framework SMT model with a complete resource sharing organization Processor depth Processor width Reorder buffer size INT/FP registers Processor core 10 stages 8 way (2 or 4 contexts) 64 entries per context 48 regs per context INT / FP /LS issue queues 64 / 64 / 64 INT / FP /LdSt unit 4 / 4 / 2 Hardware prefetcher Stream buffers (16) Icache Dcache L2 Cache Caches line size Main memory latency Memory subsystem 64 KB, 4-way, pipelined 64 KB, 4-way, 3 cyc latency 1 MB, 8-way, 20 cyc latency 64 bytes 300 cycles minimum

19 Evaluation Performance and energy-efficiency analysis throughput, hmean of thread weighted speedups extra instructions, power, ED 2 Comparison with state-of-the-art mechanisms MLP-aware Flush (MLP-Flush) MLP-aware Runahead thread (MLP-RaT) RaT + single-thread execution techniques (RaT-SET) Efficient Runahead Threads (RDP)

20 Performance IPC throughput and Hmean RDP preserves the performance of RaT better than other mechanisms within ~2% throughput and ~4% hmean deviation

Speculative instruction ratio w.r.t RaT Efficient Runahead Threads Tanausú Ramírez (UPC) 21 Extra Instructions RDP achieves the highest extra work reduction come from eliminating both useless runahead threads and useless instructions out of runahead distance 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0,0 RaT-SET MLP-RaT RDP RDP Oracle RaT ILP2 MIX2 MEM2 ILP4 MIX4 MEM4 Avg RDP : -44%

22 Power RDP effectively reduces the power consumption compared to RaT AVG MEM4 MIX4 ILP4 MEM2 MIX2 ILP2 0 25 50 75 100 125 150 175 200 Power (Watts) RDP: -14% RDP MLP-RaT RaT-SET RaT MLP-Flush ICOUNT

ED2P Ratio Efficient Runahead Threads Tanausú Ramírez (UPC) 23 Energy-efficiency Efficiency measured as energy-delay square (ED 2 ) RDP presents the best ratio (10% better) provide similar performance with RaT cause less power consumption (-14% on average) MLP-FLUSH RAT-SET MLP-RAT RDP 2,4 2,2 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 ILP2 MIX2 MEM2 ILP4 MIX4 MEM4 Avg ED 2 normalized to the SMT+RaT

#Runahead Threads (thousands) Efficient Runahead Threads Tanausú Ramírez (UPC) 24 Runahead threads behavior with RDP Controlled runahead threads distribution RDP generates small runahead threads but providing the same usefulness 350 300 250 200 150 100 50 0 RDP RAT Range of runahead distances

25 Talk outline Introduction RaT problem Proposal Efficient Runahead Threads Useful Runahead Distance Runahead Distance Prediction (RDP) Evaluation Framework Results Conclusions

26 Conclusions RaT has a shortcoming increase the extra work executed due to speculative instructions efficiency performance gain vs. extra energy consumption Our approach: efficient runahead threads Useful runahead distance prediction (RDP) eliminate at a fine-grain the useless portion of runahead threads reduces extra instructions while maintaining the performance benefits of runahead threads RDP shows significant performance and ED 2 advantage over state-of-the-art techniques

Thank you! Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15 September, Vienna, Austria

Backup Slices PACT 2010 Talk

29 Runahead Threads (RaT) Multithreaded approach of Runahead paradigm speculative thread runahead thread Thread 1 Thread 2 long-latency miss runahead thread time a runahead thread executes speculative instructions while a long-latency load is resolving follow the program path instead of being stalled independent execution of long-latency misses

x branch 1 x a x load 1 branch x x load 2 x Efficient Runahead Threads Tanausú Ramírez (UPC) 30 b x x branch 2 x x c x branch x x d x load 3 x Starting a runahead thread Thread 1 normal thread time Thread 2 normal thread load 1 runahead thread xhead Reorder buffer (ROB) When a thread is stalled by L2 cache miss turn the thread into runahead thread checkpoint architectural state retire the miss enter runahead mode (context switch is not needed)

31 Executing a runahead thread Thread 1 normal thread long-latency miss time Thread 2 other loads runahead thread Instructions are speculatively executed L2-miss dependent instructions are marked invalid and dropped status are tracked through the registers (INV bits) Valid instructions are executed loads are issue to memory

32 Executing a runahead thread Thread 1 Thread 2 normal thread long-latency miss time runahead thread INValid instructions Valid instructions Long-latency loads don t use resources free resources faster generate prefetches A runahead thread use and release faster shared resources to other threads

33 Finishing a runahead thread Thread 1 normal thread load is resolved checkpoint recovery time Thread 2 normal thread runahead thread Leave runahead when L2 miss resolves flush pipeline, restore state from the checkpoint and, resume to normal execution

Performance ( IPC ) Efficient Runahead Threads Tanausú Ramírez (UPC) 34 RaT benefit results Performance of 2-thread mix of a memorybound and ILP-bound 2,5 2 1,5 1 0,5 each thread is faster without disturbing the other threads Overall SMT processor performance is improved 0 Thd1 (MEM) Thd2 (ILP) SMT (mix) RaT (mix) single-thread performance is improved resource clogging is avoided

35 Evaluation Framework Methodology FAME Metrics ensure multithreaded state continuously threads fairly represented in the final measurements performance throughput sum of individual IPCs harmonic mean (hmean) performance/fairness balance energy efficiency energy-delay 2 (ED 2 ) ratio of performance and energy consumption

36 RDIT RDIT related data 4-way table of 1024 entries size ~ 4.5KB indexed by (PC) XORed (2-bits GHR) power 1.09 Watts based on CACTI

Ratio Efficient Runahead Threads Tanausú Ramírez (UPC) 37 DDT analysis Distance Difference threshold study 35% Performance Extra work 30% 25% 20% 15% 10% 5% 0% -5% -10%

apsi,eon apsi,gcc bzip2,vortex fma3d,gcc fma3d,mesa gcc,mgrid gzip,bzip2 gzip,vortex mgrid,galgel wupwise,gcc applpu,vortex art,gzip bzip2,mcf equake,bzip2 galgel,equake lucas,crafty mcf,eon swim,mgrid twolf,apsi wupwise,twolf applu,art art,mcf art,twolf art,vpr equake,swim mcf,twolf parser,mcf swim,mcf swim,vpr twolf,swim Efficient Runahead Threads Tanausú Ramírez (UPC) 38 Distribution of RDP decisions breakdown of runahead threads based on the RDP decisions 100% 80% 60% 40% 20% 0% not started limited full ILP2 MIX2 MEM2