Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design

Similar documents
Multithreaded Processors. Department of Electrical Engineering Stanford University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS425 Computer Systems Architecture

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Microarchitecture Overview. Performance

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Microarchitecture Overview. Performance

Handout 2 ILP: Part B

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Execution-based Prediction Using Speculative Slices

Hyperthreading Technology

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Hardware-Based Speculation

Exploitation of instruction level parallelism

Dynamically Controlled Resource Allocation in SMT Processors

Adaptive Cache Memories for SMT Processors

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Simultaneous Multithreading on Pentium 4

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Advanced issues in pipelining

Method-Level Phase Behavior in Java Workloads

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Architectures for Instruction-Level Parallelism

Parallel Processing SIMD, Vector and GPU s cont.

TDT 4260 lecture 7 spring semester 2015

Multi-core Programming Evolution

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Profiling: Understand Your Application

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

HW1 Solutions. Type Old Mix New Mix Cost CPI

Software-Controlled Multithreading Using Informing Memory Operations

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Lecture 18: Core Design, Parallel Algos

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Kaisen Lin and Michael Conley

Speculative Multithreaded Processors

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Simultaneous Multithreading (SMT)

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Advanced Computer Architecture

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Pradip Bose Tom Conte IEEE Computer May 1998

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Understanding The Effects of Wrong-path Memory References on Processor Performance

CS 426 Parallel Computing. Parallel Computing Platforms

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Techniques for Efficient Processing in Runahead Execution Engines

Instruction Level Parallelism (Branch Prediction)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

MCD: A Multiple Clock Domain Microarchitecture

TRIPS: Extending the Range of Programmable Processors

Lecture 1: Introduction

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Achieving Out-of-Order Performance with Almost In-Order Complexity

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Efficient Program Power Behavior Characterization

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Getting CPI under 1: Outline

CS 654 Computer Architecture Summary. Peter Kemper

High Performance Memory Requests Scheduling Technique for Multicore Processors

2 TEST: A Tracer for Extracting Speculative Threads

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Simultaneous Multithreading: a Platform for Next Generation Processors


Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Low-Complexity Reorder Buffer Architecture*

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Instruction-Level Parallelism and Its Exploitation

Transcription:

Outline An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization Sandhya Dwarkadas University of Rochester Framework: Dynamically Tunable Clustered Multithreaded Architecture Motivation: Workload characterization Architectural support for adaptation Role of program analysis Resource-aware operating system support Collaborators at UR: David Albonesi, Chen Ding, Eby Friedman, Michael L. Scott, UR Systems and Architecture groups Collaborators at IBM: Pradip Bose, Alper Buyuktosunoglu, Calin Cascaval, Evelyn Duesterwald, Hubertus Franke, Zhigang Hu, Bonnie Ray,Viji Srinivasan Emerging Trends Conventional Processor Design Wire delays and faster clocks will necessitate aggressive use of clustering Larger transistor budgets and low cluster design costs will enable addition of more clusters incrementally There is a trend toward multithreading to exploit the transistor budget for improved throughput by combining ILP and TLP Combine clustering and multithreading? I Cache Predictor & Dispatch Large structures Slower clock speed I s s u e Q Register File A Clustered Processor A Clustered Multithreaded (CMT) Architecture I Cache Predictor & Dispatch r r + r Regfile IQ IQ Regfile r r + r r r + r Icache predict FetchQ map IPREG Integer Int s FP s Small structures Faster clock speed But, high latency for some instructions Regfile IQ IQ Regfile r r + r Icache predict FetchQ map IPREG Integer Int s FP s L Shared Unified Cache

Components of IPC Degradation Overall Energy Impact Power ilp com mix Monolithic Clustered Centralized s No Comm Penalty Centralized s + No Comm Penalty Centralized s + No Comm Penalty + Centralized RegFile..8... SMT TD:8;Link TD:;Link TD:;Link TD:;Link Dynamic Communication ResultBus ALU Icache Regfiles Bpred Clock IQRAM Problems Tradeoff in communication vs. parallelism for a single thread Increased communication delays and contention when employing multiple threads Reduced performance Increased energy consumption Icache predict Single Thread Execution FetchQ map IPREG Integer Int s FP s L Unified Cache Goal: Intelligent mapping of applications to resources for improved throughput and resource utilization as well as reduced energy IPREG Integer Int s FP s Communication vs Parallelism clusters active instrs r r + r r r + r r r + r r8 r + r Ready instructions Distant parallelism: distant instructions that are ready to execute 8 clusters active instrs r r + r r r + r r r + r r8 r + r r r + r r9 r + r Single-Thread Adaptation [ISCA ] Dynamic interval-based exploration can adapt to available instruction-level parallelism in a single thread Determine when communication can no longer be tolerated in exploiting additional clusters Allow remaining clusters to be turned off to reduce power consumption or to be used by a different thread/application

Instructions per cycle (IPC) Results with Interval-Based Scheme (ISCA )... -clusters -clusters interval-based cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM An Integrated Approach to Dynamic Tuning of the CMT Architectural design and dynamic configuration for finegrain adaptation Program analysis to determine application behavior Runtime support to match predicted application behavior and resource requirements with available resources Resource-aware thread scheduling for maximum throughput and fairness Runtime support for balancing ILP with TLP in parallel application environments Overall improvement: % Out-of of-order order Dispatch & Fetch Gating In-Order Dispatch (dispatch stall) Multithreaded Adaptation T T T T T T8 T T T T Ready for dispatch tail Blocked from dispatch Out-of-Order Dispatch (dispatch from T) T T T T head Basic scheme Interval-based Fixed, cycles Exploration-based Hysteresis to avoid spurious changes T T T T T T8 T T T T T T T T tail Ready for dispatch Blocked from dispatch head Tx Thread id 8 Thread to Cluster Assignment Thread to Cache Bank Assignment N_WAY = 8 IPC SMT_8_BANK SMT BANK SMT BANK SMT BANK CMT_8_BANK CMT BANK CMT BANK CMT BANK ilp com mix ilp8 com8 mix8 Monolithic TD: + SD + FG TD: + SD + FG Adaptive com I com m ilp f ilp m mix m mix m

ILP vs. TLP ILP vs. TLP LU Jacobi 9 8 8 IPC Shared Ideal WB WB WT Centralized SMT IPC Shared Ideal WB WB WT Centralized SMT 8 8 Number of Threads Number of Threads A Dynamically Tunable Clustered Multithreaded (DT-CMT) Architecture IPREG Integer Int s Icache predict FetchQ map FP s L Shared Unified Cache IPREG Integer Int s Icache predict FetchQ map FP s Current Approaches to Adaptation Reactive Adaptive change is triggered after observed variation in program behavior yes Explore configurations Record CPIs Pick best configuration Inspect counters Is there a phase change? no Remain at present configuration Success depends on ability to repeat behavior across successive intervals Interval Length Problem: Unstable behavior across intervals Solution: Start with minimum allowed interval length If phase changes are too frequent, double the interval length find a coarse enough granularity such that behavior is consistent Periodically reset interval length to the minimum Small interval lengths can result in noisy measurements Benchmark gzip vpr crafty parser swim mgrid galgel cjpeg djpeg Varied Interval Lengths Instability factor for a K interval length % % % % % % % 9% % Minimum acceptable interval length and its instability factor K / % K / % K / % M / % K / % K / % K / % K / % 8K / % Instability factor: Percentage of intervals that flag a phase change

Characterizing Program Behavior Variability Whole program instrumentation (currently SPECk) Periodic hardware performance counter sampling using Ticker Dynamic Probe Class Library (DPCL) to insert a timerbased interrupt in the program Performance Monitoring API (PMAPI) to read the hardware counters AIX-based Sampling interval of msec Examination of IPC, D cache miss rate, instruction mix, branch mispredict rate Statistical analysis correlation, frequency analysis, behavior variation Example IPC Plots Example IPC Plots SPECk:bzip Existence of macro phase behavior Significant behavior variation even at coarse granularities Strong frequency components/periodicity across several metrics Similarity Across Metrics SPECk:bzip Speck:art High rate of behavior variation from one measurement to the next Comparing Frequency Spectra bzip Program Behavior Variability art Strong low (bzip) and high (art) frequency components, indicating high rate of repeatability Variation in behavior, while different, persists across different sampling interval sizes

Important Behavior Characteristics Programs exhibit high degrees of repeatability across all metrics Rate of behavior repeatability (periodicity) across metrics is highly similar Variation in behavior from one interval to the next can be high Variation in behavior, while different, persists across different sampling interval sizes On-Line Program Behavior Prediction Linear (statistical) predictors to exploit behavior in the immediate past Last value Average(N) Mode(N) Table-based predictors to exploit periodicity (non-linear) Run-length encoded Fixed-size history Cross-metric predictors to exploit similarity across metrics Use one metric to predict several potentially different metrics Efficiently combine multiple predictors On-line power-performance optimization needs to be predictive rather than reactive Table-Based Predictors E.g. table-based and asymmetric predictor a t- a t- a t- a t- a t,b t a vote, b vote Default to last value during learning period Use a voting mechanism to update table entries Prediction (a t or b t ) is updated with the mode of the actual value (vote) the last time this history was encountered, the current prediction(t), and the measured value at the end of the interval Encoding and length of history (index) can be varied Fixed size or run-length encoded Design Trade-offs Precision Too coarse a precision implies insensitivity to finegrained behavior Too fine a precision implies sensitivity to noise Size of history Too long a history implies a potentially long learning period Too short a history implies inability to distinguish between common histories of otherwise distinct regions Both precision and history have table size implications Trade-off between noise tolerance, learning period, and prediction accuracy Mean IPC Prediction Error (Power) Program Behavior Predictability Variations in program behavior are predictable to within a few percent Table-based predictors outperform any others for programs with high variability Cross-metric table-based predictors make it possible to predict multiple metrics using a single predictor Microarchitecture-independent metrics allow stable prediction even when the predicted metric changes due to dynamic optimization

Problems High variability in program behavior Interval length hard to determine Too small measurement noise Too large missed opportunities for adaptation Interval and actual phase boundaries do not match Information Space for Workload Analysis Data Locality Analysis (Shen( and Ding) Program Phase Detection!" /+,.' "#$%& '($ )*$!$+ ('$,-,.,.',- Phase Marker Insertion -----Basic Block Trace Analysis TOMCATV RD Signature with Phase Boundaries Objective: to find basic blocks marking unique phase boundaries. < ; : ; 89 : ;? : < = >

Similarity of Locality Phases: TOMCATV ( Phases) Similarity of Locality Phases: COMPRESS ( Phases) Similarity of Phases with BBV TOMCATV (9 Intervals) Bringing It All Together Locality analysis for phase detection and marking of macro phases Linear or non-linear (table-based) prediction within each phase for improved learning Resource-Aware Thread Scheduling Multiple levels of thread scheduling Application threads O/S threads H/W threads Processor pipeline Process A Process B H/W scheduling O/S task... O/S kernel Resource-Aware O/S Scheduler H/W Counters Counter-based Resource Model Resource Usage Prediction Resource-aware extension Kernel next scheduler(s) task(s) e.g., current temperature Application-level scheduling O/S scheduling Resource counter/ Sensor 8

Fair Cooperative Scheduling [PPoPP[ ] Each process is allocated a piggy-bank of time (set to quantum) from which it can borrow and to which it can add The piggy-bank is used to boost a process s priority (with the original purpose of responding to a communication request when notified by a wakeup signal) A process can add to the piggy-bank whenever it relinquishes the processor Adapt the above so the piggy-bank is used to schedule a process earlier than according to priority and replenished, for example, when reconfiguring at a phase marker Coordinate among schedulers for multiple hardware contexts Application-Level Scheduling Provide a framework for trading ILP for TLP based on application characteristics and available resources Specifying cache and cluster sharing configurations at appropriate points At the JVM level Target server workloads At the level of an API such as OpenMP Target scientific/parallel applications Resource-Aware Thread Scheduling: Other Applications Power/Thermal management Temperature-aware process/thread scheduling to avoiding temperature hotspots characterize threads based on expected temperature contribution schedule based on a thread s predicted heat contribution and current temperature Performance Improving L bandwidth utilization on a multiprocessor (e.g., the two cores of a Power) Characterize threads based on expected L cache accesses avoid scheduling different threads with high L access concurrently Resource-Aware Thread Scheduling (cont d) Performance and Power Resource (memory,, and temperature) aware thread scheduling for simultaneous multithreaded processors (e.g., the Power, and the hyper-threads of the Pentium IV) or our proposed clustered multithreaded architecture Summary: An Integrated Hardware/Software Approach to DT-CMT Aggressive clustering and multithreading requires a whole-system integrated view in order to maximize resource efficiency Architectural configuration support (while carefully considering circuit-level issues) Program analysis Runtime/OS support On-Going Projects at UofR CAP: Dynamic reconfigurable general-purpose processor design MCD: Multiple Clock Domain Processors DT-CMT: Dynamically Tunable Clustered Multithreaded architectures InterWeave: -level versioned shared state (predecessors: InterAct and Cashmere) ARCH: Architecture, Runtime, and Compiler Integration for High-Performance Computing See http://www.cs.rochester.edu/research and http://www.cs.rochester.edu/~sandhya 9