Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Similar documents
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

November 7, 2014 Prediction

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Tutorial 11. Final Exam Review

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Superscalar Processors

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Computer Systems Architecture

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Hardware-Based Speculation

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

CNS Update. José F. Martínez. M 3 Architecture Research Group

E0-243: Computer Architecture

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Handout 2 ILP: Part B

Computer Architecture Spring 2016

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Techniques for Efficient Processing in Runahead Execution Engines

Multithreaded Processors. Department of Electrical Engineering Stanford University

Copyright 2012, Elsevier Inc. All rights reserved.

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Portland State University ECE 588/688. Cray-1 and Cray T3E

SPECULATIVE MULTITHREADED ARCHITECTURES

Advanced Instruction-Level Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

PowerPC 740 and 750

CS425 Computer Systems Architecture

CS 654 Computer Architecture Summary. Peter Kemper

Lecture-13 (ROB and Multi-threading) CS422-Spring

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Processor (IV) - advanced ILP. Hwansoo Han

The Processor: Instruction-Level Parallelism

Lecture: Out-of-order Processors

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

" # " $ % & ' ( ) * + $ " % '* + * ' "

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Out of Order Processing

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

EECS 570 Final Exam - SOLUTIONS Winter 2015

Pentium IV-XEON. Computer architectures M

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Portland State University ECE 587/687. Memory Ordering

Simultaneous Multithreading Architecture

Case Study IBM PowerPC 620

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

Lec 25: Parallel Processors. Announcements

Hardware-based Speculation

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Control Hazards. Prediction

Limitations of Scalar Pipelines

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

UNIT I (Two Marks Questions & Answers)

Computer System Architecture Quiz #2 April 5th, 2019

Dynamic Memory Dependence Predication

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Copyright 2012, Elsevier Inc. All rights reserved.

Design and Implementation of a Super Scalar DLX based Microprocessor

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Kaisen Lin and Michael Conley

Advanced Computer Architecture

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Portland State University ECE 587/687. Memory Ordering

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Control Hazards. Branch Prediction

5008: Computer Architecture

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Transcription:

Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012

OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization Performance evaluation Self-optimization Applications of RL in computer systems Conclusion 2 11/9/2012

Motivation Transistor size doubles every two years Moore s Law Chip Multiprocessors (CMPs) are attractive alternative to monolithic processors in translating transistor budgets into performance improvements. CMPs have performance limitations. Software overhead exploiting full potential of these chips. Need for software to expose exponentially increasing levels of TLP. 3 11/9/2012

Introduction To meet the challenges created by the adoption and scaling of multicore architectures, we explore versatile CMP architectures. Solution: A reconfigurable CMP substrate that can accommodate software at different stages of parallelization by allowing the granularity of the architecture to be changed at runtime. A self-optimizing memory controller that learns to optimize its scheduling policy on the fly, and adapts to changing memory reference streams and workload demands via runtime interaction with the system. 4 11/9/2012

Reconfiguration Reconfiguration is achieved through a novel reconfigurable mechanism called core fusion. Core fusion An architectural technique that empowers groups of relatively small and independent CMP cores with the ability to fuse into one large CPU on demand. Benefits: Support for software diversity. Support for smoother software evaluation. Single-design solution. Optimized for parallel code. Design-bug and hard-fault resilience. 5 11/9/2012

Core Fusion - design Challenges Increase in software complexity. Restructuring the base cores. Effective dynamic reconfiguration Hardware solution: Re-configurable distributed front-end and i-cache. Effective remote wake up mechanism Re-configurable, distributed load/store queue and d-cache Re-configurable, distributed ROB organization. 6 11/9/2012

Core Fusion - Architecture A bus connects L1 i- and d- caches and provides data coherence. On-chip memory controller reside on the other side of the bus. Cores can execute independently if desired and it is also possible to fuse groups of two or four cores to constitute larger cores. 7 11/9/2012

Modifications to achieve core fusion Front end Fetch mechanism and Instruction cache Branch prediction Return Address Stack Global History Registers Handling Fetch Stalls Collective Decode/Rename 8 11/9/2012

Fetch Mechanism and Instruction Cache Collective Fetch A small coordinating unit called the Fetch Management Unit (FMU) facilitates collective fetch. Fetch mechanism Each core fetches two instructions from its own i-cache every cycle, for a total of eight instructions. On an i-cache miss, an eight-word block is (a) delivered to the requesting core if it is operating independently, or (b) distributed across all four cores in a fused configuration to permit collective fetch. In order to support the above mechanism i-caches are made reconfigurable 9 11/9/2012

Reconfigurable i-cache Each i-cache has enough tags to organize data has enough tags to organize data in two-word sub blocks When running independently four such sub blocks and one tag make up a cache block. When fused, cache blocks span all four i-caches, with each i-cache holding one sub block and a replica of the cache block s tag. 10 11/9/2012

Branches and subroutine calls prediction Each core accesses its own branch predictor and BTB. Branch predictor and BTB are indexed to accomplish maximum utilization while retaining simplicity. The indexing scheme achieves no loss in prediction accuracy. 11 11/9/2012

Branch prediction mechanism In each cycle, every core that predicts a taken branch and also a branch misprediction sends the new target PC to the FMU. FMU selects the correct PC by giving priority to the oldest misprediction-redirect PC first and the youngest branchprediction PC last. On a misprediction, misspeculated instructions are squashed in all cores. 12 11/9/2012

Branch prediction mechanism Core2 predicts branch B to be taken. After two cycles, all cores receive this prediction. They squash overfetched instructions, and adjust their PC. 13 11/9/2012

Global History Register (GHR) Independent and uncoordinated history registers on each core may make it impossible for the branch predictor to learn of their correlation. Solution: GHR is replicated across all cores and updates are coordinated through FMU. 14 11/9/2012

Return Address Stack The target PC of a subroutine call is sent to all the cores by the FMU. Core zero pushes the return address into it RAS. When a return instruction is encountered and communicated to the FMU, core zero pops its RAS and communicates the return address back through the FMU. 15 11/9/2012

Handling fetch Stalls To preserve correct fetch alignment, all fetch engines must stall when fetch stall is encountered by one core. To accomplish this cores communicate stalls to the FMU, which in turn informs the other cores. Once all cores have been informed they all discard at the same time any overfetched instruction. Fetching is resumed in sync from the right PC. 16 11/9/2012

Collective Decode /Rename After fetch, each core pre-decodes its instruction independently. Steering Management Unit (SMU) is used to rename all instructions in the fetch group. SMU consists of a global steering table to track the mapping of architectural registers to any core. 17 11/9/2012

Back-end modifications to achieve core fusion Back end Wake-up and selection Reorder buffer and commit support Load/Store queue organization 18 11/9/2012

Wake-up and selection To support operand communication, a copy-out and copy-in queue are added to each core. When copy instructions reach the consumer core, they are placed in a FIFO copy-in queue. Every cycle, the scheduler considers the two copy instructions at the head, along with instructions in the conventional issue queue. Once issued, copies wake up their dependent instructions and update the physical register file. 19 11/9/2012

Reorder buffer and commit support ROB 1 s head instruction pair is not ready to commit, which is communicated to the other ROBs. Pre-commit and conventional heads are spaced so that the message arrives just in time. Upon completion of ROB 1 s head instruction pair, a similar message is propagated, again arriving just in time to retire all four head instruction pairs in sync. 20 11/9/2012

Load/Store queue organization In fused mode, a banked-by-address load-store queue(lsq) implementation is adopted. This keeps data coherent without requiring cache flushes and supports store forwarding and speculative loads. In the case of loads, if a bank misprediction is detected, the load queue entry is recycled and the load is sent to the correct one. 21 11/9/2012

Dynamic Reconfiguration CMPs support for dynamic reconfiguration to respond to software changes (e.g., dynamic multiprogrammed environments or serial/parallel regions in a partially parallelized application) can greatly improve versatility, and thus performance. FUSE and SPLIT ISA instructions are used. FUSE operation: Application requests cores to be fused to execute sequential regions after executing parallel regions. SPLIT operation: In SPLIT operation, in-flight instructions are allowed to drain and enough copy instructions are generated. 22 11/9/2012

Performance Evaluation Simulation done on parallel, evoking parallel and sequential work loads. 23 11/9/2012

Performance Analysis 24 11/9/2012

Parallel application performance 25 11/9/2012

Why self-optimization? Self-Optimization Efficient utilization of off-chip DRAM bandwidth is a critical issue in designing cost-effective, high performance CMP platforms. Conventional memory controllers deliver relatively low performance because they often employ fixed, rigid access scheduling policies designed for average case application behavior. As a result they cannot learn and optimize the long term performance impact of their scheduling decisions, and cannot adopt their scheduling policies to dynamic workload behavior. 26 11/9/2012

Reinforcement Learning (RL) Reinforcement learning is a field of machine learning that studies how autonomous agents situated in a stochastic environment can learn optimal control policies through interaction with their environment. RL provides a general framework for high performance, self optimizing memory controller design. The memory controller is designed as a RL agent whose goal is to learn automatically an optimal memory scheduling policy via interaction wit the rest of the system. 27 11/9/2012

Advantages of RL based memory controller An RL-based memory controller takes as an input, parts of the system state and considers the long term performance impact of each action it can take. Anticipates the long-term consequences of its scheduling decisions, and continuously optimizes its scheduling policy based on this anticipation. Utilizes experience learned in previous system states to make good scheduling decisions in new, previously unobserved states. Adapts to dynamically changing workload demands and memory reference streams. 28 11/9/2012

RL-Based DRAM schedulers Each DRAM cycle, the scheduler examines valid transaction queue entries. The scheduler maximizes DRAM utilization by choosing the command with the highest expected long term performance benefit. Scheduler first derives a state-action pair for each candidate command under the current system state and uses the information to calculate the corresponding Q-values. Scheduler implements its control policy by scheduling the command with the highest Q-value each DRAM cycle. 29 11/9/2012

Performance Evaluation Performance comparison of in-order, FR-FCFS, RL based and optimistic memory controllers. 30 11/9/2012

DRAM bandwidth utilization evaluation Comparison of DRAM bandwidth utilization of in-order, FR- FCFS, RL-based and optimistic controllers. 31 11/9/2012

Applications of RL in computer systems Autonomic resource allocation decisions in data centers. Autonomous navigation and flight, helicopter control. Dynamic channel assignment in cellular networks. Processor and memory allocation in data centers. Routing in ad-hoc networks. 32 11/9/2012

Performance Review For a 4-core CMP with single channel DDR2-800 memory subsystem(6.4 GB/s peak bandwidth). The RL based memory controller improves the performance of a set of parallel applications by 19% and DRAM bandwidth utilization by 22% over a state-of-the-art FR- FCFS scheduler. For a dual-channel subsystem, the RL-based scheduler delivers an additional 14% performance improvement. Thus performance gap between single-channel configuration and a dual-channel DDR2-800 subsystem wit twice peak bandwidth is reduced. 33 11/9/2012

Conclusions Core fusion allows relatively simple CMP cores to dynamically fuse into larger, more powerful processors. It accommodates software diversity gracefully and dynamically adapts to changing demands by workloads. Core fusion adapts complexity-effective solutions for fetch, rename, execution, cache access and commit. RL based, self optimizing memory controller continuously and automatically adapts its DRAM scheduling policy based on its interaction with the system to optimize performance. RL based self optimizing memory controller efficiently utilizes the DRAM memory bandwidth available in CMP. 34 11/9/2012

Questions? 35 11/9/2012

Thank you! 36 11/9/2012