Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Similar documents
Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations

Software-Controlled Multithreading Using Informing Memory Operations

CS252 Lecture Notes Multithreaded Architectures

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Introducing Multi-core Computing / Hyperthreading

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

h Coherence Controllers

ECE 669 Parallel Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

ECE 587 Advanced Computer Architecture I

Practical Near-Data Processing for In-Memory Analytics Frameworks

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

I/O Buffering and Streaming

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Appendix C: Pipelining: Basic and Intermediate Concepts

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining, Branch Prediction, Trends

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

ECE 669 Parallel Computer Architecture

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Processors. Young W. Lim. May 12, 2016

Performance, Power, Die Yield. CS301 Prof Szajda

EITF20: Computer Architecture Part4.1.1: Cache - 2

COMPUTER ORGANIZATION AND DESI

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Module 5 Introduction to Parallel Processing Systems

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Part IV. Chapter 15 - Introduction to MIMD Architectures

Instr. execution impl. view

Chapter Seven Morgan Kaufmann Publishers

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

ECE 486/586. Computer Architecture. Lecture # 3

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

EITF20: Computer Architecture Part4.1.1: Cache - 2

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

The Processor: Instruction-Level Parallelism

High Performance Computing on GPUs using NVIDIA CUDA

Memory Consistency. Challenges. Program order Memory access order

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Multiprocessor Support

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

HPC VT Machine-dependent Optimization

Chapter 2: Memory Hierarchy Design Part 2

Lecture 5: Pipelining Basics

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Multiprocessors and Locking

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Online Course Evaluation. What we will do in the last week?

Advanced Computer Architecture (CS620)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

UNIT I (Two Marks Questions & Answers)

COSC4201 Multiprocessors

Limitations of parallel processing

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Basics of Performance Engineering

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Multithreaded Processors. Department of Electrical Engineering Stanford University

Chapter 5. Multiprocessors and Thread-Level Parallelism

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Multiprocessor Synchronization

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Why Multiprocessors?

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

HW1 Solutions. Type Old Mix New Mix Cost CPI

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

CS5460: Operating Systems Lecture 20: File System Reliability

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Instruction Pipelining Review

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Lecture 1: Introduction

Course Administration

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

Handout 3 Multiprocessor and thread level parallelism

Lec 25: Parallel Processors. Announcements

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

The University of Texas at Austin

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 4. The Processor

Advanced Instruction-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

ILP Ends TLP Begins. ILP Limits via an Oracle

Transcription:

Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline bypassing Then, tolerate remaining latency relaxed memory consistency prefetch multiple context Motivation Latency is a serious problem for modern processors wide gap between processor and memory speeds deeply pipelined multiprocessors Three major forms of latency memory instruction synchronization Multiple Context Processors Multiple context processors address latency by: switching to another thread whenever one thread performs a long latency operation making sure that context switch overhead is low, so that thread switches can be done often

Latency Tolerance Techniques Relaxed Consistency Prefetch Multiple Contexts Parallelism Single-thread Single-thread Multiple-thread Latency Tolerated Write Memory Memory Instruction Synchronization Read Write Thread Thread Outline Multiple-Context Approaches Performance Results Implementation Issues Conclusions Trends What are the microprocessor trends? relaxed memory consistency prefetch Why hasn t multiple contexts been included? multiple contexts is thought to be expensive performance benefits are relatively unknown existing multiple context designs do not help uniprocessors For multiple contexts to gain acceptance, all these issues must be addressed HEP and TERA Approach HEP was first machine to use mult. ctxts. to hide latency Processor architecture: pipelined but no interlocks, and no caches large # of registers (K), cheap thread creation, F/E bits Cycle-by-cycle context switching Memory Instr. Dep. Multiple Threads Memory Single Thread

HEP and TERA Approach (cont.) Multiple context used to hide two kinds of latency: pipeline latency ( cycles) and memory latency contexts per processor (total focus on toleration) Drawbacks of HEP approach Low single context performance (bad for applns with limited parallelism) Lots of contexts implied lots of hardware resources and high cost No caches implied high memory bandwidth requirements and high cost Scheme Combine multiple contexts with latency reduction (caches) Assumes a base cache-coherent system Assumes pipelined processor with interlocks Contexts switched only at long latency operations Cache Miss Instr. Dep. Multiple Threads Cache Miss Instr. Dep. Single Thread TERA System Pipeline Design Considerations Issues: number of contexts per PE context switch overhead effect of memory latency cache interference effects when to switch contexts implementation issues Advantage of blocked scheme small number of contexts suffice to hide memory latency Disadvantage of blocked scheme switch overhead still quite large to hide pipeline latency

Context Switch Cost IF RF EX DF WB cache miss detected here Cache miss detected late in pipeline squash partially executed instructions start fetching instructions from next context Scheme Assumptions coherent caches parallelism available, but not necessarily abundant Full single-thread support Cycle-by-cycle interleaving lowers switch cost instruction latency tolerance Combines best of HEP-like and blocked approaches Swicth Cost, Latency, and Num Ctxts Given #-of-ctxts = k, avg-run-length = R, switch-cost = C, avg-latency = L % processor utilization linear saturation number of contexts In linear region, proc. utilization = (k x R) / (L + C) Knee of curve is close to k = L/ (R + C) Cost(n) = $shared + n x $inc + $ovhd In saturation region, max proc. utilization = R / (R + C) Efficiency increases only marginally with more contexts Important to keep C small if R is going to be low Scheme Cache Miss Instr. Dep. Multiple Threads IF RF EX DF WB IF RF EX DF WB Cache Miss Instr. Dep. Single Thread

vs. Type of Latency Memory Context Switch Make Context Unavailable Synchronization Context Switch Make Context Unavailable Instruction (Long) Context Switch Make Context Unavailable Instruction (Short) STALL! Rely on Context Interleaving Outline Current Multiple-Context Approaches Performance Results Implementation Issues Conclusions vs. Thread Thread Thread Thread IF RF EX DFWB IF RF EX DFWB IF RF EX DFWB Methodology Shared-memory multiprocessor Proc Cache Mem & Dir Proc Cache Mem & Dir Network processors, - contexts per processor pipeline based on R (pipelined floating-point) ideal instruction cache, K data cache memory latencies (:::) Event-driven simulation optimized code, scheduled for pipeline Parallel application suite (SPLASH)

MPD: Memory Latency Both schemes effective in tolerating memory latency has lower context switch overhead LocusRoute: Limited Latency Simulation Results Cholesky PTHOR Ocean Locus Water Barnes MPD...... Speedup due to Multiple Contexts Water: Instruction Latency Floating Point Operation Latency Pipelined Division No Add, Convert, Multiply Yes Benefits from multiple contexts occur when: latency is a problem application has additional parallelism

I. Uniprocessor Issues More difficult environment for multiple contexts Parallel Application Multiple Applications Performance Summary Multiple contexts works well when extra parallelism is available scheme has performance advantage mean speedup for blocked scheme:. mean speedup for interleaved scheme:.9 greater cache interference needs to tolerate shorter latencies Simulation Results Uniprocessor Study Several workloads composed of SPLASH and SPEC applications Three random (R-R) Two stress the data cache (D-D) One stresses the instruction cache (I) One is floating point intensive (F) Simulation parameters Proc R, - contexts K, cycle D K, cycle Secondary M, 9 cycles Main Memory cycles I..9........ Throughput Improvement R R D R D F

Uniprocessor Summary scheme unable to address uniprocessor needs scheme able to improve uniprocessor throughput mean improvement of % for our workloads Implementation Study Single data point in existence for blocked scheme (MIT APRIL) Explored implementation issues for both schemes enough detail to expose major issues RTL diagrams transistor level + layout + Spice Outline Current Multiple-Context Approaches Performance Results Implementation Issues Conclusions Basic Implementation Needs Cache capable of multiple outstanding requests (lockup-free) Replication of key hardware state Context scheduling logic

Lockup-free Cache Required for all latency tolerance schemes State Replication scheme - single active context single set of active state master copy plus backup copies swap master and backup during context switch - all contexts active all sets of state active state used changes each cycle State Replication Register File Program Counter Related Process Status Word State Replication Example Optimizing the four-context register file Single-context x single-context area % longer access time x single-context area % longer access time

Context Control Scheme provide context switch signal global context identifier (CID) change CID and state during switch cycles Scheme provide selective squash signal CID associated with each instruction (CID chain) CID becomes another pipeline control signal state used depends on CID value Concluding Remarks Multiple contexts work well when combined with caches better at handling unstructured programs multiple context architecture offers better multiprocessor performance than blocked approach can improve uniprocessor throughput Implementation of blocked and interleaved multiple-context architectures is manageable more flexibility in implementing blocked scheme increase in area is small for both schemes should not impact cycle time Implementation Summary More implementation flexibility for blocked scheme most of the time looks like a single context processor context control is simpler Implementation cost and complexity is manageable for both schemes small area overhead (e.g. register file is % of the R die) extra delays not in critical path