Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Similar documents
The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data Speculation Support for a Chip Multiprocessor

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA

Improving the Performance of Speculatively Parallel Applications on the Hydra CMP

THE STANFORD HYDRA CMP

Submitted to 1999 ACM International Conference on Supercomputing (ICS99)

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

A Thread Partitioning Algorithm using Structural Analysis

2 TEST: A Tracer for Extracting Speculative Threads

SPECULATIVE MULTITHREADED ARCHITECTURES

A Hardware/Software Approach for Thread Level Control Speculation

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Using Thread-Level Speculation to Simplify Manual Parallelization

Integrated circuit processing technology offers


Computer Architecture

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

A Chip-Multiprocessor Architecture with Speculative Multithreading

Computer Systems Architecture

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Computer Architecture

Introduction to Parallel Computing

Lecture 13: March 25

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Kaisen Lin and Michael Conley

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Computer Systems Architecture

6.1 Multiprocessor Computing Environment

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Transactional Memory Coherence and Consistency

Speculative Versioning Cache: Unifying Speculation and Coherence

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Speculative Synchronization

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Superscalar Processors

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Parallel Computing. Hwansoo Han (SKKU)

Dynamic Performance Tuning for Speculative Threads

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Computer Science 146. Computer Architecture

Computer Architecture Spring 2016

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

UNIT I (Two Marks Questions & Answers)

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

POSH: A TLS Compiler that Exploits Program Structure

Heuristics for Profile-driven Method- level Speculative Parallelization

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS533: Speculative Parallelization (Thread-Level Speculation)

Execution-based Prediction Using Speculative Slices

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Advanced Processor Architecture

Complexity Analysis of Cache Mechanisms for Speculative Multithreading Chip Multiprocessors

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Portland State University ECE 588/688. Cray-1 and Cray T3E

A Cache Hierarchy in a Computer System

COMPUTER ORGANIZATION AND DESI

Smart Memories: A Modular Reconfigurable Architecture

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

PowerPC 740 and 750

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

Simultaneous Multithreading and the Case for Chip Multiprocessing

Evaluation of a High Performance Code Compression Method

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Simultaneous Multithreading on Pentium 4

Issues in Multiprocessors

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Chapter 5. Thread-Level Parallelism

The Design Complexity of Program Undo Support in a General-Purpose Processor

Transcription:

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

A Chip Multiprocessor Implementation benefits High-speed signals localized within CPUs Simple design replicated over die ASPLOS-VII, IEEE Computer (9/97) MOTIVATION Great when parallel threads are available Problem: Lack of parallel software!

Parallel Software Traditional auto-parallelization is limited Works for dense matrix Fortran applications MOTIVATION Many applications only hand-parallelizable Parallelism exists in algorithm One can t always statically verify parallelism Pointer disambiguation is a major problem! Some applications just not parallelizable True data dependencies may be present

Data Speculation Previous work: Knight, Multiscalar, TLDS MOTIVATION Eases compiler parallelization of loops Hardware can protect against pointer aliasing Synchronization isn t required for correctness Parallelization of loops can easily be automated! Allows more ways to parallelize code HW can break up code into arbitrary threads One possibility: speculative subroutines

Hydra CMP OUR SPECULATION DESIGN Centralized Bus Arbitration Mechanisms CPU 0 CPU 1 CPU 2 CPU 3 L1 Inst. Cache L1 Data Cache L1 Inst. Cache L1 Data Cache L1 Inst. Cache L1 Data Cache L1 Inst. Cache L1 Data Cache CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Write-through Bus (64b) Read/Replace Bus (256b) On-chip L2 Cache Rambus Memory Interface I/O Bus Interface DRAM Main Memory I/O Devices 4 processors and secondary cache on a chip 2 buses connect processors and memory Coherence: writes are broadcast on write bus

Speculative Memory I Original Sequential Loop OUR SPECULATION DESIGN Speculatively Parallelized Loop Iteration i Iteration i Forwarding from write: TIME write X Iteration i+1 write X VIOLATION FORWARDING Iteration i+1 write X write X 1. Forward data between threads 2. Detect violations

Speculative Memory II OUR SPECULATION DESIGN Writes after Violations Writes after Successful Iterations Iteration i write X Iteration i+1 write A write B Iteration i write X Iteration i+1 write X TIME 1 2 TRASH PERMANENT STATE 3. Safely back up after violations 4. Retire speculative state in the correct order

Hydra Speculation Support OUR SPECULATION DESIGN Centralized Bus Arbitration Mechanisms CPU 0 CP2 CPU 1 CP2 CPU 2 CP2 CPU 3 CP2 L1 Inst. Cache L1 Data Cache & Speculation Bits L1 Inst. Cache L1 Data Cache & Speculation Bits L1 Inst. Cache L1 Data Cache & Speculation Bits L1 Inst. Cache L1 Data Cache & Speculation Bits CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Write-through Bus (64b) Speculation Write Buffers #0 #1 #2 #3 retire On-chip L2 Cache Rambus Memory Interface Read/Replace Bus (256b) I/O Bus Interface DRAM Main Memory I/O Devices 1. Write bus & L2 buffers provide forwarding 2. Read L1 tag bits detect violations 3. Dirty L1 bits & L2 buffers allow backup 4. L2 buffers reorder & retire speculative state

Speculative Threads OUR SPECULATION DESIGN Try to run post-subroutine code speculatively in parallel with subroutines Requires return-value prediction Try to execute loop iterations in parallel Explicit synchronization still allowed during speculation Only needed if it helps performance!

Speculation Control OUR SPECULATION DESIGN Software control simplifies implementation Hand-coded assembly for speed Initiated with quick, vectored exceptions Essential control functions Control L1 cache with coprocessor instructions Send messages to other processors & L2 with stores Pass registers through memory Control speculation and value prediction logic Manage threads (a small runtime system) Overheads Subroutine control: 70-100 inst. at start and end Loop control: 70 instructions at start and end 16 instructions per iteration

Simulation Methodology A simple, automatic loop pre-translator Only for speculative code SIMULATION RESULTS Compiled with a commercial compiler Using O2 optimization 4 single-issue pipelined MIPS processors Fully simulated memory system 1 cycle instruction and data caches 5+ cycle on-chip secondary cache

vortex Results SIMULATION RESULTS 100 90 Speedup: 0.58 Processor Utilization (%) 80 70 60 50 40 30 20 Idle Overhead Wait/discard Run/discard Wait/used Run/used 10 0 0 1 2 3 Subroutine speculation is difficult Lots of overhead in control routines Load imbalance limits parallelism Many subroutines are poor speculation targets

wc Results SIMULATION RESULTS 100 90 Speedup: 0.62 0.66 with extra delay Processor Utilization (%) 80 70 60 50 40 30 20 Idle Overhead Wait/discard Run/discard Wait/used Run/used 10 0 0 1 2 3 Overheads can be significant (small regions) Software control handlers Additional load/store instructions needed Interprocessor communication delays

m88ksim Results SIMULATION RESULTS 100 90 Speedup: 1.04 Processor Utilization (%) 80 70 60 50 40 30 20 Idle Overhead Wait/discard Run/discard Wait/used Run/used 10 0 0 1 2 3 Dependencies are a problem (large regions) One dependency can force serialization

compress Results SIMULATION RESULTS 100 90 Speedup: 1.00 1.09 with explicit synchronization Processor Utilization (%) 80 70 60 50 40 30 20 Idle Overhead Wait/discard Run/discard Wait/used Run/used 10 0 0 1 2 3 Explicit synchronization can help Key dependencies can be protected But most synchronization may be omitted

ijpeg Results SIMULATION RESULTS 100 90 Speedup: 1.51 Processor Utilization (%) 80 70 60 50 40 30 20 Idle Overhead Wait/discard Run/discard Wait/used Run/used 10 0 0 1 2 3 Good speedup can occur on some loops Reasonable size (250-2,500 instructions) Limited dependencies = parallelism exists

L2 Data Buffering Small buffers are sufficient We used a fully associative line buffer 1 2 KB per thread captures most writes SIMULATION RESULTS Fraction of Speculative Threads (%) 100 90 80 70 60 50 40 30 20 10 0 wc 0 4 8 12 16 20 24 28 >32 64 B Cache Lines ijpeg vortex compress m88ksim 0 KB 1 KB 2 KB

Conclusions Reasonable cost/performance Small gain, but small investment Allows extraction of some parallelism that compilers can t normally find Just turn off if limited by dependencies or grain size CONCLUSIONS Normal MP performance not impacted Multiprogrammed and explicitly-parallelized applications still get more speedup, when available Flexible: can freely mix speculative and MP threads

Future Work Hardware modifications Update data cache protocol Special instructions to lower overhead Hardware thread control CONCLUSIONS Compiler analysis of routines Pruning of poor routines at compile time Profile directed compilation Compiler control of variable access (TLDS) Move loads from shared variables as late as possible Move stores to shared variables as early as possible Adding synchronization where useful