Using Software Logging To Support Multi-Version Buffering in Thread-Level Speculation

Similar documents
Software Logging under Speculative Parallelization

Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors

Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors Λ

POSH: A TLS Compiler that Exploits Program Structure

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

CS533: Speculative Parallelization (Thread-Level Speculation)

Speculative Synchronization

Virtual Memory Review. Page faults. Paging system summary (so far)

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

15 Sharing Main Memory Segmentation and Paging

16 Sharing Main Memory Segmentation and Paging

Past: Making physical memory pretty

Dynamically Detecting and Tolerating IF-Condition Data Races

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

DeAliaser: Alias Speculation Using Atomic Region Support

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Translation Buffers (TLB s)

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Virtual Memory. Virtual Memory

CSE 120 Principles of Operating Systems Spring 2017

Pipelined processors and Hazards

CSE502: Computer Architecture CSE 502: Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

ShortCut: Architectural Support for Fast Object Access in Scripting Languages

Advanced issues in pipelining

DAT (cont d) Assume a page size of 256 bytes. physical addresses. Note: Virtual address (page #) is not stored, but is used as an index into the table

Tradeoffs in Transactional Memory Virtualization

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

COMPUTER ORGANIZATION AND DESI

Software-Controlled Multithreading Using Informing Memory Operations

Cache Performance (H&P 5.3; 5.5; 5.6)

How to create a process? What does process look like?

Operating Systems. Operating Systems Sina Meraji U of T

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

are Softw Instruction Set Architecture Microarchitecture are rdw

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Midterm #2 Solutions April 23, 1997

Lecture 15 Pipelining & ILP Instructor: Prof. Falsafi

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

The Bulk Multicore Architecture for Programmability

CS 152 Computer Architecture and Engineering

Precise Exceptions and Out-of-Order Execution. Samira Khan

Exploiting Idle Floating-Point Resources for Integer Execution

Multi-level Page Tables & Paging+ segmentation combined

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

VIRTUAL MEMORY II. Jo, Heeseung

CSE 120 Principles of Operating Systems

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS 153 Design of Operating Systems Winter 2016

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

CS 318 Principles of Operating Systems

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Basic Pipelining Concepts

Chapter 1 Computer System Overview

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ELE 655 Microprocessor System Design

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

Topic 18: Virtual Memory

Chapter 4 The Processor (Part 4)

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

CPS104 Computer Organization and Programming Lecture 17: Interrupts and Exceptions. Interrupts Exceptions and Traps. Visualizing an Interrupt

CS420: Operating Systems. Paging and Page Tables

Improving the Practicality of Transactional Memory

Advanced Computer Architecture

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

19: I/O. Mark Handley. Direct Memory Access (DMA)

Computer Architecture Spring 2016

Speculative Locks. Dept. of Computer Science

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

EECS 482 Introduction to Operating Systems

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 15: Caching: Demand Paged Virtual Memory

200 points total. Start early! Update March 27: Problem 2 updated, Problem 8 is now a study problem.

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Operating Systems: Virtual Machines & Exceptions

Address spaces and memory management

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

Processor (IV) - advanced ILP. Hwansoo Han

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Thomas Polzer Institut für Technische Informatik

CS161 Design and Architecture of Computer Systems. Cache $$$$$

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS399 New Beginnings. Jonathan Walpole

Transcription:

Using Software Logging To Support Multi-Version Buffering in Thread-Level Spulation María Jesús Garzarán, M. Prvulovic, V. Viñals, J. M. Llabería *, L. Rauchwerger ψ, and J. Torrellas U. of Illinois at Urbana-Champaign U. of Zaragoza, Spain * U.P.C., Spain ψ Texas A&M University

Thread-Level Spulation (TLS) Exute potentially-dependent tasks in parallel Assume no cross-task dependence will be violated Track memory accesses; buffer unsafe state Dett any dependence violation Squash offending tasks, repair polluted state, restart tasks Example time for (i=0;i<n;i++){ = A[B[i]] A[C[i]] = } Task J = A[4] A[5] = Task J+1 = A[2] RAW A[2] = Task J+2 = A[5] A[6] = 2

State Buffering in TLS Spulative tasks generate spulative memory state Must buffer and manage this spulative state 3

Approaches to Buffering [HPCA-03] Archittural Main Memory (AMM): Spulative task state kept in caches Main memory only keeps safe program state Cached state merges with memory when a task commits Future Main Memory (FMM): Spulative task state can merge with memory at any time Main memory keeps the future state (unsafe) Previous state is saved into Undo Log 4

Our Contributions Software-only design for undo-log system in FMM Simplify the hardware implementation ti Very cost-efftive approach On average, it only introduces a 10% slow-down vs hardwareonly 5

Roadmap Taxonomy of Buffering: AMM vs FMM Hardware Support Software Support Evaluation Conclusions 6

Task Exution under TLS Non Sp Non Sp Sp Non Sp Sp Processor 1 2 3 time task 1 Commit task 2 task 3 Token task 4 task 5 Commit Token 7

Archittural Main Memory [HPCA03] Tasks 3 4 5 Caches Non Sp Main memory Archittural state Main memory keeps archittural or safe state Caches keep spulative state 8

Future Main Memory [HPCA03] Tasks 3 4 5 Log 6 Caches Main memory Future state Main memory keeps future state Logs keep previous state 9

Future Main Memory value address Task i writes 2 to 0x400 Task i+j writes 10 to 0x400 Archittural MM Task ID Tag Data i 0x400 2 i+j 0x400 10 Task ID Tag Data i 0x400 2 Future MM Cache Cache 10

Future Main Memory value address Task i writes 2 to 0x400 Task i+j writes 10 to 0x400 Archittural MM Future MM Producer Task ID Tag Data i 0x400 2 i+j 0x400 10 Cache Task ID Tag Data i+j 0x400 10 Cache Task ID Overwriting Task ID Tag Data i+j i 0x400 2 Log (in cache or memory) Perf Cost Faster commit but slower version rovery Need log 11

Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 12

Hardware Supports Existing TLS protocol Add hooks to support Software Logging 13

TLS protocol [Zhang99] Processor Cache RW X Network Local Memory MaxR MaxW X Task IDs 14

Hooks to Support Software Logging 1) Make Task ID pages visible to the software Cache Processor RW X Task ID Network Local Memory 2) Cache Task IDs MaxR MaxW X Task IDs Software Log Producer Task ID Data 15

Accessing Task IDs in Software Fixed offset between mapping of data pages and corresponding Task ID pages Memory Data pages Task ID pages pg Fixed offset Use two different loads: ld variable ld_tid variable 16

Accessing Data Virtual page Physical DataPage TLB ld variable To cache (data access) 17

Accessing Task IDs Virtual page Physical DataPage TLB Fixed offset ld_tid variable To cache (task ID access) 18

Caching Task IDs Bring Task ID to cache with ld_tid instruction as regular data Task IDs can be reused from the cache Task kids in memory are updated din hardware r by the TLS protocol To keep cached Task IDs up-to-date in software: st_tid instruction 19

Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 20

Software Logs A compiler instruments the application: Insert entry in log: before a store operation, add instructions to save previous value, address and Task ID in log Rycle log entries: free up the log created when a task commits Interrupt handler: Rovery : In case of a o-o-o RAW and squash Undo the modifications using data from log 21

Software Data Structures Logs are allocated locally before spulation starts Task Pointer Table Log Buffer Valid Task ID End Next Task ID Vaddr i j Value 22

Instructions to insert entry in log addu r4, r3, offset ; address of the variable sw r4, 0(r2) ; store in the log ld_tid r4, offset(r3) ; load Task ID Logging sw r4, 4(r2) ; store in the log instr lw r4, offset(r3) ; load value of variable sw r4, 8(r2) ; store in the log addu r2, r2, log_rord_size st_tid ttid r5, offset t( (r3) ; store Task kid sw r5, offset(r3) 23

Reducing Overheads Only first-stores in the task need to create a log entry Solution: At run-time, chk if store is first-store Use cached Task ID to filter: If (Cached Task ID == Current Task ID) Skip Logging Else Log entry 24

Filtering First Spulative Store Logging instr no_insert: ld_tid r6, offset (r3) ; load Task ID beq r6, r5, no_insert ; first store? addu r4, r3, offset ; insert as usual sw r4, 0(r2)... addu r2, r2, log_rord_size st_tid r5, offset (r3) ; store Task ID sw r5, offset (r3) 25

Roadmap Taxonomy of Buffering Hardware Support Software Support Evaluation Conclusions 26

Evaluation Environment Exution-driven simulator Multiprocessor with 16 processors 4 issue o-o-o superscalar processor + 2 levels of cache Compare against FMM with Hardware Logging [Zhang99] Advanced AMM system [HPCA03] 27

Applications Numerical applications: Apsi (Spfp2000) Dsmc3d and Euler (HPF-2) P3m (NCSA) Tree (Univ. of Hawaii) Track (Perft) The non-analyzable loops account on average for 61% of the serial exution time - Non-analyzable loops are identified with the Polaris parallelizing compiler - Speed-ups shown for the non-analyzable loops only 28

Exution Time Comparison Exut tion Time 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 P3m Tree Apsi Track Dsmc3d Euler Average 4.7 5.3 10 2 HW SW 32 3.2 3 3.3 3.5 14.5 14.4 14.2 3.8 4 5 6 HW SW 7 8 9 10 HW SW 11 12 9.5 9.4 9.5 7.3 7.1 7.5 0.8 13 14 HW SW 15 16 17 18 HW SW 19 20 21 0.6 22 HW SW 1.1 23 24 25 26 HW SW 27 FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM FMM AMM On average, FMM Sw only introduces a 10% slow-down over FMM Hw On average, FMM Sw is similar to the advanced AMM 29

First Store Filtering 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 P3m Tree Apsi Track Dsmc3d Euler FMM.Sw NoFilter FMM.Sw NoFilter FMM.Sw NoFilter 10 FMM.Sw NoFilter NoFilter NoFilter 1 2 3 4 5 6 7 11 12 13 FMM.Sw 14 15 16 FMM.Sw 17 8 9 Exution Time Significant impact in the exution time in Apsi Since filtering does not hurt, we rommend using it 30

Other Results in Paper Studied design tradeoffs: Log accesses bypass/do not bypass L1 cache Log space is / is not rycled Do not afft the performance of FMM.Sw when filtering is used 31

Conclusions FMM.Sw is a cost-efftive solution: Simplified design relative to FMM.Hw Introduce low exution overhead (10% over FMM.Hw) Filtering first stores is beneficial 32

Using Software Logging To Support Multi-Version Buffering in Thread-Level Spulation María Jesús Garzarán, M. Prvulovic, V. Viñals, J. M. Llabería *, L. Rauchwerger ψ, and J. Torrellas U. of Illinois at Urbana-Champaign U. of Zaragoza, Spain * U.P.C., Spain ψ Texas A&M University

FMM.Sw versus Advanced AMM Advanced AMM Needs Version Combining i Logic Collt all the versions that are committed Selt the youngest one and invalidate the others FMM.Sw Needs Task ID in main memory Comparator to pick up the youngest version 34

Problem: Address tak IDs in software TLB Non existent Physical addres Cache Data & ld/st bits Cache tagged with Non-existent physical addr Local Memory task ID Add a fixed offset 35

Problem: Address tak IDs in software TLB Non existent Physical addres Cache Data & ld/st bits NIC Physical addrs local shared Local Memory task ID Network Shared Data 36

Problem: Address time stamp in software OS Allocates a page of task IDs Map the virtual address to a non existent physical page TLB Cache Data & ld/st bits Non existent Physical addres NIC Physical addrs local shared Local Memory taskid Network Shared Data 37

Spulative protocol Cache: load and store bits per word in cache Local Memory: task ID per word ISA: new ld/st instructions 38

Problem: Address task IDs in Software The task ID is not mapped in virtual space How to make visible the task ID to the sw? Logging inst load r3, addr_task ID? Undo Log Vaddr task ID Value sw r5, offset(r3) 39

Problem: Address task IDs in Software The task ID is not mapped in virtual space How to make visible the task ID to the sw? Use spial instruction lh_tid Logging inst lh_tid r3, offset(r3) Undo Log Vaddr task ID Value sw r5, offset(r3) 40

Accessing Task IDs in Software Option 2) Map to a non-existent physical address Virtual page Physical Non existent DataPage Fixed offset lh_tid r3, offset(r3) To cache (task ID access) 41

Reducing Overheads Only first-stores in the task need to create a log entry Solution: At run-time, chk if store is first-store Use cached Task ID to filter: If Others Instrumented First stores Spulative Non spulative 42

Filtering first spulative store Using extended loads load store tag data 0 1 xlw r6, r1, offset (r3) ; Store bit goes to r6 bgtz r6, no_insert ; first store? addu r4, r3, offset ; insert as usual sw r4, 0(r2) Logging lh_ts r4, offset(r3) instr addu r2, r2, log_rord_size sh_tid r5, offset (r3) no_insert: sw r5, offset(r3) 43

Software handlers Rovery : Out-of order RAW Undo the modifications i using data from log Retrieval : Some in-order RAWs The exposed load needs dig version from log 44

2.25 2 P3m Tree Apsi Track Dsmc3d Euler 1.75 1.5 1.25 1 0.75 0.5 0.25 0 2 Exution Time By. RBy.R By.NoR NoBy.R By. R NoBy.NoR By.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR By.R By.NoR NoBy.R NoBy.NoR c Filter Filter Filter All Filter Filter Filter Filter 45

Accessing Task IDs in Software Naïve solution: Add extra field in TLB Virtual Page Physical Data Page Physical Task ID Page ld var To cache (data access) 46

Accessing Task IDs in Software Naïve solution: Add extra field in TLB Virtual Page Physical Data Page Physical Task ID Page ld_tid var To cache (task ID access) 47

Accessing Task IDs in Software Fixed offset between mapping of data pages and corresponding Task ID pages No TLB modifications In ld_tid instruction, the hardware subtracts the offset to obtain the Physical Task ID page 48