DeAliaser: Alias Speculation Using Atomic Region Support

Similar documents
DeAliaser: Alias Speculation Using Atomic Region Support

POSH: A TLS Compiler that Exploits Program Structure

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Dynamically Detecting and Tolerating IF-Condition Data Races

Speculative Synchronization

Prototyping Architectural Support for Program Rollback Using FPGAs

ShortCut: Architectural Support for Fast Object Access in Scripting Languages

NoMap: Speeding-Up JavaScript Using Hardware Transactional Memory

Execution-based Prediction Using Speculative Slices

The Bulk Multicore Architecture for Programmability

Multi-cycle Instructions in the Pipeline (Floating Point)

Spectre and Meltdown. Clifford Wolf q/talk

Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

Lecture 27. Pros and Cons of Pointers. Basics Design Options Pointer Analysis Algorithms Pointer Analysis Using BDDs Probabilistic Pointer Analysis

Lecture 20 Pointer Analysis

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

EECS Homework 2

Lecture 16 Pointer Analysis

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Lecture 14 Pointer Analysis

Dynamic Translation for EPIC Architectures

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 4 The Processor (Part 4)

Exploitation of instruction level parallelism

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

LIMITS OF ILP. B649 Parallel Architectures and Programming

L1 Data Cache Decomposition for Energy Efficiency

COMPUTER ORGANIZATION AND DESI

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Chí Cao Minh 28 May 2008

The Design Complexity of Program Undo Support in a General-Purpose Processor

Clang - the C, C++ Compiler

IA-64, P4 HT and Crusoe Architectures Ch 15

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

WeeFence: Toward Making Fences Free in TSO

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

A program execution is memory safe so long as memory access errors never occur:

Understanding Hardware Transactional Memory

Software-Controlled Multithreading Using Informing Memory Operations

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4. The Processor

Processor (IV) - advanced ILP. Hwansoo Han

Getting CPI under 1: Outline

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Multithreaded Value Prediction

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

EECS 583 Class 8 Classic Optimization

Chapter 4. The Processor

FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

A Case for an SC-Preserving Compiler

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Code Reordering and Speculation Support for Dynamic Optimization Systems

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Dynamic Control Hazard Avoidance

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

HY425 Lecture 09: Software to exploit ILP

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

HY425 Lecture 09: Software to exploit ILP

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Using Software Logging To Support Multi-Version Buffering in Thread-Level Speculation

Advanced issues in pipelining

Lec 25: Parallel Processors. Announcements

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Comparing Memory Systems for Chip Multiprocessors

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Computer Science 246 Computer Architecture

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Bulk Disambiguation of Speculative Threads in Multiprocessors

CS533: Speculative Parallelization (Thread-Level Speculation)

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

RISCV with Sanctum Enclaves. Victor Costan, Ilia Lebedev, Srini Devadas

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Secure Virtual Architecture. John Criswell University of Illinois at Urbana- Champaign

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Chapter 4. The Processor

Sequential Consistency & TSO. Subtitle

A Brief Description of the NMP ISA and Benchmarks

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences

5008: Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Achieving Out-of-Order Performance with Almost In-Order Complexity

Transcription:

DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu

Memory Aliasing Prevents Good Code Generation Many popular compiler optimizations require code motion Loop Invariant Code Motion (LICM): Body Preheader Redundancy elimination: Redundant expr. First expr. r1 = a + b r2 = a + b c = r2 Memory aliasing prevents code motion r1 = a + b *p = r2 = a + b c = r2 r1 = a + b r2 = a + b c = r2 r1 = a + b r2 = a + b *p = c = r2 r1 = a + b r2 = r1 c = r2 Problem: compiler alias analysis is notoriously difficult r1 = a + b c = r1 2

Alias Speculation Compile time: optimize assuming certain alias relationships Run time: check those assumptions Recover if assumptions are incorrect Enables further optimizations beyond what s provable statically 3

Contribution: Repurpose Transactions for Alias Speculation Atomic Regions (a.k.a transactions) are here: Intel TSX, AMD ASF, IBM Bluegene/Q, IBM Power HW for Atomic Regions performs: Memory alias detection across threads Buffering of speculative state DeAliaser: Repurpose it to detect aliasing within a thread as we move accesses How? Cover the code motion span in an Atomic Region Speculate that may-aliases in the span are no-aliases Check speculated aliases using transactional HW Recover from failure by rolling back transaction 4

Repurposing Transactional Hardware SR SW Tag Data Repurpose SR (Speculatively Read) bits to mark load locations that need monitoring due to code motion Do not mark SR bits for regular loads inside the atomic region Atomic region cannot be used for conventional TM 5

Repurposing Transactional Hardware SR SW Tag Data Repurpose SR (Speculatively Read) bits to mark load locations that need monitoring due to code motion Do not mark SR bits for regular loads inside the atomic region Atomic region cannot be used for conventional TM SW (Speculatively Written) bits are still set by all the stores Record all the transaction s speculative data for rollback 5

Repurposing Transactional Hardware ISA Extensions SR SW Tag Data Repurpose SR (Speculatively Read) bits to mark load locations that need monitoring due to code motion Do not mark SR bits for regular loads inside the atomic region Atomic region cannot be used for conventional TM SW (Speculatively Written) bits are still set by all the stores Record all the transaction s speculative data for rollback Add ISA extensions to manipulate and check SR and SW bits 5

Instructions to Mark Atomic Regions PC / Starts / ends optimization atomic region PC is the address of the Safe-Version of atomic region - Atomic region code without speculative optimizations - Execution jumps to Safe-Version after rollback Same as regular atomic regions in TM systems except that SR bit marking by regular loads is turned off 8

Extensions to the ISA (for Recording Monitored Locations) load.r r1, addr Loads location addr to r1 just like a regular load Marks SR bit in cache line containing addr Used for marking monitored loads clear.r addr Clears SR bit in cache line containing addr Used to mark end of load monitoring Repurposing of SR bits allows selective monitoring of the loaded location between load.r and clear.r Recall: all stored locations monitored until end of atomic region 9

Extensions to the ISA (for Checking Monitored Locations) storechk.(r/w/rw) r1, addr Stores r1 to location addr just like a regular store r : If SR bit is set rollback w : If SW bit is set rollback rw : If either SR or SW set rollback loadchk.(r/w/rw) r1, addr Loads r1 to location addr just like a regular load r : If SR bit is set rollback w : If SW bit is set rollback rw : If either SR or SW set rollback r, rw: set SR bit after checking 10

How are these Instructions Used? Four code motions are supported Hoisting / sinking loads Hoisting / sinking stores Some color coding before going into details Green: moved instructions Red: instructions alias-checked against moved instructions Orange: instructions alias-checked against moved instructions unnecessarily (checks due to imprecision) 11

Code Motion 1: Hoisting Loads store X load A store X load A 12

Code Motion 1: Hoisting Loads store X load A load. A store X 12

Code Motion 1: Hoisting Loads store X load A load. A store X 1. Change load A to load.r A to set up monitoring of A 12

Code Motion 1: Hoisting Loads store X load A load.r A store X 1. Change load A to load.r A to set up monitoring of A 12

Code Motion 1: Hoisting Loads store X load A load.r A store X 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 12

Code Motion 1: Hoisting Loads store X load A load.r A storechk.r X 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 12

Code Motion 1: Hoisting Loads store X load A load.r A storechk.r X 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 12

Code Motion 1: Hoisting Loads store X load A load.r A storechk.r X clear.r A 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 12

Code Motion 1: Hoisting Loads store X load A load.r A storechk.r X clear.r A 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 4. If overlapping monitor, loadchk.r A is used instead of load.r A 12

Code Motion 1: Hoisting Loads store X load A load.r B load.r A storechk.r X clear.r A 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 4. If overlapping monitor, loadchk.r A is used instead of load.r A 12

Code Motion 1: Hoisting Loads store X load A load.r B loadchk.r A storechk.r X clear.r A 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 4. If overlapping monitor, loadchk.r A is used instead of load.r A Checks whether load.r B set up monitor in same cache line Prevents clear.r A from clearing monitor set up by load.r B 12

Code Motion 1: Hoisting Loads store X load A load.r B loadchk.r A storechk.r X clear.r A Alias check is precise Selectively check against only stores in code motion span 1. Change load A to load.r A to set up monitoring of A 2. Change store X to storechk.r X to check monitor 3. Insert clear.r A to turn off monitoring at end of motion span 4. If overlapping monitor, loadchk.r A is used instead of load.r A Checks whether load.r B set up monitor in same cache line Prevents clear.r A from clearing monitor set up by load.r B 12

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X store A load Y store Z 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X load Y store Z store A 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X load Y store Z store A 1. Change store A to storechk.rw A to check preceding reads and writes 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X load Y store Z storechk.rw A 1. Change store A to storechk.rw A to check preceding reads and writes 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X load Y store Z storechk.rw A 1. Change store A to storechk.rw A to check preceding reads and writes 2. Change load Y to loadchk.r Y to setup monitoring of Y 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X loadchk.r Y store Z storechk.rw A clear.r Y 1. Change store A to storechk.rw A to check preceding reads and writes 2. Change load Y to loadchk.r Y to setup monitoring of Y 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X loadchk.r Y store Z storechk.rw A clear.r Y 1. Change store A to storechk.rw A to check preceding reads and writes 2. Change load Y to loadchk.r Y to setup monitoring of Y 3. Note store Z is already monitored so no change is needed 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X loadchk.r Y store Z storechk.rw A clear.r Y 1. Change store A to storechk.rw A to check preceding reads and writes 2. Change load Y to loadchk.r Y to setup monitoring of Y 3. Note store Z is already monitored so no change is needed 4. Note load.r W and store X are checked unnecessarily even if not in code motion span 24

Code Motion 2: Sinking Stores load.r W store X store A load Y store Z load.r W store X loadchk.r Y store Z storechk.rw A clear.r Y 1. Change store A to storechk.rw A to check preceding reads and writes 2. Change load Y to loadchk.r Y to setup monitoring of Y 3. Note store Z is already monitored so no change is needed 4. Note load.r W and store X are checked unnecessarily even if not in code motion span 24 Alias check is imprecise Checks against all preceding stores and monitored loads

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X clear.r A store Y storechk.r Z 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X clear.r A store Y storechk.r Z 1. Sink clear.r A to the end of the atomic region 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X store Y storechk.r Z clear.r A 1. Sink clear.r A to the end of the atomic region 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X store Y storechk.r Z clear.r A 1. Sink clear.r A to the end of the atomic region 2. Trivially remove clear.r A at the end of atomic region 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X store Y storechk.r Z 1. Sink clear.r A to the end of the atomic region 2. Trivially remove clear.r A at the end of atomic region 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z loadchk.r A storechk.r X store Y storechk.r Z 1. Sink clear.r A to the end of the atomic region 2. Trivially remove clear.r A at the end of atomic region 3. Change loadchk.r A to load.r A 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z load.r A storechk.r X store Y storechk.r Z 1. Sink clear.r A to the end of the atomic region 2. Trivially remove clear.r A at the end of atomic region 3. Change loadchk.r A to load.r A 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z load.r A storechk.r X store Y storechk.r Z 1. Sink clear.r A to the end of the atomic region 2. Trivially remove clear.r A at the end of atomic region 3. Change loadchk.r A to load.r A 4. Note storechk.r Z may now trigger an unnecessary rollback 33

Code Motion 3: Sinking Clears loadchk.r A storechk.r X clear.r A store Y storechk.r Z load.r A storechk.r X store Y storechk.r Z Sinking clears can reduce overhead at the price of potentially increasing imprecision Clears are the only source of instrumentation overhead (Besides begin atomic and end atomic) Can perform alias checking with almost no overhead 41

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks PC for(i=0; i < 100; i++) { load r1, b r2 = r1 + 10 store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } 42

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) PC for(i=0; i < 100; i++) { load r1, b r2 = r1 + 10 store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } 43

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) PC load r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } 43

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } 44

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } clear.r b 44

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a load r3, *q r4 = r3 + 20 storechk.r r4, *p load r5, *q r6 = r5 + 20... } clear.r b 44

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a load r3, *q r4 = r3 + 20 storechk.r r4, *p load r5, *q r6 = r5 + 20... } clear.r b 48

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p clear.r *q... } clear.r b 48

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) Sink / remove all clears PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p clear.r *q... } clear.r b 50

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) Sink / remove all clears PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p... } 50

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) Sink / remove all clears Sink store r2, a (LICM) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { store r2, a loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p... } 52

Illustrative Example: LICM and GVN // a,b,*q may alias with *p for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20;... = *q + 20; } Put atomic region around loop Perform optimizations after inserting appropriate checks Hoist b + 10 (LICM) Remove 2 nd *q + 20 (GVN) Sink / remove all clears Sink store r2, a (LICM) PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p... } storechk.w r2, a 52

PC for(i=0; i < 100; i++) { load r1, b r2 = r1 + 10 store r2, a load r3, *q r4 = r3 + 20 store r4, *p load r5, *q r6 = r5 + 20... } Illustrative Example: LICM and GVN Before PC load.r r1, b r2 = r1 + 10 for(i=0; i < 100; i++) { loadchk.r r3, *q r4 = r3 + 20 storechk.r r4, *p... } storechk.w r2, a Loop body reduced from 8 instructions to 3 instructions With no alias check overhead After 54

Issues Imprecision Issue: Single set of SR & SW bits make checks imprecise Solution: Could add more SR & SW bits to encode different code motion spans in different sets Can be implemented efficiently using HW Bloom filters Isolation Issue: Repurposing SR bits compromises isolation Solution: Do not use the same atomic region for both alias speculation and TM 55

Compiler Toolchain 1. Performs loop blocking that uses memory footprint estimation 2. Wraps loops in atomic regions and create safe versions 3. Performs speculative optimizations using DeAliaser 4. Profiles binary to find out what the beneficial optimizations are according to a cost-benefit model 5. Disables unbeneficial optimizations in the final binary 56

Experimental Setup Compare three environments using LICM and GVN/PRE optimizations: BaselineAA: Unmodified LLVM-2.8 using basic alias analysis Default alias analysis used by O3 optimization DSAA: Unmodified LLVM-2.8 using data structure alias analysis Experimental alias analysis with high time/space complexity DeAliaser: Modified LLVM-2.8 using DeAliaser to perform alias speculation Applications: SPEC INT2006, SPEC FP2006 Simulation: SESC timing simulator with Atomic Region support 32KB 8-way associative speculative L1 cache w/ 64B lines 57

BaselineAA DSAA DeAliaser BaselineAA DSAA DeAliaser Breakdown of Alias Analysis Results 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Must Alias No Alias May Alias 0% SPECINT2006 SPECFP2006 DeAliaser is able to convert almost all may-aliases to no-aliases 58

Speedups Normalized to Baseline 1.1 1.09 1.08 1.07 1.06 1.05 1.04 GVN/PRE LICM 1.03 1.02 1.01 1 DSAA DeAliaser DSAA DeAliaser SPECINT2006 SPECFP2006 DeAliaser speeds up SPEC INT by 2.5% and SPEC FP by 9% 59

Summary Proposed set of ISA extensions to expose Atomic Regions to SW for alias checking Performed hoisting / sinking of loads and stores With minimal instrumentation overhead Some imprecision due to HW limitations Evaluated using LICM and GVN/PRE May-alias results: 56% 4% SPEC INT, 43% 1% SPEC FP Speedup: 2.5% for SPEC INT, 9% for SPEC FP 60

Questions?

Atomic Region Characterization Low L1 cache occupancy due to not buffering speculatively read lines Overhead amortized over large atomic region 62

Speedups (SPECINT) Normalized against BaselineAA D = DSAA, A = Line-granularity DeAliaser, W = Word-granularity DeAliaser 63

Speedups (SPECFP) Normalized against BaselineAA D = DSAA, A = Line-granularity DeAliaser, W = Word-granularity DeAliaser 64

Commit Latency Sensitivity (SPECINT) Normalized against BaselineAA DeAliaser with A = 1-cycle commit, B = 10-cycle commit, C = 100-cycle commit 65

Commit Latency Sensitivity (SPECFP) Normalized against BaselineAA DeAliaser with A = 1-cycle commit, B = 10-cycle commit, C = 100-cycle commit 66

Rollback Overhead (SPECINT) Normalized against BaselineAA A = DeAliaser, G = Aggressive DeAliaser ignoring cost model 67

Rollback Overhead (SPECFP) Normalized against BaselineAA A = DeAliaser, G = Aggressive DeAliaser ignoring cost model 68

Dynamic Instruction Reduction (SPECINT) B = BaselineAA, D = DSAA, A = DeAliaser 69

Dynamic Instruction Reduction (SPECFP) B = BaselineAA, D = DSAA, A = DeAliaser 70

Alias Analysis Results (SPECINT) B = BaselineAA, D = DSAA, A = DeAliaser 71

Alias Analysis Results (SPECFP) B = BaselineAA, D = DSAA, A = DeAliaser 72