FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science

Similar documents
Chí Cao Minh 28 May 2008

EazyHTM: Eager-Lazy Hardware Transactional Memory

Improving the Practicality of Transactional Memory

Lowering the Overhead of Nonblocking Software Transactional Memory

LogTM: Log-Based Transactional Memory

Conflict Detection and Validation Strategies for Software Transactional Memory

Log-Based Transactional Memory

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

OS Support for Virtualizing Hardware Transactional Memory

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Lecture 17: Transactional Memories I

Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

LogSI-HTM: Log Based Snapshot Isolation in Hardware Transactional Memory

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Software-Controlled Multithreading Using Informing Memory Operations

An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

DHTM: Durable Hardware Transactional Memory

Transactional Memory

Lecture: Transactional Memory. Topics: TM implementations

TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Design and Implementation of Signatures in Transactional Memory Systems

Hardware Transactional Memory Architecture and Emulation

6 Transactional Memory. Robert Mullins

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

Work Report: Lessons learned on RTM

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Tradeoffs in Transactional Memory Virtualization

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Prototyping Architectural Support for Program Rollback Using FPGAs

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Partition-Based Hardware Transactional Memory for Many-Core Processors

The Bulk Multicore Architecture for Programmability

Design and Implementation of Signatures. for Transactional Memory Systems

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Exploiting object structure in hardware transactional memory

Lecture: Consistency Models, TM

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Consistency & TM. Consistency

Page 1. Consistency. Consistency & TM. Consider. Enter Consistency Models. For DSM systems. Sequential consistency

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing

Two Academic Papers attached for use with Q2 and Q3. Two and a Half hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Token Coherence. Milo M. K. Martin Dissertation Defense

Hardware Support For Serializable Transactions: A Study of Feasibility and Performance

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

System Challenges and Opportunities for Transactional Memory

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

DeAliaser: Alias Speculation Using Atomic Region Support

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Kaisen Lin and Michael Conley

Speculative Synchronization

An Integrated Pseudo-Associativity and Relaxed-Order Approach to Hardware Transactional Memory

Unbounded Page-Based Transactional Memory

Nonblocking Transactions Without Indirection Using Alert-on-Update

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Potential violations of Serializability: Example 1

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional. memory

tapping into parallel ism with transactional memory

Computer Architecture

SELF-TUNING HTM. Paolo Romano

Cache Coherence Protocols for Chip Multiprocessors - I

Lecture 12 Transactional Memory

Portland State University ECE 588/688. Transactional Memory

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

PERFORMANCE PATHOLOGIES IN HARDWARE TRANSACTIONAL MEMORY

ByteSTM: Java Software Transactional Memory at the Virtual Machine Level

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

ASelectiveLoggingMechanismforHardwareTransactionalMemorySystems

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

EECS 570 Final Exam - SOLUTIONS Winter 2015

Exploiting Semantic Commutativity in Hardware Speculation

ABORTING CONFLICTING TRANSACTIONS IN AN STM

Transcription:

FlexTM Flexible Decoupled Transactional Memory Support Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science 1

Transactions: Our Goal Lazy Txs (i.e., optimistic conflict resolution) more concurrency SW coordinates conflict management when (i.e., eagerly or lazily) how (i.e., stalling, who aborts) Limitless Txs Large: cache victimization and paging Long: thread switches 2

Flexible Transactional Memory 100 80 Versioning (Isolation) STM (e.g., RSTM) all software approach Execution Time 60 40 Validation (Consistency check) Bookkeeping (Metadata ops.) 20 0 Application (Useful Work) 3

Flexible Transactional Memory Execution Time 100 80 60 40 Versioning (Isolation) Validation (Consistency check) Bookkeeping (Metadata ops.) STM (e.g., RSTM) all software approach RTM [ISCA 07] new cache states help bounded txs software handles large & long txs 20 0 Application (Useful Work) 3

Flexible Transactional Memory Execution Time 100 80 60 40 20 0 Versioning (Isolation) Validation (Consistency check) Bookkeeping (Metadata ops.) Application (Useful Work) STM (e.g., RSTM) all software approach RTM [ISCA 07] new cache states help bounded txs software handles large & long txs FlexTM [this paper] Good Performance No per-location software metadata Simple hardware No bulk arbiters like lazy HTMs Allows software policy 3

Decoupled Hardware Primitives (1/2) Separate interchangeable basic hardware ops. that can be coordinated by software Why? Minimizes hardware state small footprint, simplifies virtualization reduces development time Software accessible to build transactions & fine-tune policy decisions to repurpose hardware for non-tx applications 4

Decoupled Hardware Primitives (2/2) 1. Data Isolation (delaying visibility of stores) caches buffer speculative values, provide fast-commit SW allocates overflow region & HW performs access 2. Access Summary (tracking locations accessed) maintains list of locations read & written check on coherence messages or local memory ops. 3. Conflict Summary (tracking data conflict events) tracks conflict occurrence and type between processors 4. Alert-On-Update monitor cache-blocks and trigger handlers 5

Outline Preview Data Isolation (aka. Lazy Versioning) Lazy coherence Overflow-Table Conflict Management FlexTM Software Evaluation Summary 6

Lazy Coherence (1/2): Approach Lazy coherence: permit multiple readers & writers for a cache block restore coherence for multiple lines simultaneously Current Research (e.g., TCC, Bulk) bulk arbiters, bulk GetXs, bulk ops. on directory Our approach: eager messages but lazy coherence look out for sharer conflicts in standard coherence msgs. continue caching data, but use T-MESI states simple bit-clear ops. convert T-MESI to MESI No bulk messages or address ops. 7

Lazy Coherence (2/2): Protocol Two new T tagged states: TMI (T+M) and TI (T+I) TStores & TLoads denote speculative operations ISA can include instructions or SW can tell HW the regions MESI states TLoad / ~Threat TStore Commit Abort TLoad/Threat TMI TI TStore TMI buffers TStores allows multiple writers and readers no data response but threaten On commit, T+M => M On abort, T+M =>T+I => I TI caches threatened TLoads cache remotely TStored block On commit/abort, T+I => I + cached locations are accessed directly + bounded txs perform in-place update 8

Overflow Table Challenge : Where to put evicted TMI lines? Solution : Per-thread hash table (in virtual memory) Hardware controller fill table with TMI lines evicted from cache removes table entries when reloaded into cache performs look-aside transparently on L1 miss in parallel with L2 Addr 80 current values Config. Sets,Ways Base 100 TMI WB / L1 miss Overflow-Table controller Lookaside OSig {80} 100 120 TAGS 80 Data new values per-thread Overflow Table 9

Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) Access summary signatures Conflict table Alert-On-Update FlexTM Software Evaluation Summary 10

Access Summary (1/2): Signatures Signatures [Bulk ISCA 06, LogTM-SE HPCA 07, SigTM ISCA 07] Bloom filters to represent unbounded set of cache blocks approx. representation with false positives Cache block Addr. hash1 hash2 hash3 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 Processor has two signatures: Rsig (Wsig) summarizes locations TLoad (TStore) Conflict Detection: Signatures snoop coherence messages responder detects conflict and overloads response requester picks response and resolves or notes conflict 11

Access Summary (2/2): Virtualization [details in paper] Required to handle long running txs & tx pauses Challenge : How to detect conflicts with suspended txs? Solution : Read and Write summary signatures at the directory, (note: does not affect cache hit critical path) Details: merge suspended txns signature with summary sig. all L1 cache misses test signatures if miss, no further action necessary if hit, trap to software routine that mimics conflict HW 12

Conflict Tables: Tracking Conflicts Current HTMs detect and resolve at the same time Eager HTM systems perform both on a conflict Lazy HTM systems perform both at commit time Our approach: decouple detection from resolution HW bitmaps record conflict event & expose to SW SW decides when and how to resolve conflicts Per-core conflict bitmap Core-P s table R-W W-R Ncore bits P s read--remote write P s write--remote write P s write--remote read Is there a conflict between P and core i? Ans: Yes (1) / No (0) 13

Conflict Tables: Operation 4 core machine C0 W C1 sig:{} Rsig:{} Wsig:{A} Rsig:{} L2 Directory A : M@C1 Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14

Conflict Tables: Operation 4 core machine TStore A C0 W C1 sig:{} Rsig:{} Wsig:{A} Rsig:{} L2 Directory A : M@C1 Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14

Conflict Tables: Operation 4 core machine TStore A 3 Threat C0 W C1 Wsig:{A} sig:{} Rsig:{} Wsig:{A} Rsig:{} 1 1 1 TGETX 2 Data 4 ACK_INV L2 Directory A : M@C1,C0 2 Fwd_TGETX Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14

Alert-On-Update (AOU) [ISCA 07] Vector specific coherence or update events to the processor in the form of a lightweight event/interrupt on invalidation (capacity eviction or coherence) on access/update (local event) Aload/Arelease A Tag Data Ld Add... Handler Remote Store / Eviction 15

Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) FlexTM Software FlexTM Transaction Example Evaluation Summary 16

FlexTM Transaction (1/2) Per-Tx descriptor TSW State active / committed / aborted running / suspended CMPC AbortPC handler for conflict table events AOU events on TSW FlexTM deploys Signatures for detecting and notifying conflicts Conflict Tables for tracking and managing conflicts T-MESI for in-cache buffering and OT for cache overflows AOU for propagating abort events to remote txs. FlexTM software checkpoints registers at Begin_Tx manages conflicts; aborts remote tx by changing TSW controls commit protocol routine 17

Lazy Transactions: Example T1 Begin_Tx abort_pc1 T2 Begin_Tx abort_pc2 L1 Wsig:{} Rsig:{} C0 L1 Wsig:{} Rsig:{} C1 L2 Directory 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 L1 TSW0: AE Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt A L1 A: TMI TSW0: AE Wsig:{A} Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 A : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 A : M@C0 B : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B TSt A L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI TSW1: AE Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B TSt TSt A B L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) TSt TSt A B In software, decentralized, minimal overhead No. of conflicting Txs L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) TSt TSt A B Conflict Handler! In software, decentralized, minimal overhead No. of conflicting Txs L1 A: TMI B: TMI TSW0: AE TSW1: M Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C0 M@C1 18

Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) CAS-Commit Status[id] TSt TSt A B Conflict Handler! In software, decentralized, minimal overhead No. of conflicting Txs L1 A: A: TMI M B: B: TMI M TSW0: AE M TSW1: M Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C0 M@C1 18

Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) FlexTM Software Evaluation Speedup Conflict resolution tradeoffs Other results Summary 19

Evaluation set-up Full system simulation, GEMS/SIMICS framework 16 core CMP with shared L2 ORIGIN 2000 like coherence protocol (3 hop requests and silent evictions) Workloads Data Structures: Hash,RBTree, LFUCache, Graph Applications: Scott s Delaunay, STAMP*, STMBench7 Runtime systems CGL, FlexTM (HTM interface), RTM-F, RSTM, & TL2 Polka conflict manager * - STAMP does not (yet?) interface with RTM-F and RSTM 20

FlexTM is Fast (1/2) 16 threads Normalized Throughput 10 8 6 4 2 0 2.3X FlexTM RTM-F RSTM CGL, 1 thread=1 1.9X 1.8X HashTable RBTree Delaunay 10 8 6 4 2 0 15X STMBench7 FlexTM gains over RTM-F proportional to SW bookkeeping overheads software metadata management ~50% of tx latency FlexTM gains over RSTM comparable to rigid policy HTMs 21

FlexTM is Fast (2/2) 16 threads Normalized Throughput 12 10 8 6 4 2 CGL, 1 thread=1 3.8X FlexTM 4.1X 1.4X TL2 H-High contention L-Low contention 1.9X 1.5X 0 Vacation-H Vacation-L Kmeans-L Bayes Genome Kmeans-L and Genome performance gains lower TL-2 per-access overheads low (i.e., high instructions / mem_access) Performance gains in Vacation higher lower number of instructions per memory word accessed 22

Lazy mode aids progress Normalized Throughput 10.0 7.5 5.0 2.5 0 Eager 1 thread=1, X-axis: No. of threads 1 2 4 8 16 RBTree 2 1 0 Lazy 1 2 4 8 16 Graph Lazy provides more commits Exploits R-W sharing, allows reader & writers to commit in parallel Eager causes cascaded stalls and aborts Lazy narrows conflict window 23

Mixed-mode can be better STMBench7 Normalized Throughput Eager Lazy EagerWW-LazyRW 12 10 8 6 4 2 0 1 thread=1, X-axis: No. of threads 1 2 4 8 16 Long writer (~1ms) mixed with short readers (tens thousands cycles) Pair-wise conflicts between writers, conflicts with multiple readers Eager doesn t permit R-W sharing and reduces reader throughput Lazy permits sharing, but wastes writer work on aborts Best Policy: Eager-WW with Lazy-RW 24

Other Results Area analysis [in paper] increase in core area small, OoO (0.6%), InO (3%) minimal change to pipeline, most hardware on L1 miss Comparison with Central-Arbiter HTM [in paper] broadcasts and central arbiters are an overkill de-centralized SW commit is efficient & important Non-Tx Applications Watchpoints [in TR-925] Two memory monitoring primitives, AOU & Signatures SW framework for detecting buffer overflows, memory leaks etc. 15-50X speedup over binary instrumentation 25

Summary Decouple TM hardware components to reduce HW complexity enable deployment for varied purposes FlexTM HW manages TM operations, SW manages policy decentralized conflict and commit protocol in SW Conflict management laziness is an important design requirement provides best value when left under software control 26

Summary Decouple TM hardware components to Questions? reduce HW complexity enable deployment for varied purposes http://www.cs.rochester.edu/research/cosyn http://www.cs.rochester.edu/research/synchronization FlexTM HW manages TM operations, SW manages policy decentralized conflict and commit protocol in SW Conflict management Acknowledgments Multifacet Research group, Wisconsin STAMP group, Stanford Transaction Benchmark group, EPFL Shan Lu, Opera group, Illinois laziness is an important design requirement provides best value when left under software control 26

27

28

FlexTM per-core Hardware Handler PC Flag register Processor Context Registers Control Regs. Read Signature Write Signature Read & Write Access Summary Tag Data L1 Data Cache 29

FlexTM per-core Hardware Handler PC Flag register R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Tag Data L1 Data Cache 29

FlexTM per-core Hardware Handler PC Flag register R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update A Tag Data L1 Data Cache 29

FlexTM per-core Hardware Handler PC Flag register Data Isolation R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update ASI Base Address Overflow Sig. Hash Param. L1 miss T A Tag Data Overflow Count C/A Overflow Table Controller L1 Data Cache 29

FlexTM per-core Hardware Handler PC Flag register Data Isolation R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update ASI Base Address Overflow Sig. Hash Param. L1 miss T A Tag Data Overflow Count C/A Overflow Table Controller L1 Data Cache 29

FlexTM Area Complexity Core2 Power6 Niagara2 Orig. Core Area L1 area Signatures (2Kbit) Overflow Control %L1D area inc. % core area inc. 32mm 2 53mm 2 12mm 2 1.8mm 2 2.6mm 2 0.4mm 2 0.10% 0.12% 2.1% 0.5% 0.45% 0.3% 0.35% 0.3% 3.9% 0.61% 0.58% 2.5% Effect on the processor core minimal OoO cores (~0.6\%), In-Order (~4%) Negligible effect on L1 latency small area effects, data array is the critical path Signature effects noticeable only on Niagara2 8-way SMT needs 16 2Kbit signatures (4KB state) 30

Normalized Throughput 12 10 8 6 4 2 0 Hash Table CST Serial Parallel 1 2 4 8 16

Normalized Throughput 1.2 1.0 0.8 0.6 0.4 0.2 0 RandomGraph CST Serial Parallel 1 2 4 8 16

FlexWatcher: Memory Bug Detection FlexTM HW provides two HW primitives for watching memory AOU precisely monitors cache block aligned regions but is limited by cache size Signatures provided unlimited monitoring but are vulnerable to false positives. Extended the ISA to support them as first-class entities insert, member, read-index, activate, clear etc Developed a software bug detection tool add required addresses to signatures HW checks local & remote accesses against the signatures. triggers SW trampoline on signature hits handler disambiguates, if false positive return to execution 33

FlexWatcher Evaluation BugBench from illinois, set of real-life programs with known bugs. Bugs detected Buffer Overflow Solution: Pad all heap allocated buffers with 64bytes, watch padded locations Memory Leak Solution: Monitor all heap allocated objects and update the address s timestamp on access. Invariant Violation: Solution: ALoad cache line of interested variable X. On AOU handler trigger assert program specific invariants. 34

FlexWatcher Performance Compared against Discover, popular SPARC binary instrumentation tool from Benchmark Bug FlexWatcher Discover BC GZIP GZIP 2 Man Squid BO 1.5X 75X BO 1.15X 17X IV 1.05X N/A BO 1.80X 65X ML 2.50X N/A Execution time normalized to sequential thread performance FlexWatcher overheads were estimated on the simulator Discover overheads were estimated on a Sun T1000 server 35