STEPS Towards Cache-Resident Transaction Processing

Similar documents
STEPS Towards Cache-Resident Transaction Processing

Improving Instruction Cache Performance in OLTP

Weaving Relations for Cache Performance

Bridging the Processor/Memory Performance Gap in Database Applications

Architecture-Conscious Database Systems

Walking Four Machines by the Shore

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Architecture-Conscious Database Systems

Weaving Relations for Cache Performance

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Weaving Relations for Cache Performance

Column Stores vs. Row Stores How Different Are They Really?

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Data Processing on Modern Hardware

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Performance Impact of Multithreaded Java Server Applications

Computer Science 146. Computer Architecture

Heckaton. SQL Server's Memory Optimized OLTP Engine

Main-Memory Databases 1 / 25

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Software-Controlled Multithreading Using Informing Memory Operations

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract

Panu Silvasti Page 1

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ADDICT: Advanced Instruction Chasing for Transactions

Database Management and Tuning

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Jackson Marusarz Intel Corporation

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Processors. Young W. Lim. May 12, 2016

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Lock Tuning. Concurrency Control Goals. Trade-off between correctness and performance. Correctness goals. Performance goals.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Write only as much as necessary. Be brief!

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lec 11 How to improve cache performance

Control Hazards. Prediction

Chapter 8: Memory-Management Strategies

HPC VT Machine-dependent Optimization

Buffering Database Operations for Enhanced Instruction Cache Performance

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

CSE 544: Principles of Database Systems

VERITAS Storage Foundation 4.0 for Oracle

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Control Hazards. Branch Prediction

Caches. Hiding Memory Access Times

Advanced Database Systems

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Adapted from David Patterson s slides on graduate computer architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

VERITAS Storage Foundation 4.0 for Oracle

Processors, Performance, and Profiling

CSEE 3827: Fundamentals of Computer Systems, Spring Caches

Introduction. Assessment Test. Chapter 1 Introduction to Performance Tuning 1. Chapter 2 Sources of Tuning Information 33

DBMSs on a Modern Processor: Where Does Time Go? Revisited

Analysis of Derby Performance

Chapter 8: Main Memory

Chapter 8: Main Memory

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

Using Transparent Compression to Improve SSD-based I/O Caches

Getting CPI under 1: Outline

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Method-Level Phase Behavior in Java Workloads

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Reduced Instruction Set Computer

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect.

Handout 4 Memory Hierarchy

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS370 Operating Systems

Scalable Transaction Processing on Multicores

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Objectives and Functions Convenience. William Stallings Computer Organization and Architecture 7 th Edition. Efficiency

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Speculative Synchronization

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Concurrency Control Goals

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Staged Memory Scheduling

Write only as much as necessary. Be brief!

Dark Silicon Accelerators for Database Indexing

Transcription:

STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon

CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I stalls Other stalls Computation 256KB 512KB 1B L2 cache size Cache size 10B L1-I stalls account for 25-40% of execution time Instruction caches cannot grow We need a solution for instruction cache-residency 2 1B 100KB 10KB ax on-chip L2/L3 cache L1-I cache 96 98 00 02 04 Year Introduced Server CPUs

Steps for cache-resident code Eliminate misses for a group of Xactions Xactions are assigned to threads ultiplex execution at fine granularity Reuse instructions in L1-I cache STEPS: Synchronized Transactions through Explicit Processor Scheduling 3

Fewer misses & misspred. branches # of L1-I misses 8K 6K 4K 2K Index selection Shore Steps 1 2 4 8 Concurrent threads Normalized count 4 Payment Xaction (TPC-C) 100% 80% 60% 40% 20% Cycles Shore Steps Up to 1.4 speedup Eliminate 96% of L1-I misses for each add l thread Eliminate 64% of mispredicted branches L1-I misses Br. ispred. L1-D misses

Outline Background & related work Basic implementation of Steps icrobenchmarks AthlonXP, SimFlex simulator Applying Steps to OLTP workloads TPC-C results Shore on AthlonXP 5

Background CPU - Caches trade size for lookup speed - L1-I misses are expensive L1-I cache L1-D cache Example: 2-way set associative L1-I capacity misses for loop { if (?) call B F1 F2 conflict misses F3 F4 } cache block B ( ) { B1 } for loop { if (?) call B F1 F2 F3 F4 } L2 cache code data B ( ) { B1 } data 6

Background CPU - Caches trade size for lookup speed - L1-I misses are expensive L1-I cache L1-D cache Example: 2-way set associative L1-I capacity misses for loop { if (?) call B F1 F2 conflict misses F3 F4 } cache block higher associativity + larger cache size B ( ) { B1 } slower access to L1-I I cache 7 for loop { if (?) call B F1 F2 F3 F4 } L2 cache code data B ( ) { B1 } data slower CPU clock

Related work Database & Architecture papers: DB workloads are increasingly non I/O-bound L2/L3 data misses, L1-I misses ORACLE OLTP code working set 560KB Hardware & compiler approaches Increase block size, add stream buffer [asplos98] Call graph prefetching (for DSS) [tocs03] Code layout optimizations [isca01] [..] 8

Related work: within the DBS Data-cache misses (mostly DSS) Cache aware page layout, B-trees, join algorithms Active area [..] Instruction-cache misses in DSS Batch processing of tuples [icde01][sigmod04] Instruction-cache misses in OLTP Challenging! 9

Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 10

Steps overview DBS assign Xactions to threads Xactions consist of few basic operators Index select, scan, update, insert, delete, commit Steps groups threads per Op Within each Op reuse instructions I-cache aware context-switching 11

I-cache aware context-switching instruction cache thread 1 select( ) s1 s2 s3 s4 s5 s6 s7 BEFORE iss thread 2 select( ) s1 s2 s3 s4 s5 s6 s7 CPU executes code CPU performs context-switch (CTX) 12 thread 1 thread 2 select( ) s1 s2 s3 AFTER s4 s5 s6 s7 Hit H H H H H H H select( ) s1 s2 s3 s4 s5 s6 s7 code fits in I-cache context-switch point (CTX)

Basic implementation on Shore Assume (for now) Threads interested in same Op Uninterrupted flow (no locks, I/O) Fast, small, compatible CTX code 76 bytes, bypass (for now) full CTX Add CTX calls throughout Op source code Use hardware counters (PAPI) on sample Op 13

Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 14

icrobenchmark setup All experiments on index fetch, in-memory index 45KB footprint Fast CTX for both Steps /Shore, warm cache AD AthlonXP Simulated IA-32 SimFlex L1 I + D cache size 64KB + 64KB associativity 2-way block size 64 bytes L2 cache size 256KB vary all cache parameters 15

L1-I I cache misses L1-I cache misses 4K 3K 2K 1K Shore Steps AthlonXP 1 2 4 6 Concurrent threads 8 10 Steps eliminates 92-96% of misses for add l threads All misses are conflict misses (cache is 64KB) 16

iss reduction 100% 80% 60% 40% L1-I I misses & speedup L1-I iss reduction Upper Limit L1-I iss reduction % 10 20 30 40 50 60 70 80 Concurrent threads AthlonXP Steps achieves max performance for 6-10 threads No need for larger thread groups 17

iss reduction Speedup 100% 80% 60% 40% 1.4 1.3 1.2 1.1 L1-I I misses & speedup L1-I iss reduction Upper Limit L1-I iss reduction % Speedup AthlonXP 10 20 30 40 50 60 70 80 Concurrent threads Steps achieves max performance for 6-10 threads No need for larger thread groups 18

Normalized count 120% 100% 80% 60% 40% 20% Smaller L1-I I cache 209% 10 threads AthlonXP, PIII AthlonXP Pentium III Cycles L1-I misses L1-D misses Branches Steps outperforms Shore even on smaller caches (PIII) 62-64% fewer mispred. branches on both CPUs 19 Br. ispred. Br. missed BTB Instr. stalls (cycles)

L1-I cache misses 10K 8K 6K 4K 2K SimFlex: L1-I I misses Shore-16KB Steps-16KB IN Shore-32KB Steps-32KB IN 10 threads 64b cache block Shore-64KB Steps-64KB IN AthlonXP direct 2-way4-way 8-way full higher associativity higher associativity Steps eliminates all capacity misses (16, 32KB caches) Up to 89% overall miss reduction (upper limit is 90%) 20

Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 21

Design goals High concurrency on similar Ops Cover full spectrum of Ops Correctness & low overhead for : Locks, latches, mutexes Disk I/O Exceptions (abort & roll back) Housekeeping (detect deadlock, buffer pool) 22

Overview 1. Thin wrappers per Op to sync Xactions Form Execution Teams per Op Flexible definition of Op 2. Best-effort within execution teams Fast CTX through fixed scheduling Threads leave team on exceptions 3. Repair thread structures at exceptions odify only thread package 23

System design Op X steps wrapper Op Y steps wrapper to other Op Op Z stray thread execution team steps wrapper Threads go astray on exceptions Regroup at next Op Can have execution teams per database table 24

Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 25

Experimentation setup Shore/Steps : AthlonXP, 2GB RA, 2 disks Shore locking Hierarchy: record, page, table, DB Protocol: 2-phase TPC-C : Wholesale parts supplier 10-30 Warehouses, 100-300 users Increased concurrency though Zero think time TPC-C workload In-memory database, lazy commits 26

One Xaction: payment Normalized count 100% 80% 60% 40% 20% Number of Warehouses 10 20 30 Cycles L1-I misses L1-D misses Steps outperforms Shore 1.4 speedup, 65% fewer L1-I misses 48% fewer mispredicted branches For 10 Warehouses: 15 ready threads, 7 threads / team 27 L2-I misses L2-D misses Branches mispred.

ix of four Xactions Normalized count 100% 80% 60% 40% 20% 121% 125% Number of Warehouses 10 20 Cycles L1-I misses L1-D misses L2-I misses L2-D misses Branches mispred. Xaction mix reduces average team size (4.3 in 10W) Still, Steps has 56% fewer L1-I misses (out of 77% max) 28

Summary of results Steps can handle full OLTP workloads Significant improvements in TPC-C 65% fewer L1-I misses 48% fewer mispredicted branches Room for improvement Steps was not tuned for TPC-C Shore s code yields low concurrency Steps minimizes both capacity & conflict misses without increasing I-cache I size / associativity 29

Thank you 30