Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Similar documents
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Simultaneous Multithreading: a Platform for Next Generation Processors

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

SGI Challenge Overview

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Architecture-Conscious Database Systems

arxiv: v1 [cs.ar] 13 Aug 2017

Warming up Storage-level Caches with Bonfire

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

I/O Characterization of Commercial Workloads

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Spatial Memory Streaming (with rotated patterns)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multiple Instruction Issue. Superscalars

Dark Silicon Accelerators for Database Indexing

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

A Comparison of Capacity Management Schemes for Shared CMP Caches

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Computer Science 146. Computer Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

Multithreaded Processors. Department of Electrical Engineering Stanford University

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Simultaneous Multithreading on Pentium 4

Getting it Right: Testing Storage Arrays The Way They ll be Used

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Staged Memory Scheduling

Parallel Processing SIMD, Vector and GPU s cont.

STEPS Towards Cache-Resident Transaction Processing

Write only as much as necessary. Be brief!

Simultaneous Multithreading (SMT)

CSC 631: High-Performance Computer Architecture

Execution-based Prediction Using Speculative Slices

Speculative Multithreaded Processors

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Advanced Topics in Computer Architecture

Simultaneous Multithreading (SMT)

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Simultaneous Multithreading Processor

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Register File Organization

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

A Case for Shared Instruction Cache on Chip Multiprocessors running OLTP

virtual memory Page 1 CSE 361S Disk Disk

Optimizing Replication, Communication, and Capacity Allocation in CMPs

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Lecture 14: Multithreading

CSE 392/CS 378: High-performance Computing - Principles and Practice

Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

EECS 570 Final Exam - SOLUTIONS Winter 2015

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Erik Riedel Hewlett-Packard Labs

Practical Near-Data Processing for In-Memory Analytics Frameworks

Decoupling Datacenter Studies from Access to Large-Scale Applications: A Modeling Approach for Storage Workloads

EE 457 Unit 7b. Main Memory Organization

Google is Really Different.

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Assisting Cache Replacement by Helper-Threading for MPSoCs

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Comparing Memory Systems for Chip Multiprocessors

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Written Exam / Tentamen

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

LECTURE 12. Virtual Memory

Transcription:

Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive interference could be a problem) Key question: Can SMT s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?

Workload & Methodology On-line transactions processing (OLTP) updating account balances for a bank 16 server threads Decision support systems (DSS) data warehousing, data mining answer business questions Trace-driven simulation ATOM-generated instruction traces of Oracle 7.3.2 8-context, 8-wide SMT simulator

OLTP Characterization Memory behavior for OLTP (1 context, 16 server processes) Memory region L1 cache miss rate (128KB caches) Memory footprint Instruction text 13.7% 556 KB Program global area (PGA) (private) 7.4% 1.3 MB Buffer cache 6.8% 9.3 MB Metadata 12.9% 26.5 MB High miss rates Large memory footprints

Shared Resources are Critical Almost all SMT hardware resources are dynamically shared by all executing threads Benefit: better hardware resource utilization SMT outperformed superscalar by 2-3X, single-chip MPs by 52% Cost: potential inter-thread contention shared hardware data structures contain the working set of multiple threads increase in some type of conflict Managing the shared resources effectively may avoid the conflicts

L2 Cache Interference 1 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 OLTP L2 cache miss rate (global) 1 2 4 8 Number of contexts 1 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 DSS L2 cache miss rate (global) 1 2 4 8 Number of contexts Interthread conflict misses

Critical Working Sets Total foot print is too big, but how much of it do we really need? Some data is more important than others Skewed reference behavior: a majority of references to a minority of memory blocks for example, 87% of OLTP instruction references are to 31% of the instruction footprint; 98% & 6KB for DSS for example, 41% of OLTP metadata references are to 26KB Commercial workloads have small critical working sets Even with multiple threads, performance-critical working sets might fit in the caches Multithreading doesn t have to cause more misses

Cache Conflicts OLTP DSS Percentage of L2 cache references 0 25 50 75 100 Misses (Metadata) Misses (Buffer cache) Misses (PGA) Misses (Instructions) OLTP DSS 0 5 10 15 20 Percentage of L1 cache references L1 and L2 misses dominated by PGA references Conflict misses can be avoided by page mapping and per-thread address offsetting

Page Mapping Alternatives Map virtual pages to physical page frame example reference stream (virtual pages): 0 4 0 2 4 5 2 3 0 1 7 6 Page coloring 0 1 2 3 4 5 6 7 Reference stream 0 4 0 2 4 5 2 3 0 1 7 6 MMMMMMHMMMMM Virtual addr space Page coloring exploits spatial locality

0 1 2 3 4 5 6 7 Bin hopping exploits temporal locality Bin hopping Reference stream 0 4 0 2 4 5 2 3 0 1 7 6 MMHMHMHMMMMM Virtual addr space

1 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 Page Mapping Results OLTP 1 2 4 8 Number of contexts 1 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 DSS 1 2 4 8 Number of contexts bin hopping page coloring page coloring with seed L2 cache miss rate (global) L2 cache miss rate (global)

Application-level Offsetting Virtual address space Before After 0x100002000 PGA PGA PGA PGA 0x100000000 0x100000000 0x000000000 0x000000000 Thread 0 Thread 1 Thread 0 Thread 1 Base of each thread s PGA is at the same virtual address Causes inter-thread conflicts in (virtually-indexed) L1 cache Address offsets can avoid interference (thread id * 8KB)

Offsetting Results 3 25.0 2 15.0 1 5.0 L1 data cache miss rate 3 25.0 2 15.0 1 5.0 OLTP DSS 1 2 4 8 Number of contexts 1 2 4 8 Number of contexts bin hopping no offset bin hopping with offset L1 data cache miss rate

SMT Performance with bin hopping, application offsetting, L1 I-cache thread-sharing OLTP DSS 4.0 4.0 3.0 2.0 1.0 1 context 2 contexts 4 contexts 8 contexts 1 context 2 contexts 4 contexts 8 contexts Instructions/cycle 3.0 2.0 1.0 Instructions/cycle

Summary SMT gets good speedups on commercial database workloads Limits inter-thread data interference to superscalar levels virtual-to-physical page mapping policies address offsetting for thread-private data Exploits inter-thread instruction sharing (35% reduction in instruction cache miss rates) SMT gets good speedups on commercial database workloads 3x for debit/credit systems, 1.5 for big search problems