Performance Issues and Query Optimization in Monet

Similar documents
Architecture-Conscious Database Systems

A high performance database kernel for query-intensive applications. Peter Boncz

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Walking Four Machines by the Shore

CS 426 Parallel Computing. Parallel Computing Platforms

Lec 25: Parallel Processors. Announcements

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Control Hazards. Prediction

Advanced Processor Architecture

Control Hazards. Branch Prediction

COURSE 12. Parallel DBMS

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium

Adaptive Scientific Software Libraries

Bridging the Processor/Memory Performance Gap in Database Applications

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Outline Marquette University

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Processor (IV) - advanced ILP. Hwansoo Han

Weaving Relations for Cache Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Optimising for the p690 memory system

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Intel released new technology call P6P

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Main Memory

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Chapter 8: Memory-Management Strategies

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CS Computer Architecture

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Chapter 2 Lecture 1 Computer Systems Organization

Cache-oblivious Programming

Multiple Issue ILP Processors. Summary of discussions

Advanced processor designs

Chapter 2: Memory Hierarchy Design Part 2

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

X-Ray : Automatic Measurement of Hardware Parameters

Chapter 8: Main Memory

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies

Multithreading: Exploiting Thread-Level Parallelism within a Processor

HPC VT Machine-dependent Optimization

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture 12: Instruction Execution and Pipelining. William Gropp

Part XVII. Staircase Join Tree-Aware Relational (X)Query Processing. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 440

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Pipelining and Vector Processing

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Cache Optimisation. sometime he thought that there must be a better way

Lecture 1: Introduction

CS370 Operating Systems

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Microarchitecture Overview. Performance

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSCE 212: FINAL EXAM Spring 2009

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Data Processing on Modern Hardware

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multi-core Architectures. Dr. Yingwu Zhu

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

More on Conjunctive Selection Condition and Branch Prediction

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Chapter 9 Memory Management

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Weaving Relations for Cache Performance

Advanced Instruction-Level Parallelism

1 The size of the subtree rooted in node a is 5. 2 The leaf-to-root paths of nodes b, c meet in node d

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Advanced issues in pipelining

Optimisation p.1/22. Optimisation

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

MEMORY MANAGEMENT/1 CS 409, FALL 2013

Transcription:

Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl

1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical decomposition - Algorithms: tune random memory access - Implementation techniques: avoid CPU stalls Dissecting CPU & Memory Optimization Effects - Monet experiments (focus: partitioned hash-join) - Gain (or loose) an order of magnitude in performance Query Optimization in Monet Conclusion bottom

2 Trends in DRAM and CPU Speed 1e+6 1 1 1 1 1 memory latency (ns) 1 percent improvement per year CPU clock speed (Hz) 5 percent improvement per year memory bandwidth (MB/s) CPU parallelism year type execution units 1988 classic simple 1992 scalar single pipeline 1996 super- multiple scalar pipelines 1999 Athlon 9 pipe s 2 Pentium 4 12 pipe s? Itanium 16 pipe s 1 1986 1988 199 1992 1994 1996 1998 2 year

3 Modern Computer Architecture: Exploiting CPU Parallelism Keeping all pipelines busy requires - independent operations - simple/predictable code But there are... - conditional branches (if-then-else) - (dereferenced) function calls - late binding in C++ compilers fail Good : scientific code (matrix computation) Bad : (generic) DBMS code Operator sequence: A, B, C, D, E, F Filled pipelines: E C A F D B

4 Modern Computer Architecture: Hierarchical Memory System Chip/Die CPU Caches reduce memory latency only if the requested data is found in the cache Otherwise, CPU stalls for up to 4 ns on each cache miss L1 cache L2 cache Main Memory size latency bandwidth

5 Consequences for DBMS Goal Optimize Use cache lines fully Data structures Prevent cache misses Memory access / algorithms Prevent CPU stalls Implementation techniques Exploit CPU-inherent parallelism Implementation techniques

6 Data Structures: Vertical Decomposition in Monet Store vertical fragments instead of wide relations Wide relations: waste bandwidth A_1 A_2 A_3... A_n.................................. Monet: uses complete bandwidth A_1 A_2 A_3... A_n.... requested attribute cache line

7 Algorithms: Partitioned Joins Cluster both input relations Create clusters that fit in memory cache Restrict random memory access to smallest cache Avoid cache capacity misses Join matching clusters L 3 6 17 2 37 47 57 66 75 81 92 96 R 2 3 1 17 2 32 35 47 66 96 non-clustered L 96 R 57 96 17 81 75 66 3 92 2 37 6 47 32 17 1 2 66 3 35 2 47 clustered

8 Algorithms: Straightforward Clustering Input Clustered output Problem: Number of clusters exceeds number of cache lines cache thrashing (active) cache lines Solution: Multi-pass clustering

9 Algorithms: Multi-Pass Clustering Input Clustered output Limit number of clusters per pass (active) cache lines Avoid cache thrashing Trade memory cost for CPU cost Pass 1 Pass 2

1 Dissecting CPU- & Memory-Optimization Platforms: - SGI Origin2 (MIPS R1, 25 MHz) - Sun Ultra (UltraSPARC, 2 MHz) - Intel PC (PentiumIII, 45 MHz) - AMD PC (Athlon, 6 MHz) Use hardware event counters Break down execution times into - Memory access (cache misses) - CPU stalls (on the Intel PC) - Divisions - Real CPU work

11 Memory-Optimization: SGI Origin2 Multi-Pass Clustering Intel PC 2 18 1 pass 2 18 P passes P=1 P=2 1 8 1 pass 16 14 16 14 6 4 12 12 2 1 8 6 1 8 6 8 6 P passes P=1 P=2 4 2 4 2 4 2 2 256 32k 4M 2 256 32k 4M 2 256 32k 4M number of clusters number of clusters number of clusters memory CPU stalls

12 Memory-Optimization: Partitioned Hash-Join SGI Origin2 Intel PC 45 45 4 4 35 35 3 3 25 2 25 2 15 15 1 1 5 5 64M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions

13 CPU-Optimization DBMS techniques for CPU optimization: - column-at-a-time Monet operators have fixed layout and few types - join([oid,t],[t,oid]) [oid,oid] has just one degree of freedom (T) - bulk type-switch technique: seperate routine for each T join( [oid,t], [T,oid] ) : [oid,oid] { switch(t) { int: return join int( [oid,int], [int,oid] ); string: return join string( [oid,string], [string,oid] ); default: return join ADT( [oid,t], [T,oid], ADT ); } } type-specific join: replace all function calls by inline - less overhead - code more predictable for CPU also: replace expensive division by bit operator

14 CPU-Optimization: Partitioned Hash-Join SGI Origin2: 15 s 4 s Intel PC: 1 s 4 s 25 default 25 default 2 2 15 1 15 1 5 5 2 optimized 2 optimized 15 15 1 5 1 5 64M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions

15 CPU-Optimization: Multi-Pass Clustering SGI Origin2: 3 s.75 s Intel PC: 2.25 s.75 s 1 default P=1 P=2 1 default P=1 P=2 8 8 6 4 6 4 2 2 optimized optimized 4 2 P=1 P=2 P=3 4 2 P=1 P=2 2 16 128 1k 8k 64k 512k 4M 2 16 128 1k 8k 64k 512k 4M number of clusters number of clusters memory CPU stalls

16 CPU- & Memory-Optimization: Overall Performance Boosting Effects: Mc > M & Cm > C 6 5 C SGI Origin2 M C = 1.5 s Cm = 16.4 s M = 28.6 s Mc = 34.5 s 6 5 AMD PC C = 4.3 s Cm = 8.4 s M = 18.9 s Mc = 23. s 4 4 3 3 C M 2 Mc Cm 2 1 1 Mc Cm 64M 8M 1M 128k 16k 2k 256 32 cluster size [byte] 64M 8M 1M 128k 16k 2k 256 32 cluster size [byte] default optimized simple minimum 1 pass 2 passes 3 passes 4 passes

17 Automatic Tuning of Algorithms Detailed and accurate main-memory cost models: - Calibrate CPU costs - Estimate number of cache misses - Memory access cost: misses multiplied by their latencies Calibration tool: - Automatically analyzes memory system of any computer - Extracts number of cache levels, cache sizes, cache line sizes, cache miss latencies

18 Extreme CPU-Optimization: Select exhaustive code expansion on: - data types - predicate type ( x = C, x < C, C < x, C < x < C1) - various other properties avoid as many braches/conditionals as possible in the inner loop results in 12(!) specific routines generated from one template performance improvement: up to factor 1(!) lesson learned: even correctly predicted branches do hurt

19 Monet Query Optimizer: System Architecture multiple users/applications multiple MIL streams...... query results MIL stream merger & result dispatcher merged MIL stream Strategic optimization Tactical optimization elimination of common (sub ) expressions re use of cached (intermediate) results Multi Query Optimizer Dataflow Graph query results pattern rewriting cache / memory management optimized MIL stream Execution Engine Result Cache Operational optimization parallelization Monet Database System

2 Monet Query Optimizer: Common Sub-expression Elimination avoid redundancy: s39 := {sum}(s13.reverse().join(s38), s13.tunique()); s49 := {sum}(s13.reverse().join(s48), s13.tunique()); s66 := {sum}(s13.reverse().join(s65), s13.tunique()); s76 := {count}(s13.reverse(), s13.tunique()); s13r := s13.reverse(); s13t := s13.tunique(); s39 := {sum}(s13r.join(s38), s13t); s49 := {sum}(s13r.join(s48), s13t); s66 := {sum}(s13r.join(s65), s13t); s76 := {count}(s13r, s13t);

21 Monet Query Optimizer: Heuristic Pattern Rewriting inline join (also for {sum} & {count}): x := {avg}(g.reverse().join(b),e); x := {avg}(b,g,e); avoid join: x := {count}(b,g,e); x := {count}(g.reverse(),e); avoid grouping and join: x1 := {sum}(b,g,e); x1 := {sum}(b,g,e); x2 := {count}(g.reverse(),e); x2 := {count}(g.reverse(),e); x3 := {avg}(b,g,e); x3 := x1 [/] x3;

22 Monet Query Optimizer Parallelization (SMP): - schedule independent operations for concurrent execution Cache / Memory management: - re-use intermediate results as soon as possible - discard intermediate results (free memory) as soon as possible in preparation: - cost-based pattern rewriting - cost-based cache / memory management - horizontal fragmentation / partitioning

23 Experiment: Q1 of TPC-H benchmark on Origin2 #CPU time [ms] comment 1 74887 original Monet code 1 2993 optimized Monet code 1 12174 optimized query 2 1434 parallel execution 3 136 4 8666 5 8745 6 7823 7 6957 8 6955

24 Conclusion Bad memory access pattern and poor usage of CPU-inherent parallelism ruin database performance Use data structures that exploit full memory bandwidth Tune algorithms to achieve optimal memory access Optimize (simplify) code to efficiently exploit CPU resources Tactical optimization layer allows lazy MIL generation in frontends and applications Monet home-page: www.cwi.nl/ monet