How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

Similar documents
How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

A Performance Puzzle: B-Tree Insertions are Slow on SSDs or What Is a Performance Model for SSDs?

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Databases & External Memory Indexes, Write Optimization, and Crypto-searches

Indexing Big Data. Big data problem. Michael A. Bender. data ingestion. answers. oy vey ??? ????????? query processor. data indexing.

The TokuFS Streaming File System

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

The Algorithmics of Write Optimization

NoVA MySQL October Meetup. Tim Callaghan VP/Engineering, Tokutek

Write-Optimization in B-Trees. Bradley C. Kuszmaul MIT & Tokutek

6.033 Computer System Engineering

CSE 530A. B+ Trees. Washington University Fall 2013

Chapter 12: Query Processing

Kathleen Durant PhD Northeastern University CS Indexes

Lecture S3: File system data layout, naming

CS 318 Principles of Operating Systems

Building an Inverted Index

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

ECE468 Computer Organization and Architecture. Memory Hierarchy

STORING DATA: DISK AND FILES

CS 318 Principles of Operating Systems

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter

Operating Systems. File Systems. Thomas Ropars.

MySQL B+ tree. A typical B+tree. Why use B+tree?

CS122 Lecture 3 Winter Term,

EECS 482 Introduction to Operating Systems

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Michael A. Bender. Stony Brook and Tokutek R, Inc

Cluster-Level Google How we use Colossus to improve storage efficiency

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Information Retrieval

BetrFS: A Right-Optimized Write-Optimized File System

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Balanced Trees Part One

Lecture 12. Lecture 12: The IO Model & External Sorting

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

CSE 373 OCTOBER 25 TH B-TREES

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

File Systems Fated for Senescence? Nonsense, Says Science!

Comparing Implementations of Optimal Binary Search Trees

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Column Stores vs. Row Stores How Different Are They Really?

Mastering the art of indexing

Virtual Memory (Real Memory POV)

Database Systems II. Secondary Storage

External Sorting Implementing Relational Operators

CSE 451: Operating Systems Spring Module 12 Secondary Storage

[537] Fast File System. Tyler Harter

Advanced Database Systems

Professor: Pete Keleher! Closures, candidate keys, canonical covers etc! Armstrong axioms!

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

EEC 483 Computer Organization

The TokuFS Streaming File System

Main Memory and the CPU Cache

CMU Storage Systems 20 Feb 2004 Fall 2005 Exam 1. Name: SOLUTIONS

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Information Retrieval and Organisation

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012

Lecture 24 November 24, 2015

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

Lecture 8 13 March, 2012

CS61C : Machine Structures

ECE232: Hardware Organization and Design

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Balanced Search Trees

Switching to Innodb from MyISAM. Matt Yonkovit Percona

Overview of Storage & Indexing (i)

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

L7: Performance. Frans Kaashoek Spring 2013

Chapter 6: File Systems

Hard Disk Drives (HDDs)

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

UCB CS61C : Machine Structures

Caches Concepts Review

File Systems. Chapter 11, 13 OSPP

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Lecture 16: Storage Devices

CS 350 Algorithms and Complexity

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Course Administration

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Michael A. Bender. Stony Brook and Tokutek R, Inc

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Memory. Objectives. Introduction. 6.2 Types of Memory

LECTURE 11. Memory Hierarchy

Information Retrieval

Welcome to Part 3: Memory Systems and I/O

2. PICTURE: Cut and paste from paper

- In the application level, there is a stdio cache for the application - In the kernel, there is a buffer cache, which acts as a cache for the disk.

Lecture 2: Memory Systems

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Locality. Christoph Koch. School of Computer & Communication Sciences, EPFL

Heap Arrays and Linked Lists. Steven R. Bagley

Disc Allocation and Disc Arm Scheduling?

Transcription:

6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010.

6.172 How Fractal Trees Work 2 I m an MIT alum: My Background MIT Degrees = 2 S.B + S.M. + Ph.D. I was a principal architect of the Connection Machine CM-5 super computer at Thinking Machines. I was Assistant Professor at Yale. I was Akamai working on network mapping and billing. I am research faculty in the SuperTech group, working with Charles.

6.172 How Fractal Trees Work 3 Tokutek A few years ago I started collaborating with Michael Bender and Martin Farach-Colton on how to store data on disk to achieve high performance. We started Tokutek to commercialize the research.

6.172 How Fractal Trees Work 4 I/O is a Big Bottleneck Sensor Query Systems include Sensor Disk Query sensors and Sensor storage, and Query want to perform Sensor Millions of data elements arrive per second Query recently arrived data using indexes. queries on recent data.

6.172 How Fractal Trees Work 5 The Data Indexing Problem Data arrives in one order (say, sorted by the time of the observation). Data is queried in another order (say, by URL or location). Sensor Query Sensor Disk Query Sensor Query Sensor Millions of data elements arrive per second Query recently arrived data using indexes.

6.172 How Fractal Trees Work 6 Why Not Simply Sort? Data Sorted by Time This is what data warehouses do. Sort The problem is that you must wait to sort the data before querying it: Data Sorted by URL typically an overnight delay. The system must maintain data in (effectively) several sorted orders. This problem is called maintaining indexes.

6.172 How Fractal Trees Work 7 B-Trees are Everywhere B-Trees show up in database indexes (such as MyISAM and InnoDB), file systems (such as XFS), and many other storage systems.

6.172 How Fractal Trees Work 8 B-Trees are Fast at Sequential Inserts B In Memory B B Insertions are into this leaf node One disk I/O per leaf (which contains many rows). Sequential disk I/O. Performance is limited by disk bandwidth.

6.172 How Fractal Trees Work 9 B-Trees are Slow for High-Entropy Inserts B In Memory B B Most nodes are not in main memory. Most insertions require a random disk I/O. Performance is limited by disk head movement. Only 100 s of inserts/s/disk ( 0.2% of disk bandwidth).

6.172 How Fractal Trees Work 10 New B-Trees Run Fast Range Queries B B B Range Scan In newly created B-trees, the leaf nodes are often laid out sequentially on disk. Can get near 100% of disk bandwidth. About 100MB/s per disk.

6.172 How Fractal Trees Work 11 Aged B-Trees Run Slow Range Queries B B B Leaf Blocks Scattered Over Disk In aged trees, the leaf blocks end up scattered over disk. For 16KB nodes, as little as 1.6% of disk bandwidth. About 16KB/s per disk.

6.172 How Fractal Trees Work 12 Append-to-file Beats B-Trees at Insertions Here s a data structure that is very fast for insertions: 5 4 2 7 9 4 Write next key here Write to the end of a file. Pros: Achieve disk bandwidth even for random keys. Cons: Looking up anything requires a table scan.

6.172 How Fractal Trees Work 13 Append-to-file Beats B-Trees at Insertions Here s a data structure that is very fast for insertions: 5 4 2 7 9 4 Write next key here Write to the end of a file. Pros: Achieve disk bandwidth even for random keys. Cons: Looking up anything requires a table scan.

6.172 How Fractal Trees Work 14 A Performance Tradeoff? Structure Inserts Point Queries Range Queries B-Tree Horrible Good Good (young) Append Wonderful Horrible Horrible Fractal Tree Good Good Good B-trees are good at lookup, but bad at insert. Append-to-file is good at insert, but bad at lookup. Is there a data structure that is about as good as a B-tree for lookup, but has insertion performance closer to append? Yes, Fractal Trees!

6.172 How Fractal Trees Work 15 A Performance Tradeoff? Structure Inserts Point Queries Range Queries B-Tree Horrible Good Good (young) Append Wonderful Horrible Horrible Fractal Tree Good Good Good B-trees are good at lookup, but bad at insert. Append-to-file is good at insert, but bad at lookup. Is there a data structure that is about as good as a B-tree for lookup, but has insertion performance closer to append? Yes, Fractal Trees!

6.172 How Fractal Trees Work 16 An Algorithmic Performance Model To analyze performance we use the Disk-Access Machine (DAM) model. [Aggrawal, Vitter 88] Two levels of memory. Two parameters: block size B, and memory size M. The game: Minimize the number of block transfers. Don t worry about CPU cycles. Source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse.

6.172 How Fractal Trees Work 17 Theoretical Results Structure Insert Point Query ( ) ( ) log N logn B-Tree O O log B log B ( ) ( ) 1 N Append O O B B ( ) ( ) log N logn Fractal Tree O O B 1 ε ε log B 1 ε

6.172 How Fractal Trees Work 18 Example of Insertion Cost 1 billion 128-byte rows. N = 2 30 ; log(n) = 30. 1MB block holds 8192 rows. B = 8192; logb = 13. ( ) ( ) logn 30 B-Tree: O = O 3 log B 13 ( ) ( ) logn 30 Fractal Tree: O = O 0.003. B 8192 Fractal Trees use << 1 disk I/O per insertion.

6.172 How Fractal Trees Work 19 A Simplified Fractal Tree 5 10 log N arrays, one array for each power of two. 3 6 8 12 17 23 26 30 Each array is completely full or empty. Each array is sorted.

6.172 How Fractal Trees Work 20 Example (4 elements) If there are 4 elements in our fractal tree, the structure looks like this: 12 17 23 30

6.172 How Fractal Trees Work 21 If there are 10 elements in our fractal tree, the structure might look like this: 5 10 3 6 8 12 17 23 26 30 But there is some freedom. Each array is full or empty, so the 2-array and the 8-array must be full. However, which elements go where isn t completely specified.

6.172 How Fractal Trees Work 22 Searching in a Simplified Fractal Tree Idea: Perform a binary search in each array. 5 10 Pros: It works. It s faster than a table scan. 3 6 8 12 17 23 26 30 Cons: It s slower than a B-tree at O(log 2 N) block transfers. Let s put search aside, and consider insert.

6.172 How Fractal Trees Work 23 Inserting in a Simplified Fractal Tree 5 10 3 6 8 12 17 23 26 30 Add another array of each size for temporary storage. At the beginning of each step, the temporary arrays are empty.

6.172 How Fractal Trees Work 24 Insert 15 To insert 15, there is only one place to put it: In the 1-array. 15 5 10 3 6 8 12 17 23 26 30

6.172 How Fractal Trees Work 25 Insert 7 To insert 7, no space in the 1-array. Put it in the temp 1-array. 15 7 5 10 3 6 8 12 17 23 26 30 Then merge the two 1-arrays to make a new 2-array. 5 10 7 15 3 6 8 12 17 23 26 30

6.172 How Fractal Trees Work 26 Not done inserting 7 5 10 7 15 3 6 8 12 17 23 26 30 Must merge the 2-arrays to make a 4-array. 5 7 10 15 3 6 8 12 17 23 26 30

6.172 How Fractal Trees Work 27 An Insert Can Cause Many Merges 9 5 10 2 18 33 40 3 6 8 12 17 23 26 30 31 9 5 10 2 18 33 40 3 6 8 12 17 23 26 30 9 31 5 10 2 18 33 40 3 6 8 12 17 23 26 30 5 9 10 31 2 18 33 40 3 6 8 12 17 23 26 30 2 5 9 10 18 31 33 40 3 6 8 12 17 23 26 30 2 3 5 6 8 9 10 12 17 18 23 26 30 31 33 40

6.172 How Fractal Trees Work 28 Analysis of Insertion into Simplified Fractal Tree Cost to merge 2 arrays of size 5 7 10 15 3 6 8 12 17 23 26 30 X is O(X/B) block I/Os. Merge is very I/O efficient. Cost per element to merge is O(1/B) since O(X) elements were merged. Max # of times each element is merged is O(log N). ( ) log N Average insert cost is O. B

6.172 How Fractal Trees Work 29 Improving Worst-Case Insertion Although the average cost of a merge is low, occasionally we merge a lot of stuff. 3 6 8 12 17 23 26 30 4 7 9 19 20 21 27 29 Idea: A separate thread merges arrays. An insert returns quickly. Lemma: As long as we merge Ω(logN) elements for every insertion, the merge thread won t fall behind.

6.172 How Fractal Trees Work 30 Speeding up Search At log 2 N, search is too expensive. Now let s shave a factor of log N. 5 7 10 15 3 6 8 12 17 23 26 30 The idea: Having searched an array for a row, we know where that row would belong in the array. We can gain information about where the row belongs in the next array

6.172 How Fractal Trees Work 31 Forward Pointers 9 Each element gets a forward 2 14 pointer to where that element 5 7 13 25 goes in the next array using 3 6 8 12 17 23 26 30 Fractional Cascading. [Chazelle, Guibas 1986] If you are careful, you can arrange for forward pointers to land frequently (separated by at most a constant). Search becomes O(log N) levels, each looking at a constant number of elements, for O(log N) I/Os.

6.172 How Fractal Trees Work 32 Industrial-Grade Fractal Trees A real implementation, like TokuDB, must deal with Variable-sized rows; Deletions as well as insertions; Transactions, logging, and ACID-compliant crash recovery; Must optimize sequential inserts more; Better search cost: O(log B N), not O(log 2 N); Compression; and Multithreading.

6.172 How Fractal Trees Work 33 iibench Insert Benchmark iibench was developed by us and Mark Callaghan to measure insert performance. Percona took these measurements about a year ago.

6.172 How Fractal Trees Work 34 iibench on SSD TokuDB on rotating disk beats InnoDB on SSD.

6.172 How Fractal Trees Work 35 Disk Size and Price Technology Trends SSD is getting cheaper. Rotating disk is getting cheaper faster. Seagate indicates that 67TB drives will be here in 2017. Moore s law for silicon lithography is slower over the next decade than Moore s law for rotating disks. Conclusion: big data stored on disk isn t going away any time soon. Fractal Tree indexes are good on disk. One cannot simply indexes in main memory. One must use disk efficiently.

6.172 How Fractal Trees Work 36 Speed Trends Bandwidth off a rotating disk will hit about 500MB/s at 67TB. Seek time will not change much. Conclusion: Scaling with bandwidth is good. Scaling with seek time is bad. Fractal Tree indexes scale with bandwidth. Unlike B-trees, Fractal Tree indexes can consume many CPU cycles.

6.172 How Fractal Trees Work 37 Power Trends Big disks are much more power efficient per byte stored than little disks. Making good use of disk bandwidth offers further power savings. Fractal Tree indexes can use 1/100th the power of B-trees.

6.172 How Fractal Trees Work 38 CPU Trends CPU power will grow dramatically inside servers over the next few years. 100-core machines are around the corner. 1000-core machines are on the horizon. Memory bandwidth will also increase. I/O bus bandwidth will also grow. Conclusion: Scale-up machines will be impressive. Fractal Tree indexes will make good use of cores.

6.172 How Fractal Trees Work 39 The Future Fractal Tree indexes dominate B-trees theoretically. Fractal Tree indexes ride the right technology trends. In the future, all storage systems will use Fractal Tree indexes.

MIT OpenCourseWare http://ocw.mit.edu 6.172 Performance Engineering of Software Systems Fall 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.