FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett

Similar documents
CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13

I/O and file systems. Dealing with device heterogeneity

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Memory Management. Dr. Yingwu Zhu

Succinct dictionary matching with no slowdown

CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find. Lauren Milne Spring 2015

Lecture 8 13 March, 2012

Main Memory and the CPU Cache

Last class: Today: Deadlocks. Memory Management

Chapter 12: Indexing and Hashing. Basic Concepts

COMP 530: Operating Systems File Systems: Fundamentals

8. Binary Search Tree

Asymptotic Analysis Spring 2018 Discussion 7: February 27, 2018

memory management Vaibhav Bajpai

Chapter 12: Indexing and Hashing

ICS 691: Advanced Data Structures Spring Lecture 3

CS:3330 (22c:31) Algorithms

Lecture 9 March 4, 2010

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

CSE373 Fall 2013, Midterm Examination October 18, 2013

Introduction to I/O Efficient Algorithms (External Memory Model)

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CS Prof. Clarkson Fall 2014

Chapter 12: Query Processing

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Storage hierarchy. Textbook: chapters 11, 12, and 13

Chapter 11: Indexing and Hashing

CSCI 104 Log Structured Merge Trees. Mark Redekopp

CSE373: Data Structures & Algorithms Lecture 12: Amortized Analysis and Memory Locality. Linda Shapiro Winter 2015

CSE 562 Database Systems

CSE373: Data Structures & Algorithms Lecture 12: Amortized Analysis and Memory Locality. Lauren Milne Spring 2015

CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters

CS 525: Advanced Database Organization 04: Indexing

Memory Management. Dr. Yingwu Zhu

Lecture S3: File system data layout, naming

1 Static-to-Dynamic Transformations

Computer Science 302 Spring 2007 Practice Final Examination: Part I

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Advanced Programming Handout 5. Purely Functional Data Structures: A Case Study in Functional Programming

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Advanced Database Systems

Advanced Programming Handout 6. Purely Functional Data Structures: A Case Study in Functional Programming

Chapter 13: Query Processing

File Systems: Fundamentals

CSE373 Winter 2014, Midterm Examination January 29, 2014

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

File Structures and Indexing

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Chapter 12: Indexing and Hashing (Cnt(

CS 245 Midterm Exam Solution Winter 2015

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

CS 270 Algorithms. Oliver Kullmann. Binary search. Lists. Background: Pointers. Trees. Implementing rooted trees. Tutorial

Memory. Objectives. Introduction. 6.2 Types of Memory

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

CS 245: Database System Principles

Cpt S 223 Course Overview. Cpt S 223, Fall 2007 Copyright: Washington State University

Lecture 6 Sequences II. Parallel and Sequential Data Structures and Algorithms, (Fall 2013) Lectured by Danny Sleator 12 September 2013

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

1 Ordered Sets and Tables

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Chapter 5 Data Structures Algorithm Theory WS 2016/17 Fabian Kuhn

Hierarchical Memory. Modern machines have complicated memory hierarchy

Hash-Based Indexes. Chapter 11

CSE 100: GRAPH ALGORITHMS

Memory Management Topics. CS 537 Lecture 11 Memory. Virtualizing Resources

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

CS-245 Database System Principles

Last Class: Memory management. Per-process Replacement

Section 05: Solutions

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012

1 The range query problem

CSE373 Fall 2013, Second Midterm Examination November 15, 2013

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Algorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms

Effect of memory latency

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

File Systems: Fundamentals

Chapter 12: Query Processing. Chapter 12: Query Processing

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

Algorithms and Data Structures

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

CSE 214 Computer Science II Searching

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Hash Table and Hashing

CS370 Operating Systems

CS 268: Route Lookup and Packet Classification

Graduate Algorithms CS F-12 Amortized Analysis

CS 4284 Systems Capstone

Transcription:

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett

BUILDING BETTER TOOLS Cache-Oblivious Algorithms Succinct Data Structures

RAM MODEL Almost everything you do in Haskell assumes this model Good for ADTs, but not a realistic model of today s hardware

IO MODEL CPU + Memory N B Disk Can Read/Write Contiguous Blocks of Size B Can Hold M/B blocks in working memory All other operations are Free

B-TREES Occupies O(N/B) blocks worth of space Update in time O(log(N/B)) Search O(log(N/B) + a/b) where a is the result set size

IO MODEL CPU + Registers L1 L2 L3 Main Memory Disk

IO MODEL CPU + Registers L1 L2 L3 Main Memory B 1 B 2 B 3 B 4 B 5 M 1 M 2 M 3 M 4 M 5 Disk Huge numbers of constants to tune Optimizing for one necessarily sub-optimizes others Caches grows exponentially in size and slowness

CACHE-OBLIVIOUS MODEL CPU + Memory Can Read/Write Contiguous Blocks of Size B Can Hold M/B Blocks in working memory All other operations are Free But now you don t get to know M or B! M Disk Various refinements exist e.g. the tall cache assumption B

CACHE-OBLIVIOUS MODEL CPU + Memory M B Disk If your algorithm is asymptotically optimal for an unknown cache with an optimal replacement policy it is asymptotically optimal for all caches at the same time. You can relax the assumption of optimal replacement and model LRU, k-way set associative caches, and the like via caches by modest reductions in M.

CACHE-OBLIVIOUS MODEL CPU + Memory M Disk As caches grow taller and more complex it becomes harder to tune for them at the same time. Tuning for one provably renders you suboptimal for others. The overhead of this model is largely compensated for by ease of portability and vastly reduced tuning. This model is becoming more and more true over time! B

DATA.MAP Built by Daan Leijen. Maintained by Johan Tibell and Milan Straka. Battle Tested. Highly Optimized. In use since 1998. Built on Trees of Bounded Balance The defacto benchmark of performance. Designed for the Pointer/RAM Model

DATA.MAP 2 1 4 3 5 Binary search trees of bounded balance

DATA.MAP 4 1 2 3 5 6 Binary search trees of bounded balance

DATA.MAP Production: empty :: Ord k => Map k a insert :: Ord k => k -> a -> Map k a -> Map k a Consumption: null :: Ord k => Map k a -> Bool lookup :: Ord k => k -> Map k a -> Maybe a

WHAT I WANT I need a Map that has support for very efficient range queries It also needs to support very efficient writes It needs to support unboxed data...and I don t want to give up all the conveniences of Haskell

THE DUMBEST THING THAT CAN WORK Take an array of (key, value) pairs sorted by key and arrange it contiguously in memory Binary search it. Eventually your search falls entirely within a cache line.

BINARY SEARCH Binary search assuming 0 <= l <= h. Returns h if the predicate is never True over [l..h) search :: (Int -> Bool) -> Int -> Int -> Int search p = go where go l h l == h = l p m = go l m otherwise = go (m+1) h where m = l + unsafeshiftr (h - l) 1 {-# INLINE search #-}

OFFSET BINARY SEARCH Pro Tip! Offset binary search assuming 0 <= l <= h. Returns h if the predicate is never True over [l..h) search :: (Int -> Bool) -> Int -> Int -> Int search p = go where go l h l == h = l p m = go l m otherwise = go (m+1) h where hml = h - l m = l + unsafeshiftr hml 1 + unsafeshiftr hml 6 {-# INLINE search #-} Avoids thrashing the same lines in k-way set associative caches near the root.

DYNAMIZATION We have a static structure that does what we want How can we make it updatable? Bentley and Saxe gave us one way in 1980.

BENTLEY-SAXE 5 2 20 30 40 Now let s insert 7

BENTLEY-SAXE 5 7 5 7 2 20 30 40

BENTLEY-SAXE 5 7 2 20 30 40 Now let s insert 8

BENTLEY-SAXE 8 5 7 2 20 30 40 Next insert causes a cascade of carries! Worst-case insert time is O(N/B) Amortized insert time is O((log N)/B) We computed that oblivous to B

BENTLEY-SAXE Linked list of our static structure. Each a power of 2 in size. The list is sorted strictly monotonically by size. Bigger / older structures are later in the list. We need a way to merge query results. Here we just take the first.

SLOPPY AND DYSFUNCTIONAL Chris Okasaki would not approve! Our analysis used assumed linear/ephemeral access. A sufficiently long carry might rebuild the whole thing, but if you went back to the old version and did it again, it d have to do it all over. You can t earn credits and spend them twice!

AMORTIZATION Given a sequence of n operations: a 1, a 2, a 3.. a n What is the running time of the whole sequence? k n. k Σ actual Σ k i amortized i i=1 i=1 There are algorithms for which the amortized bound is provably better than the achievable worst-case bound e.g. Union-Find

BANKER S METHOD Assign a price to each operation. Store savings/borrowings in state around the data structure If no account has any debt, then k n. k Σ actual Σ k i amortized i i=1 i=1

PHYSICIST S METHOD Start from savings and derive costs per operation Assign a potential Φ to each state in the data structure The amortized cost is actual cost plus the change in potential. amortized i = actual i + Φ i - Φ i-1 actual i = amortized i + Φ i-1 - Φ i Amortization holds if Φ 0 = 0 and Φ n 0

NUMBER SYSTEMS Unary - Linked List Binary - Bentley-Saxe Skew-Binary - Okasaki s Random Access Lists Zeroless Binary -?

UNARY data Nat = Zero Succ Nat data List a = Nil Cons a (List a)

BINARY Unary - Linked List Binary - Bentley-Saxe Skew-Binary - Okasaki s Random Access Lists Zeroless Binary -? 0 0 1 1 2 1 0 3 1 1 4 1 0 0 5 1 0 1 6 1 1 0 7 1 1 1 8 1 0 0 0 9 1 0 0 1 10 1 0 1 0

ZEROLESS BINARY Digits are all 1, 2. Unique representation 0 0 1 1 2 2 3 1 1 4 1 2 5 2 1 6 2 2 7 1 1 1 8 1 1 2 9 1 2 1 10 1 2 2

MODIFIED ZEROLESS BINARY Digits are all 1, 2 or 3. Only the leading digit can be 1 Unique representation Just the right amount of lag 0 0 1 1 2 2 3 3 4 1 2 5 1 3 6 2 2 7 2 3 8 3 2 9 3 3 10 1 2 2

Binary Zeroless Binary Modified Zeroless Binary 0 0 1 1 2 1 0 3 1 1 4 1 0 0 5 1 0 1 6 1 1 0 7 1 1 1 8 1 0 0 0 9 1 0 0 1 10 1 0 1 0 0 0 1 1 2 2 3 1 1 4 1 2 5 2 1 6 2 2 7 1 1 1 8 1 1 2 9 1 2 1 10 1 2 2 0 1 1 2 2 3 3 4 1 2 5 1 3 6 2 2 7 2 3 8 3 2 9 3 3 10 1 2 2

PERSISTENTLY AMORTIZED data Map k a = M0 M1!(Chunk k a) M2!(Chunk k a)!(chunk k a) (Chunk k a)!(map k a) M3!(Chunk k a)!(chunk k a)!(chunk k a) (Chunk k a)!(map k a) data Chunk k a = Chunk!(Array k)!(array a) O(log(N)/B) persistently amortized. Insert an element. insert :: (Ord k, Arrayed k, Arrayed v) => k -> v -> Map k v -> Map k v insert k0 v0 = go $ Chunk (singleton k0) (singleton v0) where go as M0 = M1 as go as (M1 bs) = M2 as bs (merge as bs) M0 go as (M2 bs cs bcs xs) = M3 as bs cs bcs xs go as (M3 bs cds xs) = cds `seq` M2 as bs (merge as bs) (go cds xs) {-# INLINE insert #-}

WHY DO WE CARE? Inserts are ~7-10x faster than Data.Map and get faster with scale! The structure is easily mmap d in from disk for offline storage This lets us build an unboxed Map from unboxed vectors. Matches insert performance of a B-Tree without knowing B. Nothing to tune.

PROBLEMS Searching the structure we ve defined so far takes O(log 2 (N/B) + a/b) We only matched insert performance, but not query performance. We have to query O(log n) structures to answer queries.

BLOOM-FILTERS {42} + + + + Associate a hierarchical Bloom filter with each array tuned to a false positive rate that balances the cost of the cache misses for the binary search against the cost of hashing into the filter. Improves upon a version of the Stratified Doubling Array Not Cache-Oblivious!

FRACTIONAL CASCADING Search m sorted arrays each of sizes up to n at the same time. Precalculations are allowed, but not a huge explosion in space Very useful for many computational geometry problems. Naïve Solution: Binary search each separately in O(m log n) With Fractional Cascading: O (log mn) = O(log m + log n)

FRACTIONAL CASCADING Consider 2 sorted lists e.g. 1 3 10 20 35 40 2 5 6 8 11 21 36 37 38 41 42 Copy every kth entry from the second into the first 1 2 3 8 10 20 35 36 40 41 2 5 6 8 11 21 36 37 38 41 42 After a failed search in the first, you now have to search a constant k-sized fragment of the second.

IMPLICIT FRACTIONAL CASCADING New trick: We copy every kth entry up from the next largest array. If we had a way to count the number of forwarding pointers up to a given position we could just multiply that # by k and not have to store the pointers themselves

SUCCINCT DICTIONARIES Given a bit vector of length n containing k ones e.g. 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 ( ) There exist k n such vectors. H 0 = log ( ) + 1 Knowing nothing else we could store that choice in H 0 bits k n rank α (i) = # of occurrences of α in S[0..i) select α (i) = position of the ith α in S

NON-SUCCINCT DICTIONARIES Given a bit vector of length n containing k ones e.g. 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 Break it into chunks of size log(n) (or 64) Store a prefix sum up to each chunk With just 2n total space we get an O(1) version of: rank α (S,i) = # of occurrences of α in S[0..i)

IMPLICIT FORWARDING Store a bitvector for each key in the vector that indicates if the key is a forwarding pointer, or has a value associated. To index into the values use rank up to a given position instead. This can also be used to represent deletion flags succinctly. In practice we can use non-succinct algorithms. (rank9, poppy)

BENEFITS Match the asymptotic B-Tree performance without knowing B Fully persistent, can edit previous versions. Always uses sequential writes on disk We get ~10x faster inserts than Data.Map We can reuse the dynamization technique for other domains

QUESTIONS? The code is on github: http://github.com/ekmett/structures http://github.com/ekmett/succinct

NON-SUCCINCT DICTIONARIES Given a bit vector of length n containing k ones e.g. 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 Break it into chunks of size log(n) (or 64) Store a prefix sum up to each chunk With just 2n total space we get an O(1) version of: rank α (S,i) = # of occurrences of α in S[0..i)

SUCCINCT TREES Parsed data takes several times more space than the raw format Pointers and ADTs are big How can we do better?

JACOBSON TREES Start with an implicit tree k `div` 2 2k 2k+1