I/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13

Similar documents
Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

CS473 - Algorithms I

Cache-Oblivious Algorithms and Data Structures

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Main Memory and the CPU Cache

CSC Design and Analysis of Algorithms

Cache-Efficient Algorithms

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

38 Cache-Oblivious Data Structures

We can use a max-heap to sort data.

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

Design and Analysis of Algorithms

Lecture 3: Sorting 1

Balanced Trees Part One

CSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

Question 7.11 Show how heapsort processes the input:

3.2 Cache Oblivious Algorithms

Data Structures and Algorithms Week 4

Lecture 5: Sorting Part A

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Chapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Effect of memory latency

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6. Sorting Algorithms

Lecture 8 Parallel Algorithms II

Divide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Visualizing Data Structures. Dan Petrisko

Advanced Database Systems

Lower Bound on Comparison-based Sorting

COMP Data Structures

Cache-Oblivious String Dictionaries

Hierarchical Memory. Modern machines have complicated memory hierarchy

Data Structures and Algorithms. Roberto Sebastiani

How many leaves on the decision tree? There are n! leaves, because every permutation appears at least once.

Lecture 19 Sorting Goodrich, Tamassia

CS 310: Memory Hierarchy and B-Trees

Solutions to Exam Data structures (X and NV)

Algorithms for Memory Hierarchies Lecture 2

Lecture 8: Mergesort / Quicksort Steven Skiena

Data Structures and Algorithms Chapter 4

CPSC 311 Lecture Notes. Sorting and Order Statistics (Chapters 6-9)

COMP 352 FALL Tutorial 10

SORTING AND SELECTION

Quick Sort. CSE Data Structures May 15, 2002

Introduction to Parallel Computing

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

The divide and conquer strategy has three basic parts. For a given problem of size n,

Lecture April, 2010

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

Lecture 22 November 19, 2015

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Lecture 9 March 4, 2010

Sorting and Selection

Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, Merge Sort & Quick Sort

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Sorting. Sorting in Arrays. SelectionSort. SelectionSort. Binary search works great, but how do we create a sorted array in the first place?

Flash memory: Models and Algorithms. Guest Lecturer: Deepak Ajwani

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Storage hierarchy. Textbook: chapters 11, 12, and 13

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Lecture 19 Apr 25, 2007

Optimal Cache-Oblivious Mesh Layouts

Merge Sort

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Divide and Conquer

Lecture 8 13 March, 2012

Divide and Conquer. Algorithm D-and-C(n: input size)

Midterm solutions. n f 3 (n) = 3

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Dense Matrix Multiplication

ECE608 - Chapter 8 answers

CSE 3101: Introduction to the Design and Analysis of Algorithms. Office hours: Wed 4-6 pm (CSEB 3043), or by appointment.

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

CSC Design and Analysis of Algorithms. Lecture 7. Transform and Conquer I Algorithm Design Technique. Transform and Conquer

Sorting. Quicksort analysis Bubble sort. November 20, 2017 Hassan Khosravi / Geoffrey Tien 1

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

Chapter 13: Query Processing

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

CSci 231 Final Review

Algorithm Design and Analysis

CS 171: Introduction to Computer Science II. Quicksort

CSE 326: Data Structures Sorting Conclusion

Unit-2 Divide and conquer 2016

Next. 1. Covered basics of a simple design technique (Divideand-conquer) 2. Next, more sorting algorithms.

MITOCW watch?v=v3omvlzi0we

PRAM Divide and Conquer Algorithms

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Data Structures and Algorithms CSE 465

IS 709/809: Computational Methods in IS Research. Algorithm Analysis (Sorting)

Recursive Algorithms II

Reading for this lecture (Goodrich and Tamassia):

Department of Computer Applications. MCA 312: Design and Analysis of Algorithms. [Part I : Medium Answer Type Questions] UNIT I

Divide-and-Conquer. Dr. Yingwu Zhu

Chapter 6 Heap and Its Application

CSE 530A. B+ Trees. Washington University Fall 2013

Analysis of Algorithms - Medians and order statistic -

CS 5321: Advanced Algorithms Sorting. Acknowledgement. Ali Ebnenasir Department of Computer Science Michigan Technological University

Transcription:

I/O Model 15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms Matrix multiplication Distribution sort Static searching Abstracts a single level of the memory hierarchy Fast memory (cache) of size M Accessing fast memory is free, but moving data from slow memory is expensive Memory is grouped into size-b blocks of contiguous data CPU M/B Fast Memory B block Slow Memory B Cost: the number of block transfers (or I/Os) from slow memory to fast memory. Cache-Oblivious Algorithms Algorithms not parameterized by B or M. These algorithms are unaware of the parameters of the memory hierarchy Analyze in the ideal cache model same as the I/O model except optimal replacement is assumed CPU M/B Fast Memory B block Slow Memory Advantages of Cache-Oblivious Algorithms Since CO algorithms do not depend on memory parameters, bounds generalize to multilevel hierarchies. Algorithms are platform independent Algorithms should be effective even when B and M are not static Optimal replacement means proofs may posit an arbitrary replacement policy, even defining an algorithm for selecting which blocks to load/evict. 1

Matrix Multiplication Consider standard iterative matrixmultiplication algorithm Z := Where X, Y, and Z are matrices X Y How Are Matrices Stored? How data is arranged in memory affects I/O performance Suppose X, Y, and Z are in row-major order Z := X Y for i = 1 to do for j = 1 to do for k = 1 to do Z[i][j] += X[i][k] * Y[k][j] for i = 1 to do for j = 1 to do for k = 1 to do Z[i][k] += X[i][k] * Y[k][j] If B, reading a column of Y is expensive Θ() I/Os If M, no locality across iterations for X and Y Θ( 3 ) I/Os Θ( 3 ) computation in RAM model. What about I/O? How Are Matrices Stored? Recursive Matrix Multiplication Suppose X and Z are in row-major order but Y is in column-major order ot too inconvenient. Transposing Y is relatively cheap Z := X Y Z 11 Z 12 Z 21 Z 22 := X 11 X 12 Y 11 X 21 X 22 Y 21 Y 12 Y 22 Compute 8 submatrix products recursively Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21 for i = 1 to do for j = 1 to do for k = 1 to do Z[i][k] += X[i][k] * Y[k][j] Scan row of X and column of Y Θ(/B) I/Os If M, no locality across iterations for X and Y Θ( 3 /B) We can do much better than Θ( 3 /B) I/Os, even if all matrices are row-major. Summing two matrices with row-major layout is cheap just scan the matrices in memory order. Cost is Θ( 2 /B) I/Os to sum two matrices, assuming B. 2

Recursive Multiplication Analysis Recursive Multiplication Analysis Z 11 Z 12 Z 21 Z 22 := X 11 X 12 Y 11 X 21 X 22 Y 21 Y 12 Y 22 Recursive algorithm: Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21 Z 11 Z 12 Z 21 Z 22 := X 11 X 12 Y 11 X 21 X 22 Y 21 Y 12 Y 22 Recursive algorithm: Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21 Mult(n) = 8Mult(n/2) + Θ(n 2 /B) Mult(n 0 ) = O(M/B) when n 0 for X, Y and Z fit in memory The big question is the base case n 0 The big question is the base case: Suppose an X, Y, and Z submatrices fit in memory at the same time Then multiplying them in memory is free after paying Θ(M/B) to load them into memory Array storage Row-major matrix How many blocks does a size- array occupy? If it s aligned on a block (usually true for cacheaware), it takes exactly /B blocks block If you re unlucky, it s /B +1 blocks. This is generally what you need to assume for cacheoblivious algorithms as you can t force alignment If you look at the full matrix, it s just a single array, so rows appear one after the other the matrix layout of matrix in slow memory In either case, it s Θ(1+/B) blocks 15-853 Page 11 So entire matrix fits in 2 /B +1=Θ(1+ 2 /B) blocks 15-853 Page 12 3

Row-major submatrix In a submatrix, rows are not adjacent in slow memory k the matrix k layout of matrix in slow memory eed to treat this as k arrays, so total number of blocks to store submatrix is k( k/b +1) = Θ(k+k 2 /B) 15-853 Page 13 Row-major submatrix Recall we had the recurrence Mult() = 8 Mult(/2) + Θ( 2 /B) (1) The question is when does the base case occur here? Specifically, does a Θ( M) Θ( M) matrix fit in cache, i.e., does it occupy at most M/B different blocks? If a Θ( M) Θ( M) fits in cache, we stop the analysis at a Θ( M) size lower levels are free. i.e., Mult(Θ( M)) = Θ(M/B) (2) Solving (1) with (2) as a base case gives Mult() = Θ( 2 /B + 3 /B M) load full submat in cache 15-853 Page 14 Is that assumption correct? Does a Θ( M) Θ( M) matrix occupy at most Θ(M/B) different blocks? We have a formula from before. A k k submatrix requires Θ(k + k 2 /B) blocks, so a Θ( M) Θ( M) submatrix occupies roughly M + M/B blocks The answer is yes only if Θ( M + M/B) = Θ(M). iff M M/B, or M B 2. If no, analysis (base case) is broken recursing into the submatrix will still require more I/Os. Fixing the base case Two fixes: 1. The tall cache assumption: M B 2. Then the base case is correct, completing the analysis. 2. Change the matrix layout. 15-853 Page 15 15-853 Page 16 4

Without Tall-Cache Assumption Try a better matrix layout The algorithm is recursive. Use a layout that matches the recursive nature of the algorithm For example, Z-morton ordering: - The line connects elements that are adjacent in memory - In other words, construct the layout by storing each quadrant of the matrix contiguously, and recurse Recursive MatMul with Z-Morton The analysis becomes easier Each quadrant of the matrix is contiguous in memory, so a c M c M submatrix fits in memory The tall-cache assumption is not required to make this base case work The rest of the analysis is the same Searching: binary search is bad A B C D E F G H I J K L M O Example: binary search for element A with block size B = 2 Search hits a a different block until reducing keyspace to size Θ(B). Thus, total cost is log 2 Θ(log 2 B) = Θ(log 2 (/B)) Θ(log 2 ) for >>B Static cache-oblivious searching Goal: organize keys in memory to facilitate efficient searching. (van Emde Boas layout) 1. build a balanced binary tree on the keys 2. layout the tree recursively in memory, splitting the tree at half the height memory layout 5

Static layout example Cache-oblivious searching: Analysis I H 0 D 1 L 2 3 6 9 12 B F J A C E G I K M O 4 5 7 8 10 11 13 14 H D L B A C F E G J I K M O Consider recursive subtrees of size B to B on a root-to-leaf search path. Each subtree is contiguous and fits in O(1) blocks. Each subtree has height Θ(lgB), so there are Θ(log B ) of them. memory size-o(b) subtree size-b block memory layout Cache-oblivious searching: Analysis II Analyze using a recurrence S() = 2S( ) base case S(<B) = 1. Solves to O(log B ) or S() = 2S( ) + O(1) base case S(<B) = 0. Counts the # of random accesses Distribution sort outline Analogous to multiway quicksort 1. Split input array into contiguous subarrays of size. Sort subarrays recursively, sorted 6

Distribution sort outline 2. Choose good pivots p 1 p 2 p -1. 3. Distribute subarrays into buckets, according to pivots Bucket 1 Bucket 2 Bucket p 2 p -1 Distribution sort outline 4. Recursively sort the buckets Bucket 1 Bucket 2 Bucket size 2 p 2 p -1 5. Copy concatenated buckets back to input array sorted size 2 Distribution sort analysis sketch Step 1 (implicitly) divides array and sorts size- subproblems Step 4 sorts buckets of size n i 2, with total size Step 5 copies back the output, with a scan Gives recurrence: T() = T( ) + T(n i ) + Θ(/B) + Step 2&3 2 T( ) + Θ(/B) Base: T(<M) = 1 = Θ((/B) log M/B (/B)) if M B 2 Missing steps 2. Choose good pivots p 1 p 2 p -1. (2) ot too hard in Θ(/B) 3. Distribute subarrays into buckets, according to pivots Bucket 1 Bucket 2 Bucket size 2 p 2 p -1 7

aïve distribution Better recursive distribution s 1 s 2 s k s k-1 p 2 p -1 Bucket 1 Bucket 2 Bucket Distribute first subarray, then second, then third, Cost is only Θ(/B) to scan input array What about writing to the output buckets? Suppose each subarray writes 1 element to each bucket. Cost is 1 I/O per write, for total! b 1 b 2 b k Given subarrays s 1,,s k and buckets b 1,,b k 1. Recursively distribute s 1,,s k/2 to b 1,,b k/2 2. Recursively distribute s 1,,s k/2 to b k/2,,b k 3. Recursively distribute s k/2,,s k to b 1,,b k/2 4. Recursively distribute s k/2,,s k to b k/2,,b k Despite crazy order, each subarray operates left to right. So only need to check next pivot. Distribute analysis Counting only random accesses here D(k) = 4D(k/2) + O(k) Base case: when the next block in each of the k buckets/subarrays fits in memory (this is like an M/B-way merge) So we have D(M/B) = D(B) = free ote on distribute If you unroll the recursion, it s going in Z- morton order on this matrix: subarray # bucket # Solves to D(k) = O(k 2 /B) distribute uses O(/B) random accesses the rest is scanning at a cost of O(1/B) per element i.e., first distribute s 1 to b 1, then s 1 to b 2, then s 2 to b 1, then s 2 to b 2, and so on. 8