Cache-Oblivious Algorithms

Similar documents
3.2 Cache Oblivious Algorithms

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms EXTENDED ABSTRACT

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Cache Friendly Sparse Matrix Vector Multilication

Cache-Efficient Algorithms

Report Seminar Algorithm Engineering

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Network-oblivious algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri

Cache Memories, Cache Complexity

CS3350B Computer Architecture

Algorithms and Computation in Signal Processing

CS473 - Algorithms I

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Cache-Oblivious Algorithms

Cache Memories, Cache Complexity

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

The History of I/O Models Erik Demaine

I/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13

Lecture 24 November 24, 2015

Algorithm Design and Analysis

Experimenting with the MetaFork Framework Targeting Multicores

Network-Oblivious Algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato and Francesco Silvestri

Algorithms for dealing with massive data

Cache Oblivious Algorithms

Cache Oblivious Matrix Transposition: Simulation and Experiment

6.895 Final Project: Serial and Parallel execution of Funnel Sort

Effect of memory latency

Multithreaded Parallelism and Performance Measures

Lecture 19 Apr 25, 2007

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Algorithm Engineering

Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures

Cache-Adaptive Analysis

Chapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Algorithms and Data Structures

4. A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

CSC Design and Analysis of Algorithms

CSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer

Cache-Oblivious and Data-Oblivious Sorting and Applications

Lists Revisited: Cache Conscious STL lists

CSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )

Cache-Oblivious Traversals of an Array s Pairs

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

How to Write Fast Numerical Code

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

9. Cache Oblivious Algorithms

Cache-Oblivious Algorithms and Data Structures

Cache Oblivious Matrix Transposition: Simulation and Experiment

Divide-and-Conquer. Dr. Yingwu Zhu

Matrix Multiplication

Divide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions

CS 140 : Numerical Examples on Shared Memory with Cilk++

How to Write Fast Numerical Code

Lecture April, 2010

Introduction to Algorithms

Matrix Multiplication

Cache-oblivious Programming

NUMA-aware Multicore Matrix Multiplication

Cache Oblivious Matrix Transpositions using Sequential Processing

Lecture 7 8 March, 2012

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions

Cache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Review implementation of Stable Matching Survey of common running times. Turn in completed problem sets. Jan 18, 2019 Sprenkle - CSCI211

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

Advanced Algorithms. Problem solving Techniques. Divide and Conquer הפרד ומשול

Test 1 Review Questions with Solutions

Lecture 8. Dynamic Programming

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Funnel Heap - A Cache Oblivious Priority Queue

Algorithms and Data Structures, or

D-BAUG Informatik I. Exercise session: week 5 HS 2018

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Dense Matrix Algorithms

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Adaptive Transpose Algorithms for Distributed Multicore Processors

1 Motivation for Improving Matrix Multiplication

Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm

Divide and Conquer. Algorithm Fall Semester

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Formal Loop Merging for Signal Transforms

A Parallel, In-Place, Rectangular Matrix Transpose Algorithm

Double-precision General Matrix Multiply (DGEMM)

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

Dynamic programming in faulty memory hierarchies (cache-obliviously)

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Solutions to Exam Data structures (X and NV)

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Fast Tree-Structured Computations and Memory Hierarchies

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

Transcription:

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015

Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

Matrix multiplication ORD-MULT(A, B, C) 1 for i 1 to m 2 for j 1 to p 3 for k 1 to n 4 C ij C ij + A ik B kj

Matrix layout Like in C... e nd se l- c- e- l- p- s- r- LT m ed p: (a) 0 1 2 3 4 5 6 7 (b) 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 (c) Figure: 0 1 2Row 3 16 major 17 18 order 19 (d) 4 5 6 7 20 21 22 23 8 9 10 11 24 25 26 27 12 13 14 15 28 29 30 31 32 33 34 35 48 49 50 51 36 37 38 39 52 53 54 55 40 41 42 43 56 57 58 59 44 45 46 47 60 61 62 63 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56 9 17 25 33 41 49 57 10 18 26 34 42 50 58 11 19 27 35 43 51 59 12 20 28 36 44 52 60 13 21 29 37 45 53 61 14 22 30 38 46 54 62 15 23 31 39 47 55 63 0 1 4 5 16 17 20 21 2 3 6 7 18 19 22 23 8 9 12 13 24 25 28 29 10 11 14 15 26 27 30 31 32 33 36 37 48 49 52 53 34 35 38 39 50 51 54 55 40 41 44 45 56 57 60 61 42 43 46 47 58 59 62 63

Matrix layout Like in C.... ein case (a) (4), 0we1 2 3(a) 4 50 61 72 3(b) 4 05 86 16 7 24 (b) 32 40 0 48 8 56 16 24 32 40 48 56 nd it vertically, and 8 9 10 11 12 13 8 14 9 15 10 11 12 13 1 14 9 17 15 25 33 41 1 49 9 57 17 25 33 41 49 57 esematrices, these 16 17 18 19 20 21 16 22 17 23 18 19 20 21 2 10 22 18 23 26 34 42 2 50 10 58 18 26 34 42 50 58 el-recursive mulhe base case32 oc- 33 34 35 36 37 32 38 33 39 34 35 36 37 4 12 38 20 39 28 36 44 4 52 12 60 20 28 36 44 52 60 e- ase the two 40 ele- 41 42 43 44 45 40 46 41 47 42 43 44 45 5 13 46 21 47 29 37 45 5 53 13 61 21 29 37 45 53 61 result matrix. 48 49 50 51 52 53 48 54 49 55 50 51 52 53 6 14 54 22 55 30 38 46 6 54 14 62 22 30 38 46 54 62 -and-conquer l- 56 al- 57 58 59 60 61 56 62 57 63 58 59 60 61 7 15 62 23 63 31 39 47 7 55 15 63 23 31 39 47 55 63 p- it uses cache op- 24 25 26 27 28 29 24 30 25 31 26 27 28 29 3 11 30 19 31 27 35 43 3 51 11 59 19 27 35 43 51 59 c- lgorithm, s- (c) we Figure: 0as-1 in r- row-major4or-5 6 7 20 21 4 22 5 23 6 7 20 21 2 22 3 23 6 7 18 19 2 22 3 23 6 7 18 19 22 23 LT ely, REC-MULT 8 9 10 11 24 25 8 26 9 27 10 11 24 25 8 26 9 12 27 13 24 25 8 28 9 29 12 13 24 25 28 29 ce m a subproblem Or like in Fortran 2Row 3(c) 16 major 17 0 18 1order 19 2 3(d) 16 17 0 18 1 19 4 5(d) Figure: 16 17 0 20 1Column 21 4 5 16 major 17 20 21 order 12 13 14 15 28 29 12 30 13 31 14 15 28 10 29 11 30 14 31 15 26 27 10 30 11 31 14 15 26 27 30 31 ed ms can be solved 32 33 34 35 48 49 32 50 33 51 34 35 48 32 49 33 50 36 51 37 48 49 32 52 33 53 36 37 48 49 52 53 36 37 38 39 52 53 36 54 37 55 38 39 52 34 53 35 54 38 55 39 50 51 34 54 35 55 38 39 50 51 54 55 mp: uses Θ8 mnp: 40 41 42 43 56 57 40 58 41 59 42 43 56 40 57 41 58 44 59 45 56 57 40 60 41 61 44 45 56 57 60 61 44 45 46 47 60 61 44 62 45 63 46 47 60 42 61 43 62 46 63 47 58 59 42 62 43 63 46 47 58 59 62 63 np mp: L

Cache friendly algorithm BLOCK-MULT(A, B, C, n) 1 for i 1 to n/s 2 for j 1 to n/s 3 for k 1 to n/s 4 ORD-MULT(A ik, B kj, C ij, s)

BLOCK-MULT issues Being cache aware is hard: Cumbersome structure Complicated choice of s Expensive mispicking of s Problematic if n mod s 0

Motivation Keeping algorithm simple is nice. But cache effectiveness is the must.

Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

TRACT System model arald Prokop Sridhar Ramachandran hnology Square, Cambridge, MA 02139!"# # $ % '&(#&*)+%,&- " CPU W work Z3 L Cache lines organized by optimal replacement strategy Cache Lines of length L Q cache misses Main Memory Figure 1: The ideal-cache model all assume that word size is constant; the particular nstant does not affect our asymptotic analyses. The Two level memory Fully associative Strictly optimal replacement Automatic replacement Tall cache: where: Z = Ω(L 2 ), Z number of words in the cache L number of words in a cache line

Matrix multiplication Given: A[m n] B[n p] C[m p] ( A1 A 2 ) ( A1 B B = A 2 B ( A1 A 2 ) ( B 1 B 2 ), m max(n, p) (1) ) = A 1 B 1 + A 2 B 2, n max(m, p) (2) A ( B 1 B 2 ) = ( AB1 AB 2 ), p max(n, m) (3) C ij := C ij + A ik B kj, m = n = p = 1 (4)

Bounds REC-MULT Work: Θ(n 3 ) Cache misses: Θ(n + n 2 /L + n 3 /L Z) vs BLOCK-MULT Work: Θ(n 3 ) Cache misses: Θ(1 + n 2 /L + n 3 /L Z) vs Strassen s [2] (cache oblivious) Work: Θ(n log 2 7 ) Cache misses: Θ(1 + n 2 /L + n log 2 7 /L Z)

Matrix transposition Given: A[m n] B[n m] A = ( A 1 A 2 ), B = ( B1 B 2 ) (5)

Bounds REC-TRANSPOSE Work: Θ(n m) Cache misses: Θ(1 + mn/l) Asymptotically optimal Naïve Work: Θ(n m) Cache misses: Θ(n m)

Discrete Fourier Transform (DFT) Compute: n 1 Y [i] = X [j]ωn ij, j=0 where ω n = e 2π 1/n Assume n = 2 k k N Choose n 1 = 2 log2n/2, n 2 = 2 log 2n/2 Factorized Y (Cooley-Turkey algorithm): Y [i 1 + i 2 n 1 ] = n 2 1 j 2 =0 n 1 1 j 1 =0 X [j 1 n 2 + j 2 ]ω j 1j 2 n ω j 1j 2 n 2

Sorting Mergesort is not optimal with respect to cache misses. 1. Funnelsort 2. Distribution sort Recursive Asymptotically cache-optimal Not every recursive sort is cache optimal

Funnelsort 1. Split input into n 1 3 of size n 2 3, and sort these arrays recursively 2. Merge n 1 3 sorted sequences using n 1 3 -merger

k-merger L 1 L k buffers k-merger Figure 3: Illustration of a k-merger. A k-merger is built recursively out of k left k-mergers L 1, L 2,, L k, a series of buffers, and one right k-merger R. R of 3 k buffe hold 2k 2 e are connecte the right par merger beco intermediate can hold 2k 3 of elements buffer space algorithm, a the recursion k 3 < 8 eleme A 3 k-merg In order to R k 2 times merger fills all buffers th to fill buffer left merger L buffer conta

Bounds Work: O(n log 2 n) Optimal cache misses: O(1 + (n/l)(1 + log Z n))

Relieved system model LRU Θ(Q(n; Z; L)) Multilevel cache inclusive cache

Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

Micro-benchmarks algorithm regularity y in expecmory man-. algorithms, FFT, and ith explicit Time (microseconds) 0.12 0.1 0.08 0.06 0.04 0.02 iterative recursive 0 0 100 200 300 400 500 600 N Figure 5: Average time taken to multiply two N N matrices, divided by N 3. Time (microseconds) 0.25 0 200 400 600 800 1000 1200 cept that theiterative 0.2 divide-and-conquer structure was modified recursive to produce exact powers of 2 as submatrix sizes wherever possible. In addition, the base cases were coars- 0.15 0.1 ened by inlining the recursion near the leaves to increase 0.05their size and overcome the overhead of procedure calls. 0 (A good research problem is to determine an effective compiler strategy for N coarsening base cases Figure automatically.) 4: Average time to transpose an N N matrix, divided Although by N 2 these. results must be considered preliminary, Figure 4 strongly indicates that the recursive aluses a recur transform ca duces straigh cases for the cache oblivi allocation ef erated witho target archite To close that should theoretic re rithms and c Separation: between cac It appears th use caches b they have m they are runn advantage is Ω8 lgz: adva

Real benchmarks [1] 4. Comparison of Cache Aware and Cache Oblivious Static Search Trees 89 Average number of cache misses per lookup 20 15 10 5 Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Cache Misses for Static Search Trees 0 100 1000 10000 100000 1e+06 Number of items Fig. 4.8. Cache misses per lookup for static search algorithms Classic Binary Search Explicit Instruction Count for Static Search Trees

100 1000 10000 100000 1e+06 Number of items Real benchmarks [1] Fig. 4.8. Cache misses per lookup for static search algorithms Instruction Count for Static Search Trees Average number of instructions per lookup 1200 1000 800 600 400 200 Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit 0 100 1000 10000 100000 1e+06 Number of items Fig. 4.9. Instruction count per lookup for static search algorithms Figure 4.10 gives the results of an execution time study using Windows.

Real benchmarks [1] 90 Richard E. Ladner et al. 10 8 Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Time in microseconds per lookup 6 4 2 0 10000 100000 1e+06 Number of items Fig. 4.10. Execution time on Windows for static search algorithms of computing pointers. Inexplicably, cache aware search with explicit pointers

Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion

FFMK tribute slide... FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW s code generator produces straight-line codelets, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number of registers on the target architecture.

Open questions Is there a gap in asymptotic complexity? Is there a limit as to how much better a cache-aware algorithm can be?

Conclusion Seem to be slower Provide cache optimality without knowing cache size Based on recursion

Richard E Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, pages 78 92. Springer, 2002. Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354 356, 1969.