Cache-Oblivious Algorithms
|
|
- August Hardy
- 5 years ago
- Views:
Transcription
1 Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta
2 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
3 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
4 Matrix multiplication ORD-MULT(A, B, C) 1 for i 1 to m 2 for j 1 to p 3 for k 1 to n 4 C ij C ij + A ik B kj
5 Matrix layout Like in C... e nd se l- c- e- l- p- s- r- LT m ed p: (a) (b) (c) Figure: 0 1 2Row 3 16 major order 19 (d)
6 Matrix layout Like in C.... ein case (a) (4), 0we1 2 3(a) (b) (b) nd it vertically, and esematrices, these el-recursive mulhe base case32 oc e- ase the two 40 ele result matrix and-conquer l- 56 al p- it uses cache op c- lgorithm, s- (c) we Figure: 0as-1 in r- row-major4or LT ely, REC-MULT ce m a subproblem Or like in Fortran 2Row 3(c) 16 major order (d) (d) Figure: Column major order ed ms can be solved mp: uses Θ8 mnp: np mp: L
7 Cache friendly algorithm BLOCK-MULT(A, B, C, n) 1 for i 1 to n/s 2 for j 1 to n/s 3 for k 1 to n/s 4 ORD-MULT(A ik, B kj, C ij, s)
8 BLOCK-MULT issues Being cache aware is hard: Cumbersome structure Complicated choice of s Expensive mispicking of s Problematic if n mod s 0
9 Motivation Keeping algorithm simple is nice. But cache effectiveness is the must.
10 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
11 TRACT System model arald Prokop Sridhar Ramachandran hnology Square, Cambridge, MA 02139!"# # $ % '&(#&*)+%,&- " CPU W work Z3 L Cache lines organized by optimal replacement strategy Cache Lines of length L Q cache misses Main Memory Figure 1: The ideal-cache model all assume that word size is constant; the particular nstant does not affect our asymptotic analyses. The Two level memory Fully associative Strictly optimal replacement Automatic replacement Tall cache: where: Z = Ω(L 2 ), Z number of words in the cache L number of words in a cache line
12 Matrix multiplication Given: A[m n] B[n p] C[m p] ( A1 A 2 ) ( A1 B B = A 2 B ( A1 A 2 ) ( B 1 B 2 ), m max(n, p) (1) ) = A 1 B 1 + A 2 B 2, n max(m, p) (2) A ( B 1 B 2 ) = ( AB1 AB 2 ), p max(n, m) (3) C ij := C ij + A ik B kj, m = n = p = 1 (4)
13 Bounds REC-MULT Work: Θ(n 3 ) Cache misses: Θ(n + n 2 /L + n 3 /L Z) vs BLOCK-MULT Work: Θ(n 3 ) Cache misses: Θ(1 + n 2 /L + n 3 /L Z) vs Strassen s [2] (cache oblivious) Work: Θ(n log 2 7 ) Cache misses: Θ(1 + n 2 /L + n log 2 7 /L Z)
14 Matrix transposition Given: A[m n] B[n m] A = ( A 1 A 2 ), B = ( B1 B 2 ) (5)
15 Bounds REC-TRANSPOSE Work: Θ(n m) Cache misses: Θ(1 + mn/l) Asymptotically optimal Naïve Work: Θ(n m) Cache misses: Θ(n m)
16 Discrete Fourier Transform (DFT) Compute: n 1 Y [i] = X [j]ωn ij, j=0 where ω n = e 2π 1/n Assume n = 2 k k N Choose n 1 = 2 log2n/2, n 2 = 2 log 2n/2 Factorized Y (Cooley-Turkey algorithm): Y [i 1 + i 2 n 1 ] = n 2 1 j 2 =0 n 1 1 j 1 =0 X [j 1 n 2 + j 2 ]ω j 1j 2 n ω j 1j 2 n 2
17 Sorting Mergesort is not optimal with respect to cache misses. 1. Funnelsort 2. Distribution sort Recursive Asymptotically cache-optimal Not every recursive sort is cache optimal
18 Funnelsort 1. Split input into n 1 3 of size n 2 3, and sort these arrays recursively 2. Merge n 1 3 sorted sequences using n 1 3 -merger
19 k-merger L 1 L k buffers k-merger Figure 3: Illustration of a k-merger. A k-merger is built recursively out of k left k-mergers L 1, L 2,, L k, a series of buffers, and one right k-merger R. R of 3 k buffe hold 2k 2 e are connecte the right par merger beco intermediate can hold 2k 3 of elements buffer space algorithm, a the recursion k 3 < 8 eleme A 3 k-merg In order to R k 2 times merger fills all buffers th to fill buffer left merger L buffer conta
20 Bounds Work: O(n log 2 n) Optimal cache misses: O(1 + (n/l)(1 + log Z n))
21 Relieved system model LRU Θ(Q(n; Z; L)) Multilevel cache inclusive cache
22 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
23 Micro-benchmarks algorithm regularity y in expecmory man-. algorithms, FFT, and ith explicit Time (microseconds) iterative recursive N Figure 5: Average time taken to multiply two N N matrices, divided by N 3. Time (microseconds) cept that theiterative 0.2 divide-and-conquer structure was modified recursive to produce exact powers of 2 as submatrix sizes wherever possible. In addition, the base cases were coars ened by inlining the recursion near the leaves to increase 0.05their size and overcome the overhead of procedure calls. 0 (A good research problem is to determine an effective compiler strategy for N coarsening base cases Figure automatically.) 4: Average time to transpose an N N matrix, divided Although by N 2 these. results must be considered preliminary, Figure 4 strongly indicates that the recursive aluses a recur transform ca duces straigh cases for the cache oblivi allocation ef erated witho target archite To close that should theoretic re rithms and c Separation: between cac It appears th use caches b they have m they are runn advantage is Ω8 lgz: adva
24 Real benchmarks [1] 4. Comparison of Cache Aware and Cache Oblivious Static Search Trees 89 Average number of cache misses per lookup Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Cache Misses for Static Search Trees e+06 Number of items Fig Cache misses per lookup for static search algorithms Classic Binary Search Explicit Instruction Count for Static Search Trees
25 e+06 Number of items Real benchmarks [1] Fig Cache misses per lookup for static search algorithms Instruction Count for Static Search Trees Average number of instructions per lookup Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit e+06 Number of items Fig Instruction count per lookup for static search algorithms Figure 4.10 gives the results of an execution time study using Windows.
26 Real benchmarks [1] 90 Richard E. Ladner et al Execution Time on Windows for Static Search Trees Classic Binary Search Explicit Classic Binary Search Implicit Cache Oblivious Explicit Cache Oblivious Implicit Cache Aware Explicit Cache Aware Implicit Time in microseconds per lookup e+06 Number of items Fig Execution time on Windows for static search algorithms of computing pointers. Inexplicably, cache aware search with explicit pointers
27 Table of Contents Introduction Cache-oblivious algorithms Matrix multiplication Matrix transposition Fast Fourier Transform Sorting Relieved system model Experimental evaluation Conclusion
28 FFMK tribute slide... FFTW library, which uses a recursive strategy to exploit caches in Fourier transform calculations. FFTW s code generator produces straight-line codelets, which are coarsened base cases for the FFT algorithm. Because these codelets are cache oblivious, a C compiler can perform its register allocation efficiently, and yet the codelets can be generated without knowing the number of registers on the target architecture.
29 Open questions Is there a gap in asymptotic complexity? Is there a limit as to how much better a cache-aware algorithm can be?
30 Conclusion Seem to be slower Provide cache optimality without knowing cache size Based on recursion
31 Richard E Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, pages Springer, Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4): , 1969.
3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms Matteo Frigo, Charles Leiserson, Harald Prokop, Sridhar Ramchandran Slides Written and Presented by William Kuszmaul THE DISK ACCESS MODEL Three Parameters: B M P Block Size
More informationCache-Oblivious Algorithms EXTENDED ABSTRACT
Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 fathena,cel,prokop,sridharg@supertech.lcs.mit.edu
More informationMemory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005
Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.
More informationCache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms
Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Aarhus University Cache-Oblivious Current Trends Algorithms in Algorithms, - A Unified Complexity Approach to Theory, Hierarchical
More informationCache Friendly Sparse Matrix Vector Multilication
Cache Friendly Sparse Matrix Vector Multilication Sardar Anisual Haque 1, Shahadat Hossain 2, Marc Moreno Maza 1 1 University of Western Ontario, London, Ontario (Canada) 2 Department of Computer Science,
More informationCache-Efficient Algorithms
6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache
More informationReport Seminar Algorithm Engineering
Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationNetwork-oblivious algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri
Network-oblivious algorithms Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri Overview Motivation Framework for network-oblivious algorithms Case studies: Network-oblivious
More informationCache Memories, Cache Complexity
Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London (Canada) Applications of Computer Algebra Session on High-Performance Computer Algebra Jerusalem College of Technology,
More informationCS3350B Computer Architecture
CS3350B Computer Architecture Winter 2015 Lecture 3.1: Memory Hierarchy: What and Why? Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out
More informationCS473 - Algorithms I
CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine
More informationMassive Data Algorithmics. Lecture 12: Cache-Oblivious Model
Typical Computer Hierarchical Memory Basics Data moved between adjacent memory level in blocks A Trivial Program A Trivial Program: d = 1 A Trivial Program: d = 1 A Trivial Program: n = 2 24 A Trivial
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms by Harald Prokop Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at
More informationCache Memories, Cache Complexity
Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 and CS4402-9535 Plan Hierarchical memories and their impact on our programs Cache Analysis
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationThe History of I/O Models Erik Demaine
The History of I/O Models Erik Demaine MASSACHUSETTS INSTITUTE OF TECHNOLOGY Memory Hierarchies in Practice CPU 750 ps Registers 100B Level 1 Cache 100KB Level 2 Cache 1MB 10GB 14 ns Main Memory 1EB-1ZB
More informationI/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13
I/O Model 15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms Matrix multiplication Distribution sort Static searching Abstracts a single level of the memory hierarchy Fast memory
More informationLecture 24 November 24, 2015
CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 24 November 24, 2015 Scribes: Zhengyu Wang 1 Cache-oblivious Model Last time we talked about disk access model (as known as DAM, or
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 13 Divide and Conquer Closest Pair of Points Convex Hull Strassen Matrix Mult. Adam Smith 9/24/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova,
More informationExperimenting with the MetaFork Framework Targeting Multicores
Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report
More informationNetwork-Oblivious Algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato and Francesco Silvestri
Network-Oblivious Algorithms Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato and Francesco Silvestri Overview Background Summary of results Framework for network-oblivious algorithms
More informationAlgorithms for dealing with massive data
Computer Science Department Federal University of Rio Grande do Sul Porto Alegre, Brazil Outline of the talk Introduction Outline of the talk Algorithms models for dealing with massive datasets : Motivation,
More informationCache Oblivious Algorithms
Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;
More informationCache Oblivious Matrix Transposition: Simulation and Experiment
Cache Oblivious Matrix Transposition: Simulation and Experiment Dimitrios Tsifakis, Alistair P. Rendell, and Peter E. Strazdins Department of Computer Science, Australian National University Canberra ACT0200,
More information6.895 Final Project: Serial and Parallel execution of Funnel Sort
6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationMultithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101
More informationLecture 19 Apr 25, 2007
6.851: Advanced Data Structures Spring 2007 Prof. Erik Demaine Lecture 19 Apr 25, 2007 Scribe: Aditya Rathnam 1 Overview Previously we worked in the RA or cell probe models, in which the cost of an algorithm
More informationDIVIDE & CONQUER. Problem of size n. Solution to sub problem 1
DIVIDE & CONQUER Definition: Divide & conquer is a general algorithm design strategy with a general plan as follows: 1. DIVIDE: A problem s instance is divided into several smaller instances of the same
More informationAlgorithm Engineering
Algorithm Engineering Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler: Static and Dynamic Theory
More informationAlgorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures
Richard Mayr Slides adapted from Mary Cryan (2015/16) with some changes. School of Informatics University of Edinburgh ADS (2018/19) Lecture 1 slide 1 ADS (2018/19) Lecture 1 slide 3 ADS (2018/19) Lecture
More informationCache-Adaptive Analysis
Cache-Adaptive Analysis Michael A. Bender1 Erik Demaine4 Roozbeh Ebrahimi1 Jeremy T. Fineman3 Rob Johnson1 Andrea Lincoln4 Jayson Lynch4 Samuel McCauley1 1 3 4 Available Memory Can Fluctuate in Real Systems
More informationChapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 4 Divide-and-Conquer Copyright 2007 Pearson Addison-Wesley. All rights reserved. Divide-and-Conquer The most-well known algorithm design strategy: 2. Divide instance of problem into two or more
More informationAlgorithms and Data Structures
Algorithms and Data Structures or, Classical Algorithms of the 50s, 60s, 70s Richard Mayr Slides adapted from Mary Cryan (2015/16) with small changes. School of Informatics University of Edinburgh ADS
More information4. A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation
4. A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation Richard E. Ladner, Ray Fortna, and Bao-Hoang Nguyen Department of Computer Science & Engineering Universityof
More informationCache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012
Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical
More informationCSC Design and Analysis of Algorithms
CSC 8301- Design and Analysis of Algorithms Lecture 6 Divide and Conquer Algorithm Design Technique Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide a problem instance into two
More informationCSC Design and Analysis of Algorithms. Lecture 6. Divide and Conquer Algorithm Design Technique. Divide-and-Conquer
CSC 8301- Design and Analysis of Algorithms Lecture 6 Divide and Conquer Algorithm Design Technique Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide a problem instance into two
More informationCache-Oblivious and Data-Oblivious Sorting and Applications
Cache-Oblivious and Data-Oblivious Sorting and Applications T-H. Hubert Chan, Yue Guo, Wei-Kai Lin, and Elaine Shi Jan, 2018 External Memory Model Cache efficiency: # of blocks Time: # of words Memory
More informationLists Revisited: Cache Conscious STL lists
Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi Petit, Salvador Roura Departament de Llenguatges i Sistemes Informàtics. Universitat Politècnica de Catalunya. Overview Goal: Improve STL
More informationCSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )
CSE 638: Advanced Algorithms Lectures 18 & 19 ( Cache-efficient Searching and Sorting ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2013 Searching ( Static B-Trees ) A Static
More informationCache-Oblivious Traversals of an Array s Pairs
Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationParallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication
CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato & Alen Stojanov Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More information9. Cache Oblivious Algorithms
9. Cache Oblivious Algorithms Piyush Kumar 9.1 Introduction The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current
More informationCache-Oblivious Algorithms and Data Structures
Cache-Oblivious Algorithms and Data Structures Erik D. Demaine MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, MA 02139, USA, edemaine@mit.edu Abstract. A recent direction in the
More informationCache Oblivious Matrix Transposition: Simulation and Experiment
Cache Oblivious Matrix Transposition: Simulation and Experiment Dimitrios Tsifakis, Alistair P. Rendell * and Peter E. Strazdins Department of Computer Science Australian National University Canberra ACT0200,
More informationDivide-and-Conquer. Dr. Yingwu Zhu
Divide-and-Conquer Dr. Yingwu Zhu Divide-and-Conquer The most-well known algorithm design technique: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances independently
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationDivide-and-Conquer. The most-well known algorithm design strategy: smaller instances. combining these solutions
Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances recursively 3. Obtain solution to original
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationLecture April, 2010
6.851: Advanced Data Structures Spring 2010 Prof. Eri Demaine Lecture 20 22 April, 2010 1 Memory Hierarchies and Models of Them So far in class, we have wored with models of computation lie the word RAM
More informationIntroduction to Algorithms
Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationNUMA-aware Multicore Matrix Multiplication
Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,
More informationCache Oblivious Matrix Transpositions using Sequential Processing
IOSR Journal of Engineering (IOSRJEN) e-issn: 225-321, p-issn: 2278-8719 Vol. 3, Issue 11 (November. 213), V4 PP 5-55 Cache Oblivious Matrix s using Sequential Processing korde P.S., and Khanale P.B 1
More informationLecture 7 8 March, 2012
6.851: Advanced Data Structures Spring 2012 Lecture 7 8 arch, 2012 Prof. Erik Demaine Scribe: Claudio A Andreoni 2012, Sebastien Dabdoub 2012, Usman asood 2012, Eric Liu 2010, Aditya Rathnam 2007 1 emory
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationInput parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan
Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu
More informationThe M4RI & M4RIE libraries for linear algebra over F 2 and small extensions
The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables
More informationCache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop What is Burrows-Wheeler Transform? A pre-processing step for data compression Involves sorting of all rotations
More informationReport on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]
Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1] Marc André Tanner May 30, 2014 Abstract This report contains two main sections: In section 1 the cache-oblivious computational
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5
More informationReview implementation of Stable Matching Survey of common running times. Turn in completed problem sets. Jan 18, 2019 Sprenkle - CSCI211
Objectives Review implementation of Stable Matching Survey of common running times Turn in completed problem sets Jan 18, 2019 Sprenkle - CSCI211 1 Review: Asymptotic Analysis of Gale-Shapley Alg Not explicitly
More informationCSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms
Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop
More informationAdvanced Algorithms. Problem solving Techniques. Divide and Conquer הפרד ומשול
Advanced Algorithms Problem solving Techniques. Divide and Conquer הפרד ומשול 1 Divide and Conquer A method of designing algorithms that (informally) proceeds as follows: Given an instance of the problem
More informationTest 1 Review Questions with Solutions
CS3510 Design & Analysis of Algorithms Section A Test 1 Review Questions with Solutions Instructor: Richard Peng Test 1 in class, Wednesday, Sep 13, 2017 Main Topics Asymptotic complexity: O, Ω, and Θ.
More informationLecture 8. Dynamic Programming
Lecture 8. Dynamic Programming T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo choo@skku.edu Copyright 2000-2018
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationFunnel Heap - A Cache Oblivious Priority Queue
Alcom-FT Technical Report Series ALCOMFT-TR-02-136 Funnel Heap - A Cache Oblivious Priority Queue Gerth Stølting Brodal, Rolf Fagerberg Abstract The cache oblivious model of computation is a two-level
More informationAlgorithms and Data Structures, or
Algorithms and Data Structures, or... Classical Algorithms of the 50s, 60s and 70s Mary Cryan A&DS Lecture 1 1 Mary Cryan Our focus Emphasis is Algorithms ( Data Structures less important). Most of the
More informationD-BAUG Informatik I. Exercise session: week 5 HS 2018
1 D-BAUG Informatik I Exercise session: week 5 HS 2018 Homework 2 Questions? Matrix and Vector in Java 3 Vector v of length n: Matrix and Vector in Java 3 Vector v of length n: double[] v = new double[n];
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationModule 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm
Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm This module 27 focuses on introducing dynamic programming design strategy and applying it to problems like chained matrix
More informationDivide and Conquer. Algorithm Fall Semester
Divide and Conquer Algorithm 2014 Fall Semester Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances
More informationParallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms
Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationA Parallel, In-Place, Rectangular Matrix Transpose Algorithm
Stefan Amberger ICA & RISC amberger.stefan@gmail.com A Parallel, In-Place, Rectangular Matrix Transpose Algorithm Description of Algorithm and Correctness Proof Table of Contents 1. Introduction 2. Description
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationMultithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson
Multithreaded Programming in Cilk LECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationDynamic programming in faulty memory hierarchies (cache-obliviously)
Dynamic programming in faulty memory hierarchies (cache-obliviously) S. Caminiti 1, I. Finocchi 1, E. G. Fusco 1, and F. Silvestri 2 1 Computer Science Department, Sapienza University of Rome 2 Department
More informationExtra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationSolutions to Exam Data structures (X and NV)
Solutions to Exam Data structures X and NV 2005102. 1. a Insert the keys 9, 6, 2,, 97, 1 into a binary search tree BST. Draw the final tree. See Figure 1. b Add NIL nodes to the tree of 1a and color it
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationFast Tree-Structured Computations and Memory Hierarchies
Fast Tree-Structured Computations and Memory Hierarchies Siddhartha Chatterjee Department of Computer Science The University of North Carolina at Chapel Hill sc@cs.unc.edu http://www.cs.unc.edu/research/tune/
More informationComputer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis
Computer Science 210 Data Structures Siena College Fall 2017 Topic Notes: Complexity and Asymptotic Analysis Consider the abstract data type, the Vector or ArrayList. This structure affords us the opportunity
More information