3.2 Cache Oblivious Algorithms
|
|
- Laura Greene
- 5 years ago
- Views:
Transcription
1 3.2 Cache Oblivious Algorithms
2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA. 2
3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 3
4 Assumption Only two levels of memory hierarchies: An ideal cache Fully associative Optimal replacement strategy Tall cache A very large memory 4
5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line 5
6 Cache Complexity An algorithm with input size n is measured by: Work complexity W(n) Cache complexity: the number of cache misses it incurs. Q(n; Z, L) 6
7 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 7
8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms. 8
9 Example: A blocked matrix multiplication algorithm s s A11 n A s is a tuning parameter to make the algorithm run fast 9
10 Example (2) Cache complexity The three s x s sub matrices should fit into the cache so 2 2 they occupy max( s, s / L) = Θ( s + s / L) cache lines Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cache n 2 /L cache misses needed to read n 2 elements 2 3 It is Θ(1 + n / L + ( n / s) ( Z / L)) = Θ(1 + n 2 / L + n 3 / L Z ) s = Θ( Z ) 10
11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition and FFT Conclusion 11
12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L). No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity. 12
13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p n max (m, p) m max (n, p) p max (n, m) Proceed recursively until reach the base case - one element. 13
14 Matrix Multiplication (2) Assume Sizes of A, B are nx4n, 4nxn B 11 ( A ) 11 A12 B12 A1*B1 A*B + A2*B2 + + B 1 ( A ) 1 A2 B2 21 ( A ) 21 A22 B22 A11*B11 A12*B12 A21*B21 A22*B22 B 14
15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses. 15
16 Matrix Multiplication (4) Cache complexity Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) For a square matrix, the optimal cache complexity is achieved. 16
17 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 17
18 Matrix Transposition A A T for i 1 to m m x n B n x m for j 1 to n B( j, i ) = A( i, j ) If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) 18
19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A21 A11 T A12 T A12 A22 A21 T A22 T 19
20 Matrix Transposition (3) Cache complexity It has the optimal cache complexity Q(m, n) = Θ(1+mn/L) 20
21 Fast Fourier Transform Y [ i ] = n 1 j = 0 X [ j ] ω ij n Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as: Perform n 2 DFTs of size n 1. Multiply by complex roots of unity called twiddle factors. Perform n 1 DFTs of size n 2. 21
22 n 1 Yi [] = X[ j] w n2 1 n1 1 Yi [ 1+ in 2 1] = X[ jn j2] w w w j2= 0 j1= 0 j= 0 ij ij ij i j n n n n 1 n 2 22
23 Assume X is a row-major n 1 n 2 matrix Steps: Transpose X in place. Compute n 2 DFTs Multiply by twiddle factors Transpose X in place Compute n 1 DFTs Transpose X in-place 23
24 Fast Fourier Transform n1=4, n2=2 Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return *twiddle factor Transpose to select n1 DFT of size n2 Transpose and return 24
25 Fast Fourier Transform Cache complexity Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 Q(n) = O(1+(n/L)(1+log z n) 25
26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots 26
27 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion 27
28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cacheoblivious and cache-aware algorithms? 28
29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2 29
30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3 30
31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms? FFTW library No answer yet. 31
32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC
Cache-Oblivious Algorithms
Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms Matteo Frigo, Charles Leiserson, Harald Prokop, Sridhar Ramchandran Slides Written and Presented by William Kuszmaul THE DISK ACCESS MODEL Three Parameters: B M P Block Size
More informationCache Friendly Sparse Matrix Vector Multilication
Cache Friendly Sparse Matrix Vector Multilication Sardar Anisual Haque 1, Shahadat Hossain 2, Marc Moreno Maza 1 1 University of Western Ontario, London, Ontario (Canada) 2 Department of Computer Science,
More informationMemory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005
Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.
More informationCache-Oblivious Algorithms EXTENDED ABSTRACT
Cache-Oblivious Algorithms EXTENDED ABSTRACT Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 fathena,cel,prokop,sridharg@supertech.lcs.mit.edu
More informationCache-Efficient Algorithms
6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache
More informationAlgorithms for dealing with massive data
Computer Science Department Federal University of Rio Grande do Sul Porto Alegre, Brazil Outline of the talk Introduction Outline of the talk Algorithms models for dealing with massive datasets : Motivation,
More informationCache-Adaptive Analysis
Cache-Adaptive Analysis Michael A. Bender1 Erik Demaine4 Roozbeh Ebrahimi1 Jeremy T. Fineman3 Rob Johnson1 Andrea Lincoln4 Jayson Lynch4 Samuel McCauley1 1 3 4 Available Memory Can Fluctuate in Real Systems
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms by Harald Prokop Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at
More informationAlgorithms and Data Structures: Efficient and Cache-Oblivious
7 Ritika Angrish and Dr. Deepak Garg Algorithms and Data Structures: Efficient and Cache-Oblivious Ritika Angrish* and Dr. Deepak Garg Department of Computer Science and Engineering, Thapar University,
More informationCache-Oblivious String Dictionaries
Cache-Oblivious String Dictionaries Gerth Stølting Brodal University of Aarhus Joint work with Rolf Fagerberg #"! Outline of Talk Cache-oblivious model Basic cache-oblivious techniques Cache-oblivious
More informationFunnel Heap - A Cache Oblivious Priority Queue
Alcom-FT Technical Report Series ALCOMFT-TR-02-136 Funnel Heap - A Cache Oblivious Priority Queue Gerth Stølting Brodal, Rolf Fagerberg Abstract The cache oblivious model of computation is a two-level
More informationI/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13
I/O Model 15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms Matrix multiplication Distribution sort Static searching Abstracts a single level of the memory hierarchy Fast memory
More informationReport Seminar Algorithm Engineering
Report Seminar Algorithm Engineering G. S. Brodal, R. Fagerberg, K. Vinther: Engineering a Cache-Oblivious Sorting Algorithm Iftikhar Ahmad Chair of Algorithm and Complexity Department of Computer Science
More informationCache Oblivious Matrix Transposition: Simulation and Experiment
Cache Oblivious Matrix Transposition: Simulation and Experiment Dimitrios Tsifakis, Alistair P. Rendell * and Peter E. Strazdins Department of Computer Science Australian National University Canberra ACT0200,
More informationCache Oblivious Matrix Transpositions using Sequential Processing
IOSR Journal of Engineering (IOSRJEN) e-issn: 225-321, p-issn: 2278-8719 Vol. 3, Issue 11 (November. 213), V4 PP 5-55 Cache Oblivious Matrix s using Sequential Processing korde P.S., and Khanale P.B 1
More informationExperimenting with the MetaFork Framework Targeting Multicores
Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report
More informationCache Oblivious Algorithms
Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;
More informationCache Memories, Cache Complexity
Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London (Canada) Applications of Computer Algebra Session on High-Performance Computer Algebra Jerusalem College of Technology,
More informationCache-Oblivious Traversals of an Array s Pairs
Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationCache Oblivious Matrix Transposition: Simulation and Experiment
Cache Oblivious Matrix Transposition: Simulation and Experiment Dimitrios Tsifakis, Alistair P. Rendell, and Peter E. Strazdins Department of Computer Science, Australian National University Canberra ACT0200,
More informationCache Memories, Cache Complexity
Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 and CS4402-9535 Plan Hierarchical memories and their impact on our programs Cache Analysis
More informationNetwork-oblivious algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri
Network-oblivious algorithms Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri Overview Motivation Framework for network-oblivious algorithms Case studies: Network-oblivious
More informationCache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop What is Burrows-Wheeler Transform? A pre-processing step for data compression Involves sorting of all rotations
More informationCache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012
Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical
More informationLecture 24 November 24, 2015
CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 24 November 24, 2015 Scribes: Zhengyu Wang 1 Cache-oblivious Model Last time we talked about disk access model (as known as DAM, or
More informationHow to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato
How to Write Fast Numerical Code Spring 2012 Lecture 20 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Planning Today Lecture Project meetings Project presentations 10 minutes each
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationCache Oblivious Stencil Computations
Cache Oblivious Stencil Computations Matteo Frigo and Volker Strumpen IBM Austin Research Laboratory 11501 Burnet Road, Austin, TX 78758 May 25, 2005 Abstract We present a cache oblivious algorithm for
More informationHow to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck
How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Püschel TA: Georg Ofenbeck Schedule Today Lecture Project presentations 10 minutes each random order random speaker 10 Final code
More informationOutline of the talk 1 Problem definition 2 Computational Models 3 Technical results 4 Conclusions University of Padova Bertinoro, February 17-18th 200
Cache-Oblivious Simulation of Parallel Programs Andrea Pietracaprina Geppino Pucci Francesco Silvestri Bertinoro, February 17-18th 2006 University of Padova Bertinoro, February 17-18th 2006 1/ 19 Outline
More informationCache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms
Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Aarhus University Cache-Oblivious Current Trends Algorithms in Algorithms, - A Unified Complexity Approach to Theory, Hierarchical
More informationCache-Oblivious Algorithms and Data Structures
Cache-Oblivious Algorithms and Data Structures Erik D. Demaine MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, MA 02139, USA, edemaine@mit.edu Abstract. A recent direction in the
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationLecture 19 Apr 25, 2007
6.851: Advanced Data Structures Spring 2007 Prof. Erik Demaine Lecture 19 Apr 25, 2007 Scribe: Aditya Rathnam 1 Overview Previously we worked in the RA or cell probe models, in which the cost of an algorithm
More informationThe History of I/O Models Erik Demaine
The History of I/O Models Erik Demaine MASSACHUSETTS INSTITUTE OF TECHNOLOGY Memory Hierarchies in Practice CPU 750 ps Registers 100B Level 1 Cache 100KB Level 2 Cache 1MB 10GB 14 ns Main Memory 1EB-1ZB
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Optimizing FFT, FFTW Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 10
More informationNetwork-Oblivious Algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato and Francesco Silvestri
Network-Oblivious Algorithms Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato and Francesco Silvestri Overview Background Summary of results Framework for network-oblivious algorithms
More informationDense Matrix Multiplication
Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2
More informationCS473 - Algorithms I
CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationInput parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan
Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu
More information6.895 Final Project: Serial and Parallel execution of Funnel Sort
6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato & Alen Stojanov Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationEnergy Efficient Adaptive Beamforming on Sensor Networks
Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna
More informationMulti-core Computing Lecture 2
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date
More informationReport on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]
Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1] Marc André Tanner May 30, 2014 Abstract This report contains two main sections: In section 1 the cache-oblivious computational
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationAutotuning (1/2): Cache-oblivious algorithms
Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel
More informationMulti-core Computing Lecture 1
Hi-Spade Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 20, 2012 Lecture 1 Outline Multi-cores:
More informationLecture 8 13 March, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 8 13 March, 2012 1 From Last Lectures... In the previous lecture, we discussed the External Memory and Cache Oblivious memory models.
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationImplementing FFTs in Practice
Connexions module: m16336 1 Implementing FFTs in Practice Steven G. Johnson Matteo Frigo This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract
More informationCSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )
CSE 638: Advanced Algorithms Lectures 18 & 19 ( Cache-efficient Searching and Sorting ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2013 Searching ( Static B-Trees ) A Static
More informationCache-oblivious comparison-based algorithms on multisets
Cache-oblivious comparison-based algorithms on multisets Arash Farzan 1, Paolo Ferragina 2, Gianni Franceschini 2, and J. Ian unro 1 1 {afarzan, imunro}@uwaterloo.ca School of Computer Science, University
More informationCS3350B Computer Architecture
CS3350B Computer Architecture Winter 2015 Lecture 3.1: Memory Hierarchy: What and Why? Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson
More informationBasic Communication Ops
CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction
More informationCache Efficient Simple Dynamic Programming
Cache Efficient Simple Dynamic Programming Cary Cherng Richard E. Ladner September 25, 2004 Abstract New cache-oblivious and cache-aware algorithms for simple dynamic programming based on Valiant s context-free
More informationOutline. CS38 Introduction to Algorithms. Fast Fourier Transform (FFT) Fast Fourier Transform (FFT) Fast Fourier Transform (FFT)
Outline CS8 Introduction to Algorithms Lecture 9 April 9, 0 Divide and Conquer design paradigm matrix multiplication Dynamic programming design paradigm Fibonacci numbers weighted interval scheduling knapsack
More informationCache-Aware and Cache-Oblivious Adaptive Sorting
Cache-Aware and Cache-Oblivious Adaptive Sorting Gerth Stølting rodal 1,, Rolf Fagerberg 2,, and Gabriel Moruz 1 1 RICS, Department of Computer Science, University of Aarhus, IT Parken, Åbogade 34, DK-8200
More informationLecture 9 March 15, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 9 March 15, 2012 1 Overview This is the last lecture on memory hierarchies. Today s lecture is a crossover between cache-oblivious
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationLecture April, 2010
6.851: Advanced Data Structures Spring 2010 Prof. Eri Demaine Lecture 20 22 April, 2010 1 Memory Hierarchies and Models of Them So far in class, we have wored with models of computation lie the word RAM
More informationCHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER
115 CHAPTER 6 A SECURE FAST 2D-DISCRETE FRACTIONAL FOURIER TRANSFORM BASED MEDICAL IMAGE COMPRESSION USING SPIHT ALGORITHM WITH HUFFMAN ENCODER 6.1. INTRODUCTION Various transforms like DCT, DFT used to
More informationBRICS Research Activities Algorithms
BRICS Research Activities Algorithms Gerth Stølting Brodal BRICS Retreat, Sandbjerg, 21 23 October 2002 1 Outline of Talk The Algorithms Group Courses Algorithm Events Expertise within BRICS Examples Algorithms
More informationFFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES
FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and
More informationFast Algorithm for Matrix-Vector Multiply of Asymmetric Multilevel Block-Toeplitz Matrices
" Fast Algorithm for Matrix-Vector Multiply of Asymmetric Multilevel Block-Toeplitz Matrices B. E. Barrowes, F. L. Teixeira, and J. A. Kong Research Laboratory of Electronics, MIT, Cambridge, MA 02139-4307
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st
More informationSDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms
SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms Document Number......................................................... SDP Memo 048 Document Type.....................................................................
More informationModule 9 : Numerical Relaying II : DSP Perspective
Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationImage Processing. Application area chosen because it has very good parallelism and interesting output.
Chapter 11 Slide 517 Image Processing Application area chosen because it has very good parallelism and interesting output. Low-level Image Processing Operates directly on stored image to improve/enhance
More informationAnalysis of Multithreaded Algorithms
Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
HPEC 2004 Abstract Submission Dillon Engineering, Inc. www.dilloneng.com An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Tom Dillon Dillon Engineering, Inc. This presentation outlines
More informationFPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith
FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith Sudhanshu Mohan Khare M.Tech (perusing), Dept. of ECE Laxmi Naraian College of Technology, Bhopal, India M. Zahid Alam Associate
More informationx = 12 x = 12 1x = 16
2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?
More informationExternal Memory. Philip Bille
External Memory Philip Bille Outline Computationals models Modern computers (word) RAM I/O Cache-oblivious Shortest path in implicit grid graphs RAM algorithm I/O algorithms Cache-oblivious algorithm Computational
More information38 Cache-Oblivious Data Structures
38 Cache-Oblivious Data Structures Lars Arge Duke University Gerth Stølting Brodal University of Aarhus Rolf Fagerberg University of Southern Denmark 38.1 The Cache-Oblivious Model... 38-1 38.2 Fundamental
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationDynamic programming in faulty memory hierarchies (cache-obliviously)
Dynamic programming in faulty memory hierarchies (cache-obliviously) S. Caminiti 1, I. Finocchi 1, E. G. Fusco 1, and F. Silvestri 2 1 Computer Science Department, Sapienza University of Rome 2 Department
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationFFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW
FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow
More informationThe Design and Implementation of FFTW3
The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize
More informationarxiv: v1 [cs.ds] 7 May 2016 Abstract
The I/O complexity of Strassen s matrix multiplication with recomputation Gianfranco Bilardi 1 and Lorenzo De Stefani 2 1 Department of Information Engineering, University of Padova, Via Gradenigo 6B/Padova,
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 13 Divide and Conquer Closest Pair of Points Convex Hull Strassen Matrix Mult. Adam Smith 9/24/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova,
More informationLearning to Construct Fast Signal Processing Implementations
Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More information6. Fast Fourier Transform
x[] X[] x[] x[] x[6] X[] X[] X[3] x[] x[5] x[3] x[7] 3 X[] X[5] X[6] X[7] A Historical Perspective The Cooley and Tukey Fast Fourier Transform (FFT) algorithm is a turning point to the computation of DFT
More informationCache-Oblivious and Data-Oblivious Sorting and Applications
Cache-Oblivious and Data-Oblivious Sorting and Applications T-H. Hubert Chan, Yue Guo, Wei-Kai Lin, and Elaine Shi Jan, 2018 External Memory Model Cache efficiency: # of blocks Time: # of words Memory
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationAssignment #6: Subspaces of R n, Bases, Dimension and Rank. Due date: Wednesday, October 26, 2016 (9:10am) Name: Section Number
Assignment #6: Subspaces of R n, Bases, Dimension and Rank Due date: Wednesday, October 26, 206 (9:0am) Name: Section Number Assignment #6: Subspaces of R n, Bases, Dimension and Rank Due date: Wednesday,
More informationCache efficient simple dynamic programming
Cache efficient simple dynamic programming Cary Cherng, Richard E. Ladner To cite this version: Cary Cherng, Richard E. Ladner. Cache efficient simple dynamic programming. Conrado Martínez. 2005 International
More informationCommunication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout
Communication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout Grey Ballard James Demmel Benjamin Lipshitz Oded Schwartz Sivan Toledo Electrical Engineering and Computer
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5
More information