Cache Oblivious Matrix Transpositions using Sequential Processing

Similar documents
Cache Oblivious Matrix Transposition: Simulation and Experiment

Cache Oblivious Matrix Transposition: Simulation and Experiment

3.2 Cache Oblivious Algorithms

Cache-oblivious Programming

A Review on Cache Memory with Multiprocessor System

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Multi-core Computing Lecture 2

1 Motivation for Improving Matrix Multiplication

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

LECTURE 11. Memory Hierarchy

Effect of memory latency

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

HPC Algorithms and Applications

Report Seminar Algorithm Engineering

Computer Caches. Lab 1. Caching

Lecture 2. Memory locality optimizations Address space organization

CS3350B Computer Architecture

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

CPE300: Digital System Architecture and Design

Matrix Multiplication

Basic Communication Ops

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

MIPS) ( MUX

A Two Layered approach to perform Effective Page Replacement

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Lecture 19 Apr 25, 2007

Cache-Oblivious Algorithms

The check bits are in bit numbers 8, 4, 2, and 1.

MapReduce and Friends

Outline of the talk 1 Problem definition 2 Computational Models 3 Technical results 4 Conclusions University of Padova Bertinoro, February 17-18th 200

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Fast Tree-Structured Computations and Memory Hierarchies

Matrix Multiplication

Cache-Oblivious Traversals of an Array s Pairs

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III

Today Cache memory organization and operation Performance impact of caches

Chapter 6 Objectives

Linear Arrays. Chapter 7

Memory and multiprogramming

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Cache-Oblivious Algorithms

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Physical characteristics (such as packaging, volatility, and erasability Organization.

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Characterization of Request Sequences for List Accessing Problem and New Theoretical Results for MTF Algorithm

External Memory. Philip Bille

Cache-Adaptive Analysis

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, FALL 2012

Virtual Memory. Chapter 8

Parameterization. Michael S. Floater. November 10, 2011

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Hypercubes. (Chapter Nine)

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Caching Basics. Memory Hierarchies

CSCI-580 Advanced High Performance Computing

Comprehensive Review of Data Prefetching Mechanisms

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Cache Friendly Sparse Matrix Vector Multilication

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

CS61C : Machine Structures

The Processor Memory Hierarchy

KareemNaaz Matrix Divide and Sorting Algorithm

Improving Performance of Sparse Matrix-Vector Multiplication

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

CSE 410 Final Exam Sample Solution 6/08/10

Program Optimization

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

V Advanced Data Structures

Main Memory and the CPU Cache

Introduction to Algorithms

CMPSC 311- Introduction to Systems Programming Module: Caching

Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.

Performance and Optimization Issues in Multicore Computing

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Buffer Heap Implementation & Evaluation. Hatem Nassrat. CSCI 6104 Instructor: N.Zeh Dalhousie University Computer Science

Memory hierarchy and cache

High Performance Computing and Programming, Lecture 3

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Memory Organization MEMORY ORGANIZATION. Memory Hierarchy. Main Memory. Auxiliary Memory. Associative Memory. Cache Memory.

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Com S 321 Problem Set 3

CSC D70: Compiler Optimization Memory Optimizations

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

F28HS Hardware-Software Interface: Systems Programming

Transcription:

IOSR Journal of Engineering (IOSRJEN) e-issn: 225-321, p-issn: 2278-8719 Vol. 3, Issue 11 (November. 213), V4 PP 5-55 Cache Oblivious Matrix s using Sequential Processing korde P.S., and Khanale P.B 1 Department of Computer Science Shri Shivaji College,Parbhani (M.S.) India 2 Department of Computer Science Dnyopasak College Parbhani (M.S.) India Abstract: - Matrix transpositions is a fundamental operation in linear algebra and in Fast Fourier transforms and applications in numerical analysis, image processing and graphics. The optimal cache oblivious matrix transpositions makes Ο (1+N 2 / B) cache misses. In this paper we implement divide and conquer based algorithm for matrix transposition through recursive process. Keywords: - Cache Memory; Cache oblivious; Cache hit; Cache miss; Tag. I. INTRODUCTION The most important data structure algorithm is vector, matrices, arrays. A typical modern platform features a hierarchical cascade of memories whose capacities and access times increase as they go further from the CPU. In order to amortize the larger cost incurred when referencing data in distinct levels of the hierarchy. A block of contiguous data is replicated across the aster levels either automatically by hardware or by software. The rationale such a hierarchical organization is that the memory access cost of computation can be reduced when the same data are frequently reused with in short time interval and data stored at consecutive addresses are involved in consecutive operations, two properties known as temporal and spatial locality of reference respectively. To improve cache performance the temporal and spatial locality of the access to the linearised matrix elements has to be improved most linear algebra libraries like BLAS[? ]. Therefore use techniques like loop blocking and loop unrolling [? ]. A lot of fine tuning required to reach optimal cache efficiency on given hardware. The cache oblivious model was propose in [? ] and then has been used in hundreds of research papers. It is becoming popular among researchers in external memory algorithms, parallel algorithms, data structure and other related fields. The model was designed to capture the hierarchical nature of memory organization. This paper used matrix transposition problem to evaluate the algorithms designed to be Optimal under memory model on real machine. The purpose this exercise is to understand how well the asymptomatic predications of the theoretical memory models match behavior observed in real memory hierarchies. II. OTHER RELATED WORK The memory model [1] is the original model of two level memory hierarchies. It consists of size M and data. The algorithms can transfer contiguous block of data size B to or form disk. The Textbook of data structure in this model is the B-tree, a dynamic directory that support inserts, deletes and predecessor quarries in Ό(log B N) per operations. Although a number of cache oblivious algorithms have been proposed, to date most of the analyses have been theoretical with few studies on actual machines. An exception th this is a paper by Chatterjee and Sen[ 2,1]. Their work is of interest in two large dimensions it was actually the worst. Second their timing runs showed that in transposition algorithms. It was suggested that the poor performance of the cache oblivious matrix transposition algorithm was related to the associatively fo the cache, although this relationship was not fully explored. Recently a new model that combines the simplicity of the two levels model with the realism of more complicated hierarchical models was introduced by Frigo, Leiserson, Prokopand Ramchandran[8]. This model called cache oblivious model, enable us to reason about simple two level meld, but prove results about an unknown, multi level memory model. This idea is to avoid any memory-specific parameterization that is to design algorithms that do not use any information about memory access times or block sizes. III. MATRIX TRANSPOSITION Matrix is a linear algebra operation in Fourier transforms and it has application in numerical analysis, image processing and graphics. The simplest of matrix algorithms. 5 P a g e

Cache Oblivious Matrix s using Sequential Processing Algorithm 1 : Cache Oblivious Matrix Matrix transposition is one of most common operation on matrices often algorithm. The following simple transpose the matrix essentially base on the definition of transpose and it does in place For ( i=1; i<n; i++ ) For ( j=1; j<n; j++ ) Swap A [ i ] [ j ] = A[ j ] [ i ]; A matrices are stored in row-major storage that is the right most dimension varies the fastest. In above the number of cache misses is Ο (N 2 ). The optimal cache oblivious transposition makes Ο (1+N 2 / B)cache misses. Algorithm 2 : Cache Ignorant Matrix Given cache oblivious techniques, it simple loop implementation inside the block. It takes input sub matrix as (x) in the input matrix I and transposed it to the output matrix. For ( i=1; i<n; i++ ) For ( j=1; j<n; j++ ) Temp=A[ j ] + [ i ]; A[ i ] [ j ] = A[ j ] [ i ]; A[ j ] [ i ] = temp; In this implementation the inner loop are executed in n(n-1)/2 times and no special cache is made to use the cache efficiently. Algorithm 3 : Cache Blocked Matrix In Block transposition algorithm the matrix is effectively divided into a small blocks. Two blocks are distributed with respect to the identified data copied into resident variables. The variables are then coped back into the matrix, but in transposed form. The source code are For ( i=1; i<n; i+=size ) For ( j=1; j<n; j+=size ) Copy of A[ i ] [ j ] to x Copy of A[ j ] [ij ] to y Copy of x to A[ j ] [ i ] Copy of y to A[ i ] [ j ] In this implementation small blocks is given by size with 2=size 2 is less than cache size, the size is assumed perfectly divides the matrix dimension n. in this method each element of matrix is now loaded into registers twice. Depending upon the programming language the algorithm 1,2,3 us used, the elements of the matrices will be stored in row major or column-major order or even using a pointer based scheme like in C/C++ or JAVA. As we know the resulting program will be show rather disappointing performance on most current computers due to the bad use of cache memory. We reformulate the above algorithm as Sequential Processing Matrix. Algorithm 4 :Sequential Processing for Matrix : First we reformulate the Algorithm 1,2,3 into the following form 51 P a g e

Cache Oblivious Matrix s using Sequential Processing Set counter = For i = to n For j = to n b [ counter ]= a [ j ] [ i ] Counter= counter+1 Set counter = For i = to n For j = to n a [ i ] [ j ] = b[ counter] Counter= counter+1 In algorithms we have used one dimension array on the execution order of the main loop. It may be executed in same order of another loop with coping values of matrix an in column-major order. In second time it back transfer the values of matrix a respectively. Let us consider 3 by 3 of matrices. The elements of matrices are in fig 1 a a 1 a 2 a 3 a 4 a 5 The elements of matrix A are computed as fig 2 a 6 a 7 a 8 fig 1: Matrix A 1 1 2 2 3 1 4 1 1 5 1 2 6 2 7 2 1 8 2 2 Fig 2 Graph representation of operation matrix A The elements of matrix A process in format of fig 3 Fig 3 The process of matrix A elements Sequential Processing Matrix First we compute matrix A, we have perform the following two steps 1) Initialize single dimension array and transfer element of matrix A as column-major order in single dimension array. 2) The single dimension array again back copy to Matrix A. 52 P a g e

Cache Oblivious Matrix s using Sequential Processing When individual operation can be executed in arbitrary order. Our goal will be to find an optimally localized execution order of the operations in matrix A a a 3 a 6 a 1 a 4 a 7 a 2 a 5 a 8 fig 4: Sequential Processing Matrix A The elements of matrix A are computed as fig 2 1 1 2 2 3 1 4 1 1 5 1 2 6 2 7 2 1 8 2 2 Fig 2 Graph representation of operation matrix A Experimental Results: We implemented Four variants of our algorithms 1) Cache Oblivious Matrix 2) Cache Ignorant Matrix 3) Cache Blocked Matrix 4) Sequential Processing for Matrix. There are several things to note down about Matrix algorithms. Cache Oblivious, Ignorant and Blocked Matrix solve problems in different orders. Sequential Processing for Matrix Algorithm which give better performance as cache efficiency. When we implement new method it reduces the cache miss ratio. We also measure the execution time with different array sizes. We present here general free cache oblivious of performance, we consider here 3 3, 4 4, 5 5 matrices then find out cache miss and cache hit ratio shown in tables. Table 1 : 3 3 Cache miss ratio 1 Cache Oblivious Matrix 6 2 Cache Ignorant Matrix 6 3 Cache Blocked Matrix 6 4 Sequential Processing for Matrix 53 P a g e

Cache Oblivious Matrix s using Sequential Processing Table 2 : 4 4 Cache miss ratio 4 4 1 Cache Oblivious Matrix 1 2 Cache Ignorant Matrix 1 3 Cache Blocked Matrix 1 4 Sequential Processing for Matrix Table 3 : Cache miss ratio 5 5 1 Cache Oblivious Matrix transposition 1 2 Cache Ignorant Matrix 1 3 Cache Blocked Matrix 1 4 Sequential Processing for Matrix Table 1: Analysis of Cache hit and Cache miss ratio. Sr.no. Matrices Algorithms Cache hit Cache miss 1 3 3 Cache Oblivious Matrix 3 6 2 3 3 Cache Ignorant Matrix 3 6 3 3 3 Cache Blocked Matrix 3 6 4 3 3 Sequential Processing for Matrix 9 Algorithms 1 4 4 Cache Oblivious Matrix 6 1 2 4 4 Cache Ignorant Matrix 6 1 3 4 4 Cache Blocked Matrix 6 1 4 4 4 Sequential Processing for Matrix 16 Algorithms 1 5 5 Cache Oblivious Matrix 1 15 2 5 5 Cache Ignorant Matrix 1 15 3 5 5 Cache Blocked Matrix 1 15 4 5 5 Sequential Processing for Matrix Algorithms 25 54 P a g e

Cache Oblivious Matrix s using Sequential Processing Table 1: Analysis of Cache hit and Cache miss ratio. 1 Cache Oblivious Matrix 6 2 Cache Ignorant Matrix 6 3 Cache Blocked Matrix 6 4 Sequential Processing for Matrix 1 Cache Oblivious Matrix 1 2 Cache Ignorant Matrix 1 3 Cache Blocked Matrix 1 4 Sequential Processing for Matrix 1 Cache Oblivious Matrix 15 2 Cache Ignorant Matrix 15 3 Cache Blocked Matrix 15 4 Sequential Processing for Matrix A Locality Preserving Cache Memory Matrix Multiplication 55 P a g e