Dense Matrix Multiplication

Size: px
Start display at page:

Download "Dense Matrix Multiplication"

Transcription

1 Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

2 Overview 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

3 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

4 Why is matrix multiplication important? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

5 Matrix Representation Single array contains entire matrix Matrix arranged in row-major format m n matrix contains m rows and n columns A(i, j) is the matrix entry at i th row and j th column of matrix A It is the (i n + j) th entry in the matrix array Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

6 Triple nested loop void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Total number of multiplications : n 3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

7 Row-based data decomposition in matrix C Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

8 Parallel Multiply void square_dgemm (int n, double* A, double* B, double* C) #pragma omp parallel for schedule(static) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

9 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

10 How good is the serial performance? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

11 How good is the serial performance? Normalized time becomes almost 4x when size of matrix grows from 1000 to 6000 Experiments done on 3.2 GHz machine More than 5 clock cycles taken per double precision multiplication for matrix!!! Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

12 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

13 Memory Hierarchy Model for analysis L lines of capacity m double precision numbers each Tall Cache assumption : L > m Replacement Policy : Least Recently Used No Hardware Prefetching Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

14 Memory Access Pattern during multiplication A, B and C are square matrices of size n n n is large, i.e., n > L Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

15 Memory Access Pattern for A Sequential access : Accessing a row requires n m cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

16 Memory Access Pattern for B Strided access : Accessing a column requires n cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

17 Total cache misses For computing every C(i, j), the number of cache misses : 1 + n m + n If n < ml, total cache misses : 2n2 m + n3 If n > ml, total cache misses : n2 m + n3 m + n3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

18 Is n < ml a practical assumption? 64 bytes cache line size means m = KB L2 cache means ml = For practical problems n < 10 15k Thus n < ml and the total cache misses : 2n2 m + n3 = Θ(n 3 ) Can this be improved? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

19 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

20 Alternate Memory Access Pattern For computing the row C(i, :), cache misses : 2n m + n2 m Total cache misses : 2n2 m + n3 m = Θ( n3 m ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

21 Improved Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for( int k = 0; k < n; k++ ) const int koffset = k*n; for (int j = 0; j < n; ++j) C[iOffset+j] += A[iOffset+j] * B[kOffset+j]; Triple-nested loop with the order of (i, j, k) changed to (i, k, j) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

22 ikj versus ijk Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

23 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

24 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

25 Blocking / Tiling Assumptions for analysis : n%b = 0 and b%m = 0 Cache misses in loading a block : b2 m Cache misses in finding a block of C : b2 m + b2 2n m b = b2 m + 2nb m n Total cache misses : 2 ( b2 b 2 m + 2nb m ) = n2 m + 2n3 mb = Θ( n3 mb ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

26 Choosing blocking parameter b The 3 blocks for A, B and C should just fit in the cache 3b 2 = ml, i.e., b = ml 3 For L1 cache of capacity 32KB, ml = 4096 and b = A good value for b is 32 Total cache misses : 2n3 mb = 2 3n 3 m ml = Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

27 Tiled Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; i += BLOCK_SIZE) const int ioffset = i * n; for (int j = 0; j < n; j += BLOCK_SIZE) for (int k = 0; k < n; k += BLOCK_SIZE) /* Correct block dimensions if block "goes off edge of" the matrix */ int M = min (BLOCK_SIZE, n-i); int N = min (BLOCK_SIZE, n-j); int K = min (BLOCK_SIZE, n-k); /* Perform individual block dgemm */ do_block(n, M, N, K, A + ioffset + k, B + k*n + j, C + ioffset + j); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

28 Tiled Multiply... static void do_block (int n, int M, int N, int K, double* A, double* B, double* C) for (int i = 0; i < M; ++i) const int ioffset = i*n; for (int j = 0; j < N; ++j) double cij = 0.0; for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

29 Tiled versus Normal Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

30 Tiled MT scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

31 Instruction Level Parallelism Given that we have made the data being worked upon available in the cache closest to the processor, we could use some ILP ILP kicks in when there is significant amount of independent work available in a single block of code Loop unrolling can help us achieve that Compilers also unroll loop but in this case there are too many nesting levels for the compiler to do the correct thing Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

32 Tiled Multiply with unrolling for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; for (int k = 0; k < K; k += 8) const double d0 = A[iOffset+k] * B[k*n+j]; const double d1 = A[iOffset+k+1] * B[(k+1)*n+j]; const double d2 = A[iOffset+k+2] * B[(k+2)*n+j]; const double d3 = A[iOffset+k+3] * B[(k+3)*n+j]; const double d4 = A[iOffset+k+4] * B[(k+4)*n+j]; const double d5 = A[iOffset+k+5] * B[(k+5)*n+j]; const double d6 = A[iOffset+k+6] * B[(k+6)*n+j]; const double d7 = A[iOffset+k+7] * B[(k+7)*n+j]; cij += (d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

33 Tiled Multiply with unrolling... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

34 What about the L2 cache? In addition to L1, blocking can be done for the L2 cache also 2-level tiled code For L2 cache of capacity 256KB, ml = 4096 and b = 105 L2 cache is shared for data and instruction memory accesses Finding the best L2 tile size is non-trivial and requires experimentation with different sizes We choose 88 as the block size for L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

35 Tiling for L1 and L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

36 Unrolling diminishes the difference Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

37 Multithreading shock Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

38 Automatic tuning Manual optimization and tuning is tedious and error-prone Entire process needs to be redone in full for any new architecture Multi-threaded optimization adds further complexity Code generated automatically by parameterized code generators Automatically Tuned Linear Algebra Software Portable High Performance ANSI C Essentially a search problem Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

39 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

40 A different formulation C =A B [ ] [ ] [ ] C11 C 12 A11 A = 12 B11 B 12 C 21 C 22 A 21 A 22 B 21 B 22 C 11 =A 11 B 11 + A 12 B 21 C 12 =A 11 B 12 + A 12 B 22 C 21 =A 21 B 11 + A 22 B 21 C 22 =A 21 B 12 + A 22 B 22 Solve for C 11, C 12, C 21, C 22 recursively Recursion formula : W (n) = 8W ( n 2 ) + Θ(1) W (n) = Θ(n 3 ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

41 How is it Cache oblivious? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

42 Cache Miss Analysis Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Depth 0 O(1) Nodes 1 1 O(1) O(1) O(1) O(1) 8 2 O(1) O(1) O(1) O(1) 8^2 d O(cL) 8^d Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

43 Cache Miss Analysis... Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Let n d be the value of n at depth d. Then n d = cml n 0 = n and n i = n i 1 2 = n 2 i. Therefore n d = n 2 d and 2 d = n cml n d = lg( cml ) and number of leaves l = 8 d = Number of cache misses at leaf level : n l Θ(cL) = Θ( 3 m ) = Θ( n 3 cml m ) ml n3 (cml) 3 2 Total number of cache misses : Θ( n 3 m ml ) = Θ( n 3 m Cache Size ) Recall that the total number of cache misses for tiled multiplication was found to be Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

44 Recursive Multiply void multiply(const int n, const int m, double * A, double * B, double * C) //Base cases if(m == 2) basecase2(n, m, A, B, C); return; else if(m == 1) C[0] += A[0] * B[0]; return; //Recursive multiply const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); return; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

45 Recursive multiply serial performance Can be made more consistent by writing base case code for matrices of sizes > 2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

46 How far have we come? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

47 Parallelization How about 6, 10, 12, 14, 18 or 20 threads? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

48 OpenMP task construct #pragma omp parallel #pragma omp single... #pragma omp task #pragma omp task Tasks generated by a single thread Scheduling of tasks on available threads managed by OpenMP Useful in recursive functions, linked lists Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

49 task based recursive multiplication const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; #pragma omp task multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

50 Recursive Multiplication Scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

51 Improved Parallel Recursive Multiplication Too many tasks of varying sizes Assignment of tasks to threads is arbitrary; does not consider locality At present, gcc support for OpenMP tasks is not great Explicit control of task creation Create tasks only at a pre-computed depth in the recursion tree int criticaldepth = 0; int leaves = 1; int branches = 4; while(2*numthreads > leaves) criticaldepth++; leaves *= branches; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

52 Improved Parallel Recursive Multiplication... if(depth == criticaldepth) #pragma omp task multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset22, C+offset22, depth+1, criticaldepth); else multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); Abhishek, multiply(n, Debdeep m/2, (IIT Kgp) A+offset22, B+offset22, C+offset22, Matrix Mult. depth+1, criticaldepth); October 7, / 56

53 Improved Parallel Recursive Multiplication... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

54 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

55 Key differences from shared memory parallel methods Matrices can be much larger Each node in the network holds only a part of the matrices Focus is on reducing inter-process communication and scalability All methods discussed so far are applicable to the local multiplication happening at a node Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

56 In future classes Network topologies Analysis of cost of communication in different topologies Isoefficiency function to understand scalability with number of processors and problem size Distributed methods Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56

Cache-Efficient Algorithms

Cache-Efficient Algorithms 6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache

More information

Shared Memory Parallel Sorting

Shared Memory Parallel Sorting Shared Memory Parallel Sorting Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur November 5, 2015 Abhishek, Debdeep (IIT Kgp) Sorting November 5, 2015 1 / 35 Overview 1 Serial Sorting

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

I/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13

I/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13 I/O Model 15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms Matrix multiplication Distribution sort Static searching Abstracts a single level of the memory hierarchy Fast memory

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Cilk, Matrix Multiplication, and Sorting

Cilk, Matrix Multiplication, and Sorting 6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction

More information

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms 3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,

More information

CSCI 402: Computer Architectures. Performance of Multilevel Cache

CSCI 402: Computer Architectures. Performance of Multilevel Cache CSCI 402: Computer Architectures Memory Hierarchy (5) Fengguang Song Department of Computer & Information Science IUPUI Performance of Multilevel Cache Main Memory CPU L1 cache L2 cache Given CPU base

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

Order Analysis of Algorithms. Sorting problem

Order Analysis of Algorithms. Sorting problem Order Analysis of Algorithms Debdeep Mukhopadhyay IIT Kharagpur Sorting problem Input: A sequence of n numbers, a1,a2,,an Output: A permutation (reordering) (a1,a2,,an ) of the input sequence such that

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model Typical Computer Hierarchical Memory Basics Data moved between adjacent memory level in blocks A Trivial Program A Trivial Program: d = 1 A Trivial Program: d = 1 A Trivial Program: n = 2 24 A Trivial

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix

More information

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Shared Memory Parallel Programming

Shared Memory Parallel Programming Shared Memory Parallel Programming Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur August 2, 2015 Abhishek, Debdeep (IIT Kgp) Parallel Programming August 2, 2015 1 / 46 Overview 1

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories Cache Memories Thanks to Randal E. Bryant and David R. O Hallaron from CMU Reading Assignment: Computer Systems: A Programmer s Perspec4ve, Third Edi4on, Chapter 6 1 Today Cache memory organiza7on and

More information

Autotuning (1/2): Cache-oblivious algorithms

Autotuning (1/2): Cache-oblivious algorithms Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum

More information

Analyzing Data Access of Algorithms and How to Make Them Cache-Friendly? Kenjiro Taura

Analyzing Data Access of Algorithms and How to Make Them Cache-Friendly? Kenjiro Taura Analyzing Data Access of Algorithms and How to Make Them Cache-Friendly? Kenjiro Taura 1 / 79 Contents 1 Introduction 2 Analyzing data access complexity of serial programs Overview Model of a machine An

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

ECE 563 Spring 2012 First Exam

ECE 563 Spring 2012 First Exam ECE 563 Spring 2012 First Exam version 1 This is a take-home test. You must work, if found cheating you will be failed in the course and you will be turned in to the Dean of Students. To make it easy not

More information

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms

Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Aarhus University Cache-Oblivious Current Trends Algorithms in Algorithms, - A Unified Complexity Approach to Theory, Hierarchical

More information

Cache Performance II 1

Cache Performance II 1 Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation

More information

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache

More information

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

Parallel Programming Patterns

Parallel Programming Patterns Parallel Programming Patterns Pattern-Driven Parallel Application Development 7/10/2014 DragonStar 2014 - Qing Yi 1 Parallelism and Performance p Automatic compiler optimizations have their limitations

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron 1 Topics Cache memory organiza3on and opera3on Performance impact of caches 2 Cache Memories Cache memories are

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012 Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Example. How are these parameters decided?

Example. How are these parameters decided? Example How are these parameters decided? Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix,

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron Cache Memories Computer Architecture Instructor: Norbert Lu1enberger based on the book by Randy Bryant and Dave O Hallaron 1 Today Cache memory organiza7on and opera7on Performance impact of caches The

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

CS377P Programming for Performance Single Thread Performance Caches I

CS377P Programming for Performance Single Thread Performance Caches I CS377P Programming for Performance Single Thread Performance Caches I Sreepathi Pai UTCS September 21, 2015 Outline 1 Introduction 2 Caches 3 Performance of Caches Outline 1 Introduction 2 Caches 3 Performance

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2010 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1, Mary W. Hall 2, Jacqueline Chame 3, Chun Chen 2, Paul D. Hovland 1 1 ANL/MCS 2 University

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

5.12 EXERCISES Exercises 263

5.12 EXERCISES Exercises 263 5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from

More information

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Caches III. CSE 351 Spring Instructor: Ruth Anderson Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson Teaching Assistants: Dylan Johnson Kevin Bi Linxing Preston Jiang Cody Ohlsen Yufang Sun Joshua Curtis Administrivia Office Hours Changes check

More information

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui Caches III CSE 410 Winter 2017 Instructor: Justin Hsia Teaching Assistants: Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui Please stop charging your phone in public ports "Just by plugging

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

PARALLEL AND DISTRIBUTED COMPUTING

PARALLEL AND DISTRIBUTED COMPUTING PARALLEL AND DISTRIBUTED COMPUTING 2013/2014 1 st Semester 1 st Exam January 7, 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc. - Give your answers

More information

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition Cache Memories Lecture, Oct. 30, 2018 1 General Cache Concept Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory

More information

Caches III. CSE 351 Winter Instructor: Mark Wyse

Caches III. CSE 351 Winter Instructor: Mark Wyse Caches III CSE 351 Winter 2018 Instructor: Mark Wyse Teaching Assistants: Kevin Bi Parker DeWilde Emily Furst Sarah House Waylon Huang Vinny Palaniappan https://what-if.xkcd.com/111/ Administrative Midterm

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms Matteo Frigo, Charles Leiserson, Harald Prokop, Sridhar Ramchandran Slides Written and Presented by William Kuszmaul THE DISK ACCESS MODEL Three Parameters: B M P Block Size

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

HPCG Performance Improvement on the K computer ~10min. brief~

HPCG Performance Improvement on the K computer ~10min. brief~ HPCG Performance Improvement on the K computer ~10min. brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF #273 SC14 New Orleans Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Iterative Compilation with Kernel Exploration

Iterative Compilation with Kernel Exploration Iterative Compilation with Kernel Exploration Denis Barthou 1 Sébastien Donadio 12 Alexandre Duchateau 1 William Jalby 1 Eric Courtois 3 1 Université de Versailles, France 2 Bull SA Company, France 3 CAPS

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Some features of modern CPUs. and how they help us

Some features of modern CPUs. and how they help us Some features of modern CPUs and how they help us RAM MUL core Wide operands RAM MUL core CP1: hardware can multiply 64-bit floating-point numbers Pipelining: can start the next independent operation before

More information

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction

More information

Go Multicore Series:

Go Multicore Series: Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External

More information

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant. Cache Memories 15-213: Introduc on to Computer Systems 12 th Lecture, October 6th, 2016 Instructor: Randy Bryant 1 Today Cache memory organiza on and opera on Performance impact of caches The memory mountain

More information