Dense Matrix Multiplication
|
|
- Grace Woods
- 5 years ago
- Views:
Transcription
1 Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
2 Overview 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
3 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
4 Why is matrix multiplication important? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
5 Matrix Representation Single array contains entire matrix Matrix arranged in row-major format m n matrix contains m rows and n columns A(i, j) is the matrix entry at i th row and j th column of matrix A It is the (i n + j) th entry in the matrix array Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
6 Triple nested loop void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Total number of multiplications : n 3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
7 Row-based data decomposition in matrix C Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
8 Parallel Multiply void square_dgemm (int n, double* A, double* B, double* C) #pragma omp parallel for schedule(static) for (int i = 0; i < n; ++i) const int ioffset = i*n; for (int j = 0; j < n; ++j) double cij = 0.0; for( int k = 0; k < n; k++ ) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
9 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
10 How good is the serial performance? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
11 How good is the serial performance? Normalized time becomes almost 4x when size of matrix grows from 1000 to 6000 Experiments done on 3.2 GHz machine More than 5 clock cycles taken per double precision multiplication for matrix!!! Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
12 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
13 Memory Hierarchy Model for analysis L lines of capacity m double precision numbers each Tall Cache assumption : L > m Replacement Policy : Least Recently Used No Hardware Prefetching Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
14 Memory Access Pattern during multiplication A, B and C are square matrices of size n n n is large, i.e., n > L Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
15 Memory Access Pattern for A Sequential access : Accessing a row requires n m cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
16 Memory Access Pattern for B Strided access : Accessing a column requires n cache misses Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
17 Total cache misses For computing every C(i, j), the number of cache misses : 1 + n m + n If n < ml, total cache misses : 2n2 m + n3 If n > ml, total cache misses : n2 m + n3 m + n3 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
18 Is n < ml a practical assumption? 64 bytes cache line size means m = KB L2 cache means ml = For practical problems n < 10 15k Thus n < ml and the total cache misses : 2n2 m + n3 = Θ(n 3 ) Can this be improved? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
19 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
20 Alternate Memory Access Pattern For computing the row C(i, :), cache misses : 2n m + n2 m Total cache misses : 2n2 m + n3 m = Θ( n3 m ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
21 Improved Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; ++i) const int ioffset = i*n; for( int k = 0; k < n; k++ ) const int koffset = k*n; for (int j = 0; j < n; ++j) C[iOffset+j] += A[iOffset+j] * B[kOffset+j]; Triple-nested loop with the order of (i, j, k) changed to (i, k, j) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
22 ikj versus ijk Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
23 (Almost) Perfect Scaling for matrix of size Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
24 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
25 Blocking / Tiling Assumptions for analysis : n%b = 0 and b%m = 0 Cache misses in loading a block : b2 m Cache misses in finding a block of C : b2 m + b2 2n m b = b2 m + 2nb m n Total cache misses : 2 ( b2 b 2 m + 2nb m ) = n2 m + 2n3 mb = Θ( n3 mb ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
26 Choosing blocking parameter b The 3 blocks for A, B and C should just fit in the cache 3b 2 = ml, i.e., b = ml 3 For L1 cache of capacity 32KB, ml = 4096 and b = A good value for b is 32 Total cache misses : 2n3 mb = 2 3n 3 m ml = Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
27 Tiled Multiply void square_dgemm (int n, double* A, double* B, double* C) for (int i = 0; i < n; i += BLOCK_SIZE) const int ioffset = i * n; for (int j = 0; j < n; j += BLOCK_SIZE) for (int k = 0; k < n; k += BLOCK_SIZE) /* Correct block dimensions if block "goes off edge of" the matrix */ int M = min (BLOCK_SIZE, n-i); int N = min (BLOCK_SIZE, n-j); int K = min (BLOCK_SIZE, n-k); /* Perform individual block dgemm */ do_block(n, M, N, K, A + ioffset + k, B + k*n + j, C + ioffset + j); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
28 Tiled Multiply... static void do_block (int n, int M, int N, int K, double* A, double* B, double* C) for (int i = 0; i < M; ++i) const int ioffset = i*n; for (int j = 0; j < N; ++j) double cij = 0.0; for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; C[iOffset+j] += cij; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
29 Tiled versus Normal Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
30 Tiled MT scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
31 Instruction Level Parallelism Given that we have made the data being worked upon available in the cache closest to the processor, we could use some ILP ILP kicks in when there is significant amount of independent work available in a single block of code Loop unrolling can help us achieve that Compilers also unroll loop but in this case there are too many nesting levels for the compiler to do the correct thing Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
32 Tiled Multiply with unrolling for (int k = 0; k < K; ++k) cij += A[iOffset+k] * B[k*n+j]; for (int k = 0; k < K; k += 8) const double d0 = A[iOffset+k] * B[k*n+j]; const double d1 = A[iOffset+k+1] * B[(k+1)*n+j]; const double d2 = A[iOffset+k+2] * B[(k+2)*n+j]; const double d3 = A[iOffset+k+3] * B[(k+3)*n+j]; const double d4 = A[iOffset+k+4] * B[(k+4)*n+j]; const double d5 = A[iOffset+k+5] * B[(k+5)*n+j]; const double d6 = A[iOffset+k+6] * B[(k+6)*n+j]; const double d7 = A[iOffset+k+7] * B[(k+7)*n+j]; cij += (d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
33 Tiled Multiply with unrolling... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
34 What about the L2 cache? In addition to L1, blocking can be done for the L2 cache also 2-level tiled code For L2 cache of capacity 256KB, ml = 4096 and b = 105 L2 cache is shared for data and instruction memory accesses Finding the best L2 tile size is non-trivial and requires experimentation with different sizes We choose 88 as the block size for L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
35 Tiling for L1 and L2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
36 Unrolling diminishes the difference Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
37 Multithreading shock Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
38 Automatic tuning Manual optimization and tuning is tedious and error-prone Entire process needs to be redone in full for any new architecture Multi-threaded optimization adds further complexity Code generated automatically by parameterized code generators Automatically Tuned Linear Algebra Software Portable High Performance ANSI C Essentially a search problem Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
39 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
40 A different formulation C =A B [ ] [ ] [ ] C11 C 12 A11 A = 12 B11 B 12 C 21 C 22 A 21 A 22 B 21 B 22 C 11 =A 11 B 11 + A 12 B 21 C 12 =A 11 B 12 + A 12 B 22 C 21 =A 21 B 11 + A 22 B 21 C 22 =A 21 B 12 + A 22 B 22 Solve for C 11, C 12, C 21, C 22 recursively Recursion formula : W (n) = 8W ( n 2 ) + Θ(1) W (n) = Θ(n 3 ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
41 How is it Cache oblivious? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
42 Cache Miss Analysis Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Depth 0 O(1) Nodes 1 1 O(1) O(1) O(1) O(1) 8 2 O(1) O(1) O(1) O(1) 8^2 d O(cL) 8^d Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
43 Cache Miss Analysis... Θ( n2 Q(n) = m ) for n2 < cml where 0 < c 1, 8Q( n 2 ) + Θ(1) otherwise. Let n d be the value of n at depth d. Then n d = cml n 0 = n and n i = n i 1 2 = n 2 i. Therefore n d = n 2 d and 2 d = n cml n d = lg( cml ) and number of leaves l = 8 d = Number of cache misses at leaf level : n l Θ(cL) = Θ( 3 m ) = Θ( n 3 cml m ) ml n3 (cml) 3 2 Total number of cache misses : Θ( n 3 m ml ) = Θ( n 3 m Cache Size ) Recall that the total number of cache misses for tiled multiplication was found to be Θ( n 3 m Cache Size ) Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
44 Recursive Multiply void multiply(const int n, const int m, double * A, double * B, double * C) //Base cases if(m == 2) basecase2(n, m, A, B, C); return; else if(m == 1) C[0] += A[0] * B[0]; return; //Recursive multiply const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); return; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
45 Recursive multiply serial performance Can be made more consistent by writing base case code for matrices of sizes > 2 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
46 How far have we come? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
47 Parallelization How about 6, 10, 12, 14, 18 or 20 threads? Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
48 OpenMP task construct #pragma omp parallel #pragma omp single... #pragma omp task #pragma omp task Tasks generated by a single thread Scheduling of tasks on available threads managed by OpenMP Useful in recursive functions, linked lists Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
49 task based recursive multiplication const int offset12 = m/2; const int offset21 = n*m/2; const int offset22 = offset21 + offset12; #pragma omp task multiply(n, m/2, A, B, C); multiply(n, m/2, A+offset12, B+offset21, C); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12); multiply(n, m/2, A+offset12, B+offset22, C+offset12); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21); multiply(n, m/2, A+offset22, B+offset21, C+offset21); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22); multiply(n, m/2, A+offset22, B+offset22, C+offset22); Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
50 Recursive Multiplication Scaling Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
51 Improved Parallel Recursive Multiplication Too many tasks of varying sizes Assignment of tasks to threads is arbitrary; does not consider locality At present, gcc support for OpenMP tasks is not great Explicit control of task creation Create tasks only at a pre-computed depth in the recursion tree int criticaldepth = 0; int leaves = 1; int branches = 4; while(2*numthreads > leaves) criticaldepth++; leaves *= branches; Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
52 Improved Parallel Recursive Multiplication... if(depth == criticaldepth) #pragma omp task multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); #pragma omp task multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset22, C+offset22, depth+1, criticaldepth); else multiply(n, m/2, A, B, C, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset21, C, depth+1, criticaldepth); multiply(n, m/2, A, B+offset12, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset12, B+offset22, C+offset12, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset22, B+offset21, C+offset21, depth+1, criticaldepth); multiply(n, m/2, A+offset21, B+offset12, C+offset22, depth+1, criticaldepth); Abhishek, multiply(n, Debdeep m/2, (IIT Kgp) A+offset22, B+offset22, C+offset22, Matrix Mult. depth+1, criticaldepth); October 7, / 56
53 Improved Parallel Recursive Multiplication... Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
54 Outline 1 The Problem 2 Analysis 3 Improvement 4 Better Cache utilization 5 Cache Oblivious Method 6 Distributed Matrix Multiplication Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
55 Key differences from shared memory parallel methods Matrices can be much larger Each node in the network holds only a part of the matrices Focus is on reducing inter-process communication and scalability All methods discussed so far are applicable to the local multiplication happening at a node Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
56 In future classes Network topologies Analysis of cost of communication in different topologies Isoefficiency function to understand scalability with number of processors and problem size Distributed methods Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, / 56
Cache-Efficient Algorithms
6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache
More informationShared Memory Parallel Sorting
Shared Memory Parallel Sorting Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur November 5, 2015 Abhishek, Debdeep (IIT Kgp) Sorting November 5, 2015 1 / 35 Overview 1 Serial Sorting
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationAgenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories
Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationI/O Model. Cache-Oblivious Algorithms : Algorithms in the Real World. Advantages of Cache-Oblivious Algorithms 4/9/13
I/O Model 15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms Matrix multiplication Distribution sort Static searching Abstracts a single level of the memory hierarchy Fast memory
More informationToday. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,
Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationCilk, Matrix Multiplication, and Sorting
6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction
More information3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationCSCI 402: Computer Architectures. Performance of Multilevel Cache
CSCI 402: Computer Architectures Memory Hierarchy (5) Fengguang Song Department of Computer & Information Science IUPUI Performance of Multilevel Cache Main Memory CPU L1 cache L2 cache Given CPU base
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationOrder Analysis of Algorithms. Sorting problem
Order Analysis of Algorithms Debdeep Mukhopadhyay IIT Kharagpur Sorting problem Input: A sequence of n numbers, a1,a2,,an Output: A permutation (reordering) (a1,a2,,an ) of the input sequence such that
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationMassive Data Algorithmics. Lecture 12: Cache-Oblivious Model
Typical Computer Hierarchical Memory Basics Data moved between adjacent memory level in blocks A Trivial Program A Trivial Program: d = 1 A Trivial Program: d = 1 A Trivial Program: n = 2 24 A Trivial
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationAnalysis of Multithreaded Algorithms
Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix
More informationParallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication
CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationShared Memory Parallel Programming
Shared Memory Parallel Programming Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur August 2, 2015 Abhishek, Debdeep (IIT Kgp) Parallel Programming August 2, 2015 1 / 46 Overview 1
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationCarnegie Mellon. Cache Memories
Cache Memories Thanks to Randal E. Bryant and David R. O Hallaron from CMU Reading Assignment: Computer Systems: A Programmer s Perspec4ve, Third Edi4on, Chapter 6 1 Today Cache memory organiza7on and
More informationAutotuning (1/2): Cache-oblivious algorithms
Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationImproving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationAnalyzing Data Access of Algorithms and How to Make Them Cache-Friendly? Kenjiro Taura
Analyzing Data Access of Algorithms and How to Make Them Cache-Friendly? Kenjiro Taura 1 / 79 Contents 1 Introduction 2 Analyzing data access complexity of serial programs Overview Model of a machine An
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationECE 563 Spring 2012 First Exam
ECE 563 Spring 2012 First Exam version 1 This is a take-home test. You must work, if found cheating you will be failed in the course and you will be turned in to the Dean of Students. To make it easy not
More informationCache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms
Cache-Oblivious Algorithms A Unified Approach to Hierarchical Memory Algorithms Aarhus University Cache-Oblivious Current Trends Algorithms in Algorithms, - A Unified Complexity Approach to Theory, Hierarchical
More informationCache Performance II 1
Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation
More informationSystems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations
Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache
More informationCase Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017
Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationParallel Programming Patterns
Parallel Programming Patterns Pattern-Driven Parallel Application Development 7/10/2014 DragonStar 2014 - Qing Yi 1 Parallelism and Performance p Automatic compiler optimizations have their limitations
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationCache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron
Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron 1 Topics Cache memory organiza3on and opera3on Performance impact of caches 2 Cache Memories Cache memories are
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationCache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012
Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationExample. How are these parameters decided?
Example How are these parameters decided? Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix,
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationCarnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron
Cache Memories Computer Architecture Instructor: Norbert Lu1enberger based on the book by Randy Bryant and Dave O Hallaron 1 Today Cache memory organiza7on and opera7on Performance impact of caches The
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCache-Oblivious Traversals of an Array s Pairs
Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious
More informationCS377P Programming for Performance Single Thread Performance Caches I
CS377P Programming for Performance Single Thread Performance Caches I Sreepathi Pai UTCS September 21, 2015 Outline 1 Introduction 2 Caches 3 Performance of Caches Outline 1 Introduction 2 Caches 3 Performance
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2010 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationAutotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology
Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1, Mary W. Hall 2, Jacqueline Chame 3, Chun Chen 2, Paul D. Hovland 1 1 ANL/MCS 2 University
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More information5.12 EXERCISES Exercises 263
5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from
More informationCaches III. CSE 351 Spring Instructor: Ruth Anderson
Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson Teaching Assistants: Dylan Johnson Kevin Bi Linxing Preston Jiang Cody Ohlsen Yufang Sun Joshua Curtis Administrivia Office Hours Changes check
More informationKathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui
Caches III CSE 410 Winter 2017 Instructor: Justin Hsia Teaching Assistants: Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui Please stop charging your phone in public ports "Just by plugging
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationPARALLEL AND DISTRIBUTED COMPUTING
PARALLEL AND DISTRIBUTED COMPUTING 2013/2014 1 st Semester 1 st Exam January 7, 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc. - Give your answers
More informationCache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition
Cache Memories Lecture, Oct. 30, 2018 1 General Cache Concept Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory
More informationCaches III. CSE 351 Winter Instructor: Mark Wyse
Caches III CSE 351 Winter 2018 Instructor: Mark Wyse Teaching Assistants: Kevin Bi Parker DeWilde Emily Furst Sarah House Waylon Huang Vinny Palaniappan https://what-if.xkcd.com/111/ Administrative Midterm
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms Matteo Frigo, Charles Leiserson, Harald Prokop, Sridhar Ramchandran Slides Written and Presented by William Kuszmaul THE DISK ACCESS MODEL Three Parameters: B M P Block Size
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute
More informationDenison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud
Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationHPCG Performance Improvement on the K computer ~10min. brief~
HPCG Performance Improvement on the K computer ~10min. brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF #273 SC14 New Orleans Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationIterative Compilation with Kernel Exploration
Iterative Compilation with Kernel Exploration Denis Barthou 1 Sébastien Donadio 12 Alexandre Duchateau 1 William Jalby 1 Eric Courtois 3 1 Université de Versailles, France 2 Bull SA Company, France 3 CAPS
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationSome features of modern CPUs. and how they help us
Some features of modern CPUs and how they help us RAM MUL core Wide operands RAM MUL core CP1: hardware can multiply 64-bit floating-point numbers Pipelining: can start the next independent operation before
More informationParallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS
Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction
More informationGo Multicore Series:
Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External
More informationCache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.
Cache Memories 15-213: Introduc on to Computer Systems 12 th Lecture, October 6th, 2016 Instructor: Randy Bryant 1 Today Cache memory organiza on and opera on Performance impact of caches The memory mountain
More information