Save this PDF as:

Size: px
Start display at page:

Transcription

6 OPTIMIZED THREAD CREATION FOR PROCESSOR MULTITHREADING 393 EXAMPLE 2. In two successive iterations, [i, j] and [i, j + 1], the reference X[i + 2 j + 2, i + j + 3] in Figure 5 refers to the memory locations (assuming array X of size M N, and row major mapping of the array) X[i + 2 j + 2, i + j + 3] (i + 2 j + 2) N + (i + j + 3) X[i + 2( j + 1) + 2, i + ( j + 1) + 3] (i + 2( j + 1) + 2) N + (i + ( j + 1) + 3). for (i = 1; i <= m; i++) for (j = 1; j <= n; j++) X[i+2j+2, i+j+3] =... (a) for (k = -2n-m; k <= -3; k++) for (l = max (-n-k, (1-k/2) ); l <= min(-1-k, (m-k)/2); l++) X[-k+2, l+3] =... FIGURE 5. Spatial locality transformation. The program in (a) is transformed into the program in (b), with the corresponding change in the access pattern of the successive array elements, which reduces cache misses when the array is mapped to the memory in row-major order. The horizontal axis is the "rst dimension of X and the vertical axis is the second dimension. So the spatial reuse distance between successive iterations is 2N + 1. In matrix notation, the spatial reuse distance can be expressed as [ ] T [ ] [ ] ē2 T [A]T ū = ē2 T 1 2 N N = [2 1] = 2N +1. If the cache line contains 32 matrix elements, the spatial reuse fraction is only 34% for N = 10 and 0 for N 16: T = (b) [ ]. (2) Applying the transformation in Equation (2), reference X[i + 2 j + 2, i + j + 3] changes to X[ k + 2, l + 3], as shown in Figure 5 and the resulting spatial reuse distance is 1. For the same cache line size, the spatial reuse fraction is 31/32 97% Spatial locality transformation To determine the spatial reuse transformation, we "rst determine all the references to the array elements in the loop nest and prioritize them. If pro"le-directed feedback is not available to prioritize the references, heuristics such as [19] can be used, which have been shown to provide branch predictions reasonably well for SPEC benchmarks. For each reference matrix, A i, we want to determine a non-zero column vector t n so that A i t n = [ γ]t where γ is such that n γ < cache line size. To increase the likelihood of spatial reuse, we choose γ as small as possible. We de"ne the spatial reuse condition for reference A i as [A i ] row k t n = 0 for 1 k n 1. The spatial reuse fraction for A i is Spatial reuse fraction = cache line size u n γ. cache line size Figure 6 shows the algorithm to determine the spatial transformation matrix. The algorithm "rst tries to determine the best transformation for the innermost loop and then successively goes to the outermost loop until there is no reference matrix available which satis"es the spatial reuse condition, or the spatial reuse fraction is zero for all the reference matrices. At each iteration, the algorithm determines the last column of the transformation matrix, T, which is then expanded to the full rank. The values of the smallest γ j can be determined using the Euclidean algorithm to "nd the GCD. For example, let n = 5 and after applying the spatial reuse conditions we can express t 1 and t 2 in terms of t 3, t 4 and t 5. Assume that there are two references and that and γ 1 = 5t t t 5 (3) γ 2 = 9t 3 + 3t 4. (4) Since γ 1 and γ 2 must be a multiple of 5 and 3 respectively, for the smallest values of γ 1 and γ 2 we must have and t 3 + 2t 4 + 3t 5 = 1 (5) From Equations (5) and (6), we can write 3t 3 + t 4 = 1. (6) 5t 3 + 3t 5 = 1. (7) Since the GCD of the coef"cients at the left-hand side divides the constant in the right-hand side, there is a solution to Equation (7), which can be obtained from the Euclidean algorithm. The "nal solution for [t 3 t 4 t 5 ] that minimizes γ 1 and γ 2 is [2 5 4] Temporal locality So far we have discussed a transformation that will increase the reuse of the same cache line in the next iterations. To increase the temporal locality, we look for a transformation that decreases the distance between the iterations that make reference to the same memory location. Thus this transformation has the effect of reducing the footprint of a loop nest in the data cache. Let us consider the program in Figure 7. Variable A[i, j] is referenced in three different iterations [i, j], [i + 3, j + 1]

7 394 B. SINHAROY Input: Ordered set of reference matrices S ={A 1,A 2,...}. Priority order P ={p 1,p 2,...}. Set of dependence vectors D ={ d 1, d 2,...}. Set of stride vectors {ū 1,ū 2,...}. Output: Transformation matrix M for spatial reuse. M = Identity matrix; d = n; /* loop depth */ while (S NULL) { Find largest i for which spatial reuse condition holds. /* determines the last column of T 1 = [t n1...t nd ] T */ Delete A j from S for j > i.if(i== 0) terminate. k = rank[a 1 A 2... A i ] T If (k < d) express t 1,...,t k in terms of t k+1,...,t d Find smallest γ j in the priority order of A j for 1 j i If ((u d ) j γ j >line size) delete A j from S. Expand last column of T 1 to n. Complement T 1 to full rank and "nd T. If T d i > 0 for all d i D then M = T M else delete A i from S and continue with next iteration Compute spatial reuse fractions, f j for 1 j i Order A j in increasing order of p j = p j f j for 1 j i Order entries in S in the priority order. d = d 1 } return (M) FIGURE 6. Spatial transformation algorithm. for (i = 6; i <= m; i++) for (j = 2; j <= n; j++) B[i,j] = B[i-3, j-1]+b[i-6, j-2]; FIGURE 7. Simple program with three temporal reuses. and [i +6, j +2] by the three references A[i, j], A[i 3, j 1] and A[i 6, j 2] respectively. The distance between the two iterations in which reference to the same memory location is made is called reuse vector. For the example in Figure 7, there are three reuse vectors which form the reuse matrix R = [ For a more general array reference, let us consider two references to array X, X[A ī 1 + c 1 ] and X[A ī 2 + c 2 ], where A = [ᾱ 1 ᾱ 2...] T. Iterations ī 1 and ī 2 refer to the same memory location when ]. A ī 1 + c 1 =A ī 2 + c 2. The temporal reuse distance vectors between the two iterations, r = ī 1 ī 2, can be found by solving that is, A [ī 1 ī 2 ] = A r= c 2 c 1 (8) r = null space(a) + s=span{ā 1,ā 2,...,ā k }+ s (9) where k = n rank(a), {ā 1,ā 2,...,ā k } is the set of basis vectors of the null space of A and s = [s 1 s 2...] T, a special solution of Equation (8). To increase the temporal reuse, the transformation matrix should be chosen in a way so that the higher dimensional elements of the vector T rare zero or very small. To reduce the number of reuse vectors, we choose the maximum of all the reuse distances in Equation (9) r = [r 1 r 2... r n ] T where r l = s l if a jl = 0 for all j, 1 j k=range of [ᾱ l ī + c 1 c 2 ] for i iteration space, otherwise. For more general array references, let us consider the references X[A ī + c 1 ] and X[B ī + c 2 ], where A = [ᾱ 1 ᾱ 2...] T and B = [ β 1 β 2...] T. Let {ā 1,ā 2,...,ā k }and { b 1, b 2,..., b k }be the sets of basis vectors of the null space of A and B respectively. The same memory location will be referred to in iterations ī 1 and ī 2 by these two references if and only if A ī 1 + c 1 =B ī 2 + c 2 = x (say).

12 OPTIMIZED THREAD CREATION FOR PROCESSOR MULTITHREADING 399 TABLE 1. Miss rates (per 100 references) and execution times for two different memory latency and switch times, when the cache associativity is 4, cache line size is 128 Non-Th Orig Non-Th Tran Thr Orig Thr Trans Cache Miss Exec Exec Miss Exec Exec Miss Exec Exec Miss Exec Exec Size (Kb) Rate (A) (B) Rate (A) (B) Rate (A) (B) Rate (A) (B) TABLE 2. Miss rates (per 100 references) and execution times for two different memory latency and switch times, when the cache size is 128 Kbytes and the associativity is 4 Non-Th Orig Non-Th Tran Thr Orig Thr Trans Line Miss Exec Exec Miss Exec Exec Miss Exec Exec Miss Exec Exec Size (B) Rate (A) (B) Rate (A) (B) Rate (A) (B) Rate (A) (B) Loop kernel The original loop kernel was for (i = L1; i <= U1; i++) for (j = L2; j <= U2; j++) A[i,j] = A[i-3, j-1]+a[i-6, j-2]; The reuse matrix R, transformation matrix T and the transformed reuse matrix R are [ ] [ ] T = R = [ ] [ ] [ ] R = T R = = Thread t, 0 t T 1 in the transformed program with T threads. Transformed program with multithreading (0 t T 1): for (k = L1-3*U2+t; k <= U1-3*L2; k+t) for ((l = max(l2,(l1-k)/3); l <= min((u1-k)/3,u2); l++) A[k+3*l, l] = A[k+3*l-3, l-1] + A[k+3*l-6, l-2]; 7. CONCLUSION A novel compiler optimization method based on loop transformation theory to reduce the data cache misses for a uniprocessor by increasing the temporal and spatial locality of references has been discussed in this paper. The algorithms presented work for general array references. The method helps in creating threads in a multithreaded processor with increased intra- and inter-thread locality. The method can also be extended for partitioning and mapping loop nests on a multicomputer so that the number of iterations between communication steps is large. This reduces the total communication and synchronization cost on a multicomputer. REFERENCES [1] Boland, K. and Dolles, A. (1994) Predicting and precluding problems with memory latency. IEEE Micro, 14, [2] Markatos, E. (1993) Scheduling for Locality in Shared- Memory Multiprocessors, Ph.D. Thesis, University of Rochester, Rochester, New York. [3] McFarling, S. (1989) Program optimizations for instruction caches. In Proc. 3rd Int. Conf. on Architectures Support for Programming Languages and Operating Systems, pp Association for Computing Machinery Inc. [4] Li W. and Pingali K. (1992) A singular loop transformation framework based on non-singular matrices. In Proc. 5th Workshop on Programming Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin. [5] Wolf, M. E. and Lam, M. S. (1991) A data locality optimizing algorithm. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, pp Association for Computing Machinery Inc. [6] Mowry, T. C. (1994) Tolerating Latency Through Software- Controlled Data Prefetching, Ph.D. Thesis, Stanford University. [7] Agarwal, A. (1992) Performance tradeoffs in multithreaded processors. IEEE Trans. Parallel and Distrib. Syst., PDS-3,

Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

Cache Optimisation. sometime he thought that there must be a better way

Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

Application Programmer. Vienna Fortran Out-of-Core Program

Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

Computer Architecture Spring 2016

omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

Tutorial 11. Final Exam Review

Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016

ETH login ID: (Please print in capital letters) Full name: 263-2300: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016 Instructions Make sure that

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

Make sure that your exam is not missing any sheets, then write your full name and login ID on the front.

ETH login ID: (Please print in capital letters) Full name: 63-300: How to Write Fast Numerical Code ETH Computer Science, Spring 015 Midterm Exam Wednesday, April 15, 015 Instructions Make sure that your

Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

1. Memory technology & Hierarchy

1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

EE482c Final Project: Stream Programs on Legacy Architectures

EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

Keywords and Review Questions

Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

Demand fetching is commonly employed to bring the data

Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memory Hierarchy Advanced Optimizations Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 Computer Architecture. MJT, High Performance Computing, NPTEL.

Spring Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

An Introduction to Parallel Programming

An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

DEGENERACY AND THE FUNDAMENTAL THEOREM

DEGENERACY AND THE FUNDAMENTAL THEOREM The Standard Simplex Method in Matrix Notation: we start with the standard form of the linear program in matrix notation: (SLP) m n we assume (SLP) is feasible, and

CS161 Design and Architecture of Computer Systems. Cache \$\$\$\$\$

CS161 Design and Architecture of Computer Systems Cache \$\$\$\$\$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

Caches Concepts Review

Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

Operating Systems Unit 6. Memory Management

Unit 6 Memory Management Structure 6.1 Introduction Objectives 6.2 Logical versus Physical Address Space 6.3 Swapping 6.4 Contiguous Allocation Single partition Allocation Multiple Partition Allocation

Memory Systems and Performance Engineering

SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide

CMSC 611: Advanced Computer Architecture. Cache and Memory

CMSC 611: Advanced Computer Architecture Cache and Memory Classification of Cache Misses Compulsory The first access to a block is never in the cache. Also called cold start misses or first reference misses.

+ Random-Access Memory (RAM)

+ Memory Subsystem + Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally a cell (one bit per cell). Multiple RAM chips form a memory. RAM comes

Last class. Caches. Direct mapped

Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

Memory Hierarchy Basics

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

Following are a few basic questions that cover the essentials of OS:

Operating Systems Following are a few basic questions that cover the essentials of OS: 1. Explain the concept of Reentrancy. It is a useful, memory-saving technique for multiprogrammed timesharing systems.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 11 th, 2017 Neeraja J. Yadwadkar http://cs162.eecs.berkeley.edu Recall: Caching Concept Cache: a repository

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

Computer Architecture Lecture 24: Memory Scheduling

18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

Concurrent/Parallel Processing

Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

University of Toronto Faculty of Applied Science and Engineering

Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

COSC4201 Chapter 4 Cache Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT 1 Memory Hierarchy The gap between CPU performance and main memory has been

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

Introduction to cache memories

Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

CPSC/ECE 3220 Summer 2017 Exam 2

CPSC/ECE 3220 Summer 2017 Exam 2 Name: Part 1: Word Bank Write one of the words or terms from the following list into the blank appearing to the left of the appropriate definition. Note that there are

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

Global Scheduler. Global Issue. Global Retire

The Delft-Java Engine: An Introduction C. John Glossner 1;2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs, Allentown, Pa. 2 Delft University oftechnology, Department of Electrical Engineering Delft,

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture: