Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm

Size: px
Start display at page:

Download "Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm"

Transcription

1 Second Semester, Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. Your answers to this part will be graded on your effort. Part B of this homework are hands-on exercises that require you to design and evaluate processor systems using various software and hardware tools, including Chisel and the RISCV-V compilation tool chains. They are designed to help you understand real-world processor design and the use of various tools to help you along the way. This part of the homework will be graded on correctness. Part C of this homework contains open-ended mini-project ideas. They are open-ended by nature, meaning there are no right-wrong answers. You must choose to attempt one of the several available topics. You may work individually or in groups of up to 3 for this part. If you work in groups, each of you must submit independent report on the project. The following summarize the 3 parts. Part Type Indv/Grp Grading A Basic problem set Individual Graded on effort B Hands-on Individual or Group of 2 to 3 Graded on correctness C Mini-project Individual or Group of 2 to 3 Graded on effort In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.

2 Part A: Problem Set A.1 Column-Row In class, we discussed how matrices may be stored in memory with row-major or column-major orientation. In a row-major organization, matrices are stored row-by-row in memory; while in a column-major organization, matrices are stored column-by-column. Standard C compilers organizes matrices as row-major, while Matlab organizes matrices as column-major. Consider the following C code: #define N 128 int a[n][n]; //a[0][0] located at 0xA int i,j,sum; // i, j, sum in register sum = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { sum = sum + a[i][j]; You are running this code in a processor with the following tiny data cache: Direct map 8 words block size 32 entries A.1.1 What is the capacity of the cache? A.1.2 When the code is run with an initially empty cache, list out the sequence of Hits and Misses that are being generated. What type of misses are they? What is the miss rate? A.1.3 Suppose now that the i and j loops are exchanged as in the following code: for (j = 0; j < N; j++) { for (i = 0; i < N; i++) { sum = sum + a[i][j]; Repeat your work in??. How does it affect your miss rate? A.1.4 If instead the matrix is stored in column-major, which loop may produce better cache performance? r1.2 Page 2 of 17

3 A.1.5 If the cache is changed to a 4-way set associative cache while keeping the same capacity, how would it change the miss rate in the above matrix access? A.1.6 Now consider a different loop that copies top half of the matrix to the bottom half: for (i = 0; i < N/2; i++) { for (j = 0; j < N; j++) { a[i+n/2][j] = a[i][j]; When executed with the original direct mapped cache, what is the sequence of Hits an dmisses that gets generated? What is the miss rate? Assume matrix is stored as row-major in memory. Also assume a write-back and write allocate policy. A.1.7 Repeat A.1.6 if a 2-way set associative cache is used. Assume a write-through and nowrite allocate policy. r1.2 Page 3 of 17

4 A.2 Page Table & TLB In this exercise you will experiment with the interaction between the TLB and the page table in a VM system. Assume the following system configuration: 16-bit virtual and physical address 256 B page size 4-entry, fully associative TLB, true LRU replacement policy Initially, the TLB and page table contains the following entries. An invalid entry is marked as empty. If a page is located on hard disk, it is marked as disk. Tag TLB PPN 0xAC 0x07 0x04 0xCA 0x28 0xD0 empty Loc 0xFF 0xFE 0xFD 0xFC 0xFB 0xFA. 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00 Page Table PPN or Disk disk empty empty empty 0xAB empty. empty empty disk empty 0xCA 0x18 disk empty 0xFF A.2.1 The following sequence of memory accesses are issued: 0xFF02, 0x00FF, 0xAC98, 0x0801, 0xFC98, Assume there is only 1 process in the system. Answer the following: 1. What is the final state of the TLB and Page Table after the above accesses? 2. For each memory access, is it a hit in TLB, a hit in the page table, or a page fault? A.2.2 Given the above VM setup, what is its TLB reach? A.2.3 What are some of the advantages and disadvantages of larger page size? A.2.4 Assume now that there are 2 processes running in the system. The 2 processes generate the following list of memory references: r1.2 Page 4 of 17

5 Process 1 Process 2 0x0700 0x0704 0xFA00 0xFA04 0x0700 0x0704 0xFA00 0xFA04 0x0708 0xFE00 0xFE04 0x0708 0x070C 0xFFF0 0xFFEC 0xFFE8 0x070C 0x0710 0x0714 0x0718 0xFE00 0xFE04 0xFE0C 0xFE04 0xFE00 0xFE04 Given the above accesses, how many TLB miss will be generated for Process 1 and 2 respectively? How many page faults are generated? A.2.5 Describe what kind of hardware/software changes you can make to reduce the numer of TLB misses in such scenario. r1.2 Page 5 of 17

6 A.3 Vector Co-Processor Adapted from 2016 final exam As an attempt to improve performance of a simple 32-bit scalar processor, you are considering adding a vector co-processor to your system as shown below: CPU Data Cache Main Memory DRAM Vector Co-processor As shown in the figure, the additional co-processor, called MMV, is a memory-memory vector unit that is attached directly to the system main memory. When instructed by the main processor, it fetches data from the main memory, operate on the data, and store the results back to the main memory. Your task is to evaluate the effectiveness of this MMV unit regarding the following very important program: // int i, a; #define N 256 float int x[n], y[n]; Preprocess(x); // compute kernel for (i = 0; i < N; i++) { y[i] = x[i] + y[i]; Mathematically, the loop marked as compute kernel performs the following operation: where x and y are vectors with N elements. y = x + y A.3.1 Scalar Performance The D-cache of the processor has the following properties: Cache miss takes 150 cycles Cache hit takes 1 cycle Capacity: 4 MiB Organization: 2-way set associative Cache line is 4 words Policy: write back, write allocate r1.2 Page 6 of 17

7 The for loop marked as compute kernel in the above C program is compiled as follows: # a0 = 256 # a1 is base address of x[]: 0xA # a2 is base address of y[]: 0xA0C : loop: addi a0 a0-1 04: lw t1, 0(a1) 08: lw t2, 0(a2) 10: add t2, t1, t2 14: sw t2, 0(a2) 18: addi a1, a1, 4 1C: addi a2, a2, 4 20: bne a0, zero, loop Assume CPI of all instructions is 1 and CPI of memory operation depends on cache performance. Assume cache is initially empty, how many cycles does it take to complete the above loop? r1.2 Page 7 of 17

8 A.3.2 The vector co-processor MMV is able to perform similar vector additions by directly reading input from the main memory and storing the results to the main memory. Taking into account DRAM timing, on average, it can perform 16 additions in 100 cycles, which include cycles needed to check for loop boundary. Now, assume all the data are in main memory, how many cycles does MMV take to complete the loop? A.3.3 When compared to the scalar processor, what is the speed up offered by MMV considering the compute kernel loop only? A.3.4 One limitation of MMV is that it operates directly from the main memory. Assume both x[] and y[] are dirty and thus required flushing before MMV can take over to accelerate the loop. Now, assume flushing of each cache line requires 150 cycles. How long does it take to flush the entire vector x[] and y[] from the cache to main memory if they are all dirty? A.3.5 Taking into account the time to flush x[] and y[] to memory, what is now the speed up by using MMV? A.3.6 In practice, since it is usually difficult for the OS to know which line of the cache a variable resides, it often needs to flush the entire cache instead. Recall that the cache is 4 MiB, how long does it take to flush the entire cache assuming 25 % of the cache lines are dirty? Also, what is the resulting speed up of using MMV? r1.2 Page 8 of 17

9 A.3.7 Your project partner suggests that the need to flushing dirty lines from the cache can be eliminated by implementing a write through policy. Does the use of write through policy eliminate the need to flush dirty lines from cache? Does it resolve all data coherence problem between MMV and the scalar processor? Explain your answers. r1.2 Page 9 of 17

10 A.4 Page Table Size You are considering a system with the following parameters: 64-bit architecture 1 MiB page size Each page table entry is 8 bytes A.4.1 Assume a single linear page table, with all entries pre-allocated when a process is launched, what is the size of the page table for 1 process? A.4.2 Obviously, it is unrealistic to allocate all available space for a process. In practice, most processes reference only a few clusters of addresses, such as the addresses within the data heap, the stack, and addresses for the program instructions, etc. Assuming each process access 4 different disjoint clusters of addresses, each consists of 256 MiB of data each. What is the theoretical minimum amount of storage need for page table entry (PTE)? A.4.3 To get a storage requirement closer to the above theoretical minimum, a 2-level page table system is being explored. The first level of page table is preallocated when a process is launched, and contains pointers to 2nd level page tables. The second level page table is allocated on demand according to the 4 clusters of memory regions are allocated. The following breakdown in address between the 2 levels is used: First Level Second Level Page Offset If the 4 clusters of memory are located at a contiguous 1 GiB virtual address space starting from address 0, how many 2 nd level page table are allocated for this process? With that, what is the total size of page table in both levels? A.4.4 If the 4 regions are located at 0x , 0x , 0x and 0xC Will it affect your answer from previous part? A.4.5 If a 3-level page table is used instead. Can it reduce the total memory that is needed to store all page table entries, assuming the addresses are distributed as in A.4.4? A.4.6 Assuming the addresses are distributed as in A.4.4? Will having even more levels of page table be beneficial? Is there a limit on how many levels of page table may be used? What are the tradeoff? r1.2 Page 10 of 17

11 Homework 3, Part B Part B: Hands-on Exercise In this exercise, you will learn techniques to optimize an application for CPU and for GPU. The core of the matrix-matrix multiplication program is the following loop: for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ c[i*n+j] += a[i*n+k] * b[k*n+j]; B.1 Optimizing Matrix-Matrix Multiplication for CPU Obtain the homework files: tux-1$ tar xzf ~elec3441/elec3441hw3.tar.gz tux-1$ cd hw3 tux-1$ export HW3ROOT=$PWD B.1.1 Loop Interchange There are no data or control dependencies between loop iterations, so it is possible to reorder operations arbitrarily. Depending on the order of the loops, one of the three elements a[i*n+k], b[k*n+j], c[i*n+j] stays constant during the whole inner loop. Compare the following three versions: IJK for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { x = 0; for (k = 0; k < N; k++){ x += a[i*n+k] * b[k*n+j]; c[i*n+j] += x; IKJ for (i = 0; i < N; i++){ for (k = 0; k < N; k++){ x = a[i*n+k]; r1.2 Page 11 of 17

12 Homework 3, Part B for (j = 0; j < N; j++) { c[i*n+j] += x * b[k*n+j]; JKI for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ x = b[k*n+j]; for (i = 0; i < N; i++){ c[i*n+j] += a[i*n+k] * x; B.1.2 Analyze the cache behavior of each version. Assume the matrices are very large and even one row will not fit in the cache. Matrices are stored in row-major order. One cache line fits four matrix elements. How many loads and stores does each iteration of the innermost loop need? How many cache misses per iteration will it produce for matrices A, B, C? B.1.3 What will be the behavior for loop orders JIK, KIJ and KJI? B.1.4 The file test mmm inter.c measures the time needed for the three above versions of matrix multiplication on various matrix sizes. Compile and run this code: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_inter test_mmm_inter.c -lrt tux-1$./test_mmm_inter It will output the time taken in nanoseconds for the loop orders ijk, kij, jki for matrix sizes from to Plot the time taken per iteration (time/n 3 ) for each of the loop orders. Does the result match your analysis in B.1.2? B.1.5 Can you make a guess at the L1 and L2 cache size of the machine based on this graph? You can change the values of BASE, ITER and DELTA to get a closer look for smaller matrix sizes. B.1.6 Blocking Another way to apply the observation that the different iterations of the loop body can be executed in any order is cache blocking. After loading a small part of the matrices into the cache, to obtain the best performance as many operations using this data as possible should be executed before loading new data. Examine the function mmm iijjkk blocked in test mmm block.c. In the following diagram, each numbered block is of size block size block size. Highlight the blocks of A, B and C that will be accessed for values of ii=1 and jj=2. r1.2 Page 12 of 17

13 Homework 3, Part B What inequality between the block size and the cache size needs to be fulfilled for this method to be efficient? B.1.7 Compile and run test mmm blocked.c: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_block test_mmm_block.c -lrt tux-1$./test_mmm_blocked Plot the results. What is the maximum efficient block size? What can you deduce about the cache size? B.1.8 Modify the file test mmm inter.c to obtain results for the same matrix size How does the performance of the blocked code compare to the different loop orders? B.1.9 Interchange the loops in the blocked matrix multiply code. What is the best performance you can obtain by combining the two techniques? B.1.10 Submission Submit the following: The modified files test mmm inter.c and test mmm blocked.c. The plots generated and your answers to the questions in B.1.2 to B.1.9. B.2 Optimizing Matrix Multiplication for GPU In this section you will learn how to use the CUDA toolkit to program GPUs for general-purpose computing. To make the CUDA tools available in your shell, use the following command: tux-1$ source ~elec3441/elec3441hw3.bashrc Because the graphics card on tux-1 is quite old, documentation for installed CUDA version 6.5 is no longer available online. You can find the documentation in the folder /usr/local/cuda/doc. It is recommended to read the CUDA C Programming Guide, most importantly chapters 2, 4 and 5. Note that the GPU in tux-1 offers Compute Capability 1.1, which affects many features you will see throughout the manual. B.2.1 Structure of a CUDA program CUDA is a SIMT (Single Instruction Multiple Thread) model. A CUDA kernel is a function that is executed simultaneously in many threads. It is launched using the following syntax: r1.2 Page 13 of 17

14 Homework 3, Part B kernel_fn<<< grid_dim3, block_dim3 >>>(arguments); The parameters grid dim3, block dim3 specify the number and layout of the parallel threads to be launched. Each argument is of type dim3, specifying 3 dimensions. Threads are organized in up to 3-dimensional blocks. All the blocks of a kernel are arranged in a grid. Recent GPUs allow a 3D grid, but for Compute Capability 1.1 only 2 dimensions are allowed. During execution, each thread has access to its block index and thread index to identify itself and the data it shall work on. The main task in GPU programming is to efficiently organize memory accesses among the threads. Examine the files matrixmul.cu and matrixmul naive.cuh in ${HW3ROOT/mmm gpu. The host (CPU) code is in matrixmul.cu, and matrixmul naive.cuh contains the kernel that will be executed on the GPU. Answer the following questions: B.2.2 What are the block and grid dimensions for a 2048x2048 matrix? B.2.3 What data elements in A, B and C will an individual thread touch? B.2.4 What are we comparing the performance of the CUDA code to? B.2.5 Why do we include the CUDA data management functions (cudamemcpy etc) when we measure the time for matrix multiplication on GPU? B.2.6 Compile the code and run it for matrix sizes from 16x16 to 2048x2048: tux-1$ make tux-1$./matrixmul -length=16 Plot the CPU and GPU performances. At what size does using the GPU become more efficient than the CPU? B.2.7 GPU Memory Hierarchy The first step in improving the performance of a CUDA kernel is to adapt memory accesses to the GPU s memory hierarchy. Unlike in CPUs, the memory hierarchy of the GPU is mostly exposed to the programmer, and you will have to manually copy data from the more distant to the closer levels and back again. The following types of memory are available on a GPU 1 : Registers Registers are the fastest memory, accessible without any latency on each clock cycle, just as on a regular CPU. A thread s registers cannot be shared with other threads. Shared Memory Shared memory is comparable to L1 cache memory on a regular CPU. It resides close to the multiprocessor, and has very short access times. Shared memory is shared among all the threads of a given block. Section of the Cuda C Best Practices Guide has more on shared memory optimization considerations. Global Memory Global memory resides on the device, but off-chip from the multiprocessors, so that access times to global memory can be 100 times greater than to shared memory. All threads in the kernel have access to all data in global memory. 1 r1.2 Page 14 of 17

15 Homework 3, Part B Local Memory Thread-specific memory stored where global memory is stored. Variables are stored in a thread s local memory if the compiler decides that there are not enough registers to hold the thread s data. This memory is slow, even though its called local. Constant Memory 64k of Constant memory resides off-chip from the multiprocessors, and is read-only. The host code writes to the devices constant memory before launching the kernel, and the kernel may then read this memory. Constant memory access is cached each multiprocessor can cache up to 8k of constant memory, so that subsequent reads from constant memory can be very fast. All threads have access to constant memory. Texture Memory Specialized memory for surface texture mapping, not discussed in this module. The equivalent of cache blocking is to subdivide the matrices into tiles. Each thread block is in charge of one tile of the result matrix C. It loads the necessary tiles of A and B into shared memory, computes the result values in register, and finally writes the result back to global memory. B.2.8 Examine the code in matrixmul tiling.cuh. What is the role of the for loop? For a 128x128 matrix, how many iterations will it have? Modify matrixmul.cu to make use of the tiled code: --#include "matrixmul_naive.cuh" ++#include "matrixmul_tiling.cuh" Plot the performace for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. You can comment out the line #define COMPARE CPU at the top of matrixmul.cu to avoid re-generating the CPU performance numbers. B.2.9 Coalescing Similarly to the effect of cache lines on CPU, memory accesses to consecutive memory addresses by threads with consecutive indexes are more efficient than accesses to different addresses. This is called Coalescing. Read appendix G.3 of the CUDA C Programming Guide for a detailed explanation. B.2.10 Show that the accesses to matrix A already fulfil the conditions listed in section G.3.2. Why can the accesses to matrix B not be coalesced? B.2.11 matrixmul coalescing.cuh contains a version that allows coalesced accesses to B. How has this been achieved? B.2.12 Plot the performance of the coalesced version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.13 Shared Memory Bank Conflicts You will find that instead of the expected performance improvement from coalesced memory accesses, the last version actually performs more poorly. The problem is that although we have improved the global memory access pattern, we have introduced a problem with the access patterns to shared memory, namely bank conflicts. Section G.3.3 of the Appendix to the CUDA C Programming Guide explains how shared memory is structured in banks. To avoid bank conflicts, each group of 16 consecutive thread IDs needs to access 32-bit words in 16 different banks, i.e. with a different word alignment. The easiest way to achieve this is to access consecutive words, which is what has been implemented in matrixmul nobankconflict.cuh. Explain the relevant change. r1.2 Page 15 of 17

16 Homework 3, Part B B.2.14 Why does the access AS[ty][k] not generate bank conflicts? B.2.15 Plot the performance of this version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.16 Optional: further optimizations If you are looking for inspiration for other improvements you could make in part C, the page explains more techniques you can try. B.2.17 Submission Submit the following: A plot comparing the performance of all the different versions of CUDA matrix multiply examined in this section. Your answers to the questions. r1.2 Page 16 of 17

17 Homework 3, Part C Part C: Open-ended Project C.1 Convolution In this part you will apply the techniques you have learned in part B to a different algorithm: a convolution filter. You can find the code in the folder ${HW3ROOT/convolutionSeparable. A convolution filter is applied to an image to achieve some effect, for example a blur. It is defined by a convolution kernel, which is a matrix of small size (e.g. 3 3). Each output pixel is calculated based on the input pixel at the same location as well as its neighbors. For each input pixel location, an area of the same size as the kernel centered at the location is considered. Each input pixel value of this area is multiplied by the corresponding kernel value. These values are summed up to produce the output pixel value. The convolution kernel you are optimizing in this part has a further property: it is separable. This means that the kernel is the product of two vectors. In our particular example it is the product of a vector and its own transpose = [ ] You may use this feature to your advantage when optimizing the code. C.1.1 Submission Submit your implementation of the convolution filter for CPU and GPU together with a report on how you optimized the code for each target. r1.2 Page 17 of 17

Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm

Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Second Semester, 2015 16 Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through

More information

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Second Semester, 2016 17 Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Instruction: Submit your answers electronically through

More information

ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12

ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12 Homework 3, Part ELEC3441: Computer Architecture Second Semester, 2015 16 Homework 3 (r1.1) r1.1 Page 1 of 12 A.1 Cache Access Part A: Problem Set Consider the following sequence of memory accesses to

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements CS 61C: Great Ideas in Computer Architecture Virtual II Guest Lecturer: Justin Hsia Agenda Review of Last Lecture Goals of Virtual Page Tables Translation Lookaside Buffer (TLB) Administrivia VM Performance

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double *)malloc(sizeof(double)*n*n); B = (double *)malloc(sizeof(double)*n*n);

More information

ENCM 369 Winter 2016 Lab 11 for the Week of April 4

ENCM 369 Winter 2016 Lab 11 for the Week of April 4 page 1 of 13 ENCM 369 Winter 2016 Lab 11 for the Week of April 4 Steve Norman Department of Electrical & Computer Engineering University of Calgary April 2016 Lab instructions and other documents for ENCM

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

CSE 351. Virtual Memory

CSE 351. Virtual Memory CSE 351 Virtual Memory Virtual Memory Very powerful layer of indirection on top of physical memory addressing We never actually use physical addresses when writing programs Every address, pointer, etc

More information

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

CS 61C: Great Ideas in Computer Architecture. Virtual Memory CS 61C: Great Ideas in Computer Architecture Virtual Memory Instructor: Justin Hsia 7/30/2012 Summer 2012 Lecture #24 1 Review of Last Lecture (1/2) Multiple instruction issue increases max speedup, but

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm

Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm Second Semester, 2016 17 Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Virtual Memory Overview

Virtual Memory Overview Virtual Memory Overview Virtual address (VA): What your program uses Virtual Page Number Page Offset Physical address (PA): What actually determines where in memory to go Physical Page Number Page Offset

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

High-Performance Parallel Computing

High-Performance Parallel Computing High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture;

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

ECE331 Homework 4. Due Monday, August 13, 2018 (via Moodle)

ECE331 Homework 4. Due Monday, August 13, 2018 (via Moodle) ECE331 Homework 4 Due Monday, August 13, 2018 (via Moodle) 1. Below is a list of 32-bit memory address references, given as hexadecimal byte addresses. The memory accesses are all reads and they occur

More information

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Lecture 21: Virtual Memory. Spring 2018 Jason Tang Lecture 21: Virtual Memory Spring 2018 Jason Tang 1 Topics Virtual addressing Page tables Translation lookaside buffer 2 Computer Organization Computer Processor Memory Devices Control Datapath Input Output

More information

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017 CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

COSC 3406: COMPUTER ORGANIZATION

COSC 3406: COMPUTER ORGANIZATION COSC 3406: COMPUTER ORGANIZATION Home-Work 5 Due Date: Friday, December 8 by 2.00 pm Instructions for submitting: Type your answers and send it by email or take a printout or handwritten (legible) on paper,

More information

CS152 Computer Architecture and Engineering Virtual Memory and Address Translation Assigned March 3 Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended to

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L20-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14 98:23 Intro to Computer Organization Lecture 4 Virtual Memory 98:23 Introduction to Computer Organization Lecture 4 Instructor: Nicole Hynes nicole.hynes@rutgers.edu Credits: Several slides courtesy of

More information

Virtual Memory. CS 3410 Computer System Organization & Programming

Virtual Memory. CS 3410 Computer System Organization & Programming Virtual Memory CS 3410 Computer System Organization & Programming These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer. Where are we now and

More information

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring

More information

Practice Exercises 449

Practice Exercises 449 Practice Exercises 449 Kernel processes typically require memory to be allocated using pages that are physically contiguous. The buddy system allocates memory to kernel processes in units sized according

More information

Virtual Memory 3. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4

Virtual Memory 3. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4 Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 5.4 Project3 available now Administrivia Design Doc due next week, Monday, April 16 th Schedule

More information

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems ECE 356: Architecture, Concurrency, and Energy of Computation Sample Problem Set: Memory Systems TLB 1. Consider a processor system with 256 kbytes of memory, 64 Kbyte pages, and a 1 Mbyte virtual address

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L16-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron 1 Topics Cache memory organiza3on and opera3on Performance impact of caches 2 Cache Memories Cache memories are

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle

More information

ECE Sample Final Examination

ECE Sample Final Examination ECE 3056 Sample Final Examination 1 Overview The following applies to all problems unless otherwise explicitly stated. Consider a 2 GHz MIPS processor with a canonical 5-stage pipeline and 32 general-purpose

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory This Unit: Virtual Application OS Compiler Firmware I/O Digital Circuits Gates & Transistors hierarchy review DRAM technology A few more transistors Organization: two level addressing Building a memory

More information

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory. Virtual Memory 1 Learning Outcomes An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory. 2 Memory Management Unit (or TLB) The position and function

More information

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally

More information

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory. Virtual Memory Learning Outcomes An understanding of page-based virtual memory in depth. Including the R000 s support for virtual memory. Memory Management Unit (or TLB) The position and function of the

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Computer Systems II. Memory Management" Subdividing memory to accommodate many processes. A program is loaded in main memory to be executed

Computer Systems II. Memory Management Subdividing memory to accommodate many processes. A program is loaded in main memory to be executed Computer Systems II Memory Management" Memory Management" Subdividing memory to accommodate many processes A program is loaded in main memory to be executed Memory needs to be allocated efficiently to

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance What is EE 352 Unit 11 Definitions Address Mapping Performance memory is a small, fast memory used to hold of data that the processor will likely need to access in the near future sits between the processor

More information

Virtual Memory, Address Translation

Virtual Memory, Address Translation Memory Hierarchy Virtual Memory, Address Translation Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing,

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Changelog. Virtual Memory (2) exercise: 64-bit system. exercise: 64-bit system

Changelog. Virtual Memory (2) exercise: 64-bit system. exercise: 64-bit system Changelog Virtual Memory (2) Changes made in this version not seen in first lecture: 21 November 2017: 1-level example: added final answer of memory value, not just location 21 November 2017: two-level

More information

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory 5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CS 3733 Operating Systems:

CS 3733 Operating Systems: CS 3733 Operating Systems: Topics: Memory Management (SGG, Chapter 08) Instructor: Dr Dakai Zhu Department of Computer Science @ UTSA 1 Reminders Assignment 2: extended to Monday (March 5th) midnight:

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

Direct Mapped Cache Hardware. Direct Mapped Cache. Direct Mapped Cache Performance. Direct Mapped Cache Performance. Miss Rate = 3/15 = 20%

Direct Mapped Cache Hardware. Direct Mapped Cache. Direct Mapped Cache Performance. Direct Mapped Cache Performance. Miss Rate = 3/15 = 20% Direct Mapped Cache Direct Mapped Cache Hardware........................ mem[xff...fc] mem[xff...f8] mem[xff...f4] mem[xff...f] mem[xff...ec] mem[xff...e8] mem[xff...e4] mem[xff...e] 27 8-entry x (+27+)-bit

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 53 Design of Operating Systems Winter 28 Lecture 6: Paging/Virtual Memory () Some slides modified from originals by Dave O hallaron Today Address spaces VM as a tool for caching VM as a tool for memory

More information

Winter 2009 FINAL EXAMINATION Location: Engineering A Block, Room 201 Saturday, April 25 noon to 3:00pm

Winter 2009 FINAL EXAMINATION Location: Engineering A Block, Room 201 Saturday, April 25 noon to 3:00pm University of Calgary Department of Electrical and Computer Engineering ENCM 369: Computer Organization Lecture Instructors: S. A. Norman (L01), N. R. Bartley (L02) Winter 2009 FINAL EXAMINATION Location:

More information

lecture 18 cache 2 TLB miss TLB - TLB (hit and miss) - instruction or data cache - cache (hit and miss)

lecture 18 cache 2 TLB miss TLB - TLB (hit and miss) - instruction or data cache - cache (hit and miss) lecture 18 2 virtual physical virtual physical - TLB ( and ) - instruction or data - ( and ) Wed. March 16, 2016 Last lecture I discussed the TLB and how virtual es are translated to physical es. I only

More information

Virtual Memory Oct. 29, 2002

Virtual Memory Oct. 29, 2002 5-23 The course that gives CMU its Zip! Virtual Memory Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs class9.ppt Motivations for Virtual Memory Use Physical

More information

CS 61C: Great Ideas in Computer Architecture. Virtual Memory III. Instructor: Dan Garcia

CS 61C: Great Ideas in Computer Architecture. Virtual Memory III. Instructor: Dan Garcia CS 61C: Great Ideas in Computer Architecture Virtual Memory III Instructor: Dan Garcia 1 Agenda Review of Last Lecture Goals of Virtual Memory Page Tables TranslaFon Lookaside Buffer (TLB) Administrivia

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Structure of Computer Systems

Structure of Computer Systems 222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a

More information

Solutions for Chapter 7 Exercises

Solutions for Chapter 7 Exercises olutions for Chapter 7 Exercises 1 olutions for Chapter 7 Exercises 7.1 There are several reasons why you may not want to build large memories out of RAM. RAMs require more transistors to build than DRAMs

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

COSC3330 Computer Architecture Lecture 20. Virtual Memory

COSC3330 Computer Architecture Lecture 20. Virtual Memory COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use

More information

Cache Performance II 1

Cache Performance II 1 Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation

More information

Virtual Memory. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Virtual Memory. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Virtual Memory CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Click any letter let me know you re here today. Instead of a DJ Clicker Question today,

More information

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron Cache Memories Computer Architecture Instructor: Norbert Lu1enberger based on the book by Randy Bryant and Dave O Hallaron 1 Today Cache memory organiza7on and opera7on Performance impact of caches The

More information

Virtual Memory Review. Page faults. Paging system summary (so far)

Virtual Memory Review. Page faults. Paging system summary (so far) Lecture 22 (Wed 11/19/2008) Virtual Memory Review Lab #4 Software Simulation Due Fri Nov 21 at 5pm HW #3 Cache Simulator & code optimization Due Mon Nov 24 at 5pm More Virtual Memory 1 2 Paging system

More information

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging CS162 Operating Systems and Systems Programming Lecture 14 Caching and Demand Paging October 17, 2007 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs162 Review: Hierarchy of a Modern Computer

More information

Virtual Memory, Address Translation

Virtual Memory, Address Translation Memory Hierarchy Virtual Memory, Address Translation Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing,

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Instructor: Dr. Turgay Korkmaz Department Computer Science The University of Texas at San Antonio Office: NPB

More information

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky Memory Hierarchy, Fully Associative Caches Instructor: Nick Riasanovsky Review Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of datapath component Data

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information