Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm
|
|
- Tracey Farmer
- 6 years ago
- Views:
Transcription
1 Second Semester, Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. Your answers to this part will be graded on your effort. Part B of this homework are hands-on exercises that require you to design and evaluate processor systems using various software and hardware tools, including Chisel and the RISCV-V compilation tool chains. They are designed to help you understand real-world processor design and the use of various tools to help you along the way. This part of the homework will be graded on correctness. Part C of this homework contains open-ended mini-project ideas. They are open-ended by nature, meaning there are no right-wrong answers. You must choose to attempt one of the several available topics. You may work individually or in groups of up to 3 for this part. If you work in groups, each of you must submit independent report on the project. The following summarize the 3 parts. Part Type Indv/Grp Grading A Basic problem set Individual Graded on effort B Hands-on Individual or Group of 2 to 3 Graded on correctness C Mini-project Individual or Group of 2 to 3 Graded on effort In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.
2 Part A: Problem Set A.1 Column-Row In class, we discussed how matrices may be stored in memory with row-major or column-major orientation. In a row-major organization, matrices are stored row-by-row in memory; while in a column-major organization, matrices are stored column-by-column. Standard C compilers organizes matrices as row-major, while Matlab organizes matrices as column-major. Consider the following C code: #define N 128 int a[n][n]; //a[0][0] located at 0xA int i,j,sum; // i, j, sum in register sum = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { sum = sum + a[i][j]; You are running this code in a processor with the following tiny data cache: Direct map 8 words block size 32 entries A.1.1 What is the capacity of the cache? A.1.2 When the code is run with an initially empty cache, list out the sequence of Hits and Misses that are being generated. What type of misses are they? What is the miss rate? A.1.3 Suppose now that the i and j loops are exchanged as in the following code: for (j = 0; j < N; j++) { for (i = 0; i < N; i++) { sum = sum + a[i][j]; Repeat your work in??. How does it affect your miss rate? A.1.4 If instead the matrix is stored in column-major, which loop may produce better cache performance? r1.2 Page 2 of 17
3 A.1.5 If the cache is changed to a 4-way set associative cache while keeping the same capacity, how would it change the miss rate in the above matrix access? A.1.6 Now consider a different loop that copies top half of the matrix to the bottom half: for (i = 0; i < N/2; i++) { for (j = 0; j < N; j++) { a[i+n/2][j] = a[i][j]; When executed with the original direct mapped cache, what is the sequence of Hits an dmisses that gets generated? What is the miss rate? Assume matrix is stored as row-major in memory. Also assume a write-back and write allocate policy. A.1.7 Repeat A.1.6 if a 2-way set associative cache is used. Assume a write-through and nowrite allocate policy. r1.2 Page 3 of 17
4 A.2 Page Table & TLB In this exercise you will experiment with the interaction between the TLB and the page table in a VM system. Assume the following system configuration: 16-bit virtual and physical address 256 B page size 4-entry, fully associative TLB, true LRU replacement policy Initially, the TLB and page table contains the following entries. An invalid entry is marked as empty. If a page is located on hard disk, it is marked as disk. Tag TLB PPN 0xAC 0x07 0x04 0xCA 0x28 0xD0 empty Loc 0xFF 0xFE 0xFD 0xFC 0xFB 0xFA. 0x08 0x07 0x06 0x05 0x04 0x03 0x02 0x01 0x00 Page Table PPN or Disk disk empty empty empty 0xAB empty. empty empty disk empty 0xCA 0x18 disk empty 0xFF A.2.1 The following sequence of memory accesses are issued: 0xFF02, 0x00FF, 0xAC98, 0x0801, 0xFC98, Assume there is only 1 process in the system. Answer the following: 1. What is the final state of the TLB and Page Table after the above accesses? 2. For each memory access, is it a hit in TLB, a hit in the page table, or a page fault? A.2.2 Given the above VM setup, what is its TLB reach? A.2.3 What are some of the advantages and disadvantages of larger page size? A.2.4 Assume now that there are 2 processes running in the system. The 2 processes generate the following list of memory references: r1.2 Page 4 of 17
5 Process 1 Process 2 0x0700 0x0704 0xFA00 0xFA04 0x0700 0x0704 0xFA00 0xFA04 0x0708 0xFE00 0xFE04 0x0708 0x070C 0xFFF0 0xFFEC 0xFFE8 0x070C 0x0710 0x0714 0x0718 0xFE00 0xFE04 0xFE0C 0xFE04 0xFE00 0xFE04 Given the above accesses, how many TLB miss will be generated for Process 1 and 2 respectively? How many page faults are generated? A.2.5 Describe what kind of hardware/software changes you can make to reduce the numer of TLB misses in such scenario. r1.2 Page 5 of 17
6 A.3 Vector Co-Processor Adapted from 2016 final exam As an attempt to improve performance of a simple 32-bit scalar processor, you are considering adding a vector co-processor to your system as shown below: CPU Data Cache Main Memory DRAM Vector Co-processor As shown in the figure, the additional co-processor, called MMV, is a memory-memory vector unit that is attached directly to the system main memory. When instructed by the main processor, it fetches data from the main memory, operate on the data, and store the results back to the main memory. Your task is to evaluate the effectiveness of this MMV unit regarding the following very important program: // int i, a; #define N 256 float int x[n], y[n]; Preprocess(x); // compute kernel for (i = 0; i < N; i++) { y[i] = x[i] + y[i]; Mathematically, the loop marked as compute kernel performs the following operation: where x and y are vectors with N elements. y = x + y A.3.1 Scalar Performance The D-cache of the processor has the following properties: Cache miss takes 150 cycles Cache hit takes 1 cycle Capacity: 4 MiB Organization: 2-way set associative Cache line is 4 words Policy: write back, write allocate r1.2 Page 6 of 17
7 The for loop marked as compute kernel in the above C program is compiled as follows: # a0 = 256 # a1 is base address of x[]: 0xA # a2 is base address of y[]: 0xA0C : loop: addi a0 a0-1 04: lw t1, 0(a1) 08: lw t2, 0(a2) 10: add t2, t1, t2 14: sw t2, 0(a2) 18: addi a1, a1, 4 1C: addi a2, a2, 4 20: bne a0, zero, loop Assume CPI of all instructions is 1 and CPI of memory operation depends on cache performance. Assume cache is initially empty, how many cycles does it take to complete the above loop? r1.2 Page 7 of 17
8 A.3.2 The vector co-processor MMV is able to perform similar vector additions by directly reading input from the main memory and storing the results to the main memory. Taking into account DRAM timing, on average, it can perform 16 additions in 100 cycles, which include cycles needed to check for loop boundary. Now, assume all the data are in main memory, how many cycles does MMV take to complete the loop? A.3.3 When compared to the scalar processor, what is the speed up offered by MMV considering the compute kernel loop only? A.3.4 One limitation of MMV is that it operates directly from the main memory. Assume both x[] and y[] are dirty and thus required flushing before MMV can take over to accelerate the loop. Now, assume flushing of each cache line requires 150 cycles. How long does it take to flush the entire vector x[] and y[] from the cache to main memory if they are all dirty? A.3.5 Taking into account the time to flush x[] and y[] to memory, what is now the speed up by using MMV? A.3.6 In practice, since it is usually difficult for the OS to know which line of the cache a variable resides, it often needs to flush the entire cache instead. Recall that the cache is 4 MiB, how long does it take to flush the entire cache assuming 25 % of the cache lines are dirty? Also, what is the resulting speed up of using MMV? r1.2 Page 8 of 17
9 A.3.7 Your project partner suggests that the need to flushing dirty lines from the cache can be eliminated by implementing a write through policy. Does the use of write through policy eliminate the need to flush dirty lines from cache? Does it resolve all data coherence problem between MMV and the scalar processor? Explain your answers. r1.2 Page 9 of 17
10 A.4 Page Table Size You are considering a system with the following parameters: 64-bit architecture 1 MiB page size Each page table entry is 8 bytes A.4.1 Assume a single linear page table, with all entries pre-allocated when a process is launched, what is the size of the page table for 1 process? A.4.2 Obviously, it is unrealistic to allocate all available space for a process. In practice, most processes reference only a few clusters of addresses, such as the addresses within the data heap, the stack, and addresses for the program instructions, etc. Assuming each process access 4 different disjoint clusters of addresses, each consists of 256 MiB of data each. What is the theoretical minimum amount of storage need for page table entry (PTE)? A.4.3 To get a storage requirement closer to the above theoretical minimum, a 2-level page table system is being explored. The first level of page table is preallocated when a process is launched, and contains pointers to 2nd level page tables. The second level page table is allocated on demand according to the 4 clusters of memory regions are allocated. The following breakdown in address between the 2 levels is used: First Level Second Level Page Offset If the 4 clusters of memory are located at a contiguous 1 GiB virtual address space starting from address 0, how many 2 nd level page table are allocated for this process? With that, what is the total size of page table in both levels? A.4.4 If the 4 regions are located at 0x , 0x , 0x and 0xC Will it affect your answer from previous part? A.4.5 If a 3-level page table is used instead. Can it reduce the total memory that is needed to store all page table entries, assuming the addresses are distributed as in A.4.4? A.4.6 Assuming the addresses are distributed as in A.4.4? Will having even more levels of page table be beneficial? Is there a limit on how many levels of page table may be used? What are the tradeoff? r1.2 Page 10 of 17
11 Homework 3, Part B Part B: Hands-on Exercise In this exercise, you will learn techniques to optimize an application for CPU and for GPU. The core of the matrix-matrix multiplication program is the following loop: for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ c[i*n+j] += a[i*n+k] * b[k*n+j]; B.1 Optimizing Matrix-Matrix Multiplication for CPU Obtain the homework files: tux-1$ tar xzf ~elec3441/elec3441hw3.tar.gz tux-1$ cd hw3 tux-1$ export HW3ROOT=$PWD B.1.1 Loop Interchange There are no data or control dependencies between loop iterations, so it is possible to reorder operations arbitrarily. Depending on the order of the loops, one of the three elements a[i*n+k], b[k*n+j], c[i*n+j] stays constant during the whole inner loop. Compare the following three versions: IJK for (i = 0; i < N; i++){ for (j = 0; j < N; j++) { x = 0; for (k = 0; k < N; k++){ x += a[i*n+k] * b[k*n+j]; c[i*n+j] += x; IKJ for (i = 0; i < N; i++){ for (k = 0; k < N; k++){ x = a[i*n+k]; r1.2 Page 11 of 17
12 Homework 3, Part B for (j = 0; j < N; j++) { c[i*n+j] += x * b[k*n+j]; JKI for (j = 0; j < N; j++) { for (k = 0; k < N; k++){ x = b[k*n+j]; for (i = 0; i < N; i++){ c[i*n+j] += a[i*n+k] * x; B.1.2 Analyze the cache behavior of each version. Assume the matrices are very large and even one row will not fit in the cache. Matrices are stored in row-major order. One cache line fits four matrix elements. How many loads and stores does each iteration of the innermost loop need? How many cache misses per iteration will it produce for matrices A, B, C? B.1.3 What will be the behavior for loop orders JIK, KIJ and KJI? B.1.4 The file test mmm inter.c measures the time needed for the three above versions of matrix multiplication on various matrix sizes. Compile and run this code: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_inter test_mmm_inter.c -lrt tux-1$./test_mmm_inter It will output the time taken in nanoseconds for the loop orders ijk, kij, jki for matrix sizes from to Plot the time taken per iteration (time/n 3 ) for each of the loop orders. Does the result match your analysis in B.1.2? B.1.5 Can you make a guess at the L1 and L2 cache size of the machine based on this graph? You can change the values of BASE, ITER and DELTA to get a closer look for smaller matrix sizes. B.1.6 Blocking Another way to apply the observation that the different iterations of the loop body can be executed in any order is cache blocking. After loading a small part of the matrices into the cache, to obtain the best performance as many operations using this data as possible should be executed before loading new data. Examine the function mmm iijjkk blocked in test mmm block.c. In the following diagram, each numbered block is of size block size block size. Highlight the blocks of A, B and C that will be accessed for values of ii=1 and jj=2. r1.2 Page 12 of 17
13 Homework 3, Part B What inequality between the block size and the cache size needs to be fulfilled for this method to be efficient? B.1.7 Compile and run test mmm blocked.c: tux-1$ cd ${HW3ROOT/mmm_cpu tux-1$ gcc -O3 -o test_mmm_block test_mmm_block.c -lrt tux-1$./test_mmm_blocked Plot the results. What is the maximum efficient block size? What can you deduce about the cache size? B.1.8 Modify the file test mmm inter.c to obtain results for the same matrix size How does the performance of the blocked code compare to the different loop orders? B.1.9 Interchange the loops in the blocked matrix multiply code. What is the best performance you can obtain by combining the two techniques? B.1.10 Submission Submit the following: The modified files test mmm inter.c and test mmm blocked.c. The plots generated and your answers to the questions in B.1.2 to B.1.9. B.2 Optimizing Matrix Multiplication for GPU In this section you will learn how to use the CUDA toolkit to program GPUs for general-purpose computing. To make the CUDA tools available in your shell, use the following command: tux-1$ source ~elec3441/elec3441hw3.bashrc Because the graphics card on tux-1 is quite old, documentation for installed CUDA version 6.5 is no longer available online. You can find the documentation in the folder /usr/local/cuda/doc. It is recommended to read the CUDA C Programming Guide, most importantly chapters 2, 4 and 5. Note that the GPU in tux-1 offers Compute Capability 1.1, which affects many features you will see throughout the manual. B.2.1 Structure of a CUDA program CUDA is a SIMT (Single Instruction Multiple Thread) model. A CUDA kernel is a function that is executed simultaneously in many threads. It is launched using the following syntax: r1.2 Page 13 of 17
14 Homework 3, Part B kernel_fn<<< grid_dim3, block_dim3 >>>(arguments); The parameters grid dim3, block dim3 specify the number and layout of the parallel threads to be launched. Each argument is of type dim3, specifying 3 dimensions. Threads are organized in up to 3-dimensional blocks. All the blocks of a kernel are arranged in a grid. Recent GPUs allow a 3D grid, but for Compute Capability 1.1 only 2 dimensions are allowed. During execution, each thread has access to its block index and thread index to identify itself and the data it shall work on. The main task in GPU programming is to efficiently organize memory accesses among the threads. Examine the files matrixmul.cu and matrixmul naive.cuh in ${HW3ROOT/mmm gpu. The host (CPU) code is in matrixmul.cu, and matrixmul naive.cuh contains the kernel that will be executed on the GPU. Answer the following questions: B.2.2 What are the block and grid dimensions for a 2048x2048 matrix? B.2.3 What data elements in A, B and C will an individual thread touch? B.2.4 What are we comparing the performance of the CUDA code to? B.2.5 Why do we include the CUDA data management functions (cudamemcpy etc) when we measure the time for matrix multiplication on GPU? B.2.6 Compile the code and run it for matrix sizes from 16x16 to 2048x2048: tux-1$ make tux-1$./matrixmul -length=16 Plot the CPU and GPU performances. At what size does using the GPU become more efficient than the CPU? B.2.7 GPU Memory Hierarchy The first step in improving the performance of a CUDA kernel is to adapt memory accesses to the GPU s memory hierarchy. Unlike in CPUs, the memory hierarchy of the GPU is mostly exposed to the programmer, and you will have to manually copy data from the more distant to the closer levels and back again. The following types of memory are available on a GPU 1 : Registers Registers are the fastest memory, accessible without any latency on each clock cycle, just as on a regular CPU. A thread s registers cannot be shared with other threads. Shared Memory Shared memory is comparable to L1 cache memory on a regular CPU. It resides close to the multiprocessor, and has very short access times. Shared memory is shared among all the threads of a given block. Section of the Cuda C Best Practices Guide has more on shared memory optimization considerations. Global Memory Global memory resides on the device, but off-chip from the multiprocessors, so that access times to global memory can be 100 times greater than to shared memory. All threads in the kernel have access to all data in global memory. 1 r1.2 Page 14 of 17
15 Homework 3, Part B Local Memory Thread-specific memory stored where global memory is stored. Variables are stored in a thread s local memory if the compiler decides that there are not enough registers to hold the thread s data. This memory is slow, even though its called local. Constant Memory 64k of Constant memory resides off-chip from the multiprocessors, and is read-only. The host code writes to the devices constant memory before launching the kernel, and the kernel may then read this memory. Constant memory access is cached each multiprocessor can cache up to 8k of constant memory, so that subsequent reads from constant memory can be very fast. All threads have access to constant memory. Texture Memory Specialized memory for surface texture mapping, not discussed in this module. The equivalent of cache blocking is to subdivide the matrices into tiles. Each thread block is in charge of one tile of the result matrix C. It loads the necessary tiles of A and B into shared memory, computes the result values in register, and finally writes the result back to global memory. B.2.8 Examine the code in matrixmul tiling.cuh. What is the role of the for loop? For a 128x128 matrix, how many iterations will it have? Modify matrixmul.cu to make use of the tiled code: --#include "matrixmul_naive.cuh" ++#include "matrixmul_tiling.cuh" Plot the performace for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. You can comment out the line #define COMPARE CPU at the top of matrixmul.cu to avoid re-generating the CPU performance numbers. B.2.9 Coalescing Similarly to the effect of cache lines on CPU, memory accesses to consecutive memory addresses by threads with consecutive indexes are more efficient than accesses to different addresses. This is called Coalescing. Read appendix G.3 of the CUDA C Programming Guide for a detailed explanation. B.2.10 Show that the accesses to matrix A already fulfil the conditions listed in section G.3.2. Why can the accesses to matrix B not be coalesced? B.2.11 matrixmul coalescing.cuh contains a version that allows coalesced accesses to B. How has this been achieved? B.2.12 Plot the performance of the coalesced version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.13 Shared Memory Bank Conflicts You will find that instead of the expected performance improvement from coalesced memory accesses, the last version actually performs more poorly. The problem is that although we have improved the global memory access pattern, we have introduced a problem with the access patterns to shared memory, namely bank conflicts. Section G.3.3 of the Appendix to the CUDA C Programming Guide explains how shared memory is structured in banks. To avoid bank conflicts, each group of 16 consecutive thread IDs needs to access 32-bit words in 16 different banks, i.e. with a different word alignment. The easiest way to achieve this is to access consecutive words, which is what has been implemented in matrixmul nobankconflict.cuh. Explain the relevant change. r1.2 Page 15 of 17
16 Homework 3, Part B B.2.14 Why does the access AS[ty][k] not generate bank conflicts? B.2.15 Plot the performance of this version for matrix sizes from 16x16 to 2048x2048 and compare it to the previous. B.2.16 Optional: further optimizations If you are looking for inspiration for other improvements you could make in part C, the page explains more techniques you can try. B.2.17 Submission Submit the following: A plot comparing the performance of all the different versions of CUDA matrix multiply examined in this section. Your answers to the questions. r1.2 Page 16 of 17
17 Homework 3, Part C Part C: Open-ended Project C.1 Convolution In this part you will apply the techniques you have learned in part B to a different algorithm: a convolution filter. You can find the code in the folder ${HW3ROOT/convolutionSeparable. A convolution filter is applied to an image to achieve some effect, for example a blur. It is defined by a convolution kernel, which is a matrix of small size (e.g. 3 3). Each output pixel is calculated based on the input pixel at the same location as well as its neighbors. For each input pixel location, an area of the same size as the kernel centered at the location is considered. Each input pixel value of this area is multiplied by the corresponding kernel value. These values are summed up to produce the output pixel value. The convolution kernel you are optimizing in this part has a further property: it is separable. This means that the kernel is the product of two vectors. In our particular example it is the product of a vector and its own transpose = [ ] You may use this feature to your advantage when optimizing the code. C.1.1 Submission Submit your implementation of the convolution filter for CPU and GPU together with a report on how you optimized the code for each target. r1.2 Page 17 of 17
Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm
Second Semester, 2015 16 Homework 3 (r1.1) Due: Part (A) -- Apr 29, 2016, 11:55pm Part (B) -- Apr 29, 2016, 11:55pm Part (C) -- Apr 29, 2016, 11:55pm Instruction: Submit your answers electronically through
More informationHomework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm
Second Semester, 2016 17 Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm Instruction: Submit your answers electronically through
More informationELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12
Homework 3, Part ELEC3441: Computer Architecture Second Semester, 2015 16 Homework 3 (r1.1) r1.1 Page 1 of 12 A.1 Cache Access Part A: Problem Set Consider the following sequence of memory accesses to
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationCS152 Computer Architecture and Engineering
CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationAgenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements
CS 61C: Great Ideas in Computer Architecture Virtual II Guest Lecturer: Justin Hsia Agenda Review of Last Lecture Goals of Virtual Page Tables Translation Lookaside Buffer (TLB) Administrivia VM Performance
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationToday. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,
Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory
More informationAgenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories
Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal
More informationCSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double
CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double *)malloc(sizeof(double)*n*n); B = (double *)malloc(sizeof(double)*n*n);
More informationENCM 369 Winter 2016 Lab 11 for the Week of April 4
page 1 of 13 ENCM 369 Winter 2016 Lab 11 for the Week of April 4 Steve Norman Department of Electrical & Computer Engineering University of Calgary April 2016 Lab instructions and other documents for ENCM
More informationComputer Architecture Memory hierarchies and caches
Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches
More informationCSE 351. Virtual Memory
CSE 351 Virtual Memory Virtual Memory Very powerful layer of indirection on top of physical memory addressing We never actually use physical addresses when writing programs Every address, pointer, etc
More informationCS 61C: Great Ideas in Computer Architecture. Virtual Memory
CS 61C: Great Ideas in Computer Architecture Virtual Memory Instructor: Justin Hsia 7/30/2012 Summer 2012 Lecture #24 1 Review of Last Lecture (1/2) Multiple instruction issue increases max speedup, but
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationHomework 2 (r1.0) Due: Mar 27, 2018, 11:55pm
Second Semester, 2016 17 Homework 2 (r1.0) Due: Mar 27, 2018, 11:55pm Instruction: Submit your answers electronically through Moodle. There are 3 major parts in this homework. Part A includes questions
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationVirtual Memory Overview
Virtual Memory Overview Virtual address (VA): What your program uses Virtual Page Number Page Offset Physical address (PA): What actually determines where in memory to go Physical Page Number Page Offset
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationHigh-Performance Parallel Computing
High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture;
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationECE331 Homework 4. Due Monday, August 13, 2018 (via Moodle)
ECE331 Homework 4 Due Monday, August 13, 2018 (via Moodle) 1. Below is a list of 32-bit memory address references, given as hexadecimal byte addresses. The memory accesses are all reads and they occur
More informationLecture 21: Virtual Memory. Spring 2018 Jason Tang
Lecture 21: Virtual Memory Spring 2018 Jason Tang 1 Topics Virtual addressing Page tables Translation lookaside buffer 2 Computer Organization Computer Processor Memory Devices Control Datapath Input Output
More informationCS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017
CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationwrite-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF
write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationCOSC 3406: COMPUTER ORGANIZATION
COSC 3406: COMPUTER ORGANIZATION Home-Work 5 Due Date: Friday, December 8 by 2.00 pm Instructions for submitting: Type your answers and send it by email or take a printout or handwritten (legible) on paper,
More informationCS152 Computer Architecture and Engineering Virtual Memory and Address Translation Assigned March 3 Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended to
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationVirtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1
Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L20-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More information198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14
98:23 Intro to Computer Organization Lecture 4 Virtual Memory 98:23 Introduction to Computer Organization Lecture 4 Instructor: Nicole Hynes nicole.hynes@rutgers.edu Credits: Several slides courtesy of
More informationVirtual Memory. CS 3410 Computer System Organization & Programming
Virtual Memory CS 3410 Computer System Organization & Programming These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer. Where are we now and
More informationEN1640: Design of Computing Systems Topic 06: Memory System
EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring
More informationPractice Exercises 449
Practice Exercises 449 Kernel processes typically require memory to be allocated using pages that are physically contiguous. The buddy system allocates memory to kernel processes in units sized according
More informationVirtual Memory 3. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 5.4 Project3 available now Administrivia Design Doc due next week, Monday, April 16 th Schedule
More informationECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems
ECE 356: Architecture, Concurrency, and Energy of Computation Sample Problem Set: Memory Systems TLB 1. Consider a processor system with 256 kbytes of memory, 64 Kbyte pages, and a 1 Mbyte virtual address
More informationand data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed
5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationVirtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1
Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L16-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:
More informationCache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron
Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron 1 Topics Cache memory organiza3on and opera3on Performance impact of caches 2 Cache Memories Cache memories are
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle
More informationECE Sample Final Examination
ECE 3056 Sample Final Examination 1 Overview The following applies to all problems unless otherwise explicitly stated. Consider a 2 GHz MIPS processor with a canonical 5-stage pipeline and 32 general-purpose
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationChapter 8. Virtual Memory
Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:
More informationThis Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory
This Unit: Virtual Application OS Compiler Firmware I/O Digital Circuits Gates & Transistors hierarchy review DRAM technology A few more transistors Organization: two level addressing Building a memory
More informationLearning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.
Virtual Memory 1 Learning Outcomes An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory. 2 Memory Management Unit (or TLB) The position and function
More informationDenison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud
Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally
More informationLearning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.
Virtual Memory Learning Outcomes An understanding of page-based virtual memory in depth. Including the R000 s support for virtual memory. Memory Management Unit (or TLB) The position and function of the
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationComputer Systems II. Memory Management" Subdividing memory to accommodate many processes. A program is loaded in main memory to be executed
Computer Systems II Memory Management" Memory Management" Subdividing memory to accommodate many processes A program is loaded in main memory to be executed Memory needs to be allocated efficiently to
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationWhat is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance
What is EE 352 Unit 11 Definitions Address Mapping Performance memory is a small, fast memory used to hold of data that the processor will likely need to access in the near future sits between the processor
More informationVirtual Memory, Address Translation
Memory Hierarchy Virtual Memory, Address Translation Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing,
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationChangelog. Virtual Memory (2) exercise: 64-bit system. exercise: 64-bit system
Changelog Virtual Memory (2) Changes made in this version not seen in first lecture: 21 November 2017: 1-level example: added final answer of memory value, not just location 21 November 2017: two-level
More informationCache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory
5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationCS 3733 Operating Systems:
CS 3733 Operating Systems: Topics: Memory Management (SGG, Chapter 08) Instructor: Dr Dakai Zhu Department of Computer Science @ UTSA 1 Reminders Assignment 2: extended to Monday (March 5th) midnight:
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationCS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More informationDirect Mapped Cache Hardware. Direct Mapped Cache. Direct Mapped Cache Performance. Direct Mapped Cache Performance. Miss Rate = 3/15 = 20%
Direct Mapped Cache Direct Mapped Cache Hardware........................ mem[xff...fc] mem[xff...f8] mem[xff...f4] mem[xff...f] mem[xff...ec] mem[xff...e8] mem[xff...e4] mem[xff...e] 27 8-entry x (+27+)-bit
More informationCSE 153 Design of Operating Systems
CSE 53 Design of Operating Systems Winter 28 Lecture 6: Paging/Virtual Memory () Some slides modified from originals by Dave O hallaron Today Address spaces VM as a tool for caching VM as a tool for memory
More informationWinter 2009 FINAL EXAMINATION Location: Engineering A Block, Room 201 Saturday, April 25 noon to 3:00pm
University of Calgary Department of Electrical and Computer Engineering ENCM 369: Computer Organization Lecture Instructors: S. A. Norman (L01), N. R. Bartley (L02) Winter 2009 FINAL EXAMINATION Location:
More informationlecture 18 cache 2 TLB miss TLB - TLB (hit and miss) - instruction or data cache - cache (hit and miss)
lecture 18 2 virtual physical virtual physical - TLB ( and ) - instruction or data - ( and ) Wed. March 16, 2016 Last lecture I discussed the TLB and how virtual es are translated to physical es. I only
More informationVirtual Memory Oct. 29, 2002
5-23 The course that gives CMU its Zip! Virtual Memory Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs class9.ppt Motivations for Virtual Memory Use Physical
More informationCS 61C: Great Ideas in Computer Architecture. Virtual Memory III. Instructor: Dan Garcia
CS 61C: Great Ideas in Computer Architecture Virtual Memory III Instructor: Dan Garcia 1 Agenda Review of Last Lecture Goals of Virtual Memory Page Tables TranslaFon Lookaside Buffer (TLB) Administrivia
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationStructure of Computer Systems
222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a
More informationSolutions for Chapter 7 Exercises
olutions for Chapter 7 Exercises 1 olutions for Chapter 7 Exercises 7.1 There are several reasons why you may not want to build large memories out of RAM. RAMs require more transistors to build than DRAMs
More informationregisters data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.
Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3
More informationCS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches
CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates
More informationCOSC3330 Computer Architecture Lecture 20. Virtual Memory
COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use
More informationCache Performance II 1
Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation
More informationVirtual Memory. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]
Virtual Memory CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Click any letter let me know you re here today. Instead of a DJ Clicker Question today,
More informationCarnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron
Cache Memories Computer Architecture Instructor: Norbert Lu1enberger based on the book by Randy Bryant and Dave O Hallaron 1 Today Cache memory organiza7on and opera7on Performance impact of caches The
More informationVirtual Memory Review. Page faults. Paging system summary (so far)
Lecture 22 (Wed 11/19/2008) Virtual Memory Review Lab #4 Software Simulation Due Fri Nov 21 at 5pm HW #3 Cache Simulator & code optimization Due Mon Nov 24 at 5pm More Virtual Memory 1 2 Paging system
More informationCS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging
CS162 Operating Systems and Systems Programming Lecture 14 Caching and Demand Paging October 17, 2007 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs162 Review: Hierarchy of a Modern Computer
More informationVirtual Memory, Address Translation
Memory Hierarchy Virtual Memory, Address Translation Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing,
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationTopics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems
Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems Instructor: Dr. Turgay Korkmaz Department Computer Science The University of Texas at San Antonio Office: NPB
More informationMemory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky
Memory Hierarchy, Fully Associative Caches Instructor: Nick Riasanovsky Review Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of datapath component Data
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More information