Cache Memory and Performance
|
|
- Winfred Higgins
- 5 years ago
- Views:
Transcription
1 Cache Memo and Pefomance Code and Caches 1 Man of the following slides ae taken with pemission fom Complete Powepoint Lectue Notes fo Compute Sstems: A Pogamme's Pespective (CS:APP) Randal E. Bant and David R. O'Hallaon The book is used eplicitl in CS 2505 and CS 3214 and as a efeence in CS Compute Oganiation II
2 Localit Eample (1) Code and Caches 2 Claim: Being able to look at code and get a qualitative sense of its localit is a ke skill fo a pofessional pogamme. Question: Which of these functions has good localit? int sumaaows(int a[m][n]) { int i, j, sum = 0; fo (i = 0; i < M; i++) fo (j = 0; j < N; j++) sum += a[i][j]; etun sum; int sumaacols(int a[m][n]) { int i, j, sum = 0; fo (j = 0; j < N; j++) fo (i = 0; i < M; i++) sum += a[i][j]; etun sum; Compute Oganiation II
3 Laout of C Aas in Memo C aas allocated in contiguous memo locations with addesses ascending with the aa inde: int32_t A[10] = {0, 1, 2, 3, 4,..., 8, 9; Code and Caches 3 7FFF FFF FFF FFF C 3 7FFF FFF FFF Compute Oganiation II
4 Two-dimensional Aas in C Code and Caches 4 In C, a two-dimensional aa is an aa of aas: A[0] A[1] A[2] int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24 ; In fact, if we pint the values as pointes, we see something like this: A: 07fff22e41d30 A[0]: 07fff22e41d30 A[1]: 07fff22e41d44 A[2]: 07fff22e41d Compute Oganiation II
5 Laout of C Aas in Memo Two-dimensional C aas allocated in ow-majo ode - each ow in contiguous memo locations: int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24 ; Code and Caches 5 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II
6 Laout of C Aas in Memo int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24, ; Stepping though columns in one ow: fo (i = 0; i < 3; i++) fo (j = 0; j < 5; j++) sum += A[i][j]; - accesses successive elements in memo - if cache block sie B > 4 btes, eploit spatial localit compulso miss ate = 4 btes / B i = 0 i = 1 i = 2 Code and Caches 6 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II
7 Laout of C Aas in Memo int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24, ; Stepping though ows in one column: fo (j = 0; i < 5; i++) fo (i = 0; i < 3; i++) sum += a[i][j]; accesses distant elements no spatial localit! compulso miss ate = 1 (i.e. 100%) j = 0 j = 1 Code and Caches 7 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II
8 Stide and Aa Accesses Code and Caches 8 7FFF22E41D30 0 Stide 1 Stide 4 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II
9 Witing Cache Fiendl Code Code and Caches 9 Repeated efeences to vaiables ae good (tempoal localit) Stide-1 efeence pattens ae good (spatial localit) Assume an initiall-empt cache with 16-bte cache blocks. int sumaaows(int a[m][n]) { int ow, col, sum = 0; fo (ow = 0; ow < M; ow++) fo (col = 0; col < N; col++) sum += a[ow][col]; etun sum; i = 0, j = 0 to i = 0, j = 3 i = 0, j = 4 to i = 1, j = Miss ate = 1/4 = 25% Compute Oganiation II
10 Witing Cache Fiendl Code Conside the pevious slide, but assume that the cache uses a block sie of 64 btes instead of 16 btes.. Code and Caches int sumaaows(int a[m][n]) { int ow, col, sum = 0; i = 0, j = 0 to i = 3, j = fo (ow = 0; ow < M; ow++) fo (col = 0; col < N; col++) sum += a[ow][col]; etun sum; Miss ate = 1/16 = 6.25% Compute Oganiation II
11 Witing Cache Fiendl Code Code and Caches 11 "Skipping" accesses down the ows of a column do not povide good localit: int sumaacols(int a[m][n]) { int ow, col, sum = 0; fo (col = 0; col < N; col++) fo (ow = 0; ow < M; ow++) sum += a[ow][col]; etun sum; Miss ate = 100% (That's actuall somewhat pessimistic... depending on cache geomet.) Compute Oganiation II
12 Laout of C Aas in Memo Code and Caches 12 It's eas to wite an aa tavesal and see the addesses at which the aa elements ae stoed: int A[5] = {0, 1, 2, 3, 4; fo (i = 0; i < 5; i++) pintf("%d: %p\n", i, &A[i]); We see thee that fo a 1D aa, the inde vaies in a stide-1 patten. i addess : 28ABE0 1: 28ABE4 2: 28ABE8 3: 28ABEC 4: 28ABF0 stide-1 : addesses diffe b the sie of an aa cell (4 btes, hee) Compute Oganiation II
13 Laout of C Aas in Memo Code and Caches 13 int B[3][5] = {... ; fo (i = 0; i < 3; i++) fo (j = 0; j < 5; j++) pintf("%d %3d: %p\n", i, j, &B[i][j]); We see that fo a 2D aa, the second inde vaies in a stide-1 patten. i-j ode: i j addess : 28ABA4 0 1: 28ABA8 0 2: 28ABAC 0 3: 28ABB0 0 4: 28ABB4 1 0: 28ABB8 1 1: 28ABBC 1 2: 28ABC0 stide-1 But the fist inde does not va in a stide-1 patten. j-i ode: i j addess : 28CC9C stide-5 (014/4) 1 0: 28CCB0 2 0: 28CCC4 0 1: 28CCA0 1 1: 28CCB4 2 1: 28CCC8 0 2: 28CCA4 1 2: 28CCB8 Compute Oganiation II
14 3D Aas in C Code and Caches 14 int32_t A[2][3][5] = { { { 0, 1, 2, 3, 4, { 10, 11, 12, 13, 14, { 20, 21, 22, 23, 24, { { 0, 1, 2, 3, 4, {110, 111, 112, 113, 114, {220, 221, 222, 223, 224 ; Compute Oganiation II
15 Localit Eample (2) Code and Caches 15 Question: Can ou pemute the loops so that the function scans the 3D aa a[][][] with a stide-1 efeence patten (and thus has good spatial localit)? int sumaa3d(int a[n][n][n]) { int ow, col, page, sum = 0; fo (ow = 0; ow < N; ow++) fo (col = 0; col < N; col++) fo (page = 0; page < N; page++) sum += a[page][ow][col]; etun sum; Compute Oganiation II
16 Laout of C Aas in Memo Code and Caches 16 int C[2][3][5] = {... ; fo (i = 0; i < 2; i++) fo (j = 0; j < 3; j++) fo (k = 0; k < 5; k++) pintf("%3d %3d %3d: %p\n", i, j, k, &C[i][j][k]); We see that fo a 3D aa, the thid inde vaies in a stide-1 patten: i-j-k ode: But if we change the ode of access, we no longe have a stide-1 patten: k-j-i ode: i-k-j ode: i j k addess : 28CC1C 0 0 1: 28CC : 28CC : 28CC : 28CC2C 0 1 0: 28CC : 28CC : 28CC i j k addess : 28CC24 03C 1 0 0: 28CC : 28CC : 28CC74 03C 0 2 0: 28CC4C 1 2 0: 28CC : 28CC : 28CC64 i j k addess : 28CC : 28CC : 28CC4C 0 0 1: 28CC : 28CC3C 0 2 1: 28CC : 28CC2C 0 1 2: 28CC Compute Oganiation II
17 Localit Eample (2) Code and Caches 17 Question: Can ou pemute the loops so that the function scans the 3D aa a[] with a stide-1 efeence patten (and thus has good spatial localit)? int sumaa3d(int a[n][n][n]) { int i, j, k, sum = 0; fo (i = 0; i < N; i++) fo (j = 0; j < N; j++) fo (k = 0; k < N; k++) sum += a[k][i][j]; etun sum; This code does not ield good localit at all. The inne loop is vaing the fist inde, wost case! Compute Oganiation II
18 Localit Eample (3) Code and Caches 18 Question: Which of these two ehibits bette spatial localit? // stuct of aas stuct soa { float *; float *; float *; float *; ; compute_(stuct soa s) { fo (i = 0; ) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; // aa of stucts stuct aos { float ; float ; float ; float ; ; compute_(stuct aos *s) { fo (i = 0; ) { s[i]. = s[i]. * s[i]. + s[i]. * s[i]. + s[i]. * s[i].; Fo the following discussions assume a cache block sie of 32 btes, and that the cache is not capable of holding all the blocks of the elevant stuctue at once. Compute Oganiation II
19 Localit Eample (3) Code and Caches 19 // stuct of aas stuct soa { float *; float *; float *; float *; ; stuct soa s; s. = malloc(1000 * sieof(float)); btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II
20 Code and Caches 20 Compute Oganiation II Localit Eample (3) // aa of stucts stuct aos { float ; float ; float ; float ; ; stuct aos s[1000]; 16 btes pe cell, 1000 cells
21 Localit Eample (3) Descibe the localit ehibited b this algoithm: // stuct of aas compute_(stuct soa s) { fo (int i = 0; i < 1000; i++) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; 8 cells Code and Caches 21 s.[0] miss s.[0] miss s.[0] miss s.[0] miss s.[1] hit s.[1] hit s.[1] hit s.[1] hit... s.[7] hit s.[7] hit s.[7] hit s.[7] hit s.[8] miss s.[8] miss s.[8] miss s.[8] miss 32 btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II
22 Localit Eample (3) Descibe the localit ehibited b this algoithm: // stuct of aas compute_(stuct soa s) { fo (int i = 0; i < 1000; i++) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; Code and Caches 22 s.[8] miss s.[8] miss s.[8] miss s.[8] miss s.[9] hit s.[9] hit s.[9] hit s.[9] hit... 8 cells 8 cells Fo the aas: Misses = 4*1*125 Hits = 4*7*125 Hit ate = 87.5% 32 btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II
23 Localit Eample (3) Descibe the localit ehibited b this algoithm: // aa of stucts compute_(stuct aos *s) { fo (int i = 0; i < 1000; i++) { s[i]. = s[i]. * s[i]. + s[i]. * s[i]. + s[i]. * s[i].; Code and Caches 23 s[0]. miss s[0]. hit s[0]. hit s[0]. hit s[1]. hit s[2]. hit s[3]. hit s[4]. hit... Hit ate: 7/8 o 87.5% Compute Oganiation II
24 Localit Eample (4) Code and Caches 24 Descibe the localit ehibited b this algoithm: // stuct of aas sum_(stuct soa s) { sum = 0; fo (int i = 0; i < 1000; i++) { sum += s.[i]; Compute Oganiation II
25 Code and Caches 25 Compute Oganiation II Localit Eample (4) // aa of stucts sum_(stuct aos *s) { sum = 0; fo (int i = 0; i < 1000; i++) { sum += s[i].; Descibe the localit ehibited b this algoithm:
26 Localit Eample (5) Code and Caches 26 QTP: How would this compae to the pevious two? // aa of pointes to stucts stuct aops { float ; float ; float ; float ; ; stuct *aops apos[1000]; fo (i = 0; i < 1000; i++) apos[i] = malloc(sieof(stuct aops)); Compute Oganiation II
27 Witing Cache Fiendl Code Code and Caches 27 Make the common case go fast Focus on the inne loops of the coe functions Minimie the misses in the inne loops Repeated efeences to vaiables ae good (tempoal localit) Stide-1 efeence pattens ae good (spatial localit) Ke idea: Ou qualitative notion of localit is quantified though ou undestanding of cache memoies. Compute Oganiation II
28 Miss Rate Analsis fo Mati Multipl Code and Caches 28 Assume: Line sie = 32B (big enough fo fou 64-bit wods) Mati dimension (N) is ve lage Appoimate 1/N as 0.0 Cache is not even big enough to hold multiple ows Analsis Method: Look at access patten of inne loop k j j i k i A B C Compute Oganiation II
29 Mati Multiplication Eample Code and Caches 29 Desciption: Multipl N N matices O(N 3 ) total opeations N eads pe souce element N values summed pe destination Vaiable sum /* ijk */ held in egiste fo (i=0; i<n; i++) { fo (j=0; j<n; j++) { sum = 0.0; fo (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Compute Oganiation II
30 Mati Multiplication (ijk) Code and Caches 30 /* ijk */ fo (i = 0; i < n; i++) { fo (j = 0; j < n; j++) { sum = 0.0; fo (k = 0; k < n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Inne loop: A Row-wise (i,*) (*,j) B Columnwise (i,j) C Fied Misses pe inne loop iteation: A B C Compute Oganiation II
31 Mati Multiplication (kij) /* kij */ fo (k = 0; k < n; k++) { fo (i = 0; i < n; i++) { = a[i][k]; fo (j = 0; j < n; j++) c[i][j] += * b[k][j]; Code and Caches 31 Inne loop: (i,k) (k,*) (i,*) A B C Fied Row-wise Row-wise Misses pe inne loop iteation: A B C Compute Oganiation II
32 Mati Multiplication (jki) Code and Caches 32 /* jki */ fo (j = 0; j < n; j++) { fo (k = 0; k < n; k++) { = b[k][j]; fo (i = 0; i < n; i++) c[i][j] += a[i][k] * ; Inne loop: (*,k) A (k,j) B Fied (*,j) C Columnwise Columnwise Misses pe inne loop iteation: A B C Compute Oganiation II
33 Summa of Mati Multiplication fo (i = 0; i < n; i++) { fo (j = 0; j < n; j++) { sum = 0.0; fo (k = 0; k < n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; fo (k = 0; k < n; k++) { fo (i = 0; i < n; i++) { = a[i][k]; fo (j = 0; j < n; j++) c[i][j] += * b[k][j]; fo (j = 0; j < n; j++) { fo (k = 0; k < n; k++) { = b[k][j]; fo (i = 0; i < n; i++) c[i][j] += a[i][k] * ; ijk (& jik): 2 loads, 0 stoes misses/ite = 1.25 kij (& ikj): 2 loads, 1 stoe misses/ite = 0.5 jki (& kji): 2 loads, 1 stoe misses/ite = 2.0 Code and Caches 33 Compute Oganiation II
34 Ccles pe inne loop iteation Coe i7 Mati Multipl Pefomance Code and Caches jki / kji ijk / jik jki kji ijk jik kij 10 kij / ikj Aa sie (n) Compute Oganiation II
35 Concluding Obsevations Code and Caches 35 Pogamme can optimie fo cache pefomance How data stuctues ae oganied How data ae accessed Nested loop stuctue Blocking is a geneal technique All sstems favo cache fiendl code Getting absolute optimum pefomance is ve platfom specific Cache sies, line sies, associativities, etc. Can get most of the advantage with geneic code Keep woking set easonabl small (tempoal localit) Use small stides (spatial localit) Compute Oganiation II
CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationSystems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations
Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationAgenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories
Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal
More informationToday. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,
Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationHigh-Performance Parallel Computing
High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture;
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationCache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory
5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache
More informationCache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory
5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories
More informationν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.
Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast
More informationDenison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud
Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Storage Project3 Digital Logic - Storage: Recap - Direct - Mapping - Fully Associated - 2-way Associated - Cache Friendly Code Rutgers University Liu
More informationCache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition
Cache Memories Lecture, Oct. 30, 2018 1 General Cache Concept Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory
More informationCache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron
Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron 1 Topics Cache memory organiza3on and opera3on Performance impact of caches 2 Cache Memories Cache memories are
More informationMemory Hierarchy. Announcement. Computer system model. Reference
Announcement Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 26//5 Grade for hw#4 is online Please DO submit homework if you haen t Please sign up a demo time on /6 or /7 at
More informationCarnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron
Cache Memories Computer Architecture Instructor: Norbert Lu1enberger based on the book by Randy Bryant and Dave O Hallaron 1 Today Cache memory organiza7on and opera7on Performance impact of caches The
More informationCarnegie Mellon. Cache Memories
Cache Memories Thanks to Randal E. Bryant and David R. O Hallaron from CMU Reading Assignment: Computer Systems: A Programmer s Perspec4ve, Third Edi4on, Chapter 6 1 Today Cache memory organiza7on and
More informationCSCI 402: Computer Architectures. Performance of Multilevel Cache
CSCI 402: Computer Architectures Memory Hierarchy (5) Fengguang Song Department of Computer & Information Science IUPUI Performance of Multilevel Cache Main Memory CPU L1 cache L2 cache Given CPU base
More informationCOSC 6385 Computer Architecture. - Pipelining
COSC 6385 Compute Achitectue - Pipelining Sping 2012 Some of the slides ae based on a lectue by David Culle, Pipelining Pipelining is an implementation technique wheeby multiple instuctions ae ovelapped
More informationExample. How are these parameters decided?
Example How are these parameters decided? Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix,
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationIntroduction to Computer Systems: Semester 1 Computer Architecture
Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How
More informationComputer Science 141 Computing Hardware
Compute Science 141 Computing Hadwae Fall 2006 Havad Univesity Instucto: Pof. David Books dbooks@eecs.havad.edu [MIPS Pipeline Slides adapted fom Dave Patteson s UCB CS152 slides and May Jane Iwin s CSE331/431
More information2D Transformations. Why Transformations. Translation 4/17/2009
4/7/9 D Tansfomations Wh Tansfomations Coodinate sstem tansfomations Placing objects in the wold Move/animate the camea fo navigation Dawing hieachical chaactes Animation Tanslation + d 5,4 + d,3 d 4,
More informationCache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.
Cache Memories 15-213: Introduc on to Computer Systems 12 th Lecture, October 6th, 2016 Instructor: Randy Bryant 1 Today Cache memory organiza on and opera on Performance impact of caches The memory mountain
More informationAny modern computer system will incorporate (at least) two levels of storage:
1 Any moden compute system will incopoate (at least) two levels of stoage: pimay stoage: andom access memoy (RAM) typical capacity 32MB to 1GB cost pe MB $3. typical access time 5ns to 6ns bust tansfe
More informationMemory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]
Memory Hierarchy Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP] Motivation Up to this point we have relied on a simple model of a computer system
More informationAll lengths in meters. E = = 7800 kg/m 3
Poblem desciption In this poblem, we apply the component mode synthesis (CMS) technique to a simple beam model. 2 0.02 0.02 All lengths in metes. E = 2.07 10 11 N/m 2 = 7800 kg/m 3 The beam is a fee-fee
More informationCS 2461: Computer Architecture 1 Program performance and High Performance Processors
Couse Objectives: Whee ae we. CS 2461: Pogam pefomance and High Pefomance Pocessos Instucto: Pof. Bhagi Naahai Bits&bytes: Logic devices HW building blocks Pocesso: ISA, datapath Using building blocks
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationThe Processor: Improving Performance Data Hazards
The Pocesso: Impoving Pefomance Data Hazads Monday 12 Octobe 15 Many slides adapted fom: and Design, Patteson & Hennessy 5th Edition, 2014, MK and fom Pof. May Jane Iwin, PSU Summay Pevious Class Pipeline
More informationLecture 8 Introduction to Pipelines Adapated from slides by David Patterson
Lectue 8 Intoduction to Pipelines Adapated fom slides by David Patteson http://www-inst.eecs.bekeley.edu/~cs61c/ * 1 Review (1/3) Datapath is the hadwae that pefoms opeations necessay to execute pogams.
More informationMapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma
apreduce Optimizations and Algoithms 2015 Pofesso Sasu Takoma www.cs.helsinki.fi Optimizations Reduce tasks cannot stat befoe the whole map phase is complete Thus single slow machine can slow down the
More informationThe Java Virtual Machine. Compiler construction The structure of a frame. JVM stacks. Lecture 2
Compile constuction 2009 Lectue 2 Code geneation 1: Geneating code The Java Vitual Machine Data types Pimitive types, including intege and floating-point types of vaious sizes and the boolean type. The
More informationComputer Organization - Overview
Computer Organization - Overview Hyunyoung Lee CSCE 312 1 Course Overview Topics: Theme Five great realities of computer systems Computer system overview Summary NOTE: Most slides are from the textbook
More informationLecture #22 Pipelining II, Cache I
inst.eecs.bekeley.edu/~cs61c CS61C : Machine Stuctues Lectue #22 Pipelining II, Cache I Wiewold cicuits 2008-7-29 http://www.maa.og/editoial/mathgames/mathgames_05_24_04.html http://www.quinapalus.com/wi-index.html
More informationUCB CS61C : Machine Structures
inst.eecs.bekeley.edu/~cs61c UCB CS61C : Machine Stuctues Lectue SOE Dan Gacia Lectue 28 CPU Design : Pipelining to Impove Pefomance 2010-04-05 Stanfod Reseaches have invented a monitoing technique called
More informationHigh performance CUDA based CNN image processor
High pefomance UDA based NN image pocesso GEORGE VALENTIN STOIA, RADU DOGARU, ELENA RISTINA STOIA Depatment of Applied Electonics and Infomation Engineeing Univesity Politehnica of Buchaest -3, Iuliu Maniu
More informationOutput Primitives. Ellipse Drawing
Output Pimitives Ellipse Dawing Ellipses. An ellipses is an elongated cicle and can be dawn with modified cicle dawing algoithm.. An ellipse has set of fied points (foci) that will have a constant total
More informationCS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia
CS 61C: Geat Ideas in Compute Achitectue Pipelining Hazads Instucto: Senio Lectue SOE Dan Gacia 1 Geat Idea #4: Paallelism So9wae Paallel Requests Assigned to compute e.g. seach Gacia Paallel Theads Assigned
More informationYou Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011
CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuction Level Paallelism: Multiple Instuction Issue Guest Lectue: Justin Hsia Softwae Paallel Requests Assigned to compute e.g., Seach Katz
More informationa Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives
SPARK: Soot Reseach Kit Ondřej Lhoták Objectives Spak is a modula toolkit fo flow-insensitive may points-to analyses fo Java, which enables expeimentation with: vaious paametes of pointe analyses which
More informationA Memory Efficient Array Architecture for Real-Time Motion Estimation
A Memoy Efficient Aay Achitectue fo Real-Time Motion Estimation Vasily G. Moshnyaga and Keikichi Tamau Depatment of Electonics & Communication, Kyoto Univesity Sakyo-ku, Yoshida-Honmachi, Kyoto 66-1, JAPAN
More informationDYNAMIC STORAGE ALLOCATION. Hanan Samet
ds0 DYNAMIC STORAGE ALLOCATION Hanan Samet Compute Science Depatment and Cente fo Automation Reseach and Institute fo Advanced Compute Studies Univesity of Mayland College Pak, Mayland 07 e-mail: hjs@umiacs.umd.edu
More informationComputer Organization: A Programmer's Perspective
Computer Architecture and The Memory Hierarchy Oren Kapah orenkapah.ac@gmail.com Typical Computer Architecture CPU chip PC (Program Counter) register file AL U Main Components CPU Main Memory Input/Output
More informationAccelerating Storage with RDMA Max Gurtovoy Mellanox Technologies
Acceleating Stoage with RDMA Max Gutovoy Mellanox Technologies 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 1 What is RDMA? Remote Diect Memoy Access - povides the ability
More informationIntroduction to Computer Systems
CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu Giving credit where credit is due Most of slides for this lecture are based on slides created by Drs.
More informationIntroduction to Computer Systems
CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due Most of slides for
More informationdc - Linux Command Dc may be invoked with the following command-line options: -V --version Print out the version of dc
- CentOS 5.2 - Linux Uses Guide - Linux Command SYNOPSIS [-V] [--vesion] [-h] [--help] [-e sciptexpession] [--expession=sciptexpession] [-f sciptfile] [--file=sciptfile] [file...] DESCRIPTION is a evese-polish
More informationReader & ReaderT Monad (11A) Young Won Lim 8/20/18
Copyight (c) 2016-2018 Young W. Lim. Pemission is ganted to copy, distibute and/o modify this document unde the tems of the GNU Fee Documentation License, Vesion 1.2 o any late vesion published by the
More informationComputer Graphics and Animation 3-Viewing
Compute Gaphics and Animation 3-Viewing Pof. D. Chales A. Wüthich, Fakultät Medien, Medieninfomatik Bauhaus-Univesität Weima caw AT medien.uni-weima.de Ma 5 Chales A. Wüthich Viewing Hee: Viewing in 3D
More informationCOEN-4730 Computer Architecture Lecture 2 Review of Instruction Sets and Pipelines
1 COEN-4730 Compute Achitectue Lectue 2 Review of nstuction Sets and Pipelines Cistinel Ababei Dept. of Electical and Compute Engineeing Maquette Univesity Cedits: Slides adapted fom pesentations of Sudeep
More informationUniversity of Waterloo CS240 Winter 2018 Assignment 4 Due Date: Wednesday, Mar. 14th, at 5pm
Univesit of Wateloo CS Winte Assinment Due Date: Wednesda, Ma. th, at pm vesion: -- : Please ead the uidelines on sumissions: http://.student.cs.uateloo.ca/ ~cs//uidelines.pdf. This assinment contains
More informationShape Matching / Object Recognition
Image Pocessing - Lesson 4 Poduction Line object classification Object Recognition Shape Repesentation Coelation Methods Nomalized Coelation Local Methods Featue Matching Coespondence Poblem Alignment
More informationCMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1
CMCS 611-101 Advanced Compute Achitectue Lectue 6 Intoduction to Pipelining Septembe 23, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Compute Achitectue 1 Pevious
More informationUser Visible Registers. CPU Structure and Function Ch 11. General CPU Organization (4) Control and Status Registers (5) Register Organisation (4)
PU Stuctue and Function h Geneal Oganisation Registes Instuction ycle Pipelining anch Pediction Inteupts Use Visible Registes Vaies fom one achitectue to anothe Geneal pupose egiste (GPR) ata, addess,
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More information5. Geometric Transformations and Projections
5. Geometic Tansfomations and ojections 5. Tanslations and Rotations a) Tanslation d d d d d d d d b) Scaling s s s s c) Reflection (about - - lane) d) Rotation about Ais ( ) ( ) CCW 5.. Homogeneous Repesentation
More information4.2. Co-terminal and Related Angles. Investigate
.2 Co-teminal and Related Angles Tigonometic atios can be used to model quantities such as
More informationCSE 230 Intermediate Programming in C and C++ Arrays and Pointers
CSE 230 Intermediate Programming in C and C++ Arrays and Pointers Fall 2017 Stony Brook University Instructor: Shebuti Rayana http://www3.cs.stonybrook.edu/~cse230/ Definition: Arrays A collection of elements
More informationComplete Solution to Potential and E-Field of a sphere of radius R and a charge density ρ[r] = CC r 2 and r n
Complete Solution to Potential and E-Field of a sphee of adius R and a chage density ρ[] = CC 2 and n Deive the electic field and electic potential both inside and outside of a sphee of adius R with a
More informationQuery Language #1/3: Relational Algebra Pure, Procedural, and Set-oriented
Quey Language #1/3: Relational Algeba Pue, Pocedual, and Set-oiented To expess a quey, we use a set of opeations. Each opeation takes one o moe elations as input paamete (set-oiented). Since each opeation
More informationTopic 4 Root Finding
Couse Instucto D. Ramond C. Rump Oice: A 337 Phone: (915) 747 6958 E Mail: cump@utep.edu Topic 4 EE 4386/531 Computational Methods in EE Outline Intoduction Backeting Methods The Bisection Method False
More informationThe Memory Hierarchy. Computer Organization 2/12/2015. CSC252 - Spring Memory. Conventional DRAM Organization. Reading DRAM Supercell (2,1)
Computer Organization 115 The Hierarch Kai Shen Random access memor (RM) RM is traditionall packaged as a chip. Basic storage unit is normall a cell (one bit per cell). Multiple RM chips form a memor.
More informationLecture overview. Visualisatie BMT. Visualization pipeline. Data representation. Discrete data. Sampling. Data Datasets Interpolation
Viualiatie BMT Lectue oveview Data Dataet Intepolation Data epeentation Ajan Kok a.j.f.kok@tue.nl 2 Viualization pipeline Raw Data Data Enichment/Enhancement Deived Data Data epeentation Viualization data
More informationDYNAMIC STORAGE ALLOCATION. Hanan Samet
ds0 DYNAMIC STORAGE ALLOCATION Hanan Samet Compute Science Depatment and Cente fo Automation Reseach and Institute fo Advanced Compute Studies Univesity of Mayland College Pak, Mayland 074 e-mail: hjs@umiacs.umd.edu
More informationLecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming
Lecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan 1 Parallelism
More informationTwo-Dimensional Coding for Advanced Recording
Two-Dimensional Coding fo Advanced Recoding N. Singla, J. A. O Sullivan, Y. Wu, and R. S. Indec Washington Univesity Saint Louis, Missoui s Motivation: Aeal Density Pefomance: match medium, senso, pocessing
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationGARBAGE COLLECTION METHODS. Hanan Samet
gc0 GARBAGE COLLECTION METHODS Hanan Samet Compute Science Depatment and Cente fo Automation Reseach and Institute fo Advanced Compute Studies Univesity of Mayland College Pak, Mayland 07 e-mail: hjs@umiacs.umd.edu
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationECE331: Hardware Organization and Design
ECE331: Hadwae Oganization and Design Lectue 16: Pipelining Adapted fom Compute Oganization and Design, Patteson & Hennessy, UCB Last time: single cycle data path op System clock affects pimaily the Pogam
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationMAQAO hands-on exercises
MAQAO hands-on exercises Perf: generic profiler Perf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Recompile NPB-MZ with dynamic if using cray compiler #---------------------------------------------------------------------------
More informationMonitors. Lecture 6. A Typical Monitor State. wait(c) Signal and Continue. Signal and What Happens Next?
Monitos Lectue 6 Monitos Summay: Last time A combination of data abstaction and mutual exclusion Automatic mutex Pogammed conditional synchonisation Widely used in concuent pogamming languages and libaies
More informationXFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers
XFVHDL: A Tool fo the Synthesis of Fuzzy Logic Contolles E. Lago, C. J. Jiménez, D. R. López, S. Sánchez-Solano and A. Baiga Instituto de Micoelectónica de Sevilla. Cento Nacional de Micoelectónica, Edificio
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationCache Performance II 1
Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation
More information# $!$ %&&' Thanks and enjoy! JFK/KWR. All material copyright J.F Kurose and K.W. Ross, All Rights Reserved
A note on the use of these ppt slides: We e making these slides feely available to all (faculty, students, eades). They e in PowePoint fom so you can add, modify, and delete slides (including this one)
More informationAn Adaptive Multiphase Approach for Large Unconditional and Conditional p-median Problems
An Adaptive Multiphase Appoach fo Lage Unconditional and Conditional p-median oblems Chanda Ade Iawan Said Salhi Maia aola Scapaa Cente fo Logistics & Heuistic Optimization (CLHO), Kent Buess School, Univesit
More informationMultidimensional Testing
Multidimensional Testing QA appoach fo Stoage netwoking Yohay Lasi Visuality Systems 1 Intoduction Who I am Yohay Lasi, QA Manage at Visuality Systems Visuality Systems the leading commecial povide of
More informationLecture Topics ECE 341. Lecture # 12. Control Signals. Control Signals for Datapath. Basic Processing Unit. Pipelining
EE 341 Lectue # 12 Instucto: Zeshan hishti zeshan@ece.pdx.edu Novembe 10, 2014 Potland State Univesity asic Pocessing Unit ontol Signals Hadwied ontol Datapath contol signals Dealing with memoy delay Pipelining
More informationHow many times is the loop executed? middle = (left+right)/2; if (value == arr[middle]) return true;
This lectue Complexity o binay seach Answes to inomal execise Abstact data types Stacks ueues ADTs, Stacks, ueues 1 binayseach(int[] a, int value) { while (ight >= let) { { i (value < a[middle]) ight =
More informationA New Finite Word-length Optimization Method Design for LDPC Decoder
A New Finite Wod-length Optimization Method Design fo LDPC Decode Jinlei Chen, Yan Zhang and Xu Wang Key Laboatoy of Netwok Oiented Intelligent Computation Shenzhen Gaduate School, Habin Institute of Technology
More informationA Novel Parallel Deadlock Detection Algorithm and Architecture
A Novel Paallel Deadlock Detection Aloithm and Achitectue Pun H. Shiu 2, Yudon Tan 2, Vincent J. Mooney III {ship, ydtan, mooney}@ece.atech.ed }@ece.atech.edu http://codesin codesin.ece.atech.eduedu,2
More informationIntroduction To Pipelining. Chapter Pipelining1 1
Intoduction To Pipelining Chapte 6.1 - Pipelining1 1 Mooe s Law Mooe s Law says that the numbe of pocessos on a chip doubles about evey 18 months. Given the data on the following two slides, is this tue?
More informationMatrices. Jordi Cortadella Department of Computer Science
Matrices Jordi Cortadella Department of Computer Science Matrices A matrix can be considered a two-dimensional vector, i.e. a vector of vectors. my_matrix: 3 8 1 0 5 0 6 3 7 2 9 4 // Declaration of a matrix
More informationGCC-AVR Inline Assembler Cookbook Version 1.2
GCC-AVR Inline Assemble Cookbook Vesion 1.2 About this Document The GNU C compile fo Atmel AVR isk pocessos offes, to embed assembly language code into C pogams. This cool featue may be used fo manually
More information