Cache Memory and Performance

Size: px

Start display at page:

Download "Cache Memory and Performance"

Winfred Higgins
5 years ago
Views:

1 Cache Memo and Pefomance Code and Caches 1 Man of the following slides ae taken with pemission fom Complete Powepoint Lectue Notes fo Compute Sstems: A Pogamme's Pespective (CS:APP) Randal E. Bant and David R. O'Hallaon The book is used eplicitl in CS 2505 and CS 3214 and as a efeence in CS Compute Oganiation II

2 Localit Eample (1) Code and Caches 2 Claim: Being able to look at code and get a qualitative sense of its localit is a ke skill fo a pofessional pogamme. Question: Which of these functions has good localit? int sumaaows(int a[m][n]) { int i, j, sum = 0; fo (i = 0; i < M; i++) fo (j = 0; j < N; j++) sum += a[i][j]; etun sum; int sumaacols(int a[m][n]) { int i, j, sum = 0; fo (j = 0; j < N; j++) fo (i = 0; i < M; i++) sum += a[i][j]; etun sum; Compute Oganiation II

3 Laout of C Aas in Memo C aas allocated in contiguous memo locations with addesses ascending with the aa inde: int32_t A[10] = {0, 1, 2, 3, 4,..., 8, 9; Code and Caches 3 7FFF FFF FFF FFF C 3 7FFF FFF FFF Compute Oganiation II

4 Two-dimensional Aas in C Code and Caches 4 In C, a two-dimensional aa is an aa of aas: A[0] A[1] A[2] int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24 ; In fact, if we pint the values as pointes, we see something like this: A: 07fff22e41d30 A[0]: 07fff22e41d30 A[1]: 07fff22e41d44 A[2]: 07fff22e41d Compute Oganiation II

5 Laout of C Aas in Memo Two-dimensional C aas allocated in ow-majo ode - each ow in contiguous memo locations: int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24 ; Code and Caches 5 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II

6 Laout of C Aas in Memo int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24, ; Stepping though columns in one ow: fo (i = 0; i < 3; i++) fo (j = 0; j < 5; j++) sum += A[i][j]; - accesses successive elements in memo - if cache block sie B > 4 btes, eploit spatial localit compulso miss ate = 4 btes / B i = 0 i = 1 i = 2 Code and Caches 6 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II

7 Laout of C Aas in Memo int32_t A[3][5] = { { 0, 1, 2, 3, 4, {10, 11, 12, 13, 14, {20, 21, 22, 23, 24, ; Stepping though ows in one column: fo (j = 0; i < 5; i++) fo (i = 0; i < 3; i++) sum += a[i][j]; accesses distant elements no spatial localit! compulso miss ate = 1 (i.e. 100%) j = 0 j = 1 Code and Caches 7 7FFF22E41D30 0 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II

8 Stide and Aa Accesses Code and Caches 8 7FFF22E41D30 0 Stide 1 Stide 4 7FFF22E41D34 1 7FFF22E41D38 2 7FFF22E41D3C 3 7FFF22E41D40 4 7FFF22E41D FFF22E41D FFF22E41D4C 12 7FFF22E41D FFF22E41D FFF22E41D FFF22E41D5C 21 7FFF22E41D FFF22E41D FFF22E41D68 24 Compute Oganiation II

9 Witing Cache Fiendl Code Code and Caches 9 Repeated efeences to vaiables ae good (tempoal localit) Stide-1 efeence pattens ae good (spatial localit) Assume an initiall-empt cache with 16-bte cache blocks. int sumaaows(int a[m][n]) { int ow, col, sum = 0; fo (ow = 0; ow < M; ow++) fo (col = 0; col < N; col++) sum += a[ow][col]; etun sum; i = 0, j = 0 to i = 0, j = 3 i = 0, j = 4 to i = 1, j = Miss ate = 1/4 = 25% Compute Oganiation II

10 Witing Cache Fiendl Code Conside the pevious slide, but assume that the cache uses a block sie of 64 btes instead of 16 btes.. Code and Caches int sumaaows(int a[m][n]) { int ow, col, sum = 0; i = 0, j = 0 to i = 3, j = fo (ow = 0; ow < M; ow++) fo (col = 0; col < N; col++) sum += a[ow][col]; etun sum; Miss ate = 1/16 = 6.25% Compute Oganiation II

11 Witing Cache Fiendl Code Code and Caches 11 "Skipping" accesses down the ows of a column do not povide good localit: int sumaacols(int a[m][n]) { int ow, col, sum = 0; fo (col = 0; col < N; col++) fo (ow = 0; ow < M; ow++) sum += a[ow][col]; etun sum; Miss ate = 100% (That's actuall somewhat pessimistic... depending on cache geomet.) Compute Oganiation II

12 Laout of C Aas in Memo Code and Caches 12 It's eas to wite an aa tavesal and see the addesses at which the aa elements ae stoed: int A[5] = {0, 1, 2, 3, 4; fo (i = 0; i < 5; i++) pintf("%d: %p\n", i, &A[i]); We see thee that fo a 1D aa, the inde vaies in a stide-1 patten. i addess : 28ABE0 1: 28ABE4 2: 28ABE8 3: 28ABEC 4: 28ABF0 stide-1 : addesses diffe b the sie of an aa cell (4 btes, hee) Compute Oganiation II

13 Laout of C Aas in Memo Code and Caches 13 int B[3][5] = {... ; fo (i = 0; i < 3; i++) fo (j = 0; j < 5; j++) pintf("%d %3d: %p\n", i, j, &B[i][j]); We see that fo a 2D aa, the second inde vaies in a stide-1 patten. i-j ode: i j addess : 28ABA4 0 1: 28ABA8 0 2: 28ABAC 0 3: 28ABB0 0 4: 28ABB4 1 0: 28ABB8 1 1: 28ABBC 1 2: 28ABC0 stide-1 But the fist inde does not va in a stide-1 patten. j-i ode: i j addess : 28CC9C stide-5 (014/4) 1 0: 28CCB0 2 0: 28CCC4 0 1: 28CCA0 1 1: 28CCB4 2 1: 28CCC8 0 2: 28CCA4 1 2: 28CCB8 Compute Oganiation II

$3D Aas in C Code and Caches 14 int32_t A[2][3][5] = { { { 0, 1, 2, 3, 4, { 10, 11, 12, 13, 14, { 20, 21,$

14 3D Aas in C Code and Caches 14 int32_t A[2][3][5] = { { { 0, 1, 2, 3, 4, { 10, 11, 12, 13, 14, { 20, 21, 22, 23, 24, { { 0, 1, 2, 3, 4, {110, 111, 112, 113, 114, {220, 221, 222, 223, 224 ; Compute Oganiation II

15 Localit Eample (2) Code and Caches 15 Question: Can ou pemute the loops so that the function scans the 3D aa a[][][] with a stide-1 efeence patten (and thus has good spatial localit)? int sumaa3d(int a[n][n][n]) { int ow, col, page, sum = 0; fo (ow = 0; ow < N; ow++) fo (col = 0; col < N; col++) fo (page = 0; page < N; page++) sum += a[page][ow][col]; etun sum; Compute Oganiation II

16 Laout of C Aas in Memo Code and Caches 16 int C[2][3][5] = {... ; fo (i = 0; i < 2; i++) fo (j = 0; j < 3; j++) fo (k = 0; k < 5; k++) pintf("%3d %3d %3d: %p\n", i, j, k, &C[i][j][k]); We see that fo a 3D aa, the thid inde vaies in a stide-1 patten: i-j-k ode: But if we change the ode of access, we no longe have a stide-1 patten: k-j-i ode: i-k-j ode: i j k addess : 28CC1C 0 0 1: 28CC : 28CC : 28CC : 28CC2C 0 1 0: 28CC : 28CC : 28CC i j k addess : 28CC24 03C 1 0 0: 28CC : 28CC : 28CC74 03C 0 2 0: 28CC4C 1 2 0: 28CC : 28CC : 28CC64 i j k addess : 28CC : 28CC : 28CC4C 0 0 1: 28CC : 28CC3C 0 2 1: 28CC : 28CC2C 0 1 2: 28CC Compute Oganiation II

17 Localit Eample (2) Code and Caches 17 Question: Can ou pemute the loops so that the function scans the 3D aa a[] with a stide-1 efeence patten (and thus has good spatial localit)? int sumaa3d(int a[n][n][n]) { int i, j, k, sum = 0; fo (i = 0; i < N; i++) fo (j = 0; j < N; j++) fo (k = 0; k < N; k++) sum += a[k][i][j]; etun sum; This code does not ield good localit at all. The inne loop is vaing the fist inde, wost case! Compute Oganiation II

18 Localit Eample (3) Code and Caches 18 Question: Which of these two ehibits bette spatial localit? // stuct of aas stuct soa { float *; float *; float *; float *; ; compute_(stuct soa s) { fo (i = 0; ) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; // aa of stucts stuct aos { float ; float ; float ; float ; ; compute_(stuct aos *s) { fo (i = 0; ) { s[i]. = s[i]. * s[i]. + s[i]. * s[i]. + s[i]. * s[i].; Fo the following discussions assume a cache block sie of 32 btes, and that the cache is not capable of holding all the blocks of the elevant stuctue at once. Compute Oganiation II

19 Localit Eample (3) Code and Caches 19 // stuct of aas stuct soa { float *; float *; float *; float *; ; stuct soa s; s. = malloc(1000 * sieof(float)); btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II

20 Code and Caches 20 Compute Oganiation II Localit Eample (3) // aa of stucts stuct aos { float ; float ; float ; float ; ; stuct aos s[1000]; 16 btes pe cell, 1000 cells

21 Localit Eample (3) Descibe the localit ehibited b this algoithm: // stuct of aas compute_(stuct soa s) { fo (int i = 0; i < 1000; i++) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; 8 cells Code and Caches 21 s.[0] miss s.[0] miss s.[0] miss s.[0] miss s.[1] hit s.[1] hit s.[1] hit s.[1] hit... s.[7] hit s.[7] hit s.[7] hit s.[7] hit s.[8] miss s.[8] miss s.[8] miss s.[8] miss 32 btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II

22 Localit Eample (3) Descibe the localit ehibited b this algoithm: // stuct of aas compute_(stuct soa s) { fo (int i = 0; i < 1000; i++) { s.[i] = s.[i] * s.[i] + s.[i] * s.[i] + s.[i] * s.[i]; Code and Caches 22 s.[8] miss s.[8] miss s.[8] miss s.[8] miss s.[9] hit s.[9] hit s.[9] hit s.[9] hit... 8 cells 8 cells Fo the aas: Misses = 4*1*125 Hits = 4*7*125 Hit ate = 87.5% 32 btes 4 btes pe cell, 1000 cells pe aa Compute Oganiation II

23 Localit Eample (3) Descibe the localit ehibited b this algoithm: // aa of stucts compute_(stuct aos *s) { fo (int i = 0; i < 1000; i++) { s[i]. = s[i]. * s[i]. + s[i]. * s[i]. + s[i]. * s[i].; Code and Caches 23 s[0]. miss s[0]. hit s[0]. hit s[0]. hit s[1]. hit s[2]. hit s[3]. hit s[4]. hit... Hit ate: 7/8 o 87.5% Compute Oganiation II

24 Localit Eample (4) Code and Caches 24 Descibe the localit ehibited b this algoithm: // stuct of aas sum_(stuct soa s) { sum = 0; fo (int i = 0; i < 1000; i++) { sum += s.[i]; Compute Oganiation II

25 Code and Caches 25 Compute Oganiation II Localit Eample (4) // aa of stucts sum_(stuct aos *s) { sum = 0; fo (int i = 0; i < 1000; i++) { sum += s[i].; Descibe the localit ehibited b this algoithm:

26 Localit Eample (5) Code and Caches 26 QTP: How would this compae to the pevious two? // aa of pointes to stucts stuct aops { float ; float ; float ; float ; ; stuct *aops apos[1000]; fo (i = 0; i < 1000; i++) apos[i] = malloc(sieof(stuct aops)); Compute Oganiation II

27 Witing Cache Fiendl Code Code and Caches 27 Make the common case go fast Focus on the inne loops of the coe functions Minimie the misses in the inne loops Repeated efeences to vaiables ae good (tempoal localit) Stide-1 efeence pattens ae good (spatial localit) Ke idea: Ou qualitative notion of localit is quantified though ou undestanding of cache memoies. Compute Oganiation II

28 Miss Rate Analsis fo Mati Multipl Code and Caches 28 Assume: Line sie = 32B (big enough fo fou 64-bit wods) Mati dimension (N) is ve lage Appoimate 1/N as 0.0 Cache is not even big enough to hold multiple ows Analsis Method: Look at access patten of inne loop k j j i k i A B C Compute Oganiation II

Mati Multiplication Eample Code and Caches 29 Desciption: Multipl N N matices O(N 3 ) total opeations N eads pe souce element N values summed pe destination

29 Mati Multiplication Eample Code and Caches 29 Desciption: Multipl N N matices O(N 3 ) total opeations N eads pe souce element N values summed pe destination Vaiable sum /* ijk */ held in egiste fo (i=0; i<n; i++) { fo (j=0; j<n; j++) { sum = 0.0; fo (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Compute Oganiation II

30 Mati Multiplication (ijk) Code and Caches 30 /* ijk */ fo (i = 0; i < n; i++) { fo (j = 0; j < n; j++) { sum = 0.0; fo (k = 0; k < n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Inne loop: A Row-wise (i,*) (*,j) B Columnwise (i,j) C Fied Misses pe inne loop iteation: A B C Compute Oganiation II

31 Mati Multiplication (kij) /* kij */ fo (k = 0; k < n; k++) { fo (i = 0; i < n; i++) { = a[i][k]; fo (j = 0; j < n; j++) c[i][j] += * b[k][j]; Code and Caches 31 Inne loop: (i,k) (k,*) (i,*) A B C Fied Row-wise Row-wise Misses pe inne loop iteation: A B C Compute Oganiation II

32 Mati Multiplication (jki) Code and Caches 32 /* jki */ fo (j = 0; j < n; j++) { fo (k = 0; k < n; k++) { = b[k][j]; fo (i = 0; i < n; i++) c[i][j] += a[i][k] * ; Inne loop: (*,k) A (k,j) B Fied (*,j) C Columnwise Columnwise Misses pe inne loop iteation: A B C Compute Oganiation II

33 Summa of Mati Multiplication fo (i = 0; i < n; i++) { fo (j = 0; j < n; j++) { sum = 0.0; fo (k = 0; k < n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; fo (k = 0; k < n; k++) { fo (i = 0; i < n; i++) { = a[i][k]; fo (j = 0; j < n; j++) c[i][j] += * b[k][j]; fo (j = 0; j < n; j++) { fo (k = 0; k < n; k++) { = b[k][j]; fo (i = 0; i < n; i++) c[i][j] += a[i][k] * ; ijk (& jik): 2 loads, 0 stoes misses/ite = 1.25 kij (& ikj): 2 loads, 1 stoe misses/ite = 0.5 jki (& kji): 2 loads, 1 stoe misses/ite = 2.0 Code and Caches 33 Compute Oganiation II

34 Ccles pe inne loop iteation Coe i7 Mati Multipl Pefomance Code and Caches jki / kji ijk / jik jki kji ijk jik kij 10 kij / ikj Aa sie (n) Compute Oganiation II

35 Concluding Obsevations Code and Caches 35 Pogamme can optimie fo cache pefomance How data stuctues ae oganied How data ae accessed Nested loop stuctue Blocking is a geneal technique All sstems favo cache fiendl code Getting absolute optimum pefomance is ve platfom specific Cache sies, line sies, associativities, etc. Can get most of the advantage with geneic code Keep woking set easonabl small (tempoal localit) Use small stides (spatial localit) Compute Oganiation II

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses