Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles

Impact of cache interferences on usual numerical ense loop nests O. Temam C. Fricker W. Jalby University of Leien INRIA University of Versailles Niels Bohrweg 1 Domaine e Voluceau MASI 2333 CA Leien 78153 Le Chesnay Ceex 78000 Versailles The Netherlans France France Abstract In numerical coes, the regular interleave accesses that occur within o-loop nests inuce cache interference phenomena that can severely egrae program performance. Cache interferences can signicantly increase the volume of memory trac an the amount of communication in uniprocessors an multiprocessors. In this paper, we ientify cache interference phenomena, etermine their causes an the conitions uner which they occur. Base on these results, we erive a methoology for computing an analytical expression of cache misses for most classic loop nests, which can be use for precise performance analysis an preiction. We show that cache performance is unstable, because some unexpecte parameters such as arrays base aress can play a signicant role in interference phenomena. We also show that the impact of cache interferences can be so high, that the benets of current ata locality optimization techniques can be partially, if not totally, eraicate. Keywors: memory reference patterns, software optimization, ata locality, numerical coes, moeling cache interferences, performance analysis, performance preiction. 1 Introuction As CPU cycle time ecreases, main memory an network latencies rapily increase an cache misses become very costly. Furthermore, the increasing issue rate of processors worsen the buren on caches. Moreover, most CPU chips are now being esigne for integration into a massively parallel supercomputer or a parallel workstation, an therefore minimizing memory trac, i.e. optimizing memory hierarchy utilization, is becoming critical. For all these reasons, optimizing the cache behavior has become a major issue. To achieve such optimizations, many stuies have been performe to unerstan the workings of cache memories an erive proper optimizations. The rst category of stuies [?] relies on numerous simulations of representative coes, i.e. a collection of coes which correspons to the average workloa of a computer. Such simulations provie a goo summary of average cache performance an some hints at the relationships between the ierent cache parameters (cache size, line size, set-associativity). Furthermore they provie nearly exact inications on the behavior of cache memories for specic examples. A major problem inherent to such techniques is to n coes which are truly representative in terms of memory referencing. A secon category of stuies [?] aims at builing analytical moels for synthesizing the behavior of cache uner most circumstances. Such moels provie better insight on the relationship between the ierent parameters. These moels can also be use for performance preiction, This work was fune by the BRA Esprit III European Project APPARC, European Agency DGXIII. 1

avoiing numerous simulations. However, while such analytical moels are more representative than simulation base stuies, they are generally less accurate. An they are intrinsicly limite because they cannot an are not esigne for unerstaning specic phenomena which occur within a cache. Moeling the global behavior of cache may be sucient as far as trens are neee. However, for unerstaning the weaknesses of caches an eriving either software or harware optimizations more precise moeling is necessary. Therefore a goo unerstaning of a program reference pattern nees to be extracte. Some moels have aresse this problem [?]. Although they are valuable tools, they still lack accuracy because they aim at characterizing the behavior of most programs. Therefore, they again sacrice accuracy for generality an representativity. However, there is a category of coes, numerical coes, that are emerging as some of the most emaning programs in terms of execution time an memory usage. Many architectures are targete or at least tune for such programs. The wiesprea use of numerical coes as benchmarks is a clear sign of their growing inuence. Therefore, stuying the cache behavior uner numerical workloas is critical. However, numerical coes have specic properties in terms of memory aressing (spatial an temporal locality) which harly allow them to t in the classic framework of general moels. Fricker an al. [?] evelope a moel for irect-mappe an set-associative caches that is eicate to regular (an some irregular) numerical coes. This moel takes into account the fact that references within numerical coes correspon to chunks of consecutive aresses which recur perioically. The main asset of the moel is to show the behavior of cache uner numerical workloas, an to allow imensioning of cache parameters for such coes. So, if this eort allows a better unerstaning of cache behavior uner such workloas, it oes not help in unveiling hot-spot an irregular phenomena which are specic to numerical coes an alter the cache performance. Furthermore numerical coes are actually mae of a limite number of typical loop nests. Therefore, eorts shoul be concentrate on moeling an unerstaning the actual an most frequent types of loop nests. This problem is twofol: the rst step is ientifying these typical cases an consequently restricting problem hypotheses. The secon an main step is eriving a moel which encompasses the majority of such cases. Porterel [?] evelope a moel eicate to numerical coes, which oes examine specic pieces of coes an etermine their behavior on cache. However, mostly fully-associative caches have been consiere. Because such caches are unlikely to experience interference phenomena, conclusions can harly be erive for interferences in real caches. An important step towars accurate evaluation of cache interferences has been mae in [?], where blocke Matrix-Matrix multiply is carefully stuie. Cross an self-interference misses are evaluate, an a moel for this algorithm is provie. This paper clearly unveils that interferences can severely alter locality exploitation. However, the moel still lacks accuracy an is not capable of catching some of the specic phenomena an performance uctuations inuce by the mapping of irect-mappe caches. Furthermore, new parameters (such as arrays base aress) which can play an important role in interference phenomena are not taken into account. Moreover, this moel is eicate to one particular example, while a methoology suitable to many numerical algorithms woul be very useful. Inee, many powerful software optimization techniques for exploiting numerical coes locality have now been esigne [?,?,?]. Although they stress the possible impact of cache interferences, no real evaluation of these phenomena nor a stuy of their frequency of occurrence have been 2

performe yet. In the next sections, we will show that cache interferences can have a strong impact on performance an occur frequently. Ferrante an al. [?] propose a realistic approach to locality optimization by evaluating the number of cache lines use (as oppose to the number of elements as in most other methos). It is mentione that evaluating cache conicts woul help rening even more such realistic locality optimization techniques. In one example, it is briey shown how to etect self-interferences. To aress the issues relate to cache interferences, a moel name NUMODE (NUmerical MODEl) [?] is being evelope within the APPARC 1 project. The goal of NUMODE is to provie a framework for moeling cache interferences of a given loop nest, an then eriving an analytical expression for the number of cache misses. Therefore, NUMODE is both a moel an a methoology. It is no oubt that such a moel cannot be as exible as others for global performance preictions, or for extracting global trens on cache behavior. However, it is possible to precisely quantify the interference phenomena that occur in real caches, etermine their causes an conitions of occurrence. Many such phenomena can be ientie. Moreover, the cache performance is shown to be actually unstable in many situations. It namely appears that software optimization techniques lack accuracy an can consequently lose their eciency. It is also shown that the number of aitional memory references ue to cache interferences can be obtaine as an analytical function of problem parameters. This result allows a precise analysis of the behavior of most classic algorithms on caches, an can then be use to esign new optimization techniques or tune existing ones. Making such a function available to compilers can be protable to coe restructuring techniques. NUMODE is currently uner evelopment an will be implemente to perform extensive testings of its scope an accuracy. In section?? the problem is ene an hypotheses are given. In section??, the general metho for computing the number of misses is inicate. In section??, conclusions are rawn an further work is iscusse. 2 Problem statement The purpose of the paper is twofol. The rst goal is to show that cache interferences are not infrequent, that they can have a signicant impact on numerical loop nests performance, an that their conitions of occurrence can be etermine even though cache interferences are highly irregular. Unerstaning an then eliminating such interferences woul allow stable cache performance. Furthermore, cache line size is kept small because large line sizes are generally consiere to bring more interferences. This notion is not completely true: when line size is large, compulsory misses are much less important so that interference misses correspon to a larger portion of total misses. Therefore, numerical loop nests are more sensitive to cache interferences when line size is large, but such interferences are not necessarily more important. It erives that exploiting a larger line size requires a goo unerstaning of cache interferences. The secon an main goal of this paper is to introuce a metho for estimating these cache interferences. General principles an main steps of the technique are inicate an illustrate with examples. Cache architecture Direct-mappe caches have been chosen for several reasons: Direct-mappe caches are more sensitive to interferences than w-way associative caches. Therefore, they are more likely to benet from stuies an optimizations on that matter. 1 APPARC is a BRA Esprit III European project 3

Since the replacement policy of irect-mappe caches is straightforwar, computing interferences is easier in irect-mappe caches. Though, we strongly believe the technique can be extene to w-way associative caches with moerate moications. Among the three newest processor chips (DEC Alpha, MIPS R4000, SuperSPARC), two chips (DEC Alpha, MIPS R4000) inclue a small (8kbytes) irect-mappe on-chip ata cache. Since the frequency of such processors is very high, the cost of a cache miss is huge, making it critical to reuce the amount of interferences. The placement policy in the DEC Alpha, for example, is such that, a ata cache location can generally be etermine from the ata virtual aress. Therefore, a stuy base on virtual aresses woul accurately escribe real cache behavior. In the remainer of the paper, the cache size is inicate by C S an the line size by L S. The unit size is 8 bytes, i.e. the size of a ouble-precision oating point ata. In all experiments, cache size is equal to 8-kbyte an line size is equal to 32-byte (the characteristics of the DEC Alpha ata cache), so C S = 1024 an L S = 4. Coes Let us now iscuss which types of coes are consiere. In numerical coes most ata trac occurs in o-loops, only these coe constructs are examine. Only array references are consiere because it is probable that other variables woul be store in registers if they are frequently use, an otherwise they woul inuce minimal perturbations of cache behavior. Loop Nests A loop nest is compose of n istinct loops, j i being the loop inex of the i th loop, an j n being the loop inex of the innermost loop. Column-major storage is assume, so, for example, the virtual aress of array reference A(j 1 ; j 2 ) is a 0 + N j 1 + j 2 where N is the leaing imension of array A an of the starting aress of array A (cf gure??). 3 Moeling numerical coes behavior 3.1 Restrictive hypotheses DO j 1 = 0; N 1? 1 DO j 2 = 0; N 2? 1. DO j n = 0; N n? 1. A( A 1 j k1 + 1 A ; : : :; A p j kp + p A ) B( B 1 j k1 + 1 B ; : : : ; B p j kp + p B ).. Figure 1: An example of loop nest consiere. A number of restrictions are impose on the loop nests consiere. First, the array subscripts must all be of the form A( A 1 j k1 +A 1 ; : : : ; A p j kp + A p ) where (j ki ) 1ip (with p n) is any subset of 4

loop inices, an ( A) i 1ip an ( A) i 1ip are constants. For example, subscripts such as A(j 1 +j 2 ) are not consiere. A close analysis of benchmark suites such as the Perfect Club [?] or NAS [?] an stuies such as the one one by Yew an al. [?] show that such hypotheses encompass the subscripts foun within most numerical loop nests. Furthermore, in usty-eck coes, \irregular" subscripts often correspon to linearization, an therefore, in terms of memory reference patterns, are equivalent to those consiere within the scope of the moel. For all these reasons, these restrictions on array subscripts are consiere to be reasonable constraints. Moreover, further evelopments of the present moel may inclue more complex subscripts. Another moel hypothesis is that the bounaries of each loop inex must be constant (after normalization, 0 j i N i ) an the strie of all inices equal to 1. For any rectangular loop nest, the loop inices an array subscripts can be change so as to satisfy these hypotheses. However, the fact non-rectangular loops o not r these hypotheses. Since this point is relatively restrictive, further evelopments of the moel will mainly focus on incluing this kin of loop nests. 3.2 General principles If cache was fully-associative an replacement was optimal, cache misses woul occur in two cases only. First, when ata are loae in cache for the rst time; such misses are calle compulsory misses. Secon, when cache space is too small to store all loop nest ata. Then, an element is ushe from cache each time a new element nees to be loae; such misses are calle capacity misses. However, in irect-mappe caches (an in set-associative caches as well), cache misses can occur though cache space is sucient, because a ata element can only be mappe into one specic cache location (or w locations in w-way associative caches). Therefore, such cache misses o not occur because of capacity conicts, but because of mapping conicts; they are calle mapping misses [?]. The main eect of such unexpecte misses is to egrae the spatial an temporal reuse of ata. Interferences can either correspon to interferences of an array with itself (self-interferences), or with another array (cross-interferences). In the remainer of the paper, these two types of interferences are analyze. For self-interferences the principle is to stuy the mapping of the set of elements of an array to be reuse, an check whether these elements overlap with themselves. If so, self-interferences occur, an estimating the egree of overlapping yiels the number of aitional memory accesses brought by self-interferences. In general, self-interferences mostly correspon to temporal interferences. For cross-interferences, once the set of elements of an array to be reuse is ientie (taking into account self-interferences), the overlapping between these elements an elements of another array is etermine. Knowing the number of times the two sets of elements overlap, an the amount of overlapping each time, is sucient to compute the number of aitional memory accesses brought by cross-interferences. Cross-interferences can correspon to either temporal or spatial interferences. 3.3 Self-interferences Theoretical reuse set Thanks to the subscript types consiere, it is easy to ientify where reuse ue to self-epenences occurs. If coecient a i = 0 in the virtual aress expression a 0 + a 1 j 1 + : : : + a n j n of a reference to array A, then loop i carries reuse. Let us ene l as the lowest loop level where reuse occurs. On all loops k with k > l, no reuse occurs. So, on each iteration of these loops, the array elements reference are all istinct. This set of elements is calle the reuse set. During each execution of the sub loop nest j n ; : : : ; j l+1, all elements of the reuse set are reference. 5

Denition For array reference a 0 + a 1 j 1 + : : :+ a n j n, the reuse set can be ene if there exists a k such that a k = 0. The loop level of a reuse set is l = max fk=a k = 0g. The theoretical reuse set is equal to RS(A) l = fa 0 + a 1 j 1 + : : : + a n j n ; (0 j i N i? 1) i>l g. Reuse can occur on ierent loop levels. However, reuse that is carrie on loop levels higher than l is at least one orer of magnitue less important than the reuse on loop l. For example, let us consier the following loop nest: DO J1 = 0, N1-1 DO J2 = 0, N2-1 DO J3 = 0, N3-1 DO J4 = 0, N4-1.....Y(J1,J3)..... For reference Y (j 1 ; j 3 ), reuse occurs on loop levels 4 an 2; here l = 4. On one execution of loop j 4, the array element Y (j 1 ; j 3 ) can be reuse N 4? 1 times (except for the rst iteration of loop j 4 ). So, uring execution of the whole loop nest, N 1 N 2 N 3 (N 4? 1) reuses of elements of Y can be achieve on loop j 4. On one execution of loop j 2, Y (j 1 ; 0); : : :; Y (j 1 ; N 3? 1) can be reuse N 2? 1 times. So, uring execution of the whole loop nest, N 1 (N 2? 1) N 3 reuses can be achieve on j 2. Consequently, the potential reuse on loop j 4 is approximately N 4 times more important than the potential reuse on loop j 2. Furthermore, the time interval between two reuses on loop j 4 is minimum, i.e. it is equal to one iteration of loop j 4, while N 4 N 3 iterations of loop j 4 are execute between two reuses on loop j 2. Therefore, not only reuse is more scarce on loop j 2, but the probability it can be exploite is also much smaller. That is why only the rst reuse level is consiere in general. However, when some coecients consecutive to a l (i.e. a l?1 ; a l?2 ; : : :) are also equal to 0, reuse on these loop levels is consiere to still be achieve (with respect to the reuse set, it is the same as if the bounary of loop l were N l N l?1 N l?2 : : : instea of N l ). Note that if a l = a l?1 = a l?2 = : : : = a k = 0, the reuse set on loop k is the same as on loop l. Size of the theoretical reuse set Let us compute the number of cache lines corresponing to the theoretical reuse set assuming no self-interferences. First, let us etermine the cache line strie of the reuse set of a reference A (where reuse occurs on loop l). It is equal to min(1; ), where = min l+1kn (a k ). This term is the ratio of the smallest coecient (only consiering loop levels ening the reuse set) to the line size. If this ratio is greater than 1, then it is necessarily equal to 1, since at most one new cache line is reference on each iteration. For example, the access 6

3 strie of reference A(3j 3 ) is min(1; ). Let us consier another example. The virtual aress of reference A(j 3 ; B j 2 ) is a 0 + N j 3 + Bj 2 where N is the leaing imension of A (here reuse occurs on loop level 1). Then the access strie of the reuse set is min(1; min(n;b) ). Then, the number of cache lines in the theoretical reuse set is equal to N n : : : N l+1 acces strie. Actual reuse set Because self-interferences occur, not all elements of the reuse set can actually be reuse. In irect-mappe caches, as soon as two elements of the reuse set compete for the same cache line, none of the two elements can be reuse: they are victim of self-interferences. The elements of the reuse set not victim of self-interferences belong the actual reuse set. The actual reuse set is the set of cache lines of the theoretical reuse set where no self-interferences occur. Characterizing self-interferences is equivalent to etermining the actual reuse set. An etermining the actual reuse set is equivalent to stuying the mapping of the theoretical reuse set in cache. To compute the overlapping within the theoretical reuse set of a reference A, the loops n to l + 1 are successively execute, starting with loop n. On each loop level k, the cache lines use by loops n; : : : ; k + 1, which are not victim of interferences, form a temporary reuse set calle RS(A) k. l On loop level k, the interferences between N k such temporary reuse sets RS(A) k+1 are etermine, l an the cache lines still not victim of interferences form the new temporary reuse set RS(A) k. l 1. Loop level n (reuse occurs on loop l): on this loop level, the reuse set is fa0 + a1j1 + :::anjn; 0 jn < Nng. Let a 0 n = a 0 +a 1 j 1 +: : :+a n?1 j n?1. A temporary reuse set RS(T ) n correspons to min(1; an l ) N n cache lines, starting at cache position a 0 n mo C S. If C S < min(1; an ) N n then capacity interferences occur, an the victim cache lines are remove from the temporary reuse set (cf example on Matrix-Vector multiply below). 2. Loop level n? 1: let a n?1 0 = a 0 + a 1 j 1 + : : : + a n?2 j n?2. The temporary reuse sets RS(A) n l start at cache positions a n?1 0 + a n?1 j n?1 mo C S. By checking whether two such temporary reuse sets interfere (epening on their relative cache positions) an evaluating the amount of overlapping, it is possible to compute the number of cache lines not victim of interferences. These cache lines correspon to reuse set RS(A) n?1 (cf example on Matrix-Matrix multiply l below). 3. All subsequent steps are ientical to step 2. The process stops on loop level l. 4. For each iteration of loop l, the number of cache lines of the theoretical reuse set victim of self-interferences is equal to Number of cache lines of the theoretical reuse set - Number of cache lines of the actual reuse set. So the number of aitional memory requests per iteration of loop l is equal to Number of cache lines of the theoretical reuse set - Number of cache lines of the actual reuse set. The total number of aitional memory requests is equal to N 1 : : : N l (Number of cache lines of the theoretical reuse set? Number of cache lines of the actual reuse set). This process is generally not too complex because few levels i have to be consiere (reuse sets of imension 1 or 2, i.e. epening on 1 or 2 loop levels, correspon to a majority of cases). Furthermore, very often the layout of sets in cache is straightforwar ue to the way array elements are reference (strie one access to arrays), making it relatively easy to evaluate self-interferences, an compute the actual reuse set. 7

Matrix-Vector multiply o-loop nest is the following Let us illustrate the previous notions with matrix-vector multiply. The DO J1 = 0, N-1 DO J2 = 0, N-1 Y(J1) += A(J1,J2) * X(J2) The linear function corresponing to each array reference is the following Y: 0j 2 + j 1 + y 0 X: j 2 + 0j 1 + x 0 A: j 2 + N j 1 + a 0 (where N is the leaing imension of array A) For array Y, only loop level 2 carries reuse, the theoretical reuse set is RS(Y ) 2 = fy (j 1 )g. For array X, the theoretical reuse set is RS(X) 1 = fx(0); : : :; X(N? 1)g. No reuse occurs for array A. N <= Cs Cs < N <= 2Cs Cache Elements of X Reusable elements of X 2Cs < N Figure 2: Mapping of elements of X into cache. The theoretical reuse set of Y is equal to the actual reuse set of Y an correspons to one cache line only; no self-interferences can occur. On the other han the theoretical reuse set of X contains N consecutive elements (the access strie is 1), or N cache lines. Therefore, if N C S, self-interferences occur. More precisely, if C S N 2C S, 2(N? C S ) elements of X are overlappe by other elements of X before they can be reuse. Therefore, only the reuse of 2C S? N elements of X can actually be exploite. So, the actual reuse set size is 2C S?N aitional memory requests are necessary on each iteration of loop j 1 because of self-interferences of X. If N > 2C S, no reuse occurs, the overlap between elements of X is total (cf gure??). This problem is well known in the omain of loop restructuring. It correspons to capacity interferences rather than mapping interferences. The most classic metho for ealing with it is to block the loop, so that the reuse set of X is smaller than cache size. However, a less obvious an. 2(N?C S) 8

known fact is that, even when the loop is blocke interferences can still occur. Let us consier the blocke version of Matrix-Matrix multiply. Blocke Matrix-Matrix multiply DO J1 = 0, N-1, Bs DO J2 = 0, N-1, Bs DO J3 = 0, N-1 DO J4 = J1, J1 + Bs-1 DO J5 = J2, J2 + Bs-1 C(J3,J5) += A(J3,J4) * B(J4,J5) where B S is the block size (for sake of simplicity it is assume B S ivies N). Matrix-Matrix multiply is generally blocke that way so as to keep in cache a submatrix B S B S of matrix B. Let us normalize the loop inices so that they t the moel hypotheses (constant bounaries an strie 1): j 5 is ene by j 5 = J5? J2 an similarly j 4 = J4? J1, j 1 = J1, BS j 2 = J2, an for all other i, j i = Ji. Then, the linear function corresponing to array B is BS j 5 + N B j 4 + 0j 3 + B S j 2 + B S N B j 1 + b 0, so its reuse set is RS(B) 3 = fj 5 + N B j 4 + j 2 + N B j 1 + b 0, 0 j 5 B S? 1; 0 j 4 B S? 1g, where N B is the leaing imension of array B. Let us consier what happens in cache when this block is reference. When j 5 varies between 0 an B S? 1, an interval of B S elements of B is mappe into consecutive cache locations, or B S cache lines. These cache lines correspon to the temporary reuse set RS(B) 5 3. Then, j 4 is increase by 1, an another interval of B S elements is loae into cache at a istance N B from the previous interval. Due to the nite size of the cache, the actual istance in cache between two consecutive intervals of size B S is N B mo C S (where mo is the moulo operator) instea of N B. Therefore epening on the value of N B mo C S, successive intervals can overlap in cache, thereby more or less egraing the potential reuse gaine from blocking (all the cache lines of the subintervals of size B S constitute the temporary reuse set RS(B) 4 3). For example, if 0 N B mo C S < B S, two consecutive intervals overlap by B S? (N B mo C S ) cache locations or B S?(NB mo CS) cache lines (cf gure??). Since an interval has two neighbors, each interval overlaps by 2 B S?(NB mo CS ) cache lines with its neighbors (except for the rst an last interval which have only one neighbor an consequently overlap by B S?(NB mo CS) cache lines only). So, in each interval, only B S?2(BS?(NB mo CS )) cache lines are not victim of self-interferences (except for the rst an last interval, where B S?(BS?NB mo CS ) cache lines are not victim of self-interferences). Therefore, the actual reuse set only correspons to (B S? 2) BS?2(BS?(NB mo CS)) +2 B S?(BS?NB mo CS ) = (B S?2)(2(NB mo CS )?BS)+2(NB mo CS ) lines, while the theoretical reuse set (assuming no self-interferences) correspons to B S B S cache cache lines. Therefore the self-interferences of array B bree B2 S? (B S?2)(2(NB mo CS )?BS)+2(NB mo CS) 9

aitional memory requests per iteration of loop 3, an N 1 N 2 N 3 (B S?2)(2(NB mo CS )?BS)+2(NB mo CS) aitional memory requests in total. Paing can be use for matrix B so that B S = N B mo C S (where N B is the leaing imension of matrix B). In that case, placement of intervals in cache is optimum (cf gure??). Conclusions For moeling purposes, such examples stress the fact that it is critical to properly evaluate mapping of array elements in cache To obtain accurate evaluation of self-interference misses (cf gure??). On a broaer scope, it shows that precise analysis of interferences can be valuable for coe optimizations such as loop restructuring since part or all the benets of classic loop restructuring can be lost because of cache interferences. Moreover, such phenomena can appear frequently an with varying importance (in our example, epening on the parameter N B mo C S, interferences can be maximum when N B mo C S = 0 or minimum when N B mo C S = B S ). N mo Cs = 0 N mo Cs = Bs Cache Elements of B (or D) Reusable elements of B (or D) N mo Cs = 3Bs/4 Actual Reuse Set of B (or D) Actual Interference Set of B (or D) Bs Figure 3: Mapping of elements of B into cache. 1 Hit ratio of array B 0.9 0.8 0.7 150 160 170 180 190 200 NB (leaing imension of array B) Figure 4: Inuence of parameter N B on self-interferences. 3.4 Cross-interferences Two main cases of cross-interferences can occur between two references: either the ierence between the corresponing two virtual aresses is constant (inepenent of loop inices), or it 10

varies with the loop inices. These two cases must be istinguishe because such cross-interferences are very ierent. In the rst case, the two references are in translation, they always overlap an the amount of overlapping is constant, while in the secon case the two references overlap only perioically an the amount of overlapping varies. So, etecting an estimating cross-interferences in the rst case basically amounts to comparing the constant parameter of the two virtual aresses (it generally epens on arrays imensions an base aress), while in the secon case, the relative movement of the two references must be analyze. The rst type of interferences is calle internal cross-interferences (because a set of references in translation constitutes a kin of class of references, an such cross-interferences then occur within a class), an the secon type of interferences is calle external cross-interferences (interferences among two references not belonging to the same class). 3.4.1 Estimating cross-interferences Computing the impact of cross-interferences between two references amounts to estimating how much of the reuse of one reference is lost because of cross-interferences with the other reference. So, let us consier two references R 1 ; R 2, an compute the impact of R 2 on the reuse of R 1. First, the set of elements R 1 can reuse must be estimate; it is the reuse set ene in section??. Secon, the set of elements of R 2 that can interfere with R 1 must be estimate as well; this set is calle the interference set. Because the reuse set is compute on a given loop level l, the interference set shoul be compute on the same loop level. Recall, that below loop l, no reuse can occur for R 1. Therefore, the number of aitional memory requests for R 1 ue to cross-interferences with R 2, on each reutilization of the reuse set (i.e. on each iteration of loop l), is exactly equal to the number of cache lines use by both the reuse set an the interference set. This notion is funamental in the computation of cross-interferences. Computing interferences this way allows to make abstraction of time consierations, i.e. when interferences occur. It is sucient to estimate the intersection between the set of cache lines corresponing to the reuse set an the interference set. It is important to note that, in the following sections, the reuse set consiere is the actual reuse set, otherwise cross-interferences woul be counte where self-interferences alreay occur, resulting in an overestimate of aitional memory requests. The interference set The enition of the theoretical interference set is the same as that of the theoretical reuse set. Denition For array reference a 0 + a 1 j 1 + : : : + a n j n, the theoretical interference set, on loop level l (this loop level is etermine by the victim reuse set), is equal to IS(A) l = fa 0 + a 1 j 1 + : : : + a n j n ; (0 j i N i? 1) i>l g. Moreover, etermining the actual interference set is one much the same way as for the actual reuse set. The actual reuse set is the subset of cache lines of the theoretical reuse set where no self-interferences occur, while the actual interference set is simply the set of cache lines use by the theoretical interference set. So if a cache line of the theoretical interference set is victim of self-interferences, this cache line is still counte in the actual interference set (while such a cache line is rejecte from the actual reuse set). Intuitively, the actual interference set correspons to the cache surface (the number of cache lines) use by the theoretical interference set. The amount of cross-interferences is irectly correlate to the size of the actual interference set (the larger the set, the higher the probability of overlapping with the reuse set). The process for etermining the actual reuse set is the following: 11

1. Loop level n (for the reuse set, reuse occurs on loop l): on this loop level, the interference set is fa 0 + a 1 j 1 + : : : a n j n ; 0 j n < N n g. Let a n 0 = a 0 + a 1 j 1 + : : : + a n?1 j n?1. A temporary interference set IS(T ) n correspons to min(1; an ) N l n cache lines, starting at cache position a n 0 mo C S. If C S < min(1; an ) N n then capacity interferences occur, an the cache lines use twice or more are only counte once. 2. Loop level n? 1: let a n?1 0 = a 0 + a 1 j 1 + : : : + a n?2 j n?2. The temporary interference sets IS(A) n starts at cache positions a n?1 l 0 + a n?1 j n?1 mo C S. So, by checking whether two such temporary interference sets interfere (epening on their relative cache positions) an evaluating the amount of overlapping, it is possible to compute the number of cache lines corresponing to the union of the two intervals. The union of the cache lines of all temporary interference sets IS(A) n is the temporary interference set IS(A) n?1. l l 3. All subsequent steps are ientical to step 2. The process stops on loop level l. Let us consier the same example use for illustrating the computation of the actual reuse set, i.e. Matrix-Matrix multiply, except that reference B(j 4 ; j 5 ) is replace by B(j 4 ; j 5 ) + D(j 4 ; j 5 ). Then, computing cross-interferences on B ue to D implies computing the interference set of D on loop 3 (since reuse occurs on loop 3 for B). Then, as for array B, if 0 N D mo C S < B S, each interval of size B S of D overlaps by 2(B S?(ND mo CS)) cache lines with its two neighbor intervals (except for the rst an the last interval). However, in opposition to the actual reuse set, these cache lines are still counte in the actual interference set. Because of this overlapping, the size of mo CS)+BS the actual interference set is (B S?1)(ND cache lines instea of B2 S (cf gure??). In opposition to the reuse set, it is preferable that overlapping occurs within the interference set, because the larger the overlapping within a theoretical interference set, the smaller the corresponing actual interference set, an the less likely the actual interference set overlaps with cache lines of the actual reuse set. Therefore, the optimal case is obtaine for N D mo C S = 0 (note that it correspons to the worse case for the reuse set of D; cf section??). 3.4.2 Internal cross-interferences Internal cross-interferences occur between two references in translation. Thanks to that property the relative cache position between the reuse set of the victim reference an the interference set of the interfering reference is always the same. Therefore, if the reuse set is ene on loop level l (reuse occurs on loop l), the total number of aitional memory requests ue to cross-interferences between the two references, is equal to the total number of iterations of loop l times the number of cache lines use by both the actual interference set an the actual reuse set. The process for computing internal cross-interferences on a reference R 1 ue to reference R 2 is the following: 1. Compute the actual reuse set of R 1 ; the reuse occurs on loop l. Compute the actual interference set of R 2 on loop l. 2. Compute the number of cache lines CL(R 1 ; R 2 ) use by both the actual reuse set an the actual interference set. This is one by simply comparing the relative starting position of both sets. 3. The number of total aitional memory requests ue to internal cross-interferences between R 1 an R 2 is equal to N 1 : : : N l CL(R 1 ; R 2 ). 12

Elementary linear algebra operation Let us consier the matrix operation A = X t Y where A is an N N matrix an X; Y are two vectors of imension N. Often, vectors X; Y are linear combination of several vectors. For example, if X = X 1 + X 2, the operation is A = (X 1 + X 2 ): t Y. The corresponing loop is the following DO J1 = 0, N1-1 DO J2 = 0, N2-1 A(J1,J2) = (X1(J2) + X2(J2))*Y(J1) We want to compute cross-interferences on X 1 ue to X 2. The virtual aresses of X 1 (j 1 ); X 2 (j 2 ) are x + 1 0 j 1 an x + 2 0 j 2. x + 1 0 j 2? (x + 2 0 j 2) = x 1 0? x 20, the ierence is constant so the two references are in translation. The cross-interferences are internal cross-interferences. For array X 1, reuse occurs on loop j 1 an the reuse set of X 1 correspons to N 2 consecutive cache lines. On this loop level, the interference set of X 2 correspons to N 2 consecutive cache lines. The cache istance between the two sets is equal to x 1 0? x 20 mo C S. If x 1 0? x 20 mo C S 2 [0; N 2? 1] S [C S? (N 2? 1); C S ], the two sets overlap (cf gure??). In that case, the amount of overlapping is equal to CL(X 1 ; X 2 ) = min((x 10?x 20 mo C S);CS?(x10?x 20 mo C S )) cache lines (cf gure??). Therefore, internal cross-interferences between X 1 an X 2 bring CL(X 1 ; X 2 ) aitional memory requests per iteration of loop j 1, or N 1 : : :N l CL(X 1 ; X 2 ) aitional memory requests in total (cf gure??). > N2 N N < N 2 Cache Elements of X1 Elements of X2 Elements victim of internal cross-interferences = (x1 - x2 ) mo Cs 0 0 Figure 5: Internal cross-interferences between X 1 an X 2. Spatial interferences Note that internal cross-interferences correspon to spatial interferences, only if the relative cache positions of the actual reuse set an the actual interference set is smaller than the line size (in the above example, ((x 1 0? x 20 mo C S) < L S ). So, spatial interferences have a low probability to occur. On the other han, when such a case happens, little or no spatial reuse can be achieve an no temporal reuse can be achieve also. These cases are extremely costly, they correspon to "ping-pong", i.e. when two arrays translate in cache an constantly compete for the same cache location. 13

Hit ratio 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Total X1 X2 0 10 20 30 40 50 60 70 Cache istance between X1 an X2 Figure 6: Impact of internal cross-interferences on the hit ratio (N 1 = 200; N 2 = 40). Group-epenence reuse In this paper, mostly reuse ue to self-epenences (a reference reuses itself) is analyze. However, the reuse ue to group-epenences (a reference reuses elements of another reference) can also be signicant, though it is in general less important than reuse ue to self-epenences. However, since the way internal cross-interferences perturbate group-epenence reuse is original, the phenomenon is worth being illustrate with an example. Example from FLO52 Let us consier the following simple example extracte from a version of a Perfect Club coe FLO52 [?]. DO 10 J1=2,N1 DO 10 J2=1,N2 XY(J1,J2) = X(J1-1,J2,1) - X(J1,J2,1) Array XY is actually a temporary array use for scalar expansion. The leaing imension of arrays XY an X can be consiere equal to N 2. In this case, there is no self-epenence on array X, but there is a group-epenence between references X(j 1 ; j 2 ; 1) an X(j 1? 1; j 2 ; 1) which benets to array reference X(j 1? 1; j 2 ; 1). Let us call R 1 the reference X(j 1? 1; j 2 ; 1) an R 2 the reference X(j 1 ; j 2 ; 1). Since reuse occurs on a given loop level (loop 1), it is possible to exten the enition of the reuse set to group-epenences (except the reuse set is not reuse by the reference itself). The reuse set of R 1 is RS(X) 1 2 = fx(j 1; 1; 1); : : :; X(j 1 ; N 2 ; 1)g. The epenence istance between the two references is equal to N 2. It is assume N 2 < C S so that the reuse set of R 1 ( N 2 cache lines) ts in cache. Now, array XY can have internal crossinterferences with reference R 1. However, the way such cross-interferences occur is not straightfor- war. The amount of overlapping is not equal to the number of cache lines use by both the reuse set of R 1 an the interference set of XY. Inee, internal cross-interferences occur in a boolean way. Depening on the cache istance x 0? xy 0 mo C S between the interference set of XY an the reuse set of R 1 two cases can occur. Either x 0? xy 0 mo C S 2 [C S? (N 2? 1); C S ] an the 14

interference set of XY ushes no element of the reuse set of R 1 because a given cache line is use by R 1 after it has been use by XY (cf gures?? an??). There is no aitional memory request. Or x 0? xy 0 mo C S 2 [0; N 2? 1] an the interference set of XY ushes all elements of the reuse set of R 1 because a given cache line is use by R 1 before it is use by XY. Therefore, elements reference by X(j 1 ; j 2 ; 1) are ushe before they can be reuse by X(j 1? 1; j 2 ; 1) (cf gures?? an??; note also that for x 0?xy 0 mo C S = 0 an x 0?xy 0 mo C S = N 2, reference XY inuces ping-pong phenomenon with respectively reference X(j 1 ; j 2 ; 1) an reference X(j 1?1; j 2 ; 1)). The total number of aitional memory requests is equal to the number of cache lines of the reuse set of X(j 1 ; j 2 ; 1) that woul have been reuse by X(j 1? 1; j 2 ; 1). Since N 2? 1 array elements woul have been reuse, the number of aitional memory requests per iteration of loop 1 is N 2?1, an the total number of aitional memory requests ue to internal cross-interferences is equal to (N 1? 1) N 2?1. N2 0 < < N2 Cache Elements of X(J1-1,J2,1) Elements of X(J1,J2,1) Elements of XY(J1,J2) N2 -N2 < < 0 Figure 7: Perturbation of group-epenence reuse of array X. Hit ratio 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Total X(J1,J2) X(J1-1,J2) XY 0 50 100 150 200 (XY0-X0) Figure 8: Inuence of initial positions of X an XY on the total hit ratio; N 1 = N 2 = 100. There again, by simply changing a base aress (the base aress of array XY ), it is possible to eliminate cache interferences. 15

Conclusions Internal cross-interferences can be very signicant because they occur on each reutilization of the reuse set. They are as frequent an can be as important as self-interferences. The above examples illustrate the fact that cross-interferences can vary signicantly, though apparently ranomly. Consiering arrays base aress is critical for etecting an estimating internal crossinterferences. 3.4.3 External cross-interferences If two references R 1 ; R 2 are not in translation, the cross-interferences on R 1 ue to R 2 are calle external cross-interferences. Computing such interferences amounts to estimating the relative position of R 1 an R 2. Each time the reuse set of R 1 an the interference set of R 2 overlap, the amount of overlapping is estimate (the time unit here is one iteration of the loop where reuse occurs for R 1, i.e. loop l). The virtual aresses of references of R 1 ; R 2 are r0 1 + r1 1 j 1 + : : :+ r 1 j n n an r0 2 + r2 1 j 1 + : : :+ r 2 j n n. The positions of the interference set an the reuse set are etermine by loop inices j i with i l (loop inices j i with i > l etermine the sets themselves). Therefore, the relative aress istance of the two sets is (r0 1? r2 0 ) + (r1 1? r2 1 )j 1 + : : : + (r 1? l r2)j l l. Let r i = r 1? i r2 for i > 0, i r 0 = r0 1? r2 0 mo C S, an D = gc 1il (r i ). Then for any value of (j i ) 1il, there exists 2 Z such that r 0 + r 1 j 1 + : : : + r l j l = r 0 + D. Let = gc (D; C S ). The possible relative cache positions of the two references are necessarily of the form r 0 + mo C S. Therefore, RS(R 1 ) l an IS(R 2 ) l have only C S possible relative cache positions. Approximately on each iteration of reuse loop l, a new relative position is reache. Therefore, after C S iterations of loop l, all possible relative positions have been reache. Consequently, the relative "movement" of the two references is perioic, of perio C S. The total number of iterations of loop l execute is N 1 : : : N l, so there are N 1:::Nl C S perios. Since the perio is C S, for any interval of C S values of, r 0 + mo C S escribes all possible relative cache positions. Let I be one such interval. Then for any value of, the number of cache lines use by both the reuse set an the interference set, i.e. CL(R 1 ; R 2 ; ), is compute the same way internal cross-interferences are evaluate, i.e. by computing the beginning an the en of each set an then eucing their overlap. Then, the total number of aitional memory requests for one perio is P 2I CL(R1 ; R 2 ; ) an the total number of aitional memory requests is N 1:::Nl C S P 2I CL(R1 ; R 2 ; ). So the process for etermining external cross-interferences on R 1 ue to R 2 is the following 1. Determine the reuse set of R 1. 2. Determine. The relative cache positions of the two sets are r 0 + mo C S. The perio of the movement is C S. Any interval I of C S values of is picke. Compute CL(R1 ; R 2 ; ) for any 2 I. 3. The total number of aitional memory requests is equal to N 1:::Nl C S P 2I CL(R1 ; R 2 ; ). Blocke Matrix-Vector multiply Vector multiply Let us consier the following blocke version of Matrix- 16

DO J1 = 0, N-1, Bs DO J2 = 0, N-1 DO J3 = J1, J1 + Bs-1 Y(J2) += A(J2,J3) * X(J3) where B S is assume to ivie N for simplicity's sake. As for the example of blocke Matrix-Matrix multiply, new inices are ene: j 3 = J3? J1, j 1 = J1 an j 2 = J2. The linear function corresponing to the subscripts of X an A are the BS following X: j 3 + 0j 2 + B S j 1 + x 0 A: j 3 + N j 2 + B S j 1 + a 0 (where N is the leaing imension of array A) Let us consier how array A perturbates the actual reuse of array X in Matrix-Vector multiply. The reuse set RS(X) 2 of X is an interval of size B S cache lines (reuse occurs on loop BS 2). The corresponing interference set of A is an interval of cache lines also (IS(A) 2 = f0 + N j 2 + B S j 1 + a 0 ; : : : ; B S? 1 + N j 2 + B S j 1 + a 0 ; 0 j 2 N? 1g). The relative position of the two sets change on each iteration of j 2. It is assume B S < C S 2 so that RS(X) 2 an IS(A) 2 o not necessarily overlap in cache. The relative positions of A an X are j 3 + N j 2 + B S j 1 + a 0? (j 3 + B S j 1 + x 0 ) = a 0? x 0 + N j 2. Let r 0 = a 0? x 0 mo C S. So D = gc (N) = N an = gc (N; C S ), an there are C S relative cache positions. Intuitively, stuying the relative cache positions of the two sets means that the reuse set is consiere to be xe, while the interference set is consiere to be moving in cache. General case Let us pick an interval I of C S values of such that? C S 2 r 0 + C S 2,? i.e. I = C S C S2 2?r0?r0 ;. Then, external cross-interferences occur when the beginning of the interference set, i.e. r 0 + is locate within the reuse set, i.e. within [0; B S ], or when the en of the interference set, i.e. r 0 + + B mo C S is locate within the reuse set, i.e. [0; B S ]. This can be expresse by the constraints?b S r 0 + B S h. Therefore, it isi possible to ene I (B S ), the subinterval of I where overlapping occurs I (B S ) =. For any value of 2 I (B S ),?BS?r0 ; B S?r0 the overlapping is CL(X; A; ) = jb S?(r0+)j cache lines, an the number of aitional memory requests per perio is P 2I(BS) CL(X; A; ). The total number of aitional memory requests is N 2 P 2I(BS) CL(X; A; ). BS C S Spatial interferences Estimating spatial interferences is one exactly the same way as for temporal interferences, except that spatial interferences occur only if?l S r 0 + L S instea of?b S r 0 + B S for temporal interferences. The number of aitional memory requests ue to spatial interferences is equal to CL s (X; A; ) = B S (jl S? 1? (r 0 + )j) per value of (note that 17

L S?1 is use instea of L S because the rst reference to the line is consiere to be a temporal reuse, not a spatial reuse). As above, an interval I (L S ) of values of where spatial interferences occur can be ene. Then, the number of aitional memory requests per perio is P 2I() CL s(x; A; ). N The total number of aitional memory requests is 2 P BS BS C S 2I() CL s(x; A; ). Note that spatial interferences occur very few times (or none) over a perio. So, though there are numerous aitional memory requests for each value of where spatial interferences occur, there are few such values of so that the total number of spatial interferences is small in general. Particular cases Depening on the values of r 0 an many ierent situations can occur. If B S <, there are "holes" between the ierent possible cache positions of the interference set. One such hole has size? B S, so that if > 2B S, the reuse set can t entirely within such a hole. In that case, no external cross-interferences woul occur between the two arrays. Let us consier the case = C S, i.e. there is only one possible cache position for the interference set. Then if r 0 B S, no external cross-interference can occur, while if r 0 < B S, external cross-interferences occur on each iteration of the reuse loop j 2 (cf gure??). In that case, B S?r0 cache lines overlap, so aitional memory requests are ue to such external cross-interferences. The worse case correspons to r 0 = 0, total overlapping occurs, not only temporal locality but also spatial locality cannot be exploite. that a total of N 2 BS B S?r0 r0 small r0 r0 large Bs Cache Elements of A Elements of X Reusable elements of X r0 Figure 9: Cross-interferences between A an X (N mo C S = 0). Conclusions In opposition to internal cross-interferences, external cross-interferences occur perioically an with varying importance. Therefore etecting an estimating external crossinterferences is more icult. Still, a precise estimate can be erive by stuying such interferences over a perio. It is possible to compute the perio of interferences an the number of cross-interferences over a perio. Then, the total number of external cross-interferences is equal to the number of perios times the external cross-interferences over one perio. Though, external cross-interferences operate in a more irregular manner than internal cross-interferences, they can still be very amaging. 3.5 Interferences with multiple array references The previous section provies a metho for estimating how much an array can perturbate the potential reuse of another array. Now, when the reuse of an array is perturbate by several 18