Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem

Similar documents
Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles

Online Appendix to: Generalizing Database Forensics

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Computer Organization

Learning Polynomial Functions. by Feature Construction

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

William S. Law. Erik K. Antonsson. Engineering Design Research Laboratory. California Institute of Technology. Abstract

Recitation Caches and Blocking. 4 March 2019

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

2-connected graphs with small 2-connected dominating sets

Coupling the User Interfaces of a Multiuser Program

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Skyline Community Search in Multi-valued Networks

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Considering bounds for approximation of 2 M to 3 N

P. Fua and Y. G. Leclerc. SRI International. 333 Ravenswood Avenue, Menlo Park, CA

When Clusters Meet Partitions: Dennis J.-H. Huang and Andrew B. Kahng. UCLA Computer Science Department, Los Angeles, CA USA

Loop Scheduling and Partitions for Hiding Memory Latencies

Waleed K. Al-Assadi. Anura P. Jayasumana. Yashwant K. Malaiya y. February Colorado State University

Multimodal Stereo Image Registration for Pedestrian Detection

Lab work #8. Congestion control

Kinematic Analysis of a Family of 3R Manipulators

Investigation into a new incremental forming process using an adjustable punch set for the manufacture of a doubly curved sheet metal

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Shift-map Image Registration

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

1 Surprises in high dimensions

Learning Subproblem Complexities in Distributed Branch and Bound

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

X y. f(x,y,d) f(x,y,d) Peak. Motion stereo space. parameter space. (x,y,d) Motion stereo space. Parameter space. Motion stereo space.

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Comparison of Methods for Increasing the Performance of a DUA Computation

0607 CAMBRIDGE INTERNATIONAL MATHEMATICS

Estimating Velocity Fields on a Freeway from Low Resolution Video

0607 CAMBRIDGE INTERNATIONAL MATHEMATICS

On the Placement of Internet Taps in Wireless Neighborhood Networks

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

NAND flash memory is widely used as a storage

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

Improving Performance of Sparse Matrix-Vector Multiplication

Chapter 9 Memory Management

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

All-to-all Broadcast for Vehicular Networks Based on Coded Slotted ALOHA

AnyTraffic Labeled Routing

Modifying ROC Curves to Incorporate Predicted Probabilities

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

6.823 Computer System Architecture. Problem Set #3 Spring 2002

Object Recognition Using Colour, Shape and Affine Invariant Ratios

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Lesson 11 Interference of Light

Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

A Plane Tracker for AEC-automation Applications

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER

Design of Policy-Aware Differentially Private Algorithms

Table-based division by small integer constants

Divide-and-Conquer Algorithms

E2EM-X4X1 2M *2 E2EM-X4X2 2M Shielded E2EM-X8X1 2M *2 E2EM-X8X2 2M *1 M30 15 mm E2EM-X15X1 2M *2 E2EM-X15X2 2M

Image compression predicated on recurrent iterated function systems

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Department of Computer Science, POSTECH, Pohang , Korea. (x 0 (t); y 0 (t)) 6= (0; 0) and N(t) is well dened on the

Gabriel Rivera, Chau-Wen Tseng. Abstract. Linear algebra codes contain data locality which can be exploited

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

ACE: And/Or-parallel Copying-based Execution of Logic Programs

Verifying performance-based design objectives using assemblybased vulnerability

Using Ray Tracing for Site-Specific Indoor Radio Signal Strength Analysis 1

Appearance Sensing distance Output configuration Operation mode Model. Appearance Sensing distance Output configuration Operation mode Model

A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

A shortest path algorithm in multimodal networks: a case study with time varying costs

Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Politecnico di Torino. Porto Institutional Repository

FINDING OPTICAL DISPERSION OF A PRISM WITH APPLICATION OF MINIMUM DEVIATION ANGLE MEASUREMENT METHOD

Exercises of PIV. incomplete draft, version 0.0. October 2009

Classical Mechanics Examples (Lagrange Multipliers)

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

6 Gradient Descent. 6.1 Functions

Multilevel Paging. Multilevel Paging Translation. Paging Hardware With TLB 11/13/2014. CS341: Operating System

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Appearance Sensing distance Output configuration Operation mode Model. Appearance Sensing distance Output configuration Operation mode Model

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 4, APRIL

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Module13:Interference-I Lecture 13: Interference-I

Politehnica University of Timisoara Mobile Computing, Sensors Network and Embedded Systems Laboratory. Testing Techniques

Q. No. 1 Newton postulated his corpuscular theory of light on the basis of

An FFT-based Method for Attenuation Correction in Fluorescence Confocal Microscopy Roerdink, Johannes; Bakker, M.

Design of Controller for Crawling to Sitting Behavior of Infants

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

Message Transport With The User Datagram Protocol

Chalmers Publication Library

Transcription:

Inuence of Cross-Interferences on Blocke Loops A Case Stuy with Matrix-Vector Multiply CHRISTINE FRICKER INRIA, France an OLIVIER TEMAM an WILLIAM JALBY University of Versailles, France State-of-the art ata locality optimizing algorithms are targete for local memories rather than for cache memories. Recent work on cache interferences seems to inicate that these phenomena can severely aect blocke algorithms cache performance. Because of cache conicts, it is not possible to know the precise gain brought by blocking. It is even icult to etermine for which problem sizes blocking is useful. Computing the actual optimal block size is icult because cache conicts are highly irregular. In this article, we illustrate the issue of precisely evaluating cross-interferences in blocke loops with blocke matrix-vector multiply. Most signicant interference phenomena are capture because unusual parameters such as array base aresses are being consiere. The techniques use allow us to compute the precise improvement ue to blocking an the threshol value of problem parameters for which the blocke loop shoul be preferre. It is also possible to erive an expression of the optimal block size as a function of problem parameters. Finally, it is shown that a precise rather than an approximate evaluation of cache conicts is sometimes necessary to obtain near-optimal performance. Categories an Subject Descriptors B.3.0 [Memory Structures] General; C.4 [Computer Systems Organization] Performance of Systems moeling techniques; D.3.4 [Programming Languages] Processors General Terms Measurement, Performance Aitional Key Wors an Phrases Blocking, cache conicts (interferences), cache performance, ata locality optimization, numerical coes. INTRODUCTION To ate, ata locality optimizing algorithms [Eisenbeis et al. 990; Ferrante et al. 99; McKinley 992; Porterel 989; Wolf an Lam 99] have been concerne with ecreasing capacity misses using blocking an have mostly ignore the occurrence of conict misses. However, previous stuies [Ferrante et al. 99; Lam et al. 99] showe that conict misses can signicantly alter the behavior of blocke algorithms. More precisely, self-interferences in blocke loops [Lam et al. 99] have been shown to be sensitive to the choice of the optimal block size. A ata locality optimization technique which combines tile size optimization an copying has also been propose [Esseghir 993] as a way to reuce self-interferences in numerical This work was fune by the DGXIII ESPRIT BRA III Project APPARC. Authors' aresses C. Fricker, INRIA, 7853 Le Chesnay, France; email Christine.Fricker@inria.fr; O. Temam, PRiSM, University of Versailles, 78000 Versailles, France; email temam@prism.uvsq.fr; W. Jalby, PRiSM, University of Versailles, 78000 Versailles, France; email jalby@prism.uvsq.fr.

2 DO j=0,n- reg = Y(j) DO j2=0,n- reg += A(j2,j) * X(j2) ENDDO Y(j) = reg ENDDO DO jj2=0,n-,b DO j=0,n- reg = Y(j) DO j2=jj2,min(jj2+b-,n-) reg += A(j2,j) * X(j2) ENDDO Y(j) = reg ENDDO ENDDO Fig.. Blocke an nonblocke matrix vector multiply. loops. Recently, we have evelope a moel for evaluating conict misses in numerical loops [Temam et al. 994] with the purpose of unerstaning cache interference phenomena an preicting the cache performance of a numerical loop nest. Three ierent types of interference misses were istinguishe self-interferences, internal cross-interferences (cross-interferences between two references which subscripts have ientical linear expressions), an external cross-interferences (cross-interferences between any two other references). The most frequent an most icult type of interferences to evaluate are external cross-interferences. We have mentione in Temam et al. [993] that two ierent types of evaluation can be performe approximate or precise, but up to now we have mostly focuse on the approximate evaluation. In this article, precise evaluation of external cross-interferences is shown to be sometimes necessary for computing the near-optimal block size of a numerical loop. Most ata locality optimizing algorithms barely eal with the issue of computing the optimal block size. One of the most elaborate treatments of this problem can be foun in Eisenbeis et al. [990], where the computation of the optimal block size sums up to evaluating the number of capacity misses as a function of the block size, an then ning the block size that minimizes this number. The purpose of the article is twofol provie a etaile illustration of the technique use to erive the precise number of external cross-interference misses an show how the precision of the evaluation of conict misses can aect the etermination of the optimal block size an, further, the performance of the loop. Position of the Problem. The example use to illustrate the ierent points evelope in this article is the classic numerical algebra primitive Matrix-Vector multiply an its blocke version (see Figure ). The target architecture consiere is an 8KB irect-mappe cache with a line size equal to 32 bytes, which are the parameters of several current processors [Kane an Heinrich 992; Sites 992]. All problem parameters are expresse in ouble-precision oating numbers, i.e., 8 bytes, so that a cache size C S of 8KB correspons to C S = 024, an a line size of 32 bytes to = 4. Notations. m enotes the total number of cache misses. m t ; m s enote the number of temporal an spatial misses. m i enotes the number of intrinsic misses. m(t ) enotes the total number of cache misses for array T. The notations m t (T ), m s (T ), m i (T ) can also be euce. Furthermore m(t ; T 2 ) enotes the number of misses of T ue to interferences with T 2.

3 Experiments. Throughout the article, the actual number of misses is obtaine through simulations using a simulator evelope for that purpose. 2. ESTIMATING THE NUMBER OF CACHE MISSES Because of paper length constraints, this section is restricte to stuying the external cross-interferences between array A an array X. A treatment of other external cross-interferences in the loop can be foun in Fricker et al. [993]. External crossinterferences basically correspon to the ata reuse by a reference being ushe from cache by another reference, an the two references have subscripts with istinct linear expressions. The set of ata to be reuse by the victim reference is calle the reuse set, an the set of ata interfering with this reuse set is calle the interference set. These sets are ene on the loop level where the reuse occurs. So for arrays X an A in the blocke loop nest, the reuse loop is loop j (for X), an the reuse set (of X) an the interference set (of A) both correspon to a set of B array elements or B= cache lines. The problem sums up to stuying the relative cache position of the two sets an to computing the size of their intersection when they overlap. When the intersection size is expresse in cache lines it exactly correspons to the number of conict misses between the two references. 2. Interferences between X an A Let us now stuy the relative cache position of the reuse set of X an the interference set A. The positions of the beginning of these two sets are respectively R X = x 0 + j 2 R A = a 0 + j 2 + Mj Therefore, the relative position of the interference set with respect to the reuse set is the following R XA = a 0? x 0 + Mj Possible Relative Cache Positions of A an X. The rst problem is to n all the possible relative positions of X an A, i.e., all the possible values of R XA. Since R XA = a 0? x 0 + Mj, the possible locations are (a 0? x 0 + Mj ) mo C S. Let = gc (M; C S ) an r = (a 0?x 0 ) mo C S. Then, (a 0?x 0 +Mj ) mo C S = (r+(m=)j ) mo C S. Therefore, the possible positions are all of the form R XA = (r + ) mo C S ; 2 Z. The set of values of corresponing to istinct cache positions is nite. The istance between two consecutive possible cache positions is, an the number of istinct cache positions is equal to C S =. Cache Positions where Interferences Occur. Let us consier the interval I corresponing to C S = consecutive values of an ene by?c S =2 r + C S =2. For 2 I, interferences occur only if?b r + B, i.e., if the istance in cache between the beginning of the intervals of A an X belongs to [?B; B] (see Figure 2(a)). The previous inequation can be rewritten as (?B? r)=e b(b? r)=c. Let B = B + b with b = B mo. It is certain interferences occur for 2 It is assume here that B C S =2.

4 =8 =4 = B Ls Cache X A Miss ratio of X 0.45 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 Preicte 0 52 600 800 024 200 400 536 Dimension N (M=N) (a) (b) Fig. 2. (a) Cross-interferences between A an X. (b) Miss ratio of X. [?B ; B? ], while for =?(B + ) an = B, interferences may occur epening on the relative values of b; r; an (this is ue to the ceiling an oor functions of the above inequation). Computing the Number of Temporal Interferences. As mentione in the previous paragraph, the interferences between A an X recur with a perio of C S =. Therefore, the amount of interferences nees to be compute over one perio an then multiplie by the number of perios. An approximate number of perios is N=(C S =). So, in this paragraph, only a chunk of C S = iterations is consiere, e.g., the interval I. For each value of 2 I, the istance in cache between the beginning of the intervals of X an A is jr + j. So, the overlapping (expresse in cache locations) is equal to (B?jr + j) +, where (x) + = max (x; 0). For 2 [?B ;?], the overlapping is equal to (B + r + ) + = B + r +, an for 2 [0; B? ], it is equal to (B? r? ) + = B? r?. For =?(B + ), the overlapping is equal to (B + r? (B + ) ) + = (b + r? ) +, an for = B, the overlapping is equal to (B? r? B ) + = (b? r) +. For any other value of such that?c S =2 r + C S =2, the overlapping is equal to 0. Consequently for one perio of C S = iterations the number of cache lines that overlap is equal to (b + r? ) + + (b? r) + + P B? B? r? + P? =0 =?B B + r + an since P B? B?r?+P? =0 =?B B +r + =?B 2 +2B B = (B 2?b 2 )=, the total number of temporal interferences of X ue to A is given by m t (X; A) = N B N C S (b+r?) + +(b?r) + + B2?b 2 An intuitive representation of such interferences is inicate on Figure 2(a) (all intervals of A which o not interfere with X have not been represente).! ;

5 Average interferences m t (X; A) can be average over all possible values of r which may vary between 0 an?. The expression of the average number of interferences is equal to N B N C S P? r=0 2.2 Total Number of Cache Misses (b + r? ) + + (b? r) + + B2?b 2 = N 2 B C S In this section, the analytical expressions of the ierent sources of cache misses are presente. In theory, it is not possible, for one array, to a simply all the associate expressions because of possible reunancy between cross-interferences. However, these reunancies have been ignore because they prove to be negligible in most cases. Array X. Because Y inuces a negligible number of spatial interferences on array X, the term m s (X; Y ) oes not gure in the expression of m(x). So, with m(x) = m i (X) + m t (X; A) + m s (X; A) + m t (X; Y ); m i (X) = N ; m t (X; A) = N 2 B C S ; m s (X; A) = N 2 C S (? ) 2 ; m t (X; Y ) = N 2 C S ; we obtain m(x) = N B C S C S (? ) 2 C S The variations of m(x) can be very important, essentially because of the variations of m(x; A). The precision of the above estimate is illustrate in Figure 2(b). Array Y. The expression of the total number of misses for Y, m(y ), is the following m i (Y ) + m t (Y; Y ) + min (( 2C S?N ) + ; ) (m t (Y; A) + m t (Y; X)) + m s (Y; A) + m s (Y; X) N with m i (Y ) = N ; m t (Y; Y ) = N?(N?2(N?C S )+ ) + ; m t (Y; A) = N 2 min (; 2B B ); m t (Y; X)) = N 2 ; m s (Y; A) = m s (Y; X) = N 2 ( C S C? ); we obtain m(y ) = N min (; 2B ) + N?(N?2(N?C S )+ ) + B Array A. Because array A exhibits no temporal locality, the terms m t (A; X) an m t (A; Y ) o not appear in the expression of m(a). Besies, Y inuces a negligible number of spatial misses on array A (the argument is the same as for array X), so the term m s (A; Y ) has been remove as well. So, with we obtain m(a) = m i (A) + m s (A; X); m i (A) = N 2 ; m s (A; X) = N 2 C S (? ) 2 ; m(a) = N 2 (? ) 2 C S Blocke Matrix-Vector Multiply. Regaring the whole primitive, the misses of each array are clearly cumulative; therefore it is safe to assert that the expression

6 Total miss ratio 0.35 0.3 0.25 0.2 0.5 Preicte Total miss ratio 0.7 0.6 0.5 0.4 mo Ls = 0 mo Ls = mo Ls = 2 mo Ls = 3 0. 52 600 800 024 200 400 536 Dimension N (M=N) 0.3 20 22 24 26 28 Block size B (Ls = 4) (a) (b) Fig. 3 (a) Total miss ratio of blocke matrix-vector multiply (r=4). (b) Inuence of semiintrinsic misses on global performance. of m, the total number of misses, is the following m = m(x) + m(y ) + m(a) Because the term m t (X; A) has a ominant impact on the total miss ratio, the total miss ratio is closely relate to the miss ratio of X as the comparison of Figure 3(a) with Figure 2(a) shows. 2.3 Spatial Interferences Temporal vs. Spatial Interferences. The main source of cache misses are temporal interferences on X ue to A m t (X) ' (N 2 B)=(C S ). Similarly, for spatial interferences m s (X) ' ((N 2 )=C S )(? = ) 2. An upper boun for m s (X) is (N 2 )=C S. So, if B is large enough m s (X) m t (X), i.e., spatial interferences are negligible with respect to temporal interferences. Note that, in opposition to temporal interferences, spatial interferences are inepenent of B, an therefore they o not inuence the choice of the optimal block size. As a consequence, spatial interferences will be ignore in the computations of Section 3. Semiintrinsic Misses. In the nonblocke loop, the reference to A is R A = a 0 +j 2 + Mj with 0 j < N an 0 j 2 < N, i.e., N elements are accesse consecutively; then a strie of M is applie (if M = N all elements are consecutive). In the blocke loop, R A = a 0 + j 2 + Mj + Bjj 2, i.e., the strie of M is applie much more frequently, every B elements. If oes not ivie B, or if the block of B elements is not aligne on a cache line, some elements of A are loae that o not belong to this block of B elements, i.e., useless elements. Since such elements will only be use after N iterations of loop j (i.e., they are unlikely to be kept in cache) or have alreay been use, they bree aitional cache misses that can be terme semiintrinsic misses.

7 Total miss ratio 0.6 0.5 0.4 0.3 0.2 0. Ls = 2 Ls = 8 600 800 000 200 400 600 Dimension N (M=N) Fig. 4. Inuence of on the relative importance of cache interferences. Even assuming a 0 mo = 0 (the rst element of A is aligne on a cache line), semiintrinsic misses occur if B mo 6= 0 ( oes not ivie B) an/or M mo 6= 0 (a block is not always aligne on a cache line). As can be seen in Figure 3(b), the optimal performance of the blocke loop can only be reache if these two conitions are fullle. Also, the inuence of on the number of interferences can be seen in Figure 4. 3. OPTIMAL BLOCK SIZE AND OPTIMAL GAIN The benet or gain of blocking for array T is ene by G(T ) = m n (T )? m b (T ) (where m n (T ) is m(t ) for the nonblocke? i.e., stanar? loop, an m b (T ) is m(t ) for the blocke loop). G is the total gain, i.e., G = m n? m b. For all the graphs in this section, the expression of the gain g = m n =m b is preferre because it provies the relative instea of the absolute improvement of miss rates ue to blocking. Still G(T ) has been use in the computations for the sake of simplicity. Also, in the next sections the optimal block size is enote B opt. In Section 3., the values of the optimal block size an the gain, as compute by state-of-the-art ata locality optimizing algorithms, are provie. In Section 3.2, the average gain (an the associate optimal block size) erive from the expressions of Section 2 is compute. The threshol value of N for which blocking is useful is compute in Section 3.3. The ierences between accurate an average evaluation of interferences are highlighte in Section 3.4. In Figure 5 the curves corresponing to the ierent expressions of the gain are plotte. Each curve is explaine in one of the following sections. 3. Theoretical Optimal Block Size an Theoretical Gain To ate, the most elaborate metho for computing the optimal block size in any loop can be foun in Eisenbeis et al. [990], so we will start from that point. In Eisenbeis et al. [990], for each reference, the set of ata to be reuse is calle the reference winow. The principle is to n a block size so that all winows t in cache, an which minimizes the number of cache misses. In Eisenbeis et al. [990], only capacity misses are consiere.

8 Gain g = Mn / Mb.8.7.6.5.4.3.2. 0.9 0.8 Precise Average Theoretical B=N 00 200 300 400 500 600 700 N (M=N) Gain g = Mn / Mb 2.9.8.7.6.5.4.3.2. 0.9 Precise Average Theoretical B=min(N,Cs) 700 800 900 000 00 200 300 N (M=N) Gain g = Mn / Mb 2.9.8.7.6.5.4.3.2. Precise Average Theoretical B=Cs 300 400 500 600 700 800 900 N (M=N) Gain g = Mn / Mb 2. 2.9.8.7.6.5.4.3.2. 0.9 Precise Average Theoretical B=Cs 900 2000 200 2200 2300 N (M=N) Fig. 5 Optimal gain, preicte precise optimal gain, preicte average optimal gain, theoretical optimal gain. Let us illustrate this process with blocke matrix-vector multiply. The reference winow corresponing to array Y has a size of cache line. For array X it is equal to B= cache lines. An there is no winow for array A because it is not reuse. No reuse is assume to occur for arrays to which blocking is not applie, i.e., array Y. So the number of cache misses of array Y is equal to N=B N=. The number of misses of array A is equal to N 2 = (compulsory misses). Finally, since interferences are ignore, an the winow of B is assume to t in cache, the number of misses of array X is equal to N=B B=. The optimization problem is then the following B N B C S Minimize m b = N B N + N B B = N 2 B + N So, in this case, the problem is equivalent to maximizing B uner the constraints. If N < C S, then B opt = N, an if N C S, B opt = C S, i.e., B opt = min (N; C S ). In orer to compute the gain, the number of cache misses for the nonblocke

9 loop nest must be evaluate. Shortly, the number of capacity misses of X in the nonblocke loop nest is equal to m t (X; X) = N (N? (N? 2(N? C S ) + ) + )=. So, N?(N?2(N?C G = m n? m b = S ) + ) + + N? = N?(N?2(N?C S )+ ) +? N 2 min (N;C S ) N 2 min + N (N;C S ) In the remainer of the article, these values of the optimal block size an the optimal gain are terme the theoretical optimal block size an the theoretical optimal gain. In Figure 5, it can be seen that the gain obtaine with the theoretical optimal block size is very low (lower than.2). Besies, the theoretical gain appears to be a strong mispreiction of both the actual gain an even the gain obtaine with the theoretical block size. The theoretical gain actually correspons to what \shoul happen" if blocking was behaving as preicte by the Winow moel, i.e., if capacity misses were remove an there were no interference miss. Incientally, the theoretical gain inicates the maximum gain that can be theoretically expecte, i.e., the ieal gain. Let us compute this maximum gain When N > 2C S, + ) + g = N N?(N?2(N?C S ) N min 2 + N2 (N;C S ) + N2 L + N S L + N S g ' 2 L N2 S = 2 ; N 2 L + N2 + S C S L C S S so g ' 2. The maximum gain that can be expecte is 2 (i.e., blocking woul ivie by 2 the number of cache misses) in the nonblocke loop X exhibits at most N 2 = cache misses; an A also exhibits N 2 = compulsory misses, while in the blocke loop X ieally exhibits only N= compulsory misses in the best case; an A still exhibits N 2 = cache misses. 3.2 Estimate Average Gain an Corresponing Optimal Block Size 3.2. Estimate Average Gain. For computing the average gain, the expression of the average values of interferences are use. Such average expressions have been erive for both the blocke an the nonblocke loops. Because of paper length constraints, the etails of computations have been omitte (see Fricker et al. [993]). N < C S. G = m n? m b = C S N < 2C S. G = m n? m b N = 2 + N? 2C S N. N 3? C S C S N 2 B min (; 2B ) C S C S B C S N 2 B + 2C S?N N 2 min (; 2B ) + N (2C S?N ) + 2N (N?C S ) C S C S N B C S B G = m n? m b = N 2 + N? N 2 B C S C S B

0 Because cache interferences are taken into account in the above average estimates an not in the theoretical expressions of Section 3., new terms appear, or existing terms are moie. For instance, in the rst case (N < C S ), the main new term is N 2 B=(C S ) which correspons to temporal interferences between A an X. Because this term is a function of B, it is going to aect the etermination of the optimal block size. Inee, when N < C S, the expression of the theoretical number of misses of the blocke algorithm (see m b in Section 3.) only contains one term which epens on B N 2 =(B ). Consequently, this term is minimal when the block size is the largest possible; hence B opt = min (N; C S ). Now, in the above average expression two terms epen on B (N 2 B)=(C S ) an (N 2 =(B )) min (; 2B=) which respectively increases an ecreases (or is constant) with B. Therefore, the optimal block size is either equal to a traeo value or to (see the etaile computations in Section 3.2.2). The curve Average in Figure 5 correspons to the average optimal gain. It correspons to the above expressions with B = B opt (except g is use instea of G). It is shown in Section 3.2.2 how to erive the expression of B opt in the ierent cases. As can be seen in Figure 5, the precision of the average optimal gain is usually close to the actual optimal gain. Still, when N > C S, the actual gain is perioically slightly higher than the average gain, while the precise estimate of the gain correctly preicts such phenomena (see Figure 5). The main ierence between precise an average estimates is that array base aresses are consiere in the precise estimate. In Figure 5, the base aresses of arrays X an A have been chosen large enough that no intense interference phenomena relate to array placement can occur (r = 52). But, in Section 3.4, it is shown that array base aresses can sometimes have a major inuence on the number of interference misses, in which case the precision of the average estimate can be poor. 3.2.2 Estimate Optimal Block Size Base on the Average Gain. Let us rst inicate the optimal block size expression obtaine in each case an then provie the etails of computations. When N < C S, we obtain B opt = p C S if < p C S an B opt = if p C S (recall that = gc (M; C S )). With the theoretical expression of Section 3., we obtain that B opt = N. When N C S, B opt is either equal to p C S or p 2C S (? C S =N) while the theoretical optimal block size is equal to C S in this case. Therefore, the theoretical expression of the optimal block size is generally a strong overestimate of the optimal block size, which is conrme by Figure 5. The theoretical expression of Section 3. implies that once an element of X is loae into the cache, it will not be ushe. Therefore, the only constraint on B is that it must t in cache. That is why the number of misses of X (N= ) oes not epen on B. On the other han, the expressions compute in Sections 2 an 3.2. take into account the fact the elements of X can be ushe by elements of A. Consequently, with respect to X, the block size shoul be selecte as small as possible so that the elements of X can be reuse before they can be ushe. Intuitively, it means the reuse istance shoul be small enough that the probability an element of X is ushe before it can be reuse is negligible. That is why the number of misses of X, (N 2 B)=(C S ), increases with B.

In the following paragraphs, it is now shown how the expression of the optimal block size can be erive from the expression of the average gain. G is now consiere to be a function of B. It is ierentiate along B so that its variations can be analyze. The optimal value of B, i.e., B opt, is the value that maximizes the gain. The computations are mostly etaile for the rst case. N < C S. Two subcases must be istinguishe 2B= < an 2B=. B < =2. @G=@B =?N 2 =(C S ), so @G=@B < 0 for this interval of values of B. Therefore the local maximum is reache when B is minimum, i.e., B opt =. The corresponing value of the gain is G max = G(B opt ). B =2. @G=@B =?N 2 =(C S ) =(B 2 ). Therefore, @G=@B > 0 if B > p C S an @G=@B < 0 otherwise. Thus, G increases up to the value B = max ( p C S ; =2) an ecreases afterwar. So B opt2 = max ( p C S ; =2). The maximal value of the gain is G max2 = G(B opt2 ). The maximal gain for this interval of N is the largest of the two gains, i.e., G max = max (G max ; G max2 ). These values must then be compare to n the global optimum B opt among B opt an B opt2. If p C S < =2, then G( ) an G(=2) shoul be compare. We obtain G( )? G( 2 ) = N 2 2C S C S Thus G( ) > G(=2) if > 2, which is assume. Hence B opt =. If p CS =2, then G( ) an G( p C S ) shoul be compare, which gives G( )? G( p 2 C S ) = N 2 ( p?? 2 ) CS C S Thus G( ) > G( p C S ) if 2=( p CS ) > =C S + 2=( ), which is equivalent to > p C S (? l= p C S )? ' p C S. The optimal block size for this interval of N is B opt = p C S if < p C S an B opt = otherwise. C S N < 2C S. The same subcases must be istinguishe. We obtain the following B < =2. B opt = min ( p 2(N? C S )C S =N; =2). B =2. B opt2 = max ( p C S ; =2), an these two local maxima are then compare. Note that p 2(N? C S )C S =N < p C S ; thus three cases must be istinguishe, accoring to the respective positions of an interval [ p 2(N? C S )C S =N; p CS ]. Computations show that the optimal block size is B opt = p C S if < h(n) an B opt = p 2(N? C S )C S =N otherwise, where h(n) = 2(N?C S )N p CS + 2C S?N N 2C S?N N 2N2 p Np 2 + N2 p? N 2(N?C S )C S ( CS CS N C N +) S Note that h(n) = p C S if N = C S, an h(n) = 0 if N = 2C S. 2C S N. Here there are no subcases, an B opt = p C S.

2 Gain g = Mn / Mb.3.2. Preicte Theoretical r r < B B B < r an B+r < Ls Cache X A 0 50 00 50 200 250 300 N (M=N) r (a) (b) Fig. 6 (a) Determining when blocking is useful. (b) Variations of interferences between X an A. 3.3 Threshol Value of N In this section, the problem of etermining the threshol value N thr of N for which blocking is protable is briey aresse. Basically, blocking is useful if G > 0. Accoring to the theoretical expression of Section 3., G > 0 if N > C S, i.e., N thr = C S. Now, accoring to the expressions of Section 3.2., if N < C S, G > 0 as soon as N is approximately greater than 2 p C S. In fact, the theoretical expression only consiers capacity misses, so that blocking can only become protable if X is larger than the cache, i.e., if N > C S. However, cache interferences can occur even when capacity misses still o not occur. As explaine in Section 3.2.2, blocking has the eect of reucing the reuse istance of X so that it can be useful for minimizing interferences only. That is why N thr is actually much smaller than C S. This observation is conrme by Figure 6(a) (2 p C S ' 64 for C S = 024). 3.4 Estimate Precise Gain an Corresponing Optimal Block Size In Section 2 it has been shown how to obtain a precise evaluation of cross interferences between two arrays. The precise gain is simply obtaine by cumulating such expressions for all pairs of arrays, as was one for the average gain in Section 3.2. Because of paper length constraints, the full expressions are not provie here. On the other han, the ierences between average an precise gain are highlighte. These ierences occur when = gc (M; C S ) is large. The number of possible cache positions of the blocks of size B of A is equal to C S =. So, if is large, there are few such positions. Consequently, the corresponing cache locations are heavily reference. With respect to X, A appears to be istribute into few cache intervals separate by holes (see Figures 2(a) an 6(b)). Therefore, if array X overlaps with one such interval, interferences between A an X are very intense. Overlapping occurs if the relative cache istance r between A an X is smaller than B (see Figure 6(b), case r < B). Overlapping oes not occur if B < r an B + r <. So array base aresses play a signicant role when is large. This is illustrate

3 2.9.8 Precise Average 2.9.8 Precise Average Gain g = Mn / Mb.7.6.5.4.3 Gain g = Mn / Mb.7.6.5.4.3.2.2.. 5 0 5 20 25 30 35 40 45 50 Block size B 5 0 5 20 25 30 35 40 45 50 Block size B Fig. 7. Inuence of array base aresses on the optimal block size ( = 52 an = 4). in Figure 7 where = 52 (N = 52), r = 8, an B is varie. The actual optimal value of B is equal to 8, which is correctly preicte by the precise estimate. On the other han, if the average estimate correctly preicts the optimal gain that can be expecte, it is inepenent of r, an therefore it fails to preict for which value of B interferences will occur. For a similar value of N (N = 56), = 4, an performance variations are inepenent of r. Consequently, the average estimate remains precise for all values of B. Also, the average estimate is poor when is large because array base aresses are not consiere. On the other han, the accurate estimate successfully preicts performance variations. Note that this particularity of external cross-interferences can be exploite. If there are holes between two intervals of A, then B shoul be selecte small enough that an interval of X ts into one such hole. In this case, no cross-interference occurs between A an X. This cannot be achieve when is too small (smaller than ) because the optimal block size nees to be at least equal to in orer to exploit spatial locality. However, if is relatively small, it is not obvious that selecting a very small B will yiel important benets consiering the traeos impose by array Y (see Section 3.2.). Another solution is to ajust the relative base aress r (by ajusting the base aress of A or X) so that B < r. If this is not possible because of potential negative inuence on other loops in the coe, another solution is simply to copy array X in another array with a suitable base aress. 4. APPLICATIONS The techniques presente at the beginning of this article consist in precisely evaluating the number of external cross-interferences between two arrays by rst examining their relative cache positions an then computing the number of overlapping array elements. These techniques have been applie to the three ierent pairs of arrays that can be foun in matrix-vector multiply an which exhibit ierent patterns of relative cache positions. As far as the loop nest epth is small, these techniques can be extene to any pair of arrays. If the relative position epens on many

4 inices, the same techniques can be use for the innermost inex or inices, an an average estimate can be use for the outer inices (this was one for the pair X; Y ). Accurate evaluation of cache interferences is important for checking whether restructuring techniques o not inuce negative sie-eects that egrae potential benets. It clearly appears in this article that blocking is a elicate traeo which epens on loop an array parameters. Incluing such techniques in a compiler has not yet been achieve, but it is a possible follow-up to this stuy. A rst implementation coul be limite to average estimates which can be erive relatively easily. Precise estimates are more icult to implement, but a rst solution is to only etect that precise estimates are neee by ientifying high-risk cases. For instance, if the relative position of a pair of arrays only epens on one inex with coecient M, the test woul be simply to check the value of parameter = gc (M; C S ). If is large, intense interferences coul occur epening on the array base aresses, an a conservative (but nonoptimal) attitue woul be to select a block size of in these cases. A more immeiate application of this moel is the evelopment of a linear algebra library nely tune for caches. Though it is not possible to act on array base aresses, block size ajustment an copying provie sucient exibility to exploit fully the ierent cases etaile in this article. 5. CONCLUSIONS Several conclusions can be rawn from this analysis of matrix-vector multiply. First, accurately evaluating external cross-interference misses an eriving an analytical expression of the number of such cache misses are tractable tasks. Secon, the optimal block size, as compute by current ata locality optimizing algorithms, is highly inaccurate, because only capacity misses are consiere. If interference misses are taken into account in the optimization problem, the solution then becomes an accurate evaluation of the optimal block size. Thir, average estimate of external cross-interferences is frequently but not always sucient because, in some cases, array base aresses can strongly inuence the occurrence an intensity of cache interferences. REFERENCES Eisenbeis, C., Jalby, W., Winheiser, D., an Boin, F. 990. A strategy for array management in local memory. In Proceeings of the 3r Workshop on Programming Languages an Compilers for Parallel Computing. Irvine, California. Esseghir, K. 993. Improving ata locality for caches. M.S. thesis, Univ of Texas, Houston, Tex. Ferrante, J., Sarkar, V., an Thrash, W. 99. On estimating an enhancing cache eectiveness. In Proceeings of the 4th Workshop on Languages an Compilers for Parallel Computing. Santa Clara, California. Fricker, C., Temam, O., an Jalby, W. 993. Accurate evaluation of blocke algorithms cache interferences. Tech. rep., Leien Univ., Leien, The Netherlans. Mar. Kane, G. an Heinrich, J. 992. MIPS RISC Architecture. Prentice-Hall, Englewoo Clis, N.J. Lam, M., Rothberg, E. E., an Wolf, M. E. 99. The cache performance of blocke algorithms. In 4th International Conference on Architectural Support for Programming Languages an Operating Systems. ACM, New York, 63-74.

5 McKinley, K. S. 992. Automatic an interactive parallelization. Ph. D. thesis, Tech. Rep. CRPC-TR9224, Rice Univ, Houston, Tex. Porterfiel, A. K. 989. Software Methos for Improvement of Cache Performance on Supercomputer Applications. Ph. D. thesis, Tech. Rep. CRPC-TR89-93, Rice Univ, Houston, Tex. Sites, R. L. 992. Alpha Architecture Reference Manual. Digital Press, Befor, Mass. Temam, O., Fricker, C., an Jalby, W. 993. Impact of cache interferences on usual numerical ense loop nests. In Proc. IEEE, special issue on Computer Performance Evaluation. Temam, O., Fricker, C., an Jalby, W. 994. Cache interference phenomena. In Proceeings of the ACM SIGMETRICS Conference on Measurement an Moeling of Computer Systems. (Nashville, Tenn.). ACM, New York. Wolf, M. an Lam, M. 99. A ata locality optimizing algorithm. In Proceeings of the ACM SIGPLAN '9 Conference on Programming Language Design an Implementation. SIGPLAN Not. 26, 6, 30-44. Receive January 994; revise June 994; accepte February 995