IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Prefetching for. Chi-Keung Luk and Todd C.

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Prefetching for. Chi-Keung Luk and Todd C."

Brandon Hopkins
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Preetchin or Pointer-Based Applications Chi-Keun Luk and Todd C. Mowry Abstract As the disparity between processor and memory speeds continues to row, memory latency is becomin an increasinly important perormance bottleneck. While sotware-controlled preetchin is an attractive technique or toleratin this latency, its success has been limited thus ar to array-based numeric codes. In this paper, we expand the scope o automatic compiler-inserted preetchin to also include the recursive data structures commonly ound in pointer-based applications. We propose three compilerbased preetchin schemes, and automate the most widely applicable scheme (reedy preetchin) in an optimizin research compiler. Our experimental results demonstrate that compiler-inserted preetchin can oer sinicant perormance ains on both uniprocessors and lare-scale sharedmemory multiprocessors. Keywords Caches, preetchin, pointer-based applications, recursive data structures, compiler optimization, shared-memory multiprocessors, perormance evaluation. I. Introduction SOFTWARE -controlled data preetchin [1], [2] oers the potential or bridin the ever-increasin speed ap between the memory subsystem and today's hihperormance processors. In reconition o this potential, a number o recent processors have added support or preetch instructions [3], [4], [5]. While preetchin has enjoyed considerable success in array-based numeric codes [6], its potential in pointer-based applications has remained larely unexplored. This paper investiates compilerinserted preetchin or pointer-based applicationsin particular, those containin recursive data structures. Recursive Data Structures (RDSs) include amiliar objects such as linked lists, trees, raphs, etc., where individual nodes are dynamically allocated rom the heap, and nodes are linked toether throuh pointers to orm the overall structure. For our purposes, \recursive data structures" can be broadly interpreted to include most pointer-linked data structures (e.., mutually-recursive data structures, or even a raph o heteroeneous objects). From a memory perormance perspective, these pointer-based data structures are expected to be an important concern or the ollowin reasons. For an application to suer a lare memory penalty due to data replacement misses, it typically must have a lare data set relative to the cache size. Aside rom multi-dimensional arrays, recursive data structures are one o the most common and convenient methods o buildin lare data structures (e., B-trees in database applications, octrees in raphics applications, etc.). As we traverse a C.-K. Luk is with the Department o Computer Science, University o Toronto, Toronto, Ontario M5S 3G4, Canada. luk@eec.toronto.edu. T. C. Mowry is with the Computer Science Department, Carneie Mellon University, Pittsburh, PA tcm@cs.cmu.edu. lare RDS, we may potentially visit enouh intervenin nodes to displace a iven node rom the cache beore it is revisited; hence temporal locality may be poor. Finally, in contrast with arrayswhere consecutive elements are at contiuous addressesthere is little inherent spatial locality between consecutively-accessed nodes in an RDS, since they are dynamically allocated at arbitrary addresses. To cope with the latency o accessin these pointerbased data structures, we propose three compiler-based schemes or preetchin RDSs, as described in Section II. We implemented the most widely-applicable o these schemesreedy preetchinin a modern research compiler (SUIF [7]), as discussed in Section III. To evaluate our schemes, we perormed detailed simulations o their impact on both uniprocessor and multiprocessor systems in Sections IV and V, respectively. Finally, we present related work and conclusions in Sections VI and VII. II. Sotware-Controlled Preetchin or RDSs A key challene in successully preetchin RDSs is schedulin the preetches suciently ar in advance to ully hide the latency, while introducin minimal runtime overhead. In contrast with array-based codes, where the preetchin distance can be easily controlled usin sotware pipelinin [2], the undamental diculty with RDSs is that we must rst dereerence pointers to compute the preetch addresses. Gettin several nodes ahead in an RDS traversal typically involves ollowin a pointer chain. However, the very act o touchin these intermediate nodes alon the pointer chain means that we cannot tolerate the latency o etchin more than one node ahead. To overcome this pointer-chasin problem [8], we propose three schemes or eneratin preetch addresses without ollowin the entire pointer chain. The rst two schemes reedy preetchin and history-pointer preetchinuse a pointer within the current node as the preetchin address; the dierence is that reedy preetchin uses existin pointers, whereas history-pointer preetchin creates new pointers. The third schemedata-linearization preetchin enerates preetch addresses without pointer dereerences. A. Greedy Preetchin In a k-ary RDS, each node contains k pointers to other nodes. Greedy preetchin exploits the act that when k > 1, only one o these k neihbors can be immediately ollowed as the next node in the traversal, but there is oten a ood chance that other neihbors will be visited sometime in the uture. Thereore by preetchin all k pointers when a node is rst visited, we hope that enouh o these

2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY preorder(treenode * t) i (t!= NULL) preetch(t!let); preetch(t!riht); process(t!data); preorder(t!let); preorder(t!riht); (a) Code with Greedy Preetchin cache miss cache hit partial latency cache miss (b) Cache Miss Behavior Fi. 1. Illustration o reedy preetchin. preetches are successul that we can hide at least some raction o the miss latency. To illustrate how reedy preetchin works, consider the pre-order traversal o a binary tree (i.e. k = 2), where Fiure 1(a) shows the code with reedy preetchin added. Assumin that the computation in process() takes hal as lon as the cache miss latency L, we would want to preetch two nodes ahead to ully hide the latency. Fiure 1(b) shows the cachin behavior o each node. We obviously suer a ull cache miss at the root node (node 1), since there was no opportunity to etch it ahead o time. However, we would only suer hal o the miss penalty ( L 2 ) when we visit node 2, and no miss penalty when we eventually visit node 3 (since the time to visit the subtree rooted at node 2 is reater than L). In this example, the latency is ully hidden or rouhly hal o the nodes, and reduced by 5% or the other hal (minus the root node). Greedy preetchin oers the ollowin advantaes: (i) it has low runtime overhead, since no additional storae or computation is needed to construct the preetch pointers; (ii) it is applicable to a wide variety o RDSs, reardless o how they are accessed or whether their structure is modied requently; and (iii) it is relatively straihtorward to implement in a compilerin act, we have implemented it in the SUIF compiler, as we describe later in Section III. The main disadvantae o reedy preetchin is that it does not oer precise control over the preetchin distance, which is the motivation or our next alorithm. B. History-Pointer Preetchin Rather than relyin on existin pointers to approximate preetch addresses, we can potentially synthesize more accurate pointers based on the observed RDS traversal patterns. To preetch d nodes ahead under the history-pointer preetchin scheme [8], we add a new pointer (called a history-pointer) to a node n i to record the observed address o n i+d (the node visited d nodes ater n i ) on a recent traversal o the RDS. On subsequent traversals o the RDS, we preetch the nodes pointed to by these historypointers. This scheme is most eective when the traversal pattern does not chane rapidly over time. To construct the history-pointers, we maintain a FIFO queue o lenth d which contains pointers to the last d nodes that have just been visited. When we visit a new node n i, the oldest node in the queue will be n i?d (i.e. the node visited d nodes earlier), and hence we update the history-pointer o n i?d to point to n i. Ater the rst complete traversal o the RDS, all o the history-pointers will be set. In contrast with reedy preetchin, history-pointer preetchin oers no improvement on the rst traversal o an RDS, but can potentially hide all o the latency on subsequent traversals. While history-pointer preetchin oers the potential advantae o improved latency tolerance, this comes at the expense o (i) execution overhead to construct the history-pointers, and (ii) space overhead or storin these new pointers. To minimize execution overhead, we can potentially update the history-pointers less requently, dependin on how rapidly the RDS structure chanes. In one extreme, i the RDS never chanes, we can set the history-pointers just once. The problem with space overhead is that it potentially worsens the cachin behavior. The desire to eliminate this space overhead altoether is the motivation or our next preetchin scheme. C. Data-Linearization Preetchin The idea behind data-linearization preetchin [8] is to map heap-allocated nodes that are likely to be accessed close toether in time into contiuous memory locations. With this mappin, one can easily enerate preetch addresses and launch them early enouh. Another advantae o this scheme is that it improves spatial locality. The major challene, however, is how and when we can enerate this data layout. In theory, one could dynamically remap the data even ater the RDS has been initially constructed, but doin so may result in lare runtime overheads and may also violate proram semantics. Instead, the easiest time to map the nodes is at creation time, which is appropriate i either the creation order already matches the traversal order, or i it can be saely reordered to do so. Since dynamic remappin is expensive (or impossible), this scheme obviously works best i the structure o the RDS chanes only slowly (or not at all). I the RDS does chane radically, the proram will still behave correctly, but preetchin will not improve perormance. III. Implementation o Greedy Preetchin O the three schemes that we propose, reedy preetchin is perhaps the most widely applicable since it does not rely on traversal history inormation, and it requires no additional storae or computation to construct preetch addresses. For these reasons, we have implemented a version o reedy preetchin within the SUIF compiler [7], and we will simulate the other two alorithms by hand. Our implementation consists o an analysis phase to reconize RDS accesses, and a schedulin phase to insert preetches. A. Analysis: Reconizin RDS Accesses To reconize RDS accesses, the compiler uses both type declaration inormation to reconize which data objects are RDSs, and control structure inormation to reconize when these objects are bein traversed. An RDS type is a record type r containin at least one pointer that points either directly or indirectly to a record type s. (Note that r and s are not restricted to be the same type, since RDSs may

3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY struct T int data; struct T *let; struct T *riht; struct A int i; struct B **kids[8]; struct C int j; double ; (a) RDS type (b) RDS type (c) Not RDS type Fi. 2. Examples o which types are reconized as RDS types. while (l) list *m; m = l!next; l = m!next; or () list *n; n = (n); (tree *t) (t!let); (t!riht); (a) (b) (c) (d) k(tree tn) k(*(tn.let)); k(*(tn.riht)); Fi. 3. Examples o control structures reconized as RDS traversals. be comprised o heteroeneous nodes.) For example, the type declarations in Fiure 2(a) and Fiure 2(b) would be reconized as RDS types, whereas Fiure 2(c) would not. Ater discoverin data structures with the appropriate types, the compiler then looks or control structures that are used to traverse the RDSs. In particular, the compiler looks or loops or recursive procedure calls such that durin each new loop iteration or procedure invocation, a pointer p to an RDS is assined a value resultin rom a dereerence o pwe reer to this as a recurrent pointer update. This heuristic corresponds to how RDS codes are typically written. To detect recurrent pointer updates, the compiler propaates pointer values usin a simplied (but less precise) version o earlier pointer analysis alorithms [9], [1]. Fiure 3 shows some example proram raments that our compiler treats as RDS accesses. In Fiure 3(a), l is updated to l!next!next inside the while-loop. In Fiure 3(b), n is assined the result o the unction call (n) inside the or-loop. (Since our implementation does not perorm interprocedural analysis, it assumes that (n) results in a value n!!next.) In Fiure 3(c), two dereerences o the unction arument t are passed as the parameters to two recursive calls. Fiure 3(d) is similar to Fiure 3(c), except that a record (rather than a pointer) is passed as the unction arument. Ideally, the next step would be to analyze data locality across RDS nodes to eliminate unnecessary preetches. Althouh we have not automated this step in our compiler, we evaluated its potential benets in an earlier study [8]. B. Schedulin Preetches Once RDS accesses have been reconized, the compiler inserts reedy preetches as ollows. At the point where an RDS object is bein traversedi.e. where the recurrent pointer update occursthe compiler inserts preetches o all pointers within this object that point to RDS-type objects at the earliest points where these addresses are available within the surroundin loop or procedure body. The availability o preetch addresses is computed by propwhile (l) work(l!data); l = l!next; (tree *t) tree *q; i (test(t!data)) q = t!let; else q = t!riht; i (q!= NULL) (q); =) (a) Loop =) (b) Procedure while (l) preetch(l!next); work(l!data); l = l!next; (tree *t) tree *q; preetch(t!let); preetch(t!riht); i (test(t!data)) q = t!let; else q = t!riht; i (q!= NULL) (q); Fi. 4. Examples o reedy preetch schedulin. TABLE I Benchmark characteristics. Node Recursive Data Input Memory Benchmark Structures Used Data Set Allocated BH Heteroeneous 4K bodies 721 KB octree Bisort Binary tree 25, 1,535 KB inteers EM3D Sinly-linked lists 2 H-nodes, 1,671 KB E-nodes, 75% local Health Four-way tree and level = 5, 925 KB doubly-linked lists time = 5 MST Array o sinly- 512 nodes 1 KB linked lists Perimeter A quadtree 4Kx4K imae 6,445 KB Power Multi-way tree and 1, 418 KB sinly-linked lists customers TreeAdd Binary tree 124K nodes 12,288 KB TSP Binary tree and, cities 5,12 KB doubly-linked lists Voronoi Binary tree 2, points 1,915 KB aatin the earliest eneration points o pointer values alon with the values themselves. Two examples o reedy preetch schedulin are shown in Fiure 4. Further details o our implementation can be ound in Luk's thesis [11]. IV. Preetchin RDSs on Uniprocessors In this section, we quantiy the impact o our preetchin schemes on uniprocessor perormance. Later, in Section V, we will turn our attention to multiprocessor systems. A. Experimental Framework We perormed detailed cycle-by-cycle simulations o the entire Olden benchmark suite [12] on a dynamicallyscheduled, superscalar processor similar to the MIPS R [5]. The Olden benchmark suite contains ten pointer-based applications written in C, which are briey summarized in Table I. The rihtmost column in Table I shows the amount o memory dynamically allocated to RDS nodes. Our simulation model varies slihtly rom the actual MIPS R (e.., we model two memory units, and we

4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY TABLE II Uniprocessor simulation parameters. Pipeline Parameters Issue Width 4 Functional Units 2 Int, 2 FP, 2 Memory, 1 Branch Reorder Buer Size 32 Inteer Multiply 12 cycles Inteer Divide 76 cycles All Other Inteer 1 cycle FP Divide 15 cycles FP Square Root 2 cycles All Other FP 2 cycles Branch Prediction Scheme 2-bit Counters Memory Parameters Primary Instr and Data Caches 16KB, 2-way set-associative Unied Secondary Cache 512KB, 2-way set-associative Line Size 32B Primary-to-Secondary Miss 12 cycles Primary-to-Memory Miss 75 cycles Data Cache Miss Handlers 8 Data Cache Banks 2 Data Cache Fill Time 4 cycles (Requires Exclusive Access) Main Memory Bandwidth 1 access per 2 cycles assume that all unctional units are ully-pipelined), but we do model the rich details o the processor includin the pipeline, reister renamin, the reorder buer, branch prediction, instruction etchin, branchin penalties, the memory hierarchy (includin contention), etc. Table II shows the parameters o our model. We use pixie [13] to instrument the optimized MIPS object les produced by the compiler, and pipe the resultin trace into our simulator. To avoid misses durin the initialization o dynamicallyallocated objects, we used a modied version o the IRIX mallopt routine [14] whereby we preetch allocated objects beore they are initialized. Determinin these preetch addresses is straihtorward, since objects o the same size are typically allocated rom contiuous memory. This optimization alone led to over twoold speedups relative to usin malloc or the majority o the applications particularly those that requently allocate small objects. B. Perormance o Greedy Preetchin Fiure 5 shows the results o our uniprocessor experiments. The overall perormance improvement oered by reedy preetchin is shown in Fiure 5(a), where the two bars correspond to the cases without preetchin (N) and with reedy preetchin (G). These bars represent execution time normalized to the case without preetchin, and they are broken down into our cateories explainin what happened durin all potential raduation slots. (The number o raduation slots is the issue width4 in this case multiplied by the number o cycles.) The bottom section (busy) is the number o slots when instructions actually raduate, the top two sections are any non-raduatin slots that are immediately caused by the oldest instruction suerin either a load or store miss, and the inst stall section is all other slots where instructions do not raduate. Note that the load stall and store stall sections are only a rstorder approximation o the perormance loss due to cache misses, since these delays also exacerbate subsequent data dependence stalls. As we see in Fiure 5(a), hal o the applications enjoy a speedup ranin rom 4% to 45%, and the other hal are within 2% o their oriinal perormance. For the applications with the larest memory stall penaltiesi.e. health, perimeter, and treeaddmuch o this stall time has been eliminated. In the cases o bisort and mst, preetchin overhead more than oset the reduction in memory stalls (thus resultin in a sliht perormance deradation), but this was not a problem in the other eiht applications. To understand the perormance results in reater depth, Fiure 5(b) breaks down the oriinal primary cache misses into three cateories: (i) those that are preetched and subsequently hit in the primary cache (p hit), (ii) those that are preetched but remain primary misses (p miss), and (iii) those that are not preetched (nop miss). The sum o the p hit and p miss cases is also known as the coverae actor, which ideally should be %. For em3d, power, and voronoi, the coverae actor is quite low (under 2%) because most o their misses are caused by array or scalar reerenceshence preetchin RDSs yields little improvement. In all other cases, the coverae actor is above 6%, and in our cases we achieve nearly perect coverae. I the p miss cateory is lare, this indicates that preetches were not scheduled eectivelyeither they were issued too late to hide the latency, or else they were too early and the preetched data was displaced rom the cache beore it could be reerenced. This cateory is most prominent in mst, where the compiler is unable to preetch early enouh durin the traversal o very short linked lists within a hash table. Since reedy preetchin oer little control over preetchin distance, it is not surprisin that schedulin is imperectin act, it is encourain that the p miss ractions are this low. To help evaluate the costs o preetchin, Fiure 5(c) shows the raction o dynamic preetches that are unnecessary because the data is ound in the primary cache. For each application, we show our dierent bars indicatin the total (dynamic) unnecessary preetches caused by static preetch instructions with hit rates up to a iven threshold. Hence the bar labeled \" corresponds to all unnecessary preetches, whereas the bar labeled \99" shows the total unnecessary preetches i we exclude preetch instructions with hit rates over 99%, etc. This breakdown indicates the potential or reducin overhead by eliminatin static preetch instructions that are clearly o little value. For example, eliminatin preetches with hit rates over 99% would eliminate over hal o the unnecessary preetches in perimeter, thus decreasin overhead sinicantly. In contrast, reducin overhead with a at distribution (e.., bh) is more dicult since preetches that sometimes hit also miss at least 1% o the time; thereore, eliminatin them may sacrice some latency-hidin benet. We ound that eliminatin preetches with hit rates above 95% improves perormance by 1-7% or these applications [8]. Finally, we measured the impact o reedy preetchin on memory bandwidth consumption. We observe that on av-

5 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Normalized Execution Time N G N G N G N G N G N G N G N G N G N G bh bisort em3d health mst perimeter power treeadd tsp voronoi (a) Execution Time (N = no preetchin, G = reedy preetchin) load stall store stall inst stall busy % o Oriinal Load D-Cache Misses nop_miss p_miss p_hit bisort health perimeter treeadd voronoi bh em3d mst power tsp (b) Coverae Factor % o PF that Hit in D-Cache bh bisort em3d health mst perimeter power treeadd tsp voronoi (c) Unnecessary Preetches Fi. 5. Perormance impact o compiler-inserted reedy preetchin on a uniprocessor. erae, reedy preetchin increases the trac between the primary and secondary caches by 12.7%, and the trac between the secondary cache and main memory by 7.8%. In our experiments, this has almost no impact on perormance. Hence reedy preetchin does not appear to be suerin rom memory bandwidth problems. In summary, we have seen that automatic compilerinserted preetchin can result in sinicant speedups or uniprocessor applications containin RDSs. We now investiate whether the two more sophisticated preetchin schemes can oer even larer perormance ains. C. Perormance o History-Pointer Preetchin and Data- Linearization Preetchin We applied history-pointer preetchin and datalinearization preetchin by hand to several o our applications. History-pointer preetchin is applicable to health because the list structures that are accessed by a key procedure remain unchaned across the over ten thousand times that it is called. As a result, history-pointer preetchin achieves a 4% speedup over reedy preetchin throuh better miss coverae and ewer unnecessary preetches. Althouh history-pointer preetchin has ewer unnecessary preetches than reedy preetchin, it has sinicantly hiher instruction overhead due to the extra work required to maintain the history-pointers. Data-linearization preetchin is applicable to both perimeter and treeadd, because the creation order is identical to the major subsequent traversal order in both cases. As a result, data linearization does not require chanin the data layout in these cases (hence spatial locality is unaected). By reducin the number o unnecessary preetches (and hence preetchin overhead) while maintainin ood coverae actors, data-linearization preetchin results in speedups o 9% and 18% over reedy preetchin or perimeter and treeadd, respectively. Overall, we see that both schemes can potentially oer sinicant improvements over reedy preetchin when applicable. V. Preetchin RDSs on Multiprocessors Havin observed the benets o automatic preetchin o RDSs on uniprocessors, we now investiate whether the compiler can also accelerate pointer-based applications runnin on multiprocessors. In earlier studies, Mowry demonstrated that the compiler can successully preetch parallel matrix-based codes [2], [15], but the compiler used in those studies did not attempt to preetch pointer-based access patterns. However, throuh hand-inserted preetchin, Mowry was able to achieve a sinicant speedup in BARNES [15], which is a pointer-intensive shared-memory parallel application rom the SPLASH suite [16]. BARNES perorms a hierarchical n-body simulation o the evolution o alaxies. The main computation consists o a depth-rst traversal o an octree structure to compute the ravitational orce exerted by the iven body on all other bodies in the tree. This is repeated or each body in the system, and the bodies are statically assined to processors or the duration o each time step. Cache misses occur whenever a processor visits a part o the octree that is not already in its cache, either due to replacements or communication. To insert preetches by hand, Mowry used a stratey similar to reedy preetchin: upon rst arrivin at a node, he preetched all immediate children beore descendin depth-rst into the rst child.

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Normalized Execution Time TABLE III Memory latencies in multiprocessor simulations. Destination o Access Read Write Primary Cache 1 cycle 1 cycle Secondary Cache 15 cycles 4 cycles Local Node 29 cycles 17 cycles Remote Node 11 cycles 89 cycles Dirty Remote, Remote Home 132 cycles 12 cycles memory stalls synchronization instructions N G H % o Oriinal D-Cache Misses G nop_miss p_miss p_hit H % o PF that Hit in D-Cache (a) Execution (b) Coverae (c) Unnecessary Time Factor Preetches Fi. 6. Impact o compiler-inserted reedin preetchin on BARNES on a multiprocessor (N = no preetchin, G = compiler-inserted reedy preetchin, H = hand-inserted preetchin). G H To evaluate the perormance o our compiler-based implementation o reedy preetchin on a multiprocessor, we compared it with hand-inserted preetchin or BARNES. For the sake o comparison, we adopted the same simulation environment used in Mowry's earlier study [15], which we now briey summarize. We simulated a cache-coherent, shared-memory multiprocessor that resembles the DASH multiprocessor [17]. Our simulated machine consists o 16 processors, each o which has two levels o direct-mapped caches, both usin 16 byte lines. Table III shows the latency or servicin an access to dierent levels o the memory hierarchy, in the absence o contention (our simulations did model contention, however). To make simulations easible, we scaled down both the problem size and cache sizes accordinly (we ran 8192 bodies throuh 3 times steps on an 8K/64K cache hierarchy), as was done (and explained in more detail) in the oriinal study [2]. Fiure 6 shows the impact o both compiler-inserted reedy preetchin (G) and hand-inserted preetchin (H) on BARNES. The execution times in Fiure 6(a) are broken down as ollows: the bottom section is the amount o time spent executin instructions (includin any preetchin instruction overhead), and the middle and top sections are synchronization and memory stall times, respectively. As we see in Fiure 6(a), the compiler achieves nearly identical perormance to hand-inserted preetchin. The compiler preetches 9% o the oriinal cache misses with only 15% o these misses bein unnecessary, as we see in Fiures 6(b) and 6(c), respectively. O the preetched misses, the latency was ully hidden in hal o the cases (p hit), and partially hidden in the other cases (p miss). By eliminatin rouhly hal o the oriinal memory stall time, the compiler was able to achieve a 16% speedup. The compiler's reedy stratey or insertin preetches is quite similar to what was done by hand, with the ollowin exception. In an eort to minimize unnecessary preetches, the compiler's deault stratey is to preetch only the rst 64 bytes within a iven RDS node. In the case o BARNES, the nodes are loner than 64 bytes, and we discovered that hand-inserted preetchin achieves better perormance when we preetch the entire nodes. In this case, the improved miss coverae o preetchin the entire nodes is worth the additional unnecessary preetches, thereby resultin in a 1% speedup over compiler-inserted preetchin. Overall, however, we are quite pleased that the compiler was able to do this well, nearly matchin the best perormance that we could achieve by hand. VI. Related Work Althouh preetchin has been studied extensively or array-based numeric codes [6], [18], relatively little work has been done on non-numeric applications. Chen et al. [19] used lobal instruction schedulin techniques to move address eneration back as early as possible to hide a small cache miss latency (1 cycles), and ound mixed results. In contrast, our alorithms ocus only on RDS accesses, and can issue preetches much earlier (across procedure and loop iteration boundaries) by overcomin the pointerchasin problem. Zhan and Torrellas [2] proposed a hardware-assisted scheme or preetchin irreular applications in shared-memory multiprocessors. Under their scheme, prorams are annotated to bind toether roups o data (e.., elds in a record or two records linked by a pointer), which are then preetched under hardware control. Compared with our compiler-based approach, their scheme has two shortcomins: (i) annotations are inserted manually, and (ii) their hardware extensions are not likely to be applicable in uniprocessors. Joseph and Grunwald [21] proposed a hardware-based Markov preetchin scheme which preetches multiple predicted addresses upon a primary cache miss. While Markov preetchin can potentially handle chaotic miss patterns, it requires considerably more hardware support and has less exibility in selectin what to preetch and controllin the preetch distance than our compiler-based schemes. To our knowlede, the only compiler-based pointer preetchin scheme in the literature is the SPAID scheme proposed by Lipasti et al. [22]. Based on an observation that procedures are likely to dereerence any pointers passed to them as aruments, SPAID inserts preetches or the objects pointed to by these pointer aruments at the call sites. Thereore this scheme is only eective i the interval between the start o a procedure call and its dereerence o a pointer is comparable to the cache miss latency. In an earlier study [8], we ound that reedy preetchin oers substantially better perormance than SPAID by hidin more latency while payin less overhead. VII. Conclusions While automatic compiler-inserted preetchin has shown considerable success in hidin the memory latency o array-based codes, the compiler technoloy or successully preetchin pointer-based data structures has thus ar

7 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY been lackin. In this paper, we propose three preetchin schemes which overcome the pointer-chasin problem, we automate the most widely applicable scheme (reedy preetchin) in the compiler, and we evaluate its perormance on both a modern superscalar uniprocessor (similar to the MIPS R) and on a lare-scale sharedmemory multiprocessor. Our uniprocessor experiments show that automatic compiler-inserted preetchin can accelerate pointer-based applications by as much as 45%. In addition, the more sophisticated alorithms (which we currently simulate by hand) can oer even larer perormance ains. Our multiprocessor experiments demonstrate that the compiler can potentially provide equivalent perormance to hand-inserted preetchin even on parallel applications. These encourain results suest that the latency problem or pointer-based codes may be addressed larely throuh the preetch instructions that already exist in many recent microprocessors. Acknowledments This work is supported by a rant rom IBM Canada's Centre or Advanced Studies. Chi-Keun Luk is partially supported by a Canadian Commonwealth Fellowship. Todd C. Mowry is partially supported by a Faculty Development Award rom IBM. Reerences [1] D. Callahan, K. Kennedy, and A. Portereld, \Sotware preetchin," in Proceedins o the 4th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, April 1991, pp. 4{52. [2] T. C. Mowry, Toleratin Latency Throuh Sotware-Controlled Data Preetchin, Ph.D. thesis, Stanord University, March [3] D. Bernstein, D. Cohen, A. Freund, and D. E. Maydan, \Compiler techniques or data preetchin on the PowerPC," in Proceedins o the 1995 International Conerence on Parallel Architectures and Compilation Techniques, June 1995, pp. 19{26. [4] V. Santhanam, E. Gornish, and W.-C. Hsu, \Data preetchin on the HP PA8," in Proceedins o the 24th Annual International Symposium on Computer Architecture, June 1997, pp. 264{273. [5] K. Yeaer, \The MIPS R superscalar microprocessor," IEEE Micro, pp. 28{41, April [6] T. C. Mowry, M. S. Lam, and A. Gupta, \Desin and evaluation o a compiler alorithm or preetchin," in Proceedins o the 5th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, October 1992, pp. 62{73. [7] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinhe, J. M. Anderson, S. W.K. Tjian, S.-W. Liao, C.-W. Tsen, M. W. Hall, M. S. Lam, and J. L. Hennessy, \SUIF: An inrastructure or research on parallelizin and optimizin compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31{37, Dec [8] C.-K. Luk and T. C. Mowry, \Compiler-based preetchin or recursive data structures," in Proceedins o the 7th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, October 1996, pp. 222{233. [9] M. Emami, R. Ghiya, and L. J. Hendren, \Context-sensitive interprocedural points-to analysis in the presence o unction pointers," in Proceedins o the ACM SIGPLAN 94 Conerence on Prorammin Lanuae Desin and Implementation, June 1994, pp. 242{256. [1] W. Landi, B. G. Ryder, and S. Zhan, \Interprocedural modication side eect analysis with pointer aliasin," in Proceedins o the ACM SIGPLAN 93 Conerence on Prorammin Lanuae Desin and Implementation, June 1993, pp. 56{67. [11] C.-K. Luk, Optimizin the Cache Perormance o Non-Numeric Applications, Ph.D. thesis, Department o Computer Science, University o Toronto, orthcomin. [12] A. Roers, M. Carlisle, J. Reppy, and L. Hendren, \Supportin dynamic data structures on distributed memory machines," ACM Transactions. on Prorammin Lanuaes and Systems, vol. 17, no. 2, pp. 233{263, March [13] M. D. Smith, \Tracin with pixie," Tech. Rep. CSL-TR , Stanord University, November [14] C. J. Stephenson, \Fast ts," in Proceedins o the ACM 9th Symposium on Operatin Systems, October 1983, pp. 3{32. [15] T. C. Mowry, \Toleratin latency in multiprocessors throuh compiler-inserted preetchin," ACM Transactions on Computer Systems, vol. 16, no. 1, pp. 55{92, [16] J. P. Sinh, W.-D. Weber, and A. Gupta, \Splash: Stanord parallel applications or shared memory," Tech. Rep. CSL-TR , Stanord University, April [17] D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam, \The Stanord DASH multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63{79, March [18] J.-L. Baer and T.-F. Chen, \An eective on-chip preloadin scheme to reduce data access penalty," in Proceedins o Supercomputin '91, 1991, pp. 176{186. [19] W. Y. Chen, S. A. Mahlke, P. P. Chan, and W. W. Hwu, \Data access microarchitectures or superscalar processors with compiler-assisted data preetchin," in Proceedins o the 24th Annual ACM/IEEE International Symposium on Microarchitecture, 1991, pp. 69{73. [2] Z. Zhan and J. Torrellas, \Speedin up irreular applications in shared-memory multiprocessors: Memory bindin and roup preetchin," in Proceedins o the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 188{2. [21] D. Joseph and D. Grunwald, \Preetchin usin Markov predictors," in Proceedins o the 24th Annual International Symposium on Computer Architecture, June 1997, pp. 252{263. [22] M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roedier, \SPAID: Sotware preetchin in pointer- and call-intensive environments," in Proceedins o the 28th Annual ACM/IEEE International Symposium on Microarchitecture, 1995, pp. 231{ 236. Chi-Keun Luk is a Ph.D. candidate in the Department o Computer Science at the University o Toronto, and is currently a visitin scholar at Carneie Mellon University. He received his B.Sc. (First Class Honors) and M.Phil. derees in computer science, both rom The Chinese University o Hon Kon. His research interests are computer architecture, compiler optimizations, and prorammin lanuaes, with a ocus on the memory perormance o non-numeric applications. He has been awarded a Canadian Commonwealth Fellowship, an IBM CAS Fellowship, and a Croucher Foundation Fellowship. Further inormation about his current research activities can be ound at Todd C. Mowry received his B.S.E.E. rom the University o Virinia in 1988, and his M.S.E.E. and Ph.D. rom Stanord University in 1989 and 1994, respectively. From 1994 throuh 1997, he was an assistant proessor in the Department o Electrical and Computer Enineerin and the Department o Computer Science at the University o Toronto. Since 1997, he has been an associate proessor in the Computer Science Department at Carneie Mellon University. Dr. Mowry's research interests span architecture, compilers, and operatin systems. Most recently, he has been ocusin on automatically toleratin the latency o accessin and communicatin data, and on automatically extractin thread-level parallelism rom non-numeric applications. Further inormation about his current research activities can be ound at

Status. We ll do code generation first... Outline

Status. We ll do code generation first... Outline Status Run-time Environments Lecture 11 We have covered the ront-end phases Lexical analysis Parsin Semantic analysis Next are the back-end phases Optimization Code eneration We ll do code eneration irst...