IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Prefetching for. Chi-Keung Luk and Todd C.

Size: px
Start display at page:

Download "IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Prefetching for. Chi-Keung Luk and Todd C."

Transcription

1 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Automatic Compiler-Inserted Preetchin or Pointer-Based Applications Chi-Keun Luk and Todd C. Mowry Abstract As the disparity between processor and memory speeds continues to row, memory latency is becomin an increasinly important perormance bottleneck. While sotware-controlled preetchin is an attractive technique or toleratin this latency, its success has been limited thus ar to array-based numeric codes. In this paper, we expand the scope o automatic compiler-inserted preetchin to also include the recursive data structures commonly ound in pointer-based applications. We propose three compilerbased preetchin schemes, and automate the most widely applicable scheme (reedy preetchin) in an optimizin research compiler. Our experimental results demonstrate that compiler-inserted preetchin can oer sinicant perormance ains on both uniprocessors and lare-scale sharedmemory multiprocessors. Keywords Caches, preetchin, pointer-based applications, recursive data structures, compiler optimization, shared-memory multiprocessors, perormance evaluation. I. Introduction SOFTWARE -controlled data preetchin [1], [2] oers the potential or bridin the ever-increasin speed ap between the memory subsystem and today's hihperormance processors. In reconition o this potential, a number o recent processors have added support or preetch instructions [3], [4], [5]. While preetchin has enjoyed considerable success in array-based numeric codes [6], its potential in pointer-based applications has remained larely unexplored. This paper investiates compilerinserted preetchin or pointer-based applicationsin particular, those containin recursive data structures. Recursive Data Structures (RDSs) include amiliar objects such as linked lists, trees, raphs, etc., where individual nodes are dynamically allocated rom the heap, and nodes are linked toether throuh pointers to orm the overall structure. For our purposes, \recursive data structures" can be broadly interpreted to include most pointer-linked data structures (e.., mutually-recursive data structures, or even a raph o heteroeneous objects). From a memory perormance perspective, these pointer-based data structures are expected to be an important concern or the ollowin reasons. For an application to suer a lare memory penalty due to data replacement misses, it typically must have a lare data set relative to the cache size. Aside rom multi-dimensional arrays, recursive data structures are one o the most common and convenient methods o buildin lare data structures (e., B-trees in database applications, octrees in raphics applications, etc.). As we traverse a C.-K. Luk is with the Department o Computer Science, University o Toronto, Toronto, Ontario M5S 3G4, Canada. luk@eec.toronto.edu. T. C. Mowry is with the Computer Science Department, Carneie Mellon University, Pittsburh, PA tcm@cs.cmu.edu. lare RDS, we may potentially visit enouh intervenin nodes to displace a iven node rom the cache beore it is revisited; hence temporal locality may be poor. Finally, in contrast with arrayswhere consecutive elements are at contiuous addressesthere is little inherent spatial locality between consecutively-accessed nodes in an RDS, since they are dynamically allocated at arbitrary addresses. To cope with the latency o accessin these pointerbased data structures, we propose three compiler-based schemes or preetchin RDSs, as described in Section II. We implemented the most widely-applicable o these schemesreedy preetchinin a modern research compiler (SUIF [7]), as discussed in Section III. To evaluate our schemes, we perormed detailed simulations o their impact on both uniprocessor and multiprocessor systems in Sections IV and V, respectively. Finally, we present related work and conclusions in Sections VI and VII. II. Sotware-Controlled Preetchin or RDSs A key challene in successully preetchin RDSs is schedulin the preetches suciently ar in advance to ully hide the latency, while introducin minimal runtime overhead. In contrast with array-based codes, where the preetchin distance can be easily controlled usin sotware pipelinin [2], the undamental diculty with RDSs is that we must rst dereerence pointers to compute the preetch addresses. Gettin several nodes ahead in an RDS traversal typically involves ollowin a pointer chain. However, the very act o touchin these intermediate nodes alon the pointer chain means that we cannot tolerate the latency o etchin more than one node ahead. To overcome this pointer-chasin problem [8], we propose three schemes or eneratin preetch addresses without ollowin the entire pointer chain. The rst two schemes reedy preetchin and history-pointer preetchinuse a pointer within the current node as the preetchin address; the dierence is that reedy preetchin uses existin pointers, whereas history-pointer preetchin creates new pointers. The third schemedata-linearization preetchin enerates preetch addresses without pointer dereerences. A. Greedy Preetchin In a k-ary RDS, each node contains k pointers to other nodes. Greedy preetchin exploits the act that when k > 1, only one o these k neihbors can be immediately ollowed as the next node in the traversal, but there is oten a ood chance that other neihbors will be visited sometime in the uture. Thereore by preetchin all k pointers when a node is rst visited, we hope that enouh o these

2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY preorder(treenode * t) i (t!= NULL) preetch(t!let); preetch(t!riht); process(t!data); preorder(t!let); preorder(t!riht); (a) Code with Greedy Preetchin cache miss cache hit partial latency cache miss (b) Cache Miss Behavior Fi. 1. Illustration o reedy preetchin. preetches are successul that we can hide at least some raction o the miss latency. To illustrate how reedy preetchin works, consider the pre-order traversal o a binary tree (i.e. k = 2), where Fiure 1(a) shows the code with reedy preetchin added. Assumin that the computation in process() takes hal as lon as the cache miss latency L, we would want to preetch two nodes ahead to ully hide the latency. Fiure 1(b) shows the cachin behavior o each node. We obviously suer a ull cache miss at the root node (node 1), since there was no opportunity to etch it ahead o time. However, we would only suer hal o the miss penalty ( L 2 ) when we visit node 2, and no miss penalty when we eventually visit node 3 (since the time to visit the subtree rooted at node 2 is reater than L). In this example, the latency is ully hidden or rouhly hal o the nodes, and reduced by 5% or the other hal (minus the root node). Greedy preetchin oers the ollowin advantaes: (i) it has low runtime overhead, since no additional storae or computation is needed to construct the preetch pointers; (ii) it is applicable to a wide variety o RDSs, reardless o how they are accessed or whether their structure is modied requently; and (iii) it is relatively straihtorward to implement in a compilerin act, we have implemented it in the SUIF compiler, as we describe later in Section III. The main disadvantae o reedy preetchin is that it does not oer precise control over the preetchin distance, which is the motivation or our next alorithm. B. History-Pointer Preetchin Rather than relyin on existin pointers to approximate preetch addresses, we can potentially synthesize more accurate pointers based on the observed RDS traversal patterns. To preetch d nodes ahead under the history-pointer preetchin scheme [8], we add a new pointer (called a history-pointer) to a node n i to record the observed address o n i+d (the node visited d nodes ater n i ) on a recent traversal o the RDS. On subsequent traversals o the RDS, we preetch the nodes pointed to by these historypointers. This scheme is most eective when the traversal pattern does not chane rapidly over time. To construct the history-pointers, we maintain a FIFO queue o lenth d which contains pointers to the last d nodes that have just been visited. When we visit a new node n i, the oldest node in the queue will be n i?d (i.e. the node visited d nodes earlier), and hence we update the history-pointer o n i?d to point to n i. Ater the rst complete traversal o the RDS, all o the history-pointers will be set. In contrast with reedy preetchin, history-pointer preetchin oers no improvement on the rst traversal o an RDS, but can potentially hide all o the latency on subsequent traversals. While history-pointer preetchin oers the potential advantae o improved latency tolerance, this comes at the expense o (i) execution overhead to construct the history-pointers, and (ii) space overhead or storin these new pointers. To minimize execution overhead, we can potentially update the history-pointers less requently, dependin on how rapidly the RDS structure chanes. In one extreme, i the RDS never chanes, we can set the history-pointers just once. The problem with space overhead is that it potentially worsens the cachin behavior. The desire to eliminate this space overhead altoether is the motivation or our next preetchin scheme. C. Data-Linearization Preetchin The idea behind data-linearization preetchin [8] is to map heap-allocated nodes that are likely to be accessed close toether in time into contiuous memory locations. With this mappin, one can easily enerate preetch addresses and launch them early enouh. Another advantae o this scheme is that it improves spatial locality. The major challene, however, is how and when we can enerate this data layout. In theory, one could dynamically remap the data even ater the RDS has been initially constructed, but doin so may result in lare runtime overheads and may also violate proram semantics. Instead, the easiest time to map the nodes is at creation time, which is appropriate i either the creation order already matches the traversal order, or i it can be saely reordered to do so. Since dynamic remappin is expensive (or impossible), this scheme obviously works best i the structure o the RDS chanes only slowly (or not at all). I the RDS does chane radically, the proram will still behave correctly, but preetchin will not improve perormance. III. Implementation o Greedy Preetchin O the three schemes that we propose, reedy preetchin is perhaps the most widely applicable since it does not rely on traversal history inormation, and it requires no additional storae or computation to construct preetch addresses. For these reasons, we have implemented a version o reedy preetchin within the SUIF compiler [7], and we will simulate the other two alorithms by hand. Our implementation consists o an analysis phase to reconize RDS accesses, and a schedulin phase to insert preetches. A. Analysis: Reconizin RDS Accesses To reconize RDS accesses, the compiler uses both type declaration inormation to reconize which data objects are RDSs, and control structure inormation to reconize when these objects are bein traversed. An RDS type is a record type r containin at least one pointer that points either directly or indirectly to a record type s. (Note that r and s are not restricted to be the same type, since RDSs may

3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY struct T int data; struct T *let; struct T *riht; struct A int i; struct B **kids[8]; struct C int j; double ; (a) RDS type (b) RDS type (c) Not RDS type Fi. 2. Examples o which types are reconized as RDS types. while (l) list *m; m = l!next; l = m!next; or () list *n; n = (n); (tree *t) (t!let); (t!riht); (a) (b) (c) (d) k(tree tn) k(*(tn.let)); k(*(tn.riht)); Fi. 3. Examples o control structures reconized as RDS traversals. be comprised o heteroeneous nodes.) For example, the type declarations in Fiure 2(a) and Fiure 2(b) would be reconized as RDS types, whereas Fiure 2(c) would not. Ater discoverin data structures with the appropriate types, the compiler then looks or control structures that are used to traverse the RDSs. In particular, the compiler looks or loops or recursive procedure calls such that durin each new loop iteration or procedure invocation, a pointer p to an RDS is assined a value resultin rom a dereerence o pwe reer to this as a recurrent pointer update. This heuristic corresponds to how RDS codes are typically written. To detect recurrent pointer updates, the compiler propaates pointer values usin a simplied (but less precise) version o earlier pointer analysis alorithms [9], [1]. Fiure 3 shows some example proram raments that our compiler treats as RDS accesses. In Fiure 3(a), l is updated to l!next!next inside the while-loop. In Fiure 3(b), n is assined the result o the unction call (n) inside the or-loop. (Since our implementation does not perorm interprocedural analysis, it assumes that (n) results in a value n!!next.) In Fiure 3(c), two dereerences o the unction arument t are passed as the parameters to two recursive calls. Fiure 3(d) is similar to Fiure 3(c), except that a record (rather than a pointer) is passed as the unction arument. Ideally, the next step would be to analyze data locality across RDS nodes to eliminate unnecessary preetches. Althouh we have not automated this step in our compiler, we evaluated its potential benets in an earlier study [8]. B. Schedulin Preetches Once RDS accesses have been reconized, the compiler inserts reedy preetches as ollows. At the point where an RDS object is bein traversedi.e. where the recurrent pointer update occursthe compiler inserts preetches o all pointers within this object that point to RDS-type objects at the earliest points where these addresses are available within the surroundin loop or procedure body. The availability o preetch addresses is computed by propwhile (l) work(l!data); l = l!next; (tree *t) tree *q; i (test(t!data)) q = t!let; else q = t!riht; i (q!= NULL) (q); =) (a) Loop =) (b) Procedure while (l) preetch(l!next); work(l!data); l = l!next; (tree *t) tree *q; preetch(t!let); preetch(t!riht); i (test(t!data)) q = t!let; else q = t!riht; i (q!= NULL) (q); Fi. 4. Examples o reedy preetch schedulin. TABLE I Benchmark characteristics. Node Recursive Data Input Memory Benchmark Structures Used Data Set Allocated BH Heteroeneous 4K bodies 721 KB octree Bisort Binary tree 25, 1,535 KB inteers EM3D Sinly-linked lists 2 H-nodes, 1,671 KB E-nodes, 75% local Health Four-way tree and level = 5, 925 KB doubly-linked lists time = 5 MST Array o sinly- 512 nodes 1 KB linked lists Perimeter A quadtree 4Kx4K imae 6,445 KB Power Multi-way tree and 1, 418 KB sinly-linked lists customers TreeAdd Binary tree 124K nodes 12,288 KB TSP Binary tree and, cities 5,12 KB doubly-linked lists Voronoi Binary tree 2, points 1,915 KB aatin the earliest eneration points o pointer values alon with the values themselves. Two examples o reedy preetch schedulin are shown in Fiure 4. Further details o our implementation can be ound in Luk's thesis [11]. IV. Preetchin RDSs on Uniprocessors In this section, we quantiy the impact o our preetchin schemes on uniprocessor perormance. Later, in Section V, we will turn our attention to multiprocessor systems. A. Experimental Framework We perormed detailed cycle-by-cycle simulations o the entire Olden benchmark suite [12] on a dynamicallyscheduled, superscalar processor similar to the MIPS R [5]. The Olden benchmark suite contains ten pointer-based applications written in C, which are briey summarized in Table I. The rihtmost column in Table I shows the amount o memory dynamically allocated to RDS nodes. Our simulation model varies slihtly rom the actual MIPS R (e.., we model two memory units, and we

4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY TABLE II Uniprocessor simulation parameters. Pipeline Parameters Issue Width 4 Functional Units 2 Int, 2 FP, 2 Memory, 1 Branch Reorder Buer Size 32 Inteer Multiply 12 cycles Inteer Divide 76 cycles All Other Inteer 1 cycle FP Divide 15 cycles FP Square Root 2 cycles All Other FP 2 cycles Branch Prediction Scheme 2-bit Counters Memory Parameters Primary Instr and Data Caches 16KB, 2-way set-associative Unied Secondary Cache 512KB, 2-way set-associative Line Size 32B Primary-to-Secondary Miss 12 cycles Primary-to-Memory Miss 75 cycles Data Cache Miss Handlers 8 Data Cache Banks 2 Data Cache Fill Time 4 cycles (Requires Exclusive Access) Main Memory Bandwidth 1 access per 2 cycles assume that all unctional units are ully-pipelined), but we do model the rich details o the processor includin the pipeline, reister renamin, the reorder buer, branch prediction, instruction etchin, branchin penalties, the memory hierarchy (includin contention), etc. Table II shows the parameters o our model. We use pixie [13] to instrument the optimized MIPS object les produced by the compiler, and pipe the resultin trace into our simulator. To avoid misses durin the initialization o dynamicallyallocated objects, we used a modied version o the IRIX mallopt routine [14] whereby we preetch allocated objects beore they are initialized. Determinin these preetch addresses is straihtorward, since objects o the same size are typically allocated rom contiuous memory. This optimization alone led to over twoold speedups relative to usin malloc or the majority o the applications particularly those that requently allocate small objects. B. Perormance o Greedy Preetchin Fiure 5 shows the results o our uniprocessor experiments. The overall perormance improvement oered by reedy preetchin is shown in Fiure 5(a), where the two bars correspond to the cases without preetchin (N) and with reedy preetchin (G). These bars represent execution time normalized to the case without preetchin, and they are broken down into our cateories explainin what happened durin all potential raduation slots. (The number o raduation slots is the issue width4 in this case multiplied by the number o cycles.) The bottom section (busy) is the number o slots when instructions actually raduate, the top two sections are any non-raduatin slots that are immediately caused by the oldest instruction suerin either a load or store miss, and the inst stall section is all other slots where instructions do not raduate. Note that the load stall and store stall sections are only a rstorder approximation o the perormance loss due to cache misses, since these delays also exacerbate subsequent data dependence stalls. As we see in Fiure 5(a), hal o the applications enjoy a speedup ranin rom 4% to 45%, and the other hal are within 2% o their oriinal perormance. For the applications with the larest memory stall penaltiesi.e. health, perimeter, and treeaddmuch o this stall time has been eliminated. In the cases o bisort and mst, preetchin overhead more than oset the reduction in memory stalls (thus resultin in a sliht perormance deradation), but this was not a problem in the other eiht applications. To understand the perormance results in reater depth, Fiure 5(b) breaks down the oriinal primary cache misses into three cateories: (i) those that are preetched and subsequently hit in the primary cache (p hit), (ii) those that are preetched but remain primary misses (p miss), and (iii) those that are not preetched (nop miss). The sum o the p hit and p miss cases is also known as the coverae actor, which ideally should be %. For em3d, power, and voronoi, the coverae actor is quite low (under 2%) because most o their misses are caused by array or scalar reerenceshence preetchin RDSs yields little improvement. In all other cases, the coverae actor is above 6%, and in our cases we achieve nearly perect coverae. I the p miss cateory is lare, this indicates that preetches were not scheduled eectivelyeither they were issued too late to hide the latency, or else they were too early and the preetched data was displaced rom the cache beore it could be reerenced. This cateory is most prominent in mst, where the compiler is unable to preetch early enouh durin the traversal o very short linked lists within a hash table. Since reedy preetchin oer little control over preetchin distance, it is not surprisin that schedulin is imperectin act, it is encourain that the p miss ractions are this low. To help evaluate the costs o preetchin, Fiure 5(c) shows the raction o dynamic preetches that are unnecessary because the data is ound in the primary cache. For each application, we show our dierent bars indicatin the total (dynamic) unnecessary preetches caused by static preetch instructions with hit rates up to a iven threshold. Hence the bar labeled \" corresponds to all unnecessary preetches, whereas the bar labeled \99" shows the total unnecessary preetches i we exclude preetch instructions with hit rates over 99%, etc. This breakdown indicates the potential or reducin overhead by eliminatin static preetch instructions that are clearly o little value. For example, eliminatin preetches with hit rates over 99% would eliminate over hal o the unnecessary preetches in perimeter, thus decreasin overhead sinicantly. In contrast, reducin overhead with a at distribution (e.., bh) is more dicult since preetches that sometimes hit also miss at least 1% o the time; thereore, eliminatin them may sacrice some latency-hidin benet. We ound that eliminatin preetches with hit rates above 95% improves perormance by 1-7% or these applications [8]. Finally, we measured the impact o reedy preetchin on memory bandwidth consumption. We observe that on av-

5 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Normalized Execution Time N G N G N G N G N G N G N G N G N G N G bh bisort em3d health mst perimeter power treeadd tsp voronoi (a) Execution Time (N = no preetchin, G = reedy preetchin) load stall store stall inst stall busy % o Oriinal Load D-Cache Misses nop_miss p_miss p_hit bisort health perimeter treeadd voronoi bh em3d mst power tsp (b) Coverae Factor % o PF that Hit in D-Cache bh bisort em3d health mst perimeter power treeadd tsp voronoi (c) Unnecessary Preetches Fi. 5. Perormance impact o compiler-inserted reedy preetchin on a uniprocessor. erae, reedy preetchin increases the trac between the primary and secondary caches by 12.7%, and the trac between the secondary cache and main memory by 7.8%. In our experiments, this has almost no impact on perormance. Hence reedy preetchin does not appear to be suerin rom memory bandwidth problems. In summary, we have seen that automatic compilerinserted preetchin can result in sinicant speedups or uniprocessor applications containin RDSs. We now investiate whether the two more sophisticated preetchin schemes can oer even larer perormance ains. C. Perormance o History-Pointer Preetchin and Data- Linearization Preetchin We applied history-pointer preetchin and datalinearization preetchin by hand to several o our applications. History-pointer preetchin is applicable to health because the list structures that are accessed by a key procedure remain unchaned across the over ten thousand times that it is called. As a result, history-pointer preetchin achieves a 4% speedup over reedy preetchin throuh better miss coverae and ewer unnecessary preetches. Althouh history-pointer preetchin has ewer unnecessary preetches than reedy preetchin, it has sinicantly hiher instruction overhead due to the extra work required to maintain the history-pointers. Data-linearization preetchin is applicable to both perimeter and treeadd, because the creation order is identical to the major subsequent traversal order in both cases. As a result, data linearization does not require chanin the data layout in these cases (hence spatial locality is unaected). By reducin the number o unnecessary preetches (and hence preetchin overhead) while maintainin ood coverae actors, data-linearization preetchin results in speedups o 9% and 18% over reedy preetchin or perimeter and treeadd, respectively. Overall, we see that both schemes can potentially oer sinicant improvements over reedy preetchin when applicable. V. Preetchin RDSs on Multiprocessors Havin observed the benets o automatic preetchin o RDSs on uniprocessors, we now investiate whether the compiler can also accelerate pointer-based applications runnin on multiprocessors. In earlier studies, Mowry demonstrated that the compiler can successully preetch parallel matrix-based codes [2], [15], but the compiler used in those studies did not attempt to preetch pointer-based access patterns. However, throuh hand-inserted preetchin, Mowry was able to achieve a sinicant speedup in BARNES [15], which is a pointer-intensive shared-memory parallel application rom the SPLASH suite [16]. BARNES perorms a hierarchical n-body simulation o the evolution o alaxies. The main computation consists o a depth-rst traversal o an octree structure to compute the ravitational orce exerted by the iven body on all other bodies in the tree. This is repeated or each body in the system, and the bodies are statically assined to processors or the duration o each time step. Cache misses occur whenever a processor visits a part o the octree that is not already in its cache, either due to replacements or communication. To insert preetches by hand, Mowry used a stratey similar to reedy preetchin: upon rst arrivin at a node, he preetched all immediate children beore descendin depth-rst into the rst child.

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY Normalized Execution Time TABLE III Memory latencies in multiprocessor simulations. Destination o Access Read Write Primary Cache 1 cycle 1 cycle Secondary Cache 15 cycles 4 cycles Local Node 29 cycles 17 cycles Remote Node 11 cycles 89 cycles Dirty Remote, Remote Home 132 cycles 12 cycles memory stalls synchronization instructions N G H % o Oriinal D-Cache Misses G nop_miss p_miss p_hit H % o PF that Hit in D-Cache (a) Execution (b) Coverae (c) Unnecessary Time Factor Preetches Fi. 6. Impact o compiler-inserted reedin preetchin on BARNES on a multiprocessor (N = no preetchin, G = compiler-inserted reedy preetchin, H = hand-inserted preetchin). G H To evaluate the perormance o our compiler-based implementation o reedy preetchin on a multiprocessor, we compared it with hand-inserted preetchin or BARNES. For the sake o comparison, we adopted the same simulation environment used in Mowry's earlier study [15], which we now briey summarize. We simulated a cache-coherent, shared-memory multiprocessor that resembles the DASH multiprocessor [17]. Our simulated machine consists o 16 processors, each o which has two levels o direct-mapped caches, both usin 16 byte lines. Table III shows the latency or servicin an access to dierent levels o the memory hierarchy, in the absence o contention (our simulations did model contention, however). To make simulations easible, we scaled down both the problem size and cache sizes accordinly (we ran 8192 bodies throuh 3 times steps on an 8K/64K cache hierarchy), as was done (and explained in more detail) in the oriinal study [2]. Fiure 6 shows the impact o both compiler-inserted reedy preetchin (G) and hand-inserted preetchin (H) on BARNES. The execution times in Fiure 6(a) are broken down as ollows: the bottom section is the amount o time spent executin instructions (includin any preetchin instruction overhead), and the middle and top sections are synchronization and memory stall times, respectively. As we see in Fiure 6(a), the compiler achieves nearly identical perormance to hand-inserted preetchin. The compiler preetches 9% o the oriinal cache misses with only 15% o these misses bein unnecessary, as we see in Fiures 6(b) and 6(c), respectively. O the preetched misses, the latency was ully hidden in hal o the cases (p hit), and partially hidden in the other cases (p miss). By eliminatin rouhly hal o the oriinal memory stall time, the compiler was able to achieve a 16% speedup. The compiler's reedy stratey or insertin preetches is quite similar to what was done by hand, with the ollowin exception. In an eort to minimize unnecessary preetches, the compiler's deault stratey is to preetch only the rst 64 bytes within a iven RDS node. In the case o BARNES, the nodes are loner than 64 bytes, and we discovered that hand-inserted preetchin achieves better perormance when we preetch the entire nodes. In this case, the improved miss coverae o preetchin the entire nodes is worth the additional unnecessary preetches, thereby resultin in a 1% speedup over compiler-inserted preetchin. Overall, however, we are quite pleased that the compiler was able to do this well, nearly matchin the best perormance that we could achieve by hand. VI. Related Work Althouh preetchin has been studied extensively or array-based numeric codes [6], [18], relatively little work has been done on non-numeric applications. Chen et al. [19] used lobal instruction schedulin techniques to move address eneration back as early as possible to hide a small cache miss latency (1 cycles), and ound mixed results. In contrast, our alorithms ocus only on RDS accesses, and can issue preetches much earlier (across procedure and loop iteration boundaries) by overcomin the pointerchasin problem. Zhan and Torrellas [2] proposed a hardware-assisted scheme or preetchin irreular applications in shared-memory multiprocessors. Under their scheme, prorams are annotated to bind toether roups o data (e.., elds in a record or two records linked by a pointer), which are then preetched under hardware control. Compared with our compiler-based approach, their scheme has two shortcomins: (i) annotations are inserted manually, and (ii) their hardware extensions are not likely to be applicable in uniprocessors. Joseph and Grunwald [21] proposed a hardware-based Markov preetchin scheme which preetches multiple predicted addresses upon a primary cache miss. While Markov preetchin can potentially handle chaotic miss patterns, it requires considerably more hardware support and has less exibility in selectin what to preetch and controllin the preetch distance than our compiler-based schemes. To our knowlede, the only compiler-based pointer preetchin scheme in the literature is the SPAID scheme proposed by Lipasti et al. [22]. Based on an observation that procedures are likely to dereerence any pointers passed to them as aruments, SPAID inserts preetches or the objects pointed to by these pointer aruments at the call sites. Thereore this scheme is only eective i the interval between the start o a procedure call and its dereerence o a pointer is comparable to the cache miss latency. In an earlier study [8], we ound that reedy preetchin oers substantially better perormance than SPAID by hidin more latency while payin less overhead. VII. Conclusions While automatic compiler-inserted preetchin has shown considerable success in hidin the memory latency o array-based codes, the compiler technoloy or successully preetchin pointer-based data structures has thus ar

7 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY been lackin. In this paper, we propose three preetchin schemes which overcome the pointer-chasin problem, we automate the most widely applicable scheme (reedy preetchin) in the compiler, and we evaluate its perormance on both a modern superscalar uniprocessor (similar to the MIPS R) and on a lare-scale sharedmemory multiprocessor. Our uniprocessor experiments show that automatic compiler-inserted preetchin can accelerate pointer-based applications by as much as 45%. In addition, the more sophisticated alorithms (which we currently simulate by hand) can oer even larer perormance ains. Our multiprocessor experiments demonstrate that the compiler can potentially provide equivalent perormance to hand-inserted preetchin even on parallel applications. These encourain results suest that the latency problem or pointer-based codes may be addressed larely throuh the preetch instructions that already exist in many recent microprocessors. Acknowledments This work is supported by a rant rom IBM Canada's Centre or Advanced Studies. Chi-Keun Luk is partially supported by a Canadian Commonwealth Fellowship. Todd C. Mowry is partially supported by a Faculty Development Award rom IBM. Reerences [1] D. Callahan, K. Kennedy, and A. Portereld, \Sotware preetchin," in Proceedins o the 4th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, April 1991, pp. 4{52. [2] T. C. Mowry, Toleratin Latency Throuh Sotware-Controlled Data Preetchin, Ph.D. thesis, Stanord University, March [3] D. Bernstein, D. Cohen, A. Freund, and D. E. Maydan, \Compiler techniques or data preetchin on the PowerPC," in Proceedins o the 1995 International Conerence on Parallel Architectures and Compilation Techniques, June 1995, pp. 19{26. [4] V. Santhanam, E. Gornish, and W.-C. Hsu, \Data preetchin on the HP PA8," in Proceedins o the 24th Annual International Symposium on Computer Architecture, June 1997, pp. 264{273. [5] K. Yeaer, \The MIPS R superscalar microprocessor," IEEE Micro, pp. 28{41, April [6] T. C. Mowry, M. S. Lam, and A. Gupta, \Desin and evaluation o a compiler alorithm or preetchin," in Proceedins o the 5th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, October 1992, pp. 62{73. [7] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinhe, J. M. Anderson, S. W.K. Tjian, S.-W. Liao, C.-W. Tsen, M. W. Hall, M. S. Lam, and J. L. Hennessy, \SUIF: An inrastructure or research on parallelizin and optimizin compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31{37, Dec [8] C.-K. Luk and T. C. Mowry, \Compiler-based preetchin or recursive data structures," in Proceedins o the 7th International Conerence on Architectural Support or Prorammin Lanuaes and Operatin Systems, October 1996, pp. 222{233. [9] M. Emami, R. Ghiya, and L. J. Hendren, \Context-sensitive interprocedural points-to analysis in the presence o unction pointers," in Proceedins o the ACM SIGPLAN 94 Conerence on Prorammin Lanuae Desin and Implementation, June 1994, pp. 242{256. [1] W. Landi, B. G. Ryder, and S. Zhan, \Interprocedural modication side eect analysis with pointer aliasin," in Proceedins o the ACM SIGPLAN 93 Conerence on Prorammin Lanuae Desin and Implementation, June 1993, pp. 56{67. [11] C.-K. Luk, Optimizin the Cache Perormance o Non-Numeric Applications, Ph.D. thesis, Department o Computer Science, University o Toronto, orthcomin. [12] A. Roers, M. Carlisle, J. Reppy, and L. Hendren, \Supportin dynamic data structures on distributed memory machines," ACM Transactions. on Prorammin Lanuaes and Systems, vol. 17, no. 2, pp. 233{263, March [13] M. D. Smith, \Tracin with pixie," Tech. Rep. CSL-TR , Stanord University, November [14] C. J. Stephenson, \Fast ts," in Proceedins o the ACM 9th Symposium on Operatin Systems, October 1983, pp. 3{32. [15] T. C. Mowry, \Toleratin latency in multiprocessors throuh compiler-inserted preetchin," ACM Transactions on Computer Systems, vol. 16, no. 1, pp. 55{92, [16] J. P. Sinh, W.-D. Weber, and A. Gupta, \Splash: Stanord parallel applications or shared memory," Tech. Rep. CSL-TR , Stanord University, April [17] D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam, \The Stanord DASH multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63{79, March [18] J.-L. Baer and T.-F. Chen, \An eective on-chip preloadin scheme to reduce data access penalty," in Proceedins o Supercomputin '91, 1991, pp. 176{186. [19] W. Y. Chen, S. A. Mahlke, P. P. Chan, and W. W. Hwu, \Data access microarchitectures or superscalar processors with compiler-assisted data preetchin," in Proceedins o the 24th Annual ACM/IEEE International Symposium on Microarchitecture, 1991, pp. 69{73. [2] Z. Zhan and J. Torrellas, \Speedin up irreular applications in shared-memory multiprocessors: Memory bindin and roup preetchin," in Proceedins o the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 188{2. [21] D. Joseph and D. Grunwald, \Preetchin usin Markov predictors," in Proceedins o the 24th Annual International Symposium on Computer Architecture, June 1997, pp. 252{263. [22] M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roedier, \SPAID: Sotware preetchin in pointer- and call-intensive environments," in Proceedins o the 28th Annual ACM/IEEE International Symposium on Microarchitecture, 1995, pp. 231{ 236. Chi-Keun Luk is a Ph.D. candidate in the Department o Computer Science at the University o Toronto, and is currently a visitin scholar at Carneie Mellon University. He received his B.Sc. (First Class Honors) and M.Phil. derees in computer science, both rom The Chinese University o Hon Kon. His research interests are computer architecture, compiler optimizations, and prorammin lanuaes, with a ocus on the memory perormance o non-numeric applications. He has been awarded a Canadian Commonwealth Fellowship, an IBM CAS Fellowship, and a Croucher Foundation Fellowship. Further inormation about his current research activities can be ound at Todd C. Mowry received his B.S.E.E. rom the University o Virinia in 1988, and his M.S.E.E. and Ph.D. rom Stanord University in 1989 and 1994, respectively. From 1994 throuh 1997, he was an assistant proessor in the Department o Electrical and Computer Enineerin and the Department o Computer Science at the University o Toronto. Since 1997, he has been an associate proessor in the Computer Science Department at Carneie Mellon University. Dr. Mowry's research interests span architecture, compilers, and operatin systems. Most recently, he has been ocusin on automatically toleratin the latency o accessin and communicatin data, and on automatically extractin thread-level parallelism rom non-numeric applications. Further inormation about his current research activities can be ound at

Status. We ll do code generation first... Outline

Status. We ll do code generation first... Outline Status Run-time Environments Lecture 11 We have covered the ront-end phases Lexical analysis Parsin Semantic analysis Next are the back-end phases Optimization Code eneration We ll do code eneration irst...

More information

A SUIF Interface Module for Eli. W. M. Waite. University of Colorado

A SUIF Interface Module for Eli. W. M. Waite. University of Colorado A SUIF Interace Module or Eli W. M. Waite Department o Electrical and Computer Enineerin University o Colorado William.Waite@Colorado.edu 1 What is Eli? Eli [2] is a domain-specic prorammin environment

More information

RE2C { A More Versatile Scanner Generator. Peter Bumbulis Donald D. Cowan. University of Waterloo. April 15, Abstract

RE2C { A More Versatile Scanner Generator. Peter Bumbulis Donald D. Cowan. University of Waterloo. April 15, Abstract RE2C { A More Versatile Scanner Generator Peter Bumbulis Donald D. Cowan Computer Science Department and Computer Systems Group University o Waterloo April 15, 1994 Abstract It is usually claimed that

More information

1.1 The Problem with Paed Virtual Memory The performance of out-of-core applications that rely simply on paed virtual memory to perform their I/O is t

1.1 The Problem with Paed Virtual Memory The performance of out-of-core applications that rely simply on paed virtual memory to perform their I/O is t Automatic Compiler-Inserted I/O Prefetchin for Out-of-Core Applications Todd C. Mowry, Anela K. Demke and Orran Krieer Department of Electrical and Computer Enineerin Department of Computer Science University

More information

Integrated QOS management for disk I/O. Dept. of Comp. Sci. Dept. of Elec. Engg. 214 Zachry. College Station, TX

Integrated QOS management for disk I/O. Dept. of Comp. Sci. Dept. of Elec. Engg. 214 Zachry. College Station, TX Interated QOS manaement or disk I/O Ravi Wijayaratne A. L. Narasimha Reddy Dept. o Comp. Sci. Dept. o Elec. En. Texas A & M University 214 Zachry Collee Station, TX 77843-3128 ravi,reddy@ee.tamu.edu Abstract

More information

Improving Hash Join Performance Through Prefetching

Improving Hash Join Performance Through Prefetching Improvin Hash Join Performance Throuh Prefetchin Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, Todd C. Mowry Email: chensm@cs.cmu.edu, natassa@cs.cmu.edu, phillip.b.ibbons@intel.com, tcm@cs.cmu.edu

More information

a<b x = x = c!=d = y x = = x false true

a<b x = x = c!=d = y x = = x false true 1 Introduction 1.1 Predicated execution Predicated execution [HD86, RYYT89, DT93, KSR93] is an architectural model in which each operation is uarded by a boolean operand whose value determines whether

More information

Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem

Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem Perormance and Overhead Measurements on the Makbilan Yosi Ben-Asher Dror G. Feitelson Dtment o Computer Science The Hebrew University o Jerusalem 91904 Jerusalem, Israel E-mail: yosi,dror@cs.huji.ac.il

More information

FPGA Technology Mapping: A Study of Optimality

FPGA Technology Mapping: A Study of Optimality FPGA Technoloy Mappin: A Study o Optimality Andrew Lin Department o Electrical and Computer Enineerin University o Toronto Toronto, Canada alin@eec.toronto.edu Deshanand P. Sinh Altera Corporation Toronto

More information

i=266 to 382 ok Undo

i=266 to 382 ok Undo Softspec: Software-based Speculative Parallelism Derek Bruenin, Srikrishna Devabhaktuni, Saman Amarasinhe Laboratory for Computer Science Massachusetts Institute of Technoloy Cambride, MA 9 iye@mit.edu

More information

Imitation: An Alternative to Generalization in Programming by Demonstration Systems

Imitation: An Alternative to Generalization in Programming by Demonstration Systems Imitation: An Alternative to Generalization in Prorammin by Demonstration Systems Technical Report UW-CSE-98-08-06 Amir Michail University of Washinton amir@cs.washinton.edu http://www.cs.washinton.edu/homes/amir/opsis.html

More information

IBM Thomas J. Watson Research Center. Yorktown Heights, NY, U.S.A. The major advantage of BDDs is their eciency for a

IBM Thomas J. Watson Research Center. Yorktown Heights, NY, U.S.A. The major advantage of BDDs is their eciency for a Equivalence Checkin Usin Cuts and Heaps Andreas Kuehlmann Florian Krohm IBM Thomas J. Watson Research Center Yorktown Heihts, NY, U.S.A. Abstract This paper presents a verication technique which is specically

More information

Theodore Johnson. Dept. of Computer and Information Science, University of Florida. Abstract

Theodore Johnson. Dept. of Computer and Information Science, University of Florida. Abstract A Concurrent Fast-Fits Memory Manaer University o Florida, Dept. o CIS Electronic TR91-009 Theodore Johnson Dept. o Computer and Inormation Science, University o Florida ted@cis.u.edu September 12, 1991

More information

PT = 4l - 3w 1 PD = 2. w 2

PT = 4l - 3w 1 PD = 2. w 2 University of Maryland Systems & Computer Architecture Group technical report UMD-SCA-TR-2000-01. Multi-Chain Prefetching: Exploiting Natural Memory Parallelism in Pointer-Chasing Codes Nicholas Kohout,

More information

Backwards-compatible bounds checking for arrays and pointers in C. programs. Richard W M Jones and Paul H J Kelly. Department of Computing

Backwards-compatible bounds checking for arrays and pointers in C. programs. Richard W M Jones and Paul H J Kelly. Department of Computing Backwards-compatible bounds checkin or arrays and pointers in C prorams Richard W M Jones and Paul H J Kelly Department o Computin Imperial Collee o Science, Technoloy and Medicine 180 Queen's Gate, London

More information

OCC and Its Variants. Jan Lindström. Helsinki 7. November Seminar on real-time systems UNIVERSITY OF HELSINKI. Department of Computer Science

OCC and Its Variants. Jan Lindström. Helsinki 7. November Seminar on real-time systems UNIVERSITY OF HELSINKI. Department of Computer Science OCC and Its Variants Jan Lindström Helsinki 7. November 1997 Seminar on real-time systems UNIVERSITY OF HELSINKI Department o Computer Science Contents 1 Introduction 1 2 Optimistic Concurrency Control

More information

2 IEICE TRANS. COMMUN., VOL. 0, NO Table 1 Classication of Replacement Policies. Replacement Re-reference likelihood Non-uniformity Policies T

2 IEICE TRANS. COMMUN., VOL. 0, NO Table 1 Classication of Replacement Policies. Replacement Re-reference likelihood Non-uniformity Policies T IEICE TRANS. COMMUN., VOL. 0, NO. 0 2000 1 PAPER IEICE Transactions on Communications Exploitin Metadata of Absent Objects for Proxy Cache Consistency Jooyon Kim y,hyokyun Bahn yy, Nonmembers, and Kern

More information

f y f x f z exu f xu syu s y s x s zl ezl f zl s zu ezu f zu sign logic significand multiplier exponent adder inc multiplexor multiplexor ty

f y f x f z exu f xu syu s y s x s zl ezl f zl s zu ezu f zu sign logic significand multiplier exponent adder inc multiplexor multiplexor ty A Combined Interval and Floatin Point Multiplier James E. Stine and Michael J. Schulte Computer Architecture and Arithmetic Laboratory Electrical Enineerin and Computer Science Department Lehih University

More information

environments with objects which diusely reect or emit liht [16]. The method is based on the enery radiation between surfaces of objects and accounts f

environments with objects which diusely reect or emit liht [16]. The method is based on the enery radiation between surfaces of objects and accounts f A Shared-Memory Implementation of the Hierarchical Radiosity Method Axel Podehl Fachbereich Informatik, Universitat des Saarlandes, PF 151150, 66041 Saarbrucken, Germany, axelp@tibco.com Thomas Rauber

More information

Compiler Phase F1 F2 F3... Fn. Classical Optimization. ILP Optimization. Prepass Scheduling. Register Allocation. Postpass Scheduling

Compiler Phase F1 F2 F3... Fn. Classical Optimization. ILP Optimization. Prepass Scheduling. Register Allocation. Postpass Scheduling Reion-Based Compilation: An Introduction and Motivation Richard E. Hank Wen-mei W. Hwu B. Ramakrishna Rau Center for Reliable and Hih-Performance Computin University of Illinois Urbana-Champain, IL 6 Hewlett

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto S.B., Massachusetts Institute of Technology (1995) M.Eng., Massac

Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto S.B., Massachusetts Institute of Technology (1995) M.Eng., Massac Lanuae and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto S.B., Massachusetts Institute of Technoloy (1995) M.En., Massachusetts Institute of Technoloy (1995) Submitted to the

More information

From Java to C A Supplement to Computer Algorithms, Third Edition. Sara Baase Allen Van Gelder

From Java to C A Supplement to Computer Algorithms, Third Edition. Sara Baase Allen Van Gelder From Java to C A Supplement to Computer Alorithms, Third Edition Sara Baase Allen Van Gelder October 30, 2004 ii clcopyriht 2000, 2001 Sara Baase and Allen Van Gelder. All rihts reserved. This document

More information

Compiler Optimization of Scalar Value Communication Between Speculative Threads

Compiler Optimization of Scalar Value Communication Between Speculative Threads Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Greory Steffan, and Todd C. Mowry School of Computer Science Carneie Mellon University

More information

MetaTeD A Meta Language for Modeling. Telecommunication Networks. Kalyan S. Perumalla and Richard M. Fujimoto

MetaTeD A Meta Language for Modeling. Telecommunication Networks. Kalyan S. Perumalla and Richard M. Fujimoto MetaTeD A Meta Lanuae or Modelin Telecommunication Networks Kalyan S. Perumalla and Richard M. Fujimoto (kalyan@cc.atech.edu and ujimoto@cc.atech.edu) Collee o Computin Georia Institute o Technoloy Atlanta,

More information

The SpecC Methodoloy Technical Report ICS December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department

The SpecC Methodoloy Technical Report ICS December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department The SpecC Methodoloy Technical Report ICS-99-56 December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department of Information and Computer Science University of

More information

IN SUPERSCALAR PROCESSORS DAVID CHU LIN. B.S., University of Illinois, 1990 THESIS. Submitted in partial fulllment of the requirements

IN SUPERSCALAR PROCESSORS DAVID CHU LIN. B.S., University of Illinois, 1990 THESIS. Submitted in partial fulllment of the requirements COMPILER SUPPORT FOR PREDICTED EXECUTION IN SUPERSCLR PROCESSORS BY DVID CHU LIN B.S., University of Illinois, 1990 THESIS Submitted in partial fulllment of the requirements for the deree of Master of

More information

Cached. Cached. Cached. Active. Active. Active. Active. Cached. Cached. Cached

Cached. Cached. Cached. Active. Active. Active. Active. Cached. Cached. Cached Manain Pipeline-Reconurable FPGAs Srihari Cadambi, Jerey Weener, Seth Copen Goldstein, Herman Schmit, and Donald E. Thomas Carneie Mellon University Pittsburh, PA 15213-3890 fcadambi,weener,seth,herman,thomas@ece.cmu.edu

More information

Process Synchronization with Readers and Writers Revisited

Process Synchronization with Readers and Writers Revisited Journal of Computin and Information Technoloy - CIT 13, 2005, 1, 43 51 43 Process Synchronization with Readers and Writers Revisited Jalal Kawash Department of Computer Science, American University of

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

A Rigorous Correctness Proof of a Tomasulo Scheduler Supporting Precise Interrupts

A Rigorous Correctness Proof of a Tomasulo Scheduler Supporting Precise Interrupts A Riorous Correctness Proo o a Tomasulo Scheduler Supportin Precise Interrupts Daniel Kroenin Λ, Silvia M. Mueller y, and Wolan J. Paul Dept. 14: Computer Science, University o Saarland, Post Box 151150,

More information

Leveraging Models at Run-time to Retrieve Information for Feature Location

Leveraging Models at Run-time to Retrieve Information for Feature Location Leverain Models at Run-time to Retrieve Information for Feature Location Lorena Arcea,2, Jaime Font,2 Øystein Hauen 2,3, and Carlos Cetina San Jore University, SVIT Research Group, Zaraoza, Spain {larcea,jfont,ccetina}@usj.es

More information

Ecient Detection of Data Races in SR Programs. Darren John Esau. presented to the University of Waterloo. in fullment of the

Ecient Detection of Data Races in SR Programs. Darren John Esau. presented to the University of Waterloo. in fullment of the Ecient Detection o Data Races in SR Prorams by Darren John Esau A thesis presented to the University o Waterloo in ullment o the thesis requirement or the deree o Master o Mathematics in Computer Science

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Salto: System for Assembly-Language. Transformation and Optimization. Erven Rohou, Francois Bodin, Andre Seznec.

Salto: System for Assembly-Language. Transformation and Optimization. Erven Rohou, Francois Bodin, Andre Seznec. Salto: System for Assembly-Lanuae Transformation and Optimization Erven Rohou, Francois Bodin, Andre Seznec ferohou,bodin,seznec@irisa.fr Abstract On critical applications the performance tunin requires

More information

Transparent Pointer Compression for Linked Data Structures

Transparent Pointer Compression for Linked Data Structures Transparent Pointer Compression for Linked Data Structures lattner@cs.uiuc.edu Vikram Adve vadve@cs.uiuc.edu June 12, 2005 MSP 2005 http://llvm.cs.uiuc.edu llvm.cs.uiuc.edu/ Growth of 64-bit computing

More information

pp , John Wiley and Sons, 1991 No. 3, 1994, pp , Victoria, 1994 Vol. 37, No. 5, pp Baltimore, 1993 pp.

pp , John Wiley and Sons, 1991 No. 3, 1994, pp , Victoria, 1994 Vol. 37, No. 5, pp Baltimore, 1993 pp. termediate representation used is much simpler than a ull UI specication lanuae. We have also proposed to populate automatically the taret GUI builder space thus allowin desiners/developers to ully exploit

More information

Using Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases

Using Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases Usin Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases Jan Lindström and Kimmo Raatikainen University o Helsinki, Department o Computer Science P.O. Box 26 (Teollisuuskatu

More information

A Nearest Neighbor Method for Efficient ICP

A Nearest Neighbor Method for Efficient ICP A Nearest Neihbor Method or Eicient ICP Michael Greenspan Guy Godin Visual Inormation Technoloy Group Institute or Inormation Technoloy, National Research Council Canada Bld. M50, 1500 Montreal Rd., Ottawa,

More information

Code Optimization Techniques for Embedded DSP Microprocessors

Code Optimization Techniques for Embedded DSP Microprocessors Code Optimization Techniques for Embedded DSP Microprocessors Stan Liao Srinivas Devadas Kurt Keutzer Steve Tjian Albert Wan MIT Department of EECS Synopsys, Inc. Cambride, MA 02139 Mountain View, CA 94043

More information

Three Architectural Models for Compiler-Controlled. Speculative Execution

Three Architectural Models for Compiler-Controlled. Speculative Execution Three Architectural Models or Compiler-Controlled Speculative Execution Pohua P. Chang Nancy J. Warter Scott A. Mahlke William Y. Chen Wen-mei W. Hwu Abstract To eectively exploit instruction level parallelism,

More information

Module. Sanko Lan Avi Ziv Abbas El Gamal. and its accompanying FPGA CAD tools, we are focusing on

Module. Sanko Lan Avi Ziv Abbas El Gamal. and its accompanying FPGA CAD tools, we are focusing on Placement and Routin For A Field Prorammable Multi-Chip Module Sanko Lan Avi Ziv Abbas El Gamal Information Systems Laboratory, Stanford University, Stanford, CA 94305 Abstract Placemen t and routin heuristics

More information

disambiuation, conservative assumptions about memory dependence have to be made, leavin the code optimized in a dissatised way. Many researchers have

disambiuation, conservative assumptions about memory dependence have to be made, leavin the code optimized in a dissatised way. Many researchers have A Practical Interprocedural Pointer Analysis Framework Ben-Chun Chen Wen-mei W. Hwu y Department of Computer Science y Department of Electrical and Computer Enineerin The Coordinated Science Laboratory

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

CSC D70: Compiler Optimization Prefetching

CSC D70: Compiler Optimization Prefetching CSC D70: Compiler Optimization Prefetching Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons DRAM Improvement

More information

Thread-based vs Event-based Implementation of a Group Communication Service

Thread-based vs Event-based Implementation of a Group Communication Service Thread-based vs Event-based Implementation o a Group Communication Service Shivakant Mishra and Ronuan Yan Department o Computer Science University o Wyomin, P.O. Box 3682 Laramie, WY 8271-3682, USA. Email:

More information

2. Background. Figure 1. Dependence graph used for scheduling

2. Background. Figure 1. Dependence graph used for scheduling Speculative Hede: Reulatin Compile-Time Speculation Aainst Prole Variations Brian L. Deitrich Wen-mei W. Hwu Center for Reliable and Hih-Performance Computin University of Illinois Urbana-Champain, IL

More information

A Framework for Data Prefetching using Off-line Training of Markovian Predictors

A Framework for Data Prefetching using Off-line Training of Markovian Predictors A Framework for Data Prefetching using Off-line Training of Markovian Predictors Jinwoo Kim, Krishna V. Palem Center for Research on Embedded Systems and Technology Georgia Institute of Technology Atlanta,

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Characterization of Repeating Data Access Patterns in Integer Benchmarks

Characterization of Repeating Data Access Patterns in Integer Benchmarks Characterization of Repeating Data Access Patterns in Integer Benchmarks Erik M. Nystrom Roy Dz-ching Ju Wen-mei W. Hwu enystrom@uiuc.edu roy.ju@intel.com w-hwu@uiuc.edu Abstract Processor speeds continue

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Guided Prefetching Based on Runtime Access Patterns

Guided Prefetching Based on Runtime Access Patterns Guided Prefetching Based on Runtime Access Patterns Jie Tao 1, Georges Kneip 2, and Wolfgang Karl 2 1 Steinbuch Center for Computing Forschungszentrum Karlsruhe Karlsruhe Institute of Technology, Germany

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Coarse Grained Parallel Maximum Matching In Convex Bipartite Graphs

Coarse Grained Parallel Maximum Matching In Convex Bipartite Graphs Coarse Grained Parallel Maximum Matchin In Convex Bipartite Graphs P. Bose, A. Chan, F. Dehne, and M. Latzel School o Computer Science Carleton University Ottawa, Canada K1S 5B6 jit,achan,dehne,mlatzel@scs.carleton.ca

More information

A Cell Burst Scheduling for ATM Networking Part II: Implementation

A Cell Burst Scheduling for ATM Networking Part II: Implementation A Cell Burst Schedulin or ATM Networkin Part II: Implementation C. Tan, A. T. Chronopoulos, Senior Member, IEEE Computer Science Department Wayne State University email:ctan, chronos@cs.wayne.edu E. Yaprak,

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

LEGEND. Cattail Sawgrass-Cattail Mixture Sawgrass-Slough. Everglades. National. Park. Area of Enlargement. Lake Okeechobee

LEGEND. Cattail Sawgrass-Cattail Mixture Sawgrass-Slough. Everglades. National. Park. Area of Enlargement. Lake Okeechobee An Ecient Parallel Implementation of the Everlades Landscape Fire Model Usin Checkpointin Fusen He and Jie Wu Department of Computer Science and Enineerin Florida Atlantic University Boca Raton, FL 33431

More information

Low cost concurrent error masking using approximate logic circuits

Low cost concurrent error masking using approximate logic circuits 1 Low cost concurrent error maskin usin approximate loic circuits Mihir R. Choudhury, Member, IEEE and Kartik Mohanram, Member, IEEE Abstract With technoloy scalin, loical errors arisin due to sinle-event

More information

Motivation Dynamic bindin facilitates more exible and extensible software architectures, e.., { Not all desin decisions need to be known durin the ini

Motivation Dynamic bindin facilitates more exible and extensible software architectures, e.., { Not all desin decisions need to be known durin the ini The C++ Prorammin Lanuae Dynamic Bindin Outline Motivation Dynamic vs. Static Bindin Shape Example Callin Mechanisms Downcastin Run-Time Type Identication Summary Motivation When desinin a system it is

More information

Minimum-Cost Multicast Routing for Multi-Layered Multimedia Distribution

Minimum-Cost Multicast Routing for Multi-Layered Multimedia Distribution Minimum-Cost Multicast Routin for Multi-Layered Multimedia Distribution Hsu-Chen Chen and Frank Yeon-Sun Lin Department of Information Manaement, National Taiwan University 50, Lane 144, Keelun Rd., Sec.4,

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Compressing Heap Data for Improved Memory Performance

Compressing Heap Data for Improved Memory Performance Compressing Heap Data for Improved Memory Performance Youtao Zhang Rajiv Gupta Department of Computer Science Department of Computer Science The University of Texas at Dallas The University of Arizona

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses Onur Mutlu Hyesoon Kim Yale N. Patt High Performance Systems Group Department of Electrical

More information

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton Exploitin implicit loop parallelism usin multiple multithreaded servers in Java Fabian Bre Aart Bik Dennis Gannon December 16, 1997 1 Introduction Since its introduction in the late eihties, the lobal

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

The Gene Expression Messy Genetic Algorithm For. Hillol Kargupta & Kevin Buescher. Computational Science Methods division

The Gene Expression Messy Genetic Algorithm For. Hillol Kargupta & Kevin Buescher. Computational Science Methods division The Gene Expression Messy Genetic Alorithm For Financial Applications Hillol Karupta & Kevin Buescher Computational Science Methods division Los Alamos National Laboratory Los Alamos, NM, USA. Abstract

More information

Compiler-Based Prefetching for Recursive Data Structures

Compiler-Based Prefetching for Recursive Data Structures Compiler-Based Prefetching for Recursive Data Structures Chi-Keung Luk and Todd C. Mowry Department of Computer Science Department of Electrical and Computer Engineering University of Toronto Toronto,

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Ecient and Precise Modeling of Exceptions for the Analysis of Java Programs. IBM Research. Thomas J. Watson Research Center

Ecient and Precise Modeling of Exceptions for the Analysis of Java Programs. IBM Research. Thomas J. Watson Research Center Ecient and Precise Modelin of Exceptions for the Analysis of Java Prorams Jon-Deok Choi David Grove Michael Hind Vivek Sarkar IBM Research Thomas J. Watson Research Center P.O. Box 704, Yorktown Heihts,

More information

Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy

Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2006 Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy Sheela A. Doshi Louisiana State University and Agricultural

More information

in two important ways. First, because each processor processes lare disk-resident datasets, the volume of the communication durin the lobal reduction

in two important ways. First, because each processor processes lare disk-resident datasets, the volume of the communication durin the lobal reduction Compiler and Runtime Analysis for Ecient Communication in Data Intensive Applications Renato Ferreira Gaan Arawal y Joel Saltz Department of Computer Science University of Maryland, Collee Park MD 20742

More information

Reducing Network Cost of Many-to-Many Communication in Unidirectional WDM Rings with Network Coding

Reducing Network Cost of Many-to-Many Communication in Unidirectional WDM Rings with Network Coding 1 Reducin Network Cost of Many-to-Many Communication in Unidirectional WDM Rins with Network Codin Lon Lon and Ahmed E. Kamal, Senior Member, IEEE Abstract In this paper we address the problem of traffic

More information

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures.

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures. Compiler Techniques for Energy Saving in Instruction Caches of Speculative Parallel Microarchitectures Seon Wook Kim Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES

CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES CACHE-CONSCIOUS ALLOCATION OF POINTER- BASED DATA STRUCTURES Angad Kataria, Simran Khurana Student,Department Of Information Technology Dronacharya College Of Engineering,Gurgaon Abstract- Hardware trends

More information

Improving Index Performance through Prefetching

Improving Index Performance through Prefetching Improving Index Performance through Prefetching Shimin Chen School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 chensm@cs.cmu.edu Phillip B. Gibbons Information Sciences Research

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

optimization agents user interface agents database agents program interface agents

optimization agents user interface agents database agents program interface agents A MULTIAGENT SIMULATION OPTIMIZATION SYSTEM Sven Hader Department of Computer Science Chemnitz University of Technoloy D-09107 Chemnitz, Germany E-Mail: sha@informatik.tu-chemnitz.de KEYWORDS simulation

More information

Topics on Compilers Spring Semester Christine Wagner 2011/04/13

Topics on Compilers Spring Semester Christine Wagner 2011/04/13 Topics on Compilers Spring Semester 2011 Christine Wagner 2011/04/13 Availability of multicore processors Parallelization of sequential programs for performance improvement Manual code parallelization:

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Linear Network Coding

Linear Network Coding IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 2, FEBRUARY 2003 371 Linear Network Codin Shuo-Yen Robert Li, Senior Member, IEEE, Raymond W. Yeun, Fellow, IEEE, Nin Cai Abstract Consider a communication

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Runtime Biased Pointer Reuse Analysis and Its Application to Energy Efficiency

Runtime Biased Pointer Reuse Analysis and Its Application to Energy Efficiency Runtime Biased Pointer Reuse Analysis and Its Application to Energy Efficiency Yao Guo, Saurabh Chheda, Csaba Andras Moritz Department of Electrical and Computer Engineering University of Massachusetts,

More information

Chapter 5 THE MODULE FOR DETERMINING AN OBJECT S TRUE GRAY LEVELS

Chapter 5 THE MODULE FOR DETERMINING AN OBJECT S TRUE GRAY LEVELS Qian u Chapter 5. Determinin an Object s True Gray evels 3 Chapter 5 THE MODUE OR DETERMNNG AN OJECT S TRUE GRAY EVES This chapter discusses the module for determinin an object s true ray levels. To compute

More information

main Entry main main pow Entry pow pow

main Entry main main pow Entry pow pow Interprocedural Path Prolin David Melski and Thomas Reps Computer Sciences Department, University of Wisconsin, 20 West Dayton Street, Madison, WI, 53706, USA, fmelski, reps@cs.wisc.edu Abstract. In path

More information

Improving Computer Security using Extended Static Checking

Improving Computer Security using Extended Static Checking Improvin Computer Security usin Extended Static Checkin Brian V. Chess Department o Computer Enineerin University o Caliornia, Santa Cruz Abstract We describe a method or indin security laws in source

More information

Data-flow-based Testing of Object-Oriented Libraries

Data-flow-based Testing of Object-Oriented Libraries Data-flow-based Testin of Object-Oriented Libraries Ramkrishna Chatterjee Oracle Corporation One Oracle Drive Nashua, NH 03062, USA ph: +1 603 897 3515 Ramkrishna.Chatterjee@oracle.com Barbara G. Ryder

More information

10. SOPC Builder Component Development Walkthrough

10. SOPC Builder Component Development Walkthrough 10. SOPC Builder Component Development Walkthrough QII54007-9.0.0 Introduction This chapter describes the parts o a custom SOPC Builder component and guides you through the process o creating an example

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

General Design of Grid-based Data Replication. Schemes Using Graphs and a Few Rules. availability of read and write operations they oer and

General Design of Grid-based Data Replication. Schemes Using Graphs and a Few Rules. availability of read and write operations they oer and General esin of Grid-based ata Replication Schemes Usin Graphs and a Few Rules Oliver Theel University of California epartment of Computer Science Riverside, C 92521-0304, US bstract Grid-based data replication

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Data Flow Analysis for Software Prefetching Linked Data Structures in Java

Data Flow Analysis for Software Prefetching Linked Data Structures in Java Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon and Kathryn S. McKinley Department of Computer Science University of Massachusetts Amherst, MA 01002 cahoon,mckinley

More information