Prefetching in an Intelligent Memory Architecture Using a Helper Thread

Size: px

Start display at page:

Download "Prefetching in an Intelligent Memory Architecture Using a Helper Thread"

Evelyn Crawford
6 years ago
Views:

1 Prefething in n Intelligent Memory Arhiteture Using Helper Thre Yn Solihin, Jejin Lee, n Josep Torrells University of Illinois t Urn-Chmpign Mihign Stte University jlee@se.msu.eu Astrt Dt prefething is populr tehnique for tolerting long memory ess ltenies. In this pper, we introue novel type of prefething: memory-sie orreltion prefething implemente in user-level thre. The prefething thre runs on generl-purpose proessor emee in the min memory. By lloting the orreltion tle in min memory, we n ffor the lrge spe require y the tle. In ition, the sheme n e supporte with few moifitions to the L2 he n no moifition to the min proessor ore. We introue new orgniztion of the orreltion tle n new prefething lgorithm tht enle fst n urte frhe prefething with high overge. Overll, our evlution shows tht the lgorithm effetively prefethes irregulr pplitions, speeing up three pplitions y n verge of Furthermore, our sheme n work synergistilly with onventionl proessor-sie prefether to eliver n verge speeup of Introution Dt prefething is populr tehnique to tolerte long memory ess ltenies. There hve een mny proposls using helper thre to help prefething for the min thre, suh s [12, 15]. These proposls hve fouse on either SMT or CMP pltforms. In this pper, we propose prefething thre sheme tht is suitle for implementtion in n Intelligent Memory Arhiteture (IMA). In IMA, the memory system is ugmente with one or more memory proessors. The nture of the prolems in IMA is quite ifferent thn in SMT or CMP pltforms. First, in SMT/CMP, Proessor-Sie prefething is use, while in IMA, Memory-Sie prefething is use, euse prefeth requests re generte y the proessor in the min memory. Seonly, ommunition etween the thres is hep in SMT/CMP, while it is expensive in IMA. Thus, suitle prefething sheme is one tht opertes utonomously n tht n e effetive with orsegrin ommunition etween the prefething n the min This work ws supporte in prt y the Ntionl Siene Fountion uner grnts CCR , EIA , n EIA , y DARPA uner grnt F C-0078, n y NCSA, Mihign Stte University, n gifts from IBM n Intel. thres. In this work, we implement the prefether s userlevel thre tht n prefeth irregulr pplitions effetively using orreltion prefething lgorithms. The only ommunition neee y the prefething thre is the miss ress strem of the min thre. Memory-sie prefething is ttrtive for severl resons. First, it elimintes the overhes tht prefeth requests n stte ookkeeping introue in the pths etween the min proessor n its hes. Seonly, it n e supporte with very few moifitions to the L2 he n no moifition to the min proessor ore. Thirly, the prefether n exploit its proximity to the memory to its vntge. Memory-sie prefething hs the itionl ttrtion of riing the tehnology tren of inrese hip integrtion. Inee, populr pltforms like PCs re eing equippe with grphis engines in the memory system [16]. Some hipsets, like NVIDIA s nfore [13] even integrte powerful proessor in the North Brige hip. Similr engines n e provie for prefething, or existing grphis proessors n e reuse for prefething when uner-utilize. Moreover, there re proposls to integrte proessing logi in DRAM hips, suh s IRAM [8]. Using n engine for memory-sie prefething hs een propose elsewhere [1, 2, 4, 13, 14, 16, 18]. However, in most ses, these engines perform either very simple opertions or highly-speifi opertions, suh s prefething linke t strutures [4, 18]. Inste, wht we woul like, is very flexile, generl-purpose prefether. While memory-sie prefether n support vriety of prefething lgorithms, one type tht is prtiulrly suitle is Correltion Prefething [1, 3, 5, 11]. Correltion prefething relies on orreltion of miss resses to preit n prefeth future misses se on the urrent stte. Beuse the only informtion the prefeth thre nees is the miss ress strem, orreltion prefething is suitle for n IMA pltform. In the pst, generl orreltion prefething hs een supporte y hrwre ontrollers tht require lrge eite hrwre tle struture [1, 3, 5, 11]. In ll ut one se, these ontrollers hve een ple etween the L1 n L2 hes or etween the L1 n the proessor. While effetive, the pproh hs very high hrwre ost. Furthermore, it oes not prefeth fr enough n tens to hve low overge.

2 This pper introues novel prefething sheme where memory-sie orreltion prefething lgorithms re implemente in softwre y using user-level thre. The lgorithms run on generl-purpose proessor in the min memory system. The sheme llows prefething lgorithms to evolve with the pplitions, even fter the omputer system is shippe. In ition, the system n e supporte with few moifitions to the L2 he, n no moifitions to the min proessor ore. We introue new orgniztion of the orreltion tle n new orreltion prefething lgorithm tht enle fst n fr-he prefething, with high overge n ury. By lloting the orreltion tle in min memory, we n ffor the lrge spe require y the tle. We emonstrte tht the softwre lgorithm n effetively prefeth t for irregulr pplitions. Inee, our sheme spees up three SPECInt2000 pplitions y n verge of We lso show tht our sheme n work synergistilly with onventionl proessor-sie prefether to eliver n verge speeup of The rest of the pper is orgnize s follows: Setion 2 isusses memory-sie prefething n orreltion prefething; Setion 3 presents our esign; Setion 4 isusses our evlution setup; Setion 5 evlutes our esign; n Setion 6 onlues. 2 Relte Issues 2.1 Memory-Sie Prefething Memory-Sie prefething ours when prefething is initite y one or set of engines tht resie in or esie the min memory (efinitely eyon ny memory us). Chip mnufturers hve integrte hrwire ontrollers tht proly reognize very simple sequenes like stries, suh s NVIDIA s DASP engine in the North Brige hip [13] n Intel s prefeth he in its i860 hipset. In this pper, we propose to use simple generl-purpose memory proessor for memory-sie prefething. Although this ie is pplile to generi memory system, we will illustrte it on PC-like memory system epite in Figure 1- (). The memory proessor n e ple in severl ples, suh s in the North Brige (Memory Controller) hip (1), or in the DRAM hips (2). The vntges of the first se re tht it is simple to support, euse the DRAM interfe is not moifie, n tht the memory proessor n e employe for other uses, suh s grphis engine. The seon se, lthough more omplite to support, hs the vntge of lower memory ess lteny n higher memory nwith ue to higher integrtion. In this pper, we stuy the performne potentil of the DRAM se. Memory- n proessor-sie prefething re not the sme s Push n Pull (or on-emn) prefething [18], respetively. Push prefeth ours when prefethe t is sent to he or proessor tht hs not requeste it, while pull prefeth is the opposite. Clerly, memory prefether n t s pull prefether, y simply storing the prefethe t in CPU L1 $ L2 $ North Brige Chip 1 2 () DRAM Memory Min Pro 1: MIss i Mem Pro () 3: Prefeth j, k Mem 2: Lookup Min Memory System Figure 1: Arhiteture of the system (), n tions of the prefethes (). lol uffer n supplying it to the proessor on emn. In generl, however, memory-sie prefething is most interesting when it performs push prefething to the hes of the proessor, euse it n hie lrger frtion of memory ess lteny. In our system, the memory proessor oserves the requests from the min proessor tht reh min memory. Bse on them, n fter exmining some internl stte, the memory proessor prefethes other lines tht it expets the min proessor to nee in the future (Figure 1-()). In this pper, we onentrte on push prefething into the L2 he. Sine the memory proessor only sees L2 he miss strems, it ims to eliminte L2 he misses y pushing the prefethe t into the L2 he. L2 he miss penlty is the lrgest omponent of memory ess lteny, n it is the hrest to hie, even y n out-of-orer proessor. Our sheme is inexpensive to support. The min proessor ore oes not nee to e moifie t ll. The L2 he nees to hve the following supports. First, s in mny other systems [4, 7], the L2 he ontroller hs to e le to ept lines from the memory system tht it hs not requeste. To o so, the L2 hs to ssign unuse Miss Sttus Hnling Registers (MSHRs) [10] to suh lines. Seonly, if the L2 hs pening request for the sme line when prefeth rrives, the prefeth simply stels the MSHR n uptes the he s if it were the reply. Finlly, prefethe line rriving t L2 is roppe in the following ses: the L2 he lrey hs opy of the line, the write k queue hs opy of the line euse the L2 is trying to write it k to memory, ll MSHRs re full, or ll the lines in the set where the prefeth line wnts to go re in pening stte. 2.2 Correltion Prefething Correltion Prefething uses the urrent stte of the referene or miss strem to preit n prefeth future misses. Two populr orreltion shemes re the Strie-Bse n Pir-Bse shemes. The former tries to fin strie pttern in the miss strem n prefeth ll the lotions tht woul e esse if the pttern ontinues in the future. The lt-

3 ter tries to ientify orreltion etween pirs of misses, for exmple etween miss n its immeite suessor. It silly reors sequene of miss resses in tle, n lter when it enounters the he of the sequene, it looks up the tle n prefethes the rest of the sequene. Wht mkes pir-se shemes ttrtive is their generl ppliility, i.e. they work for ny miss sequenes tht repet. This is true for regulr pplitions n for wie rnge of irregulr pplitions suh s those tht operte on sprse mtries n linke t strutures. Furthermore, the shemes n e employe without ny ompiler support or hnges in the pplition inries. Pir-se orreltion prefething hs only een stuie using hrwre implementtion of prefeth engines [1, 3, 5, 11, 17], usully y pling the engine etween the L1 n L2 he [3, 5, 11, 17]. These stuies hve emonstrte the ppliility of pir-se orreltion prefething on wie vriety of pplitions. However, they lso revel shortomings of the pproh. One ritil prolem is tht to e effetive, it nees lrge storge spe to mth the footprints of the pplitions. One n two Megytes of eite on-hip SRAM tles hve een propose [5, 11], while some pplitions with lrger footprints even nee 7.6 MB off-hip SRAM tle [11]. Furthermore, it oes not prefeth fr enough n hs low overge (unless it is tightly ouple to the min proessor n uses more fine grin informtion [11]). For exmple, for eh miss, Joseph n Grunwl only store immeite suessors [5]. The overge is low euse it nees one miss to trigger the prefether to prefeth the suessor of the miss. At est only hlf of the misses n e eliminte. This sheme uses wie tle tht stores mny suessors per miss n ontinuously reuils the tle to inrese the overge. However, it uses exessive useless prefethes. 3 Propose Sheme Pir-se orreltion prefething is suitle for our memorysie prefething system to support euse it hs generl ppliility n n e supporte inexpensively. We show tht shortomings of the urrent orreltion prefething shemes n e eliminte y improving the orreltion lgorithms n implementing them in softwre. The lgorithms esrie re implemente in prefething thre running on the memory proessor. The oe for the prefething thre is written in C n hn-optimize for miniml prefeth response n oupny time. In the following setions, we isuss the onepts (Setion 3.1), the rhiteture (Setion 3.2), pir-se orreltion prefething lgorithms (Setion 3.3), n onventionl proessor-sie prefething (Setion 3.4). 3.1 Conepts Prefething lgorithms re implemente s user-level helper thre tht we ll prefething thre. The tions of the memory proessor re etermine y the ehvior of the prefething thre tht we implement. The opertion of the prefething thre n e oneptully ivie into two phses: lerning n prefething. In the lerning phse, the prefething thre reors the L2 re n write miss ptterns tht it oserves in orreltion tle, one miss t time. In the prefething phse, every time tht the prefething thre sees miss, it looks up the orreltion tle n prefethes severl memory lines to the L2 he of the min proessor. No tion is tken on write-k memory ess. In prtie, s in [5], we foun tht omining the lerning n prefething phses enles the orreltion tle to quikly lern new ptterns n provies the est performne in most ses (Figure 2). Miss ress Prefeth resses Hnler finishes ville ville proessing Prefething phse Response Time Oupny Time Lerning phse Figure 2: Timing of the prefething thre. The prefething lgorithm n e hrterize y its response time n oupny time (Figure 2). The response time is efine s the time eginning when the prefething thre otins miss ress until the prefething thre proues the prefeth resses. The oupny time is the time the prefething thre is usy n nnot proess nother miss ress. As n e seen in the figure, the prefething phse is lwys exeute efore the lerning phse to minimize the response time. For the softwre implementtion to e vile, the oupny time hs to e smller thn the verge time etween two onseutive L2 he misses. Also, for est performne, the response time nees to e s smll s possile. By using prefething thre tht stores the orreltion tle in the min memory, we eliminte the high hrwre ost require y the tle in the tritionl implementtion. We further ress the inequies of tritionl orreltion prefething, nmely low prefething overge, n not prefething fr enough, y improving the orreltion lgorithms (Setion 3.3). 3.2 Arhiteture of the System When we integrte the memory proessor in the DRAM hips, the DRAM hips n possily the DRAM interfe nee to e moifie. Extr omplexities in hnling multiple DRAM hips must lso e resse. Our gol in this pper is to stuy the performne potentil of this se. Consequently, we strt wy the implementtion omplexity of integrting the proessor in the DRAM y ssuming single hip min memory with single memory proessor in it (Figure 3). The key ommunition ours through queues 1, 2, n 3. Miss requests from the min proessor re eposite in queues 1 n then in 2. In the lerning phse, the memory proessor uses the entries in queue 2 to uil its stte. In the prefething phse, the memory proessor uses the entries in queue 2 n its stte to generte resses to prefeth. The

4 North Brige Chip Other Units Min Proessor Bus Interfe 4 Memory Controller 1 DRAM hip 2 Memory Proessor Che 3 Row Deoer DRAM ells Row Buffer Figure 3: Mirorhiteure DRAM hip tht inlues memory proessor use for orreltion prefething. lines prefethe re eposite in queue 3. If the memory proessor suffers he miss on its orreltion tle struture, it esses the DRAM iretly. Queue 4 is in the replying pth from memory to the min proessor. 3.3 Pir-Bse Correltion Algorithms We now isuss the pir-se orreltion prefething lgorithms. We onsier two ifferent orgniztions for the orreltion tle: si one tht oes not llow t replition n more vne one tht llows replition. Their use gives rise to ifferent lgorithms. We onsier them in turn. Pir-Bse Algorithms with Bsi Tle Orgniztion Eh row in this tle stores the tg of the miss ress, n the resses of set of immeite suessor misses store in MRU orer. We onsier two lgorithms tht use this si orgniztion: Bse n Chin. Bse follows the sheme propose y Joseph n Grunwl [5]. For ny given miss, Bse is only intereste in prefething immeite suessor misses. The prmeters of the lgorithm re the numer of immeite suessors preite (NumSu), the numer of misses tht the orreltion tle n store preitions for (NumRows), n the ssoitivity of the orreltion tle (Asso). Bse is illustrte in Figure 4-(). It shows two snpshots of the orreltion tle t the point tht the orresponing miss tre hs een onsume (i n ii). In the exmple, NumSu is 2, NumRows is 4, n Asso is 1. Within row, suessors re reple using LRU replement poliy. As in Joseph n Grunwl s stuy [5], we fin tht LRU replement poliy for the suessors in eh row works est. The figures show the suessors in MRU orer from left to right. In the lerning phse, the proessor keeps pointer to the row of the lst miss oserve. When miss ours, its ress is ple s one of the immeite suessors of the lst miss, n new row is llote for the new miss unless n entry for the ress lrey exists. In the prefething phse (iii), when miss is oserve, the proessor fins the orresponing row n prefethes ll the NumSu immeite suessors, strting from the MRU one. Sine Bse only prefethes immeite suessors, its overge n lteny hiing pilities re limite. To improve this, we propose the Chin lgorithm, whih for every miss prefethes multiple levels of suessors. The lgorithm tkes one extr prmeter lle NumLevels, whih is the numer of levels of suessors prefethe. The lgorithm is illustrte in Figure 4-(). In the lerning phse, Chin is ientil to Bse (i n ii). However, Chin oes more work in the prefething phse (iii). After prefething the row of immeite suessors, it tkes the most reently-use suessor mong them n inexes the orreltion tle with its ress. If the entry is foun, it prefethes ll NumSu suessors there. Then, it tkes the most reently use suessor in tht row n repets the proess for NumLevels-1 times. As n exmple, suppose tht miss on line ours (iii). The memory proessor first prefethes n. Then, it tkes the MRU entry, looks-up the tle, n prefethes s suessor,. While improving the overge n fr-he prefething pility over Bse, Chin hs two limittions. One limittion is tht the response time of the lgorithm is high. To issue prefethes in response to miss, it nees to mke NumLevels esses to ifferent rows in the tle, eh possily involving low-ssoitive serh n potentilly using he miss. The seon limittion is tht it oes not prefeth the orret MRU suessors of eh level of suessors. Inste, it only prefethes suessors foun long the MRU pth. Pir-Bse Algorithms with ite Tle Orgniztion Eh row in this tle stores the tg of the miss ress, n NumLevels levels of suessors. Eh level ontins Num- Su resses, whih re MRU-orere. We propose new lgorithm lle ite tht exploits this tle orgniztion. ite tkes the sme prmeters s Chin. In the lerning phse, NumLevels pointers to the tle re kept for effiient ess, pointing to the rows for the ress of the lst miss, seon lst, n so on. When miss ours, its ress is reore in the orret position of MRU suessors of the lst few misses y using these pointers. Figures 4-() illustrtes the lgorithm. In the exmple, NumSu is 2, NumRows is 4, Asso is 1, n NumLevels is 2. The figure shows two snpshots of the orreltion tle in the lerning phse t the point where the orresponing miss tre hs een onsume (i n ii). The figure lso shows the position of the two pointers, n the lgorithm in prefething phse (iii). Note tht this orgniztion solves the two prolems of Chin. First, the response time is muh shorter. We n prefeth severl levels of suessors with single row ess, possily with only one he miss. In ft, we shift some omputtion from the prefething phse, whih is the ritil phse, to the lerning phse. Now the lerning phse nees to upte severl rows in the tle. However, the rows re most likely still in the he n, sine we keep the pointers to the entries of lst few miss resses, the ssoitive serh is voie. Seonly, y grouping together ll the suessors from given level, we n ientify the orret MRU suessors from tht level, yieling higher ury.

5 (i) NumRows=4 (ii) (iii) on miss Softwre Correltion Tle NumSu=2 () urrent miss,,,,,,... (tre of misses) urrent miss,,,,,,... prefeth, (i) NumRows=4 (ii) (iii) on miss follow link NumLevels=2 Softwre Correltion Tle NumSu=2 () urrent miss,,,,,,... (tre of misses) urrent miss,,,,,,... prefeth, prefeth (i) NumLevels=2 urrent miss SeonLst Lst,,,,,,... NumSu=2 (ii) urrent miss Lst,,,,,,... SeonLst (iii) on miss prefeth,, Lst SeonLst () Figure 4: Pir-se orreltion lgorithms: Bse (), Chin (), n ite (). Chrteristis Bse Chin ite Levels of suessors prefethe 1 Full MRU orering for eh level? Yes No Yes Num. row esses in the prefething phse (SEARCH) 1 1 Num. row esses in the lerning phse (NO SEARCH) 1 1 Response Time Low High Low Spe requirement (for onstnt numer of prefethes) Algorithm Comprison Tle 1 ompres the three pir-se shemes. From the tle, we see tht ite lgorithm tries to solve prolems in urrent orreltion prefething lgorithms: it looks fr he y prefething severl levels of suessors, therey improving overge, while keeping high ury y prefething the orret MRU suessors in eh level. Its only shortoming is its high spe requirements for the orreltion tle. Fortuntely, this is minor issue, sine the tle is llote in the min memory. The response time is etter with the ite lgorithm thn with the Chin lgorithm. The hnler in ite runs very effiiently euse he lines re well utilize. Note tht ll the orreltion lgorithms oul e implemente in hrwre. However, ite is very suitle for softwre implementtion euse it hs low response time, frhe prefething pility, n uses he lines well. Tle 1: Compring the ifferent pir-se lgorithms. 3.4 Conventionl Prefething Previous stuies foun tht pling strie-se prefether s front en of pir-se prefether mkes pir-se prefething more effetive [3, 17]. We exploit this fining y inluing proessor-sie prefething in the form of hrwre multi-strem sequentil prefether t the L1 he. The prefether hs similr pilities to strem uffers [6], exept tht the prefeth lines re put iretly in the L1 he. In our system, we ssume tht the memory ontroller n istinguish the prefethes issue y the proessor-sie prefether from regulr misses. The memory ontroller hooses not to pss suh prefethes to the memory proessor. As result, in generl, the proessor-sie prefether trgets the regulr misses while the memory-sie prefether trgets the irregulr ones. 4 Evlution Environment Applitions. To evlute our prefething sheme, we use three mostly irregulr memory-intensive pplitions from the SPECInt2000 suite. Irregulr pplitions re hrly menle to ompiler-se prefething. Consequently, they re the ovious trget for the type of prefething tht we propose. We hoose Gp, Mf, n Prser. Gp uses suset of the test input set, Mf uses the test input set, n Prser uses suset of the trin input set. Simultion Environment. The evlution is performe using exeution-riven simultion. Our environment is se on n extension to MINT tht supports ynmi superslr proessor moels with register renming, rnh preition, n non-loking memory opertions [9]. The rhiteture moele is tht of high-en PC with

6 Min Pro 6-issue ynmi, 1.6 GHz. Int, fp, l/st FU: 4,4,2. Pening l/st: 8/16. Brnh penlty: 12 yles. L1 t: write-k, 16 KB, 2 wy, 32-B line, 3-yle hit RT. L2 t: write-k, 512 KB, 4 wy, 64-B line, 19-yle hit RT. RT memory lteny: 243 yles (row miss), 208 yles (row hit). Min memory us: split-trnstion, 8-B wie, 400 MHz, 3.2 GB/se pek. Mem Pro in DRAM 2-issue ynmi, 800 MHz. Int, fp, l/st FU: 2,2,1. Pening l/st: 4/4. Brnh penlty: 6 yles. L1 t: write-k, 32 KB, 2 wy, 32-B line, 4-yle hit RT. RT memory lteny: 56 yles (row miss), 21 yles (row hit). Internl DRAM t us: 32-B wie, 800 MHz, 25.6 GB/se. DRAM prmeters Dul hnnel; eh hnnel 2-B wie, 800 MHz; totl 3.2 GB/se pek. Rnom ess time (trac) 45 ns; from Mem Controller (tsystem) 60 ns. Other Depth of queues 1 through 4: % Corret Preition Seq1 Seq4 Bse Seq4+Bse Gp Mf Prser Averge Figure 5: Chrterizing the preitility of misses. Tle 2: Prmeters of the simulte rhiteture. Ltenies orrespon to ontention-free onitions. RT stns for roun-trip from the proessor. All yles re 1.6 GHz yles. 512-KB L2 he is hosen for the min proessor euse we run smll inputs for the pplitions. memory proessor tht is integrte in the DRAM, following the mirorhiteture of Figure 3. Tle 2 shows the prmeters use for eh omponent of the rhiteture. The rhiteture is moele yle y yle, inluing ontention effets. In the simultion, oth the pplition thre n the prefething thre re run simultneously. We moel the ontention etween the two thres on memory susystems tht re shre (memory ontroller, DRAM hnnels, DRAM nks, et.). The simultion inlues ll overhes inurre y running the two thres on ifferent proessors. Algorithm Prmeters. Tle 3 shows the efult prmeter vlues tht we use for the lgorithms esrie in Setion 3.2. For the Bse lgorithm, we use the vlues similr to wht Joseph n Grunwl use for their system [5] to mke the omprison esier. For ll the lgorithms, we use Num- Rows = 64K, whih results in tle of size 1.3 MBytes, 0.66 MBytes, n 1.8 MBytes for Bse, Chin, n, respetively. These sizes re very tolerle, sine the tle is plin softwre t struture tht is store in min memory, is ynmilly llote, n is he y the memory proessor. The onventionl prefething isusse in Setion 3.4 tkes two prmeters: the numer of strems it is le to prefeth simultneously (NumSeq) n the numer of prefethes tht it issues per miss in sequene oserve (NumPref). We implement this lgorithm in hrwre in the L1 he (Conven4) n lso in softwre running on the memory proessor (Seq1 n Seq4). Algorithm Lel Prmeter Vlues Bse Bse NumSu = 4, Asso = 4 Chin Chin NumSu = 2, Asso = 2, NumLevels = 3 ite NumSu = 2, Asso = 2, NumLevels = 3 Conventionl 1-Strem Seq1 NumSeq = 1, NumPref = 6 Conventionl 4-Strem Seq4 NumSeq = 4, NumPref = 6 Conventionl 4-Strem Conven4 NumSeq = 4, NumPref = 6 Tle 3: Prmeter vlues use in the lgorithms. 5 Evlution To evlute our prefething sheme, we first hrterize the ehvior of pplitions (Setion 5.1) n then ompre the performne of ifferent lgorithms (Setion 5.2). 5.1 Chrterizing Applition Behvior For memory-sie orreltion prefething to e effetive, the miss ress strems hve to e preitle. In this experiment, we reor the frtion of L2 he misses tht re orretly preite. For sequentil sheme, this mens tht the upoming ress extly mthes the one preite, while for pir-se sheme, the upoming ress mthes one of the preite suessors. The thre oes not perform prefething here n it only oserves the resses of ll L2 he misses. In our experiments, shown in Figure 5, we reor the frtion of L2 he misses tht re orretly preite. We try strie-se shemes tht etet up to one strem (Seq1) n four strems (Seq4), the Bse lgorithm, n the omintion. The figure shows tht the miss strem is lrgely preitle, with Seq4, Bse, n Seq4+Bse orretly preiting roughly 40%, 70%, n 80% of the misses on verge, respetively. However, the preitility of eh pplition iffers. For exmple, Mf oes not hve sequentil ptterns, while Prser hs mostly sequentil ptterns, n Gp is mixe. % of Misses 100% 80% 60% 40% 20% 0% Figure 6: misses. Gp Mf Prser Averge [280,Infinity) [200,280) [80,200) [0,80) Chrterizing the time etween onseutive Seq4 lwys outperform Seq1, initing tht multiple

7 Normize Exeution Time NoPref Conven4 Bse Chin Conven4+ NoPref Conven4 Bse Chin Conven4+ NoPref Conven4 Bse Chin Conven4+ NoPref Conven4 Bse Chin Conven4+ Busy L1toL2 PstL2 Gp Mf Prser Averge Figure 7: Exeution time of the ifferent lgorithms. strem support is neessry for sequentil sheme. The figure shows tht in ll pplitions, Bse is lmost s goo s the omintion Seq4+Bse. This is euse orreltion tle is le to etet oth sequentil n irregulr ptterns, s long s the ptterns repet. One the tle lerns pttern, it n preit it effetively. However, it is still enefiil to hve multi-strem sequentil prefether t the proessor-sie for severl resons: it oes not nee lerning, it n e heply implemente, n it n hie the full memory lteny if integrte with the L1 he. Furthermore, it splits the misses into regulr n irregulr strems, n y tkling the regulr one, it removes some lo from the memory prefether. We now onsier the time etween misses. Figure 6 lssifies the misses oring to the numer of yles etween two onseutive misses rriving t the memory. The misses re groupe in ins orresponing to [0,80) 1.6 GHz proessor yles, [80,200), et. The most signifint ins in the figure re [200,280), [280, ), n [0,80), whih ontriute on verge to 54%, 28%, n 18% of ll miss istnes. The misses with istnes etween 200 n 280 re ritil s they re oth frequent n hr to hie even with out-of-orer proessors. Furthermore, sine the roun-trip memory lteny is etween 208 n 243 yles, epenent misses re likely to fll in this in. This hrteriztion suggests tht, to e on the sfe sie, oupny time of the prefething lgorithm shoul e less thn 200 yles. The [0,80) in ontins misses tht my not give enough time for our prefething thre to respon. Fortuntely, these misses re not frequent n re likely to e overlppe with eh other or with omputtion. Thus, they hrm the performne muh less thn the in size implies. 5.2 Compring the Different Algorithms Figure 7 ompres the exeution time of the pplitions in ifferent ses: no prefething (NoPref), hrwre proessorsie L1 prefething s shown in Tle 3 (Conven4), ifferent softwre memory-sie prefething shemes s shown in Tle 3 (Bse, Chin, n ), n the omintion of Conven4 n (Conven4+). For eh pplition n the verge, the rs re normlize to NoPref. They re roken own into miss stll time pst the L2 he (PstL2), miss stll time etween the L1 n L2 hes (L1toL2), n the remining time (Busy) tht represents proessor omputtion plus vrious pipeline stlls. On verge, the PstL2 time is the most signifint omponent of the exeution time, ontriuting out 40%, while Busy n L1toL2 follow with 35% n 25%, respetively. Thus, lthough our softwre sheme n only trget L2 he misses, we re trgeting the min performne ottlenek. The onventionl sheme (Conven4) performs well on pplitions with some sequentil ptterns, suh s Gp n Prser, ut is ineffetive in the pplition tht hs purely irregulr ptterns (Mf). On verge, Conven4 reues the exeution time y 10%. The pir-se shemes show mixe performne. The Bse sheme, moele fter Joseph n Grunwl s, shows limite speeups euse it oes not prefeth fr enough. Chin performs slightly etter thn Bse, ut is limite y inury n high response time. is le to reue the exeution time signifintly. It outperforms oth Bse n Chin in ll pplitions. Its impt omes from the nie properties of the ite lgorithm, s isusse in Setion 3. The omine sheme (Conven4+) performs the est. Its impt is signifint: it removes on verge 60% of PstL2 stll time, proviing n verge speeup of Compre to proessor-sie prefething only (Conven4) with n verge speeup of 1.11, n memory-sie prefething only () with n verge speeup of 1.28, there is ler synergisti effet in the omine sheme. Memory-sie prefething helps proessor-sie prefething in irregulr ptterns, while proessor-sie prefething helps in regulr ptterns. Worklo of the Prefething Thre We n gin further insight y exmining the work lo of the prefething thre. Figure 8 shows the verge response

8 time n oupny of the prefething thre for eh of the memory-sie prefething lgorithm. The ltenies re shown in 1.6 GHz yles n orrespon to the verge of ll pplitions. Eh r is roken own into omputtion time (Busy) n memory stll time (Mem). The numers on top of eh r show the verge IPC of the prefething thre. The IPC is lulte s the numer of instrutions ivie y the numer of memory proessor yles. The figure shows tht for ll the lgorithms, the oupny time is less thn 200 yles, showing the viility of the softwre implementtion. Chin n hve the lowest oupny time. Due to the fewer ssoitive serhes n the etter he use, hs only slightly higher oupny time ompre to Chin, espite performing more tle uptes. The response time is very importnt for prefething effetiveness. The figure shows tht hs the lowest response time. its vlue is roun 30 yles. Numer of Proessor Cyles Bse 1.09 Chin Mem Busy 0.94 Response Time 1.40 Bse 1.24 Chin Oupny Time Figure 8: Response n oupny time of the prefething thre for eh of the prefething lgorithm. 6 Conlusions This pper introue memory-sie orreltion-se prefething implemente in user-level thre. The sheme runs on generl-purpose proessor in the min memory. The sheme n e supporte with few moifitions to the L2 he n no moifition to the min proessor. We introue new orgniztion of the orreltion tle n new orreltion prefething lgorithm tht enle fst n urte fr-he prefething with high overge. Overll, our sheme effetively prefethe irregulr pplitions, speeing up three SPECInt2000 pplitions y n verge of Furthermore, our sheme n work synergistilly with onventionl proessor-sie prefether to eliver n verge speeup of Referenes [1] T. Alexner n G. Keem. Distriute Preitive Che Design for High Performne Memory Systems. In Proeeings of the 2n HPCA, Fe [2] J.B. Crter, W.C. Hsieh, L.B. Stoller, M.R. Swnson, L. Zhng, E.L. Brunvn, A. Dvis, C.-C. Kuo, R. Kurmkote, M.A. Prker, L. Shelike, n T. Tteym. Impulse: Builing Smrter Memory Controller. In Proeeings the 5th HPCA, Jnury [3] M.J. Chrney n A.P.Reeves. Generlize Correltion Bse Hrwre Prefething. Tehnil Report EE-CEG-95-1, Cornell University, Fe [4] C.J. Hughes. Prefething Linke Dt Strutures in Systems with Merge DRAM-Logi. Mster s thesis, University of Illinois t Urn-Chmpign, My URL: jhughes/jhughesmsthesis.pf. [5] D. Joseph n D. Grunwl. Prefething Using Mrkov Preitors. In Proeeings of the 24th ISCA, June [6] N.P. Jouppi. Improving Diret-Mppe Che Performne y the Aition of Smll Fully-Assoitive Che n Prefeth Buffers. In Proeeings of the 17th ISCA, pges , [7] D. Koufty n J. Torrells. Compring Dt Forwring n Prefething for Communition-Inue Misses in Shre-Memory MPs. In Proeeings of the ICS, July [8] C. Kozyrkis, S. Perisskis, D. Ptterson, T. Anerson, K. Asnovi, N. Crwell, R. Fromm, J. Golus, B. Grist, K. Keeton, R. Thoms, N. Treuhft, n K. Yelik. Slle Proessors in the Billion- Trnsistor Er: IRAM. IEEE Computer, Septemer [9] V. Krishnn n J. Torrells. An Exeution-Driven Frmework for Fst n Aurte Simultion of Superslr Proessors. In Interntionl Conferene on Prllel Arhitetures n Compiltion Tehniques (PACT), Otoer [10] D. Kroft. Lokup-free Instrution Feth/Prefeth Che Orgniztion. In Proeeings of the 8th ISCA, pges 87 85, [11] A.-C. Li, C. Fie, n B. Flsfi. De-Blok Preition n De- Blok Correlting Prefethers. In Proeeings of the 28th ISCA, [12] C.-K. Luk. Tolerting Memory Lteny through Softwre-Controlle Pre-Exeution in Simultneous Multithreing Proessors. In Proeeings of the 28th ISCA, [13] NVIDIA. [14] R. Cooksey, D. Colrelli, n D. Grunwl. Content-se Prefething: Initil Results. In The 2n Workshop on Intelligent Memory Systems, Nov [15] A. Roth n G.S. Sohi. Speultive Dt-Driven Multithreing. In Proeeings of the 7th HPCA, pges 37 48, Jn [16] Sony Computer Entertinment In. [17] T. Sherwoo, S. Sir, n B. Cler. Preitor-Direte Strem Buffers. In Proeeings of the 33th MICRO, De [18] C.-L. Yng n A.R.Leek. Push vs. Pull: Dt Movement for Linke Dt Strutures. In Proeeings of the 2000 ICS, My Aknowlegement We thnk Jmes Tuk, Jose F. Mrtinez, Jose Renu, n Mihel Hung for ontriutions to this work.

Using a User-Level Memory Thread for Correlation Prefetching

Using a User-Level Memory Thread for Correlation Prefetching Using User-Level Memory Thre for Correltion Prefething Yn Solihin Jejin Lee Josep Torrells University of Illinois t Urn-Chmpign Mihign Stte University http://iom.s.uiu.eu http://www.se.msu.eu/ jlee Astrt