Using a User-Level Memory Thread for Correlation Prefetching

Size: px

Start display at page:

Download "Using a User-Level Memory Thread for Correlation Prefetching"

Hester Rice
6 years ago
Views:

1 Using User-Level Memory Thre for Correltion Prefething Yn Solihin Jejin Lee Josep Torrells University of Illinois t Urn-Chmpign Mihign Stte University jlee Astrt This pper introues the ie of using User-Level Memory Thre (ULMT) for orreltion prefething. In this pproh, user thre runs on generl-purpose proessor in min memory, either in the memory ontroller hip or in DRAM hip. The thre performs orreltion prefething in softwre, sening the prefethe t into the L2 he of the min proessor. This pproh requires miniml hrwre eyon the memory proessor: the orreltion tle is softwre t struture tht resies in min memory, while the min proessor only nees few moifitions to its L2 he so tht it n ept inoming prefethes. In ition, the pproh hs wie usility, s it n effetively prefeth even for irregulr pplitions. Finlly, it is very flexile, s the prefething lgorithm n e ustomize y the user on n pplition sis. Our simultion results show tht, through new esign of the orreltion tle n prefething lgorithm, our sheme elivers goo results. Speifilly, nine mostly-irregulr pplitions show n verge speeup of Furthermore, our sheme works well in omintion with onventionl proessor-sie sequentil prefether, in whih se the verge speeup inreses to Finlly, y exploiting the ustomiztion of the prefething lgorithm, we inrese the verge speeup to Introution Dt prefething is populr tehnique to tolerte long memory ess ltenies. Most of the pst work on t prefething hs fouse on proessor-sie prefething [6, 7, 8, 12, 13, 14, 15, 19, 20, 23, 25, 26, 28, 29]. In this pproh, the proessor or n engine in its he hierrhy issues the prefeth requests. An interesting lterntive is memory-sie prefething, where the engine tht prefethes t for the proessor is in the min memory system [1, 4, 9, 11, 22, 28]. Memory-sie prefething is ttrtive for severl resons. First, it elimintes the overhes n stte ookkeeping tht prefeth requests introue in the pths etween the min proessor n its hes. Seon, it n e supporte with few moifitions to the ontroller of the L2 he n no moifition to the min proessor. Thir, the prefether n exploit its proximity to the memory to its vntge, for exmple y storing its stte in memory. Finlly, memory-sie prefething hs the itionl ttrtion of riing the tehnology tren of inrese hip integrtion. Inee, populr pltforms like PCs re eing equippe with grphis engines in the memory system [27]. Some hipsets like NVIDIA s nfore even integrte powerful proessor in the North Brige hip [22]. Simpler This work ws supporte in prt y the Ntionl Siene Fountion uner grnts CCR , EIA , EIA , n CHE ; y DARPA uner grnt F C-0078; y Mihign Stte University; n y gifts from IBM, Intel, n Hewlett-Pkr. engines n e provie for prefething, or existing grphis proessors n e ugmente with prefething pilities. Moreover, there re proposls to integrte proessing logi in DRAM hips, suh s IRAM [16]. Unfortuntely, existing proposls for memory-sie prefething engines hve nrrow sope [1, 9, 11, 22, 28]. Inee, some esigns re hrwre ontrollers tht perform simple n speifi opertions [1, 9, 22]. Other esigns re speilize engines tht re ustom-esigne to prefeth linke t strutures [11, 28]. Inste, we woul like n engine tht is usle in wie vriety of worklos n tht offers flexiility of use to the progrmmer. While memory-sie prefething n support vriety of prefething lgorithms, one type tht is prtiulrly suite to it is Correltion prefething [1, 6, 12, 18, 26]. Correltion prefething uses pst sequenes of referene or miss resses to preit n prefeth future misses. Sine no progrm knowlege is neee, orreltion prefething n e esily move to the memory sie. In the pst, orreltion prefething hs een supporte y hrwre ontrollers tht typilly require lrge hrwre tle to keep the orreltions [1, 6, 12, 18]. In ll ses ut one, these ontrollers re ple etween the L1 n L2 hes, or etween the proessor n the L1. While effetive, this pproh hs high hrwre ost. Furthermore, it is often unle to prefeth fr he enough n eliver goo prefeth overge. In this pper, we present new sheme where orreltion prefething is performe y User-Level Memory Thre (ULMT) running on simple generl-purpose proessor in memory. Suh proessor is either in the memory ontroller hip or in DRAM hip, n prefethes lines to the L2 he of the min proessor. The sheme requires miniml hrwre support eyon the memory proessor: the orreltion tle is softwre t struture tht resies in min memory, while the min proessor only nees few moifitions to its L2 he ontroller so tht it n ept inoming prefethes. Moreover, our sheme hs wie usility, s it n effetively prefeth even for irregulr pplitions. Finlly, it is very flexile, s the prefething lgorithm exeute y the ULMT n e ustomize y the progrmmer on n pplition sis. Using new esign of the orreltion tle n orreltion prefething lgorithm, our sheme elivers n verge speeup of 1.32 for nine mostly-irregulr pplitions. Furthermore, our sheme works well in omintion with onventionl proessor-sie sequentil prefether, in whih se the verge speeup inreses to Finlly, y exploiting the ustomiztion of the prefething lgorithm, we inrese the verge speeup to This pper is orgnize s follows: Setion 2 isusses memory-sie n orreltion prefething; Setion 3 presents ULMT for orrel-

2 Possile Lotions of the Memory Proessor CPU L1 $ L2 $ North Brige Chip DRAM Memory Min Pro 1: Feth i Mem Pro 3: Prefeth j, k 2: y i () () () Figure 1. Memory-sie prefething: some lotions where the memory proessor n e ple (), n tions uner push pssive () n push tive () prefething. Mem 2: Lookup Min Memory System Min Pro Mem Pro 1: Exeute 3: y i 3: Prefeth i Mem 2: Feth i Min Memory System tion prefething; Setion 4 isusses our evlution setup; Setion 5 evlutes our esign; Setion 6 isusses relte work; n Setion 7 onlues. 2. Memory-Sie n Correltion Prefething 2.1. Memory-Sie Prefething Memory-Sie prefething ours when prefething is initite y n engine tht resies either lose to the min memory (eyon ny memory us) or insie of it [1, 4, 9, 11, 22, 28]. Some mnufturers hve uilt suh engines. Typilly, they re simple hrwire ontrollers tht proly reognize only simple strie-se sequenes n prefeth t into lol uffers. Some exmples re NVIDIA s DASP engine in the North Brige hip [22] n Intel s prefeth he in the i860 hipset. In this pper, we propose to support memory-sie prefething with user-level thre running on generl-purpose ore. The ore n e very simple n oes not nee to support floting point. For illustrtion purposes, Figure 1-() shows the memory system of PC. The ore n e ple in ifferent ples, suh s in the North Brige (memory ontroller) hip or in the DRAM hips. Pling it in the North Brige simplifies the esign euse the DRAM is not moifie. Moreover, some existing systems lrey inlue ore in the North Brige for grphis proessing [22], whih oul potentilly e reuse for prefething. Pling the ore in DRAM hip omplites the esign, ut the resulting highly-integrte system hs lower memory ess lteny n higher memory nwith. In this pper, we exmine the performne potentil of oth esigns. Memory- n proessor-sie prefething re not the sme s Push n Pull (or On-Demn) prefething [28], respetively. Push prefething ours when prefethe t is sent to he or proessor tht hs not requeste it, while pull prefething is the opposite. Clerly, memory prefether n t s pull prefether y simply uffering the prefethe t lolly n supplying it to the proessor on emn [1, 22]. In generl, however, memory-sie prefething is most interesting when it performs push prefething to the hes of the proessor euse it n hie lrger frtion of the memory ess lteny. Memory-sie prefething n lso e lssifie into Pssive n Ative. In pssive prefething, the memory proessor oserves the requests from the min proessor tht reh min memory. on them, n fter exmining some internl stte, the memory proessor prefethes other t for the min proessor tht it expets the ltter to nee in the future (Figure 1-()). In tive prefething, the memory proessor runs n rige version of the oe tht is running on the min proessor. The exeution of the oe inues the memory proessor to feth t tht the min proessor will nee lter. The t fethe y these requests is lso sent to the min proessor (Figure 1-()). In this pper, we onentrte on pssive push memory-sie prefething into the L2 he of the min proessor. The memory proessor ims to eliminte only L2 he misses, sine they re the only ones tht it sees. Typilly, L2 he miss time is n importnt ontriutor to the proessor stll ue to memory esses, n is usully the hrest to hie with out-of-orer exeution. This pproh to prefething is inexpensive to support. The min proessor ore oes not nee to e moifie t ll. Its L2 he nees to hve the following supports. First, s in other systems [11, 15, 28], the L2 he hs to ept lines from the memory tht it hs not requeste. To o so, the L2 uses free Miss Sttus Hnling Registers (MSHRs) in suh events. Seonly, if the L2 hs pening request n prefethe line with the sme ress rrives, the prefeth simply stels the MSHR n uptes the he s if it were the reply. Finlly, prefethe line rriving t L2 is roppe in the following ses: the L2 he lrey hs opy of the line, the write-k queue hs opy of the line euse the L2 he is trying to write it k to memory, ll MSHRs re usy, or ll the lines in the set where the prefethe line wnts to go re in trnstion-pening stte Correltion Prefething Correltion Prefething uses pst sequenes of referene or miss resses to preit n prefeth future misses [1, 6, 12, 18, 26]. Two populr orreltion shemes re Strie- n Pir- shemes. Strie-se shemes fin strie ptterns in the ress sequenes n prefeth ll the resses tht will e esse if the ptterns ontinue in the future. Pir-se shemes ientify orreltion etween pirs or groups of resses, for exmple etween miss n sequene of suessor misses. A typil implementtion of pir-se shemes uses Correltion Tle to reor the resses tht re orrelte. Lter, when miss is oserve, ll the resses tht re orrelte with its ress re prefethe. Pir-se shemes re ttrtive euse they hve generl ppliility: they work for ny miss ptterns s long s miss ress sequenes repet. Suh ehvior is ommon in oth regulr n irregulr pplitions, inluing those with sprse mtries or linke t strutures. Furthermore, pir-se shemes, like ll orreltion shemes, nee neither ompiler support nor hnges in the pplition inry. Pir-se orreltion prefething hs only een stuie using hrwre-se implementtions [1, 6, 12, 18, 26], typilly y pling ustom prefeth engine n hrwre orreltion tle etween the proessor n L1 he, or etween the L1 n L2 hes. The typil orreltion tle, s use in [6, 12, 26], is orgnize s

3 follows. Eh row stores the tg of n ress tht misse, n the resses of set of immeite suessor misses. These re misses tht hve een seen to immeitely follow the first one t ifferent points in the pplition. The prmeters of the tle re the mximum numer of immeite suessors per miss (NumSu), the mximum numer of misses tht the tle n store preitions for (NumRows), n the ssoitivity of the tle (Asso). Aoring to [12], for est performne, the entries in row shoul reple eh other with LRU poliy. Figure 4-() illustrtes how the lgorithm works. We ll the lgorithm. The figure shows two snpshots of the tle t ifferent points in the miss strem ((i) n (ii)). Within row, suessors re liste in MRU orer from left to right. At ny time, the hrwre keeps pointer to the row of the lst miss oserve. When miss ours, the tle lerns y pling the miss ress s one of the immeite suessors of the lst miss, n new row is llote for the new miss unless it lrey exists. When the tle is use to prefeth ((iii)), it rets to n oserve miss y fining the orresponing row n prefething ll NumSu suessors, strting from the MRU one. The esigns in [1, 18] work slightly ifferently. They re isusse in Setion 6. Overll, pst work hs emonstrte the ppliility of pir-se orreltion prefething for mny pplitions. However, it hs lso revele the shortomings of the pproh. One ritil prolem is tht, to e effetive, this pproh nees lrge tle. Propose shemes typilly nee 1-2 Myte on-hip SRAM tle [12, 18], while some pplitions with lrge footprints even nee 7.6 Myte off-hip SRAM tle [18]. Furthermore, the populr shemes tht prefeth severl potentil immeite suessors for eh miss [6, 12, 26] hve two limittions: they o not prefeth very fr he n, intuitively, they nee to oserve one miss to eliminte nother miss (its immeite suessor). As result, they ten to hve low overge. Coverge is the numer of useful prefethes over the originl numer of misses [12]. 3. ULMT for Correltion Prefething We propose to use ULMT to eliminte the shortomings of pirse orreltion prefething while enhning its vntges. In the following, we isuss the min onept (Setion 3.1), the rhiteture of the system (Setion 3.2), moifie orreltion prefething lgorithms (Setion 3.3), n relte operting system issues (Setion 3.4) Min Conept A ULMT running on generl-purpose ore in memory performs two oneptully istint opertions: lerning n prefething. Lerning involves oserving the misses on the min proessor s L2 he n reoring them in orreltion tle one miss t time. The prefething opertion involves reting to one suh miss y looking up the orreltion tle n triggering the prefething of severl memory lines for the L2 he of the min proessor. No tion is tken on write-k to memory. In prtie, in greement with pst work [12], we fin tht omining oth lerning n prefething works est: the orreltion tle ontinuously lerns new ptterns, while uninterrupte prefething elivers higher performne. Consequently, the ULMT exeutes the infinite loop shown in Figure 2. Initilly, the thre wits for miss to e oserve. When it oserves one, it looks up the tle n genertes the resses of the lines to prefeth (Prefething Step). Then, it uptes the tle with the ress of the oserve miss (Lerning Step). It then resumes witing. Miss ress oserve Prefething step Response time Prefeth resses generte Oupny time Wit Lerning step Figure 2. Infinite loop exeute y the ULMT. Tle upte Any prefeth lgorithm exeute y the ULMT is hrterize y its Response n Oupny times. The response time is the time from when the ULMT oserves miss ress until it genertes the resses to prefeth. For est performne, the response time shoul e s smll s possile. This is why we lwys exeute the Prefething step efore the Lerning one. Moreover, we shift s muh omputtion s possile from the Prefething to the Lerning step, retining only the most ritil opertions in the Prefething step. The oupny time is the time when the ULMT is usy proessing single oserve miss. For the ULMT implementtion of the prefether to e vile, the oupny time hs to e smller thn the time etween two onseutive L2 misses most of the times. The orreltion tle tht the ULMT res n writes is simply softwre t struture in memory. Consequently, our sheme elimintes the ostly hrwre tle require y urrent implementtions of orreltion prefething [12, 18]. Moreover, esses to the softwre tle re inexpensive euse the memory proessor trnsprently hes the tle in its he. Finlly, our new sheme enles the reesign of the orreltion tle n prefething lgorithms (Setion 3.3) to ress the low-overge n short-istne prefething limittions of urrent implementtions Arhiteture of the System Figures 3-() n () show the rhiteture of system tht integrtes the memory proessor in the North Brige hip or in DRAM hip, respetively. The first esign requires no moifition to the DRAM or its interfe, n is lrgely omptile with onventionl memory systems. The seon esign nees hnges to the DRAM hips n their interfe, n nees speil support to work in typil memory systems, whih hve multiple DRAM hips. However, sine our gol is to exmine the performne potentil of the two esigns, we strt wy some of the implementtion omplexity of the seon esign y ssuming single-hip min memory. In the following, we outline how the systems work. In our isussion, we only onsier memory esses resulting from misses; we ignore write-ks for simpliity n euse they o not ffet our lgorithms. In Figure 3-(), the key ommunition ours through queues 1, 2, n 3. Miss requests from the min proessor re eposite in queues 1 n 2 simultneously. The ULMT uses the entries in queue 2 to uil its tle n, se on it, generte the resses to prefeth. The ltter re eposite in queue 3. Queues 1 n 3 ompete to ess memory, lthough queue 3 hs lower priority thn 1. When the ress of line to prefeth is eposite in queue 3, the hrwre ompres it ginst ll the entries in queue 2. If mth for ress ress X is etete, X is remove from oth queues. We remove X from queue 3 euse it is reunnt: higher-priority

4 North Brige Chip Memory Proessor Che Other Units Filter 5 Min Proessor 2 Bus Interfe Memory Controller Memory 4 North Brige Chip Other Units Min Proessor Bus Interfe Memory Controller DRAM hip () () Figure 3. Arhiteture of system tht integrtes the memory proessor in the North Brige hip () or in DRAM hip () Memory Proessor Che Filter 3 DRAM request for X is lrey in queue 1. X is remove from queue 2 to sve omputtion in the ULMT. Note tht it is unler whether we lost the opportunity to prefeth X s suessors y not proessing X. The reson is tht our lgorithms prefeth severl levels of suessor misses (Setion 3.3) n, s result, some of X s suessors my lrey e in queue 3. Proessing X my help improve the stte in the orreltion tle. However, minimizing the totl oupny of the ULMT is ruil in our sheme. Similrly, when min-proessor miss is out to e eposite in queues 1 n 2, the hrwre ompres its ress ginst those in queue 3. If there is mth, the request is put only in queue 1 n the mthing entry in queue 3 is remove. It is possile tht requests from the min proessor rrive too fst for the ULMT to onsume them n queue 2 overflows. In this se, the memory proessor simply rops these requests. Figure 3-() lso shows the Filter moule ssoite with queue 3. This moule improves the performne of orreltion prefething, whih my sometimes try to prefeth the sme ress severl times in short time. The Filter moule rops prefeth requests irete to ny ress tht hs een reently issue nother prefeth requests. The moule is fixe-size FIFO list tht reors the resses of ll the reently-issue requests. Before request is issue to queue 3, the hrwre heks the Filter list. If it fins its ress, the request is roppe n the list is left unmoifie. Otherwise, the ress is e to the til of the list. With this support, some unneessry prefeth requests re eliminte. For ompleteness, the figure shows other queues. ies from memory to the min proessor go through queue 4. In ition, the ULMT nees to ess the softwre orreltion tle in min memory. Rell tht the tle is trnsprently he y the memory proessor. Logil queues 5 n 6 provie the neessry pths for the memory proessor to ess min memory. In prtie, queues 5 n 6 re merge with the others. If the memory proessor is in the DRAM hip (Figure 3-()), the system works slightly ifferently. Miss requests from the min proessor re eposite first in queue 1 n then in queue 2. The ULMT in the memory proessor esses the orreltion tle from its he n, on miss, iretly from the DRAM. The resses to prefeth re psse through the Filter moule n ple in queue 3. As in Figure 3-(), entries in queues 2 n 3 re heke ginst eh other, n the ommon entries re roppe. The replies to oth prefethes n min-proessor requests re returne to the memory ontroller. As they reh the memory ontroller, their resses re ompre to the proessor miss requests in queue 1. If memory-prefethe line mthes miss request from the min proessor, the former is onsiere to e the reply of the ltter, n the ltter is not sent to the memory hip. Finlly, in mhines tht inlue form of proessor-sie prefething, we envision our rhiteture to operte in two moes: Verose n Non-Verose. In Verose moe, queue 2 in Figures 3-() n () reeives oth min-proessor misses n min-proessor prefeth requests. In Non-Verose moe, queue 2 only reeives min-proessor misses. This moe ssumes tht min-proessor prefeth requests re istinguishle from other requests, for exmple with tg s in the MIPS R10000 [21]. The Non-Verose moe is useful to reue the totl oupny of the ULMT. In this se, the proessor-sie prefether n fous on the esy-to-preit sequentil or regulr miss ptterns, while the ULMT n fous on the hr-to-preit irregulr ones. The Verose moe is lso useful: the ULMT n implement prefeth lgorithm tht enhnes the effetiveness of the proessor-sie prefether. We present n exmple of this se in Setion Correltion Prefething Algorithms Simply tking the urrent pir-se orreltion tle n lgorithm n implementing them in softwre is not goo enough. Inee, s inite in Setion 2.2, the lgorithm hs two limittions: it oes not prefeth very fr he n, intuitively, it nees to oserve one miss to eliminte nother miss (its immeite suessor). As result, it tens to hve low overge. To inrese overge, three things nee to our. First, we nee to eliminte these two limittions y storing in the tle (n prefething) severl levels of suessor misses per miss: immeite suessors, suessors of immeite suessors, n so on for severl levels. Seon, these prefethes hve to e highly urte. Finlly, the prefether hs to tke eisions erly enough so tht the prefethe lines reh the min proessor efore they re neee. These onitions re esier to support n ensure when the orreltion lgorithm is implemente s ULMT. There re two resons for it. The first one is tht storge is now hep n, therefore, the orreltion tle n e inexpensively expne to hol multiple levels of suessor misses per miss, even if tht mens repliting informtion. The seon reson is the Customizility provie y softwre implementtion of the prefething lgorithm. In the rest of this setion, we esrie how ULMT implementtion of orreltion prefething n eliver high overge. We esrie three pprohes: using onventionl tle orgniztion, using tle re-orgnize for ULMT, n exploiting ustomizility.

5 Correltion Tle (i) NumRows=4 Miss Sequene urrent miss,,,,,,... Correltion Tle (i) NumRows=4 Miss Sequene urrent miss,,,,,,... (i) SeonLst Lst Correltion Tle NumLevels=2 Miss Sequene urrent miss,,,,,,... NumSu=2 NumSu=2 (ii) urrent miss,,,,,,... (ii) urrent miss,,,,,,... (ii) Lst SeonLst NumSu=2 urrent miss,,,,,,... (iii) on miss prefeth, (iii) on miss follow link NumLevels=2 prefeth, prefeth (iii) on miss prefeth,, () () Figure 4. Pir-se orreltion lgorithms: (), (), n ite (). () Using Conventionl Tle Orgniztion As first step, we ttempt to improve overge without speifilly exploiting the low-ost storge or ustomizility vntges of ULMT. We simply tke the onventionl tle orgniztion of Setion 2.2 n fore the ULMT to prefeth multiple levels of suessors for every miss. The resulting lgorithm we ll. tkes the sme prmeters s plus NumLevels, whih is the numer of levels of suessors prefethe. The lgorithm is illustrte in Figure 4-(). uptes the tle like ((i) n (ii)) ut prefethes ifferently ((iii)). Speifilly, fter prefething the row of immeite suessors, it tkes the MRU one mong them n esses the orreltion tle gin with its ress. If the entry is foun, it prefethes ll NumSu suessors there. Then, it tkes the MRU suessor in tht row n repets the proess. This is one NumLevels-1 times. As n exmple, suppose tht miss on ours ((iii)). The ULMT first prefethes n. Then, it tkes the MRU entry, looks-up the tle, n prefethes s suessor,. resses the two limittions of, nmely not prefething very fr he, n neeing one miss to eliminte seon one. However, my not eliver high overge for two resons: the prefethes my not e highly urte n the ULMT my hve high response time to issue ll the prefethes. The prefethes my e inurte euse oes not prefeth the true MRU suessors in eh level of suessors. Inste, it only prefethes suessors foun long the MRU pth. For exmple, onsier sequene of misses tht lterntes etween,, n,e,,f:,,,...,,e,,f,...,,,,... When miss is enountere, prefethes its immeite suessors (), n then esses the entry for to prefeth e n f. Note tht is not prefethe. The high response time of to miss omes from hving to mke NumLevels esses to ifferent rows in the tle. Eh ess involves n ssoitive serh euse the tle is ssoitive n, potentilly, one or more he misses Using Tle Re-Orgnize for ULMT We now ttempt to improve overge y exploiting the low ost of storge in ULMT solutions. Speifilly, we expn the tle to llow replite informtion. Eh row of the tle stores the tg of the miss ress, n NumLevels levels of suessors. Eh level ontins NumSu resses tht use LRU for replement. Using this tle, we propose n lgorithm lle ite (Figure 4-()). ite tkes the sme prmeters s. As shown in Figure 4-(), ite keeps NumLevels pointers to the tle. These pointers point to the entries for the ress of the lst miss, seon lst, n so on, n re use for effiient tle ess. When miss ours, these pointers re use to ess the entries of the lst few misses, n insert the new ress s the MRU suessor of the orret level ((i) n (ii)). In the figure, the NumSu entries t eh level re MRU orere. Finlly, prefething in ite is simple: when miss is seen, ll the entries in the orresponing row re prefethe ((iii)). Note tht ite elimintes the two prolems of. First, prefethes re urte euse they ontin the true MRU suessors t eh level. This is the result of grouping together ll the suessors from given level, irrespetive of the pth tken. In the sequene shown ove,,,...,,e,,f,...,,,,..., on miss on, ite prefethes n. Seon, the response time of ite is muh smller thn. Inee, ite prefethes severl levels of suessors with single row ess, n mye even with single he miss. ite effetively shifts some omputtion from the Prefething step to the Lerning one: prefething nees single tle ess, while lerning miss nees multiple tle uptes. This is goo tre-off euse the Prefething step is the ritil one. Furthermore, these multiple lerning uptes re inexpensive: the use of the pointers elimintes the nee to o ny ssoitive serhes on the tle, n the rows to e upte re most likely still in the he of the memory proessor (sine they were upte most reently) Exploiting the Customizility of ULMT We n lso improve overge y exploiting the seon vntge of ULMT solutions: ustomizility. The progrmmer or system n hoose to run ifferent lgorithm in the ULMT for eh pplition. The hosen lgorithm n e highly ustomize to the pplition s nees. One pproh to ustomiztion is to use the tle orgniztions n prefething lgorithms esrie ove ut to tune their prmeters on n pplition sis. For exmple, in pplitions where the miss sequenes re highly preitle, we n set the numer of levels of suessors to prefeth (NumLevels) to high vlue. As result,

6 Chrteristis ite Levels of suessors prefethe 1 NumLevels NumLevels True MRU orering for eh level? Yes No Yes Numer of row esses in the Prefething step (Requires SEARCH) 1 NumLevels 1 Numer of row esses in the Lerning step (Requires NO SEARCH) 1 1 NumLevels Response time Low High Low Spe requirement (for onstnt numer of prefethes) NumLevels Tle 1. Compring ifferent pir-se orreltion prefething lgorithms running on ULMT. we will prefeth more levels of suessors with high ury. In pplitions with unpreitle sequenes, we n o the opposite. We n lso tune the numer of rows in the tle (NumRows). In pplitions tht hve lrge footprints, we n set NumRows to high vlue to hol more informtion in the tle. In smll pplitions, we n o the opposite to sve spe. A seon pproh to ustomiztion is to use ifferent prefething lgorithm. For exmple, we n support for sequentil prefething to ll the lgorithms esrie ove. The resulting lgorithms will hve low response time for sequentil miss ptterns. Another pproh is to ptively eie the lgorithm on-the-fly, s the pplition exeutes. In ft, this pproh n lso e use to exeute ifferent lgorithms in ifferent prts of one pplition. Suh intr-pplition ustomizility my e useful in omplex pplitions. Finlly, the ULMT n lso e use for profiling purposes. It n monitor the misses of n pplition n infer higher-level informtion suh s he performne, pplition ess ptterns, or pge onflits Compring the Algorithms Tle 1 ompres the,, n ite lgorithms exeuting on ULMT. ite hs the highest potentil for high overge: it supports fr-he prefething y prefething severl levels of suessors, its prefethes hve high ury euse they prefeth the true MRU suessors t eh level, n it hs low response time, in prt euse it only nees to ess single tle row in the Prefething step. Aessing single row minimizes the ssoitive serhes n the he misses. The only shortoming of ite is the lrger spe tht it requires for the orreltion tle. However, this is minor issue sine the tle is softwre struture llote in min memory. Note tht ll these lgorithms n lso e implemente in hrwre. However, ite is more suitle for n ULMT implementtion euse proviing the lrger spe require in hrwre is expensive Operting System Issues There re some operting system issues tht re relte to ULMT opertion. We outline them here. Protetion. The ULMT hs its own seprte ress spe with its instrutions, the orreltion tle, n few other t strutures. The ULMT shres neither instrutions nor t with ny pplition. The ULMT n oserve the physil resses of the pplition misses. It n lso issue prefethes for these resses on ehlf of the min proessor. However, it n neither re from nor write to these resses. Therefore, protetion is gurntee. Multiprogrmme Environment. It is poor pproh to hve ll the pplitions shre single tle: the tle is likely to suffer lot of interferene. A etter pproh is to ssoite ifferent ULMT, with its own tle, to eh pplition. This elimintes interferene in the tles. In ition, it enles the ustomiztion of eh ULMT to its own pplition. If we onservtively ssume 4-Myte tle on verge per pplition, 8 pplitions require 32 Mytes, whih is only moest frtion of toy s typil min memory. If this requirement is exessive, we n sve spe y ynmilly sizing the tles. In this se, if n pplition oes not use the spe, its tle shrinks. Sheuling. The sheuler knows the ULMT ssoite with eh pplition. Consequently, the sheuler sheules n preempts oth pplition n ULMT s group. Furthermore, the operting system provies n interfe for the pplition to ontrol its ULMT. Pge Re-mpping. Sometimes, pge gets re-mppe. Sine ULMTs operte on physil resses, suh events n use some tle entries to eome stle. We n hoose to tke no tion n let the tle upte itself utomtilly through lerning. Alterntively, the operting system n inform the orresponing ULMT when re-mpping ours, pssing the ol n new physil pge numer. Then, the ULMT inexes its tle for eh line of the ol pge. If the entry is foun, the ULMT relotes it n uptes oth the tg n ny pplile suessors in the row. Given urrent pge sizes, we estimte the tle upte to tke few miroseons. Suh overhe my e overlppe with the exeution of the operting system pge mpping hnler in the min proessor. Note tht some other entries in the tle my still keep stle suessor informtion. Suh informtion my use few useless prefethes, ut the tle will quikly upte itself utomtilly. 4. Evlution Environment Applitions. To evlute the ULMT pproh, we use nine mostlyirregulr, memory-intensive pplitions. Irregulr pplitions re hrly menle to ompiler-se prefething. Consequently, they re the ovious trget for ULMT orreltion prefething. The exeption is CG, whih is regulr pplition. Tle 2 esries the pplitions. The lst four olumns of the tle will e expline lter. Simultion Environment. The evlution is one using n exeution-riven simultion environment tht supports ynmi superslr proessor moel [17]. We moel PC rhiteture with simple memory proessor tht is integrte in either the North Brige hip or in DRAM hip, following the miro-rhiteture of Figure 3. Tle 3 shows the prmeters use for eh omponent of the rhiteture. All yles re 1.6 GHz yles. The rhiteture is moele yle y yle. We moel only uni-progrmme environment with single pplition n single ULMT tht exeute onurrently. We moel ll the ontention in the system, inluing the ontention of the pplition thre n the ULMT on shre resoures suh s the memory ontroller, DRAM hnnels, n DRAM nks. Proessor-Sie Prefething. The min proessor optionlly inlues hrwre prefether tht n prefeth multiple strems of strie 1 or -1 into the L1 he. The prefether monitors L1 he misses n n ientify n prefeth up to NumSeq sequen-

7 Correltion Tle Appl Suite Prolem Input NumRows Size (Mytes) (K) CG NAS Conjugte grient Clss S Equke SpeFP2000 Seismi wve propgtion simultion Test FT NAS 3D Fourier trnsform Clss S Gp SpeInt2000 Group theory solver Rko (suset of test) Mf SpeInt2000 Comintoril optimiztion Test MST Olen Fining minimum spnning tree 1024 noes Prser SpeInt2000 Wor proessing Suset of trin Sprse SprseBenh[10] GMRES with ompresse row storge Tree Univ. of Hwii[3] Brnes-Hut N-oy prolem 2048 oies Averge Tle 2. Applitions use. PROCESSOR Min Proessor: 6-issue ynmi. 1.6 GHz. Int, fp, l/st FUs: 4, 4, 2 Pening l, st: 8, 16. Brnh penlty: 12 yles Memory Proessor: 2-issue ynmi. 800 MHz. Int, fp, l/st FUs: 2, 0, 1 Pening l, st: 4, 4. Brnh penlty: 6 yles MEMORY Min Proessor s Memory Hierrhy: L1 t: write-k, 16 KB, 2 wy, 32-B line, 3-yle hit RT L2 t: write-k, 512 KB, 4 wy, 64-B line, 19-yle hit RT RT memory lteny: 243 yles (row miss), 208 yles (row hit) Memory us: split-trnstion, 8 B, 400 MHz, 3.2 GB/se pek Memory Proessor s Memory Hierrhy: L1 t: write-k, 32 KB, 2 wy, 32-B line, 4-yle hit RT In North Brige: RT mem lteny: 100 yles (row miss), 65 yles (row hit) Lteny of prefeth request to reh DRAM: 25 yles In DRAM: RT mem lteny: 56 yles (row miss), 21 yles (row hit) Internl DRAM t us: 32-B wie, 800 MHz, 25.6 GB/se pek DRAM Prmeters (pplile to ll pros): Dul hnnel. Eh hnnel: 2 B, 800 MHz. Totl: 3.2 GB/se pek Rnom ess time (trac): 45 ns Time from memory ontroller (tsystem): 60 ns OTHER Depth of queues 1 through 6: 16 Filter moule: 32 entries, FIFO Tle 3. Prmeters of the simulte rhiteture. Ltenies orrespon to ontention-free onitions. RT stns for roun-trip from the proessor. All yles re 1.6 GHz yles. til strems onurrently. It works s follows. When the thir miss in sequene is oserve, the prefether reognizes strem. Then, it prefethes the next NumPref lines in the strem into the L1 he. Furthermore, it stores the strie n the next ress expete in the strem in speil register. If the proessor lter misses on the ress in the register, the prefether prefethes the next NumPref lines in the strem n uptes the register. The prefether ontins NumSeq suh registers. As we n see, while this sheme works somewht like strem uffers [13], the prefethe lines go to L1. We hoose this pproh to minimize hrwre omplexity. A shortoming is tht the L1 he my get pollute. For ompleteness, we resimulte the system with the prefethes going into seprte uffers rther thn into L1. We foun tht the performne hnges very little, in prt euse heking the uffers on L1 misses introues ely. Algorithm Prmeters. Tle 4 lists the prefething lgorithms tht we evlute n the efult prmeters tht we use. The sequentil prefething supporte in hrwre y the min proessor is lle for onventionl. It n lso e implemente in softwre y ULMT. We evlute two suh softwre implementtions (Seq1 n Seq4). In this se, the prefether in memory oserves L2 misses rther thn L1. Unless otherwise inite, the proessor-sie prefether is off n, if it is on, the ULMT lgorithms operte in Non-Verose moe (Setion 3.2). For the lgorithm, we hoose the prmeter vlues use y Joseph n Grunwl [12] so tht we n ompre the work. The lst four olumns of Tle 2 give onservtive vlue for the size of the orreltion tle for eh pplition. The tle is twowy set-ssoitive. We hve size the numer of rows in the tle (NumRows) to e the lowest power of two suh tht, with trivil hshing funtion tht simply tkes the lower its of the line ress, less thn 5% of the insertions reple n existing entry. This is very generous llotion. A more sophistite hsh funtion n reue NumRows signifintly without inresing onflits muh. In ny se, knowing tht eh row in,, n tkes 20, 12, n 28 ytes, respetively, in 32-it mhine, we n ompute the totl tle size. Overll, while some pplitions nee more spe thn others, the verge vlue is tolerle: 2.7, 1.6, n 3.8 Mytes for,, n, respetively. ULMT Implementtion. We wrote ll ULMTs in C n hnoptimize them for miniml response n oupny time. One mjor performne ottlenek of the implementtion is frequent rnhes. We remove rnhes y unrolling loops n hrwiring ll lgorithm prmeters. We lso perform optimiztions to inrese the sptil lolity n to reue instrution ount. None of the lgorithms uses floting-point opertions. 5. Evlution 5.1. Chrterizing Applition Behvior Preitility of the Miss Sequenes. We strt y hrterizing how well our ULMT lgorithms n preit the miss sequenes of the pplitions. For tht, we run eh ULMT lgorithm simply oserving ll L2 he miss resses without performing prefething. We reor the frtion of L2 he misses tht re orretly preite. For sequentil prefether, this mens tht the upoming miss ress mthes the next ress preite y one of the strems ientifie; for pir-se prefether, the upoming ress mthes one of the suessors preite for tht level. Figure 5 shows the results of preition for up to three levels of suessors. Given miss, the Level 1 hrt shows the preitility of the immeite suessor, while Level 2 shows the preitility of the next suessor, n Level 3 the suessor fter tht one. The experiments for the pir-se shemes use lrge tles to ensure tht prtilly no preition is misse ue to onflits in the tle: Num- Rows is 256 K, Asso is 4, n NumSu is 4. Uner these oni-

8 Prefething Algorithm Implementtion Nme Prmeter Vlues NumSu = 4, Asso = 4 NumSu = 2, Asso = 2, NumLevels = 3 ite Softwre in memory s ULMT NumSu = 2, Asso = 2, NumLevels = 3 Sequentil 1-Strem Seq1 NumSeq = 1, NumPref = 6 Sequentil 4-Strems Seq4 NumSeq = 4, NumPref = 6 Sequentil 4-Strems Hrwre in L1 of min proessor NumSeq = 4, NumPref = 6 Tle 4. Prmeter vlues use for the ifferent lgorithms. Level 1 Level 2 Level 3 % Corret Preition % Corret Preition % Corret Preition CG Equke FT Gp Mf MST Prser Sprse Tree Averge CG Equke FT Gp Mf MST Prser Sprse Tree Averge CG Equke FT Gp Mf MST Prser Sprse Tree Averge Seq1 Seq4 Seq4+ Seq1 Seq4 Seq4+ Seq1 Seq4 Seq4+ Figure 5. Frtion of L2 he misses tht re orretly preite y ifferent lgorithms for ifferent levels of suessors. tions, for level 1, n re equivlent to. For levels 2 n 3, is not pplile. The figure lso shows the effet of omining lgorithms. Figure 5 shows tht our ULMT lgorithms n effetively preit the miss strems of the pplitions. For exmple, t level 1, Seq4 n orretly preit on verge 49% n 82% of the misses, respetively. Moreover, the est lgorithms keep preiting orretly ross severl levels of suessors. For exmple, orretly preits on verge 77% n 73% of the misses for levels 2 n 3, respetively. Therefore, these lgorithms hve goo potentil. The figure lso shows tht ifferent pplitions hve ifferent miss ehvior. For instne, pplitions suh s Mf n Tree o not hve sequentil ptterns n, therefore, only pir-se lgorithms n preit misses. In other pplitions suh s CG, inste, sequentil ptterns ominte. As result, sequentil prefething n preit prtilly ll L2 misses. Most pplitions hve mix of oth ptterns. Among pir-se lgorithms, lmost lwys outperforms y wie mrgin. This is euse oes not mintin the true MRU suessors t eh level. However, while is effetive uner ll ptterns, it is etter when omine with multi-strem sequentil prefething (Seq4+). Time Between L2 Misses. Another importnt issue is the time etween L2 misses. Figure 6 lssifies L2 misses oring to the numer of yles etween two onseutive misses rriving t the memory. The misses re groupe in ins orresponing to [0,80) yles, [80,200) yles, et. The unit is 1.6 GHz proessor yles. The most signifint in is [200,280), whih ontriutes with 60% of ll miss istnes on verge. These misses re ritil eyon their numers euse their ltenies re hr to hie with out-of-orer exeution. Inee, sine the roun-trip lteny to memory is yles, epenent misses re likely to fll in this in. They ontriute more to proessor stll thn the figure suggests euse epenent misses nnot e overlppe with eh other. Consequently, we wnt the ULMT to prefeth them. To mke sure tht the ULMT is fst enough to lern these misses, its oupny shoul e less thn 200 yles. The misses in the other ins re fewer n less ritil. Those in [280, ) re too fr prt to put pressure on the ULMT s timing. Those in [0,80) my not give enough time to the ULMT to respon. Fortuntely, these misses re more likely to e overlppe with eh other n with omputtion. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 % of Misses CG Equke FT Gp Mf MST Prser Sprse Tree Averge Figure 6. Chrterizing the time etween L2 misses Compring the Different Algorithms [280,Infinity) [200,280) [80,200) [0,80) Figure 7 ompres the exeution time of the pplitions uner ifferent ses: no prefething (), proessor-sie prefething s liste in Tle 4 (), ifferent ULMT shemes liste in Tle 4 (,, n ), the omintion of n (), n some ustomize lgorithms (Custom). The results re for the se where the memory proessor is integrte in the DRAM. For eh pplition n the verge, the rs re normlize to. The rs show the memory-inue proessor stll time tht is use y requests etween the proessor n the L2 he (UptoL2), n y requests eyon the L2 he (Be-

9 Normize Exeution Time Normlize Exeution Time Custom CG Equke FT Gp Mf Custom MST Prser Sprse Tree Averge Custom BeyonL2 UptoL2 Busy BeyonL2 UptoL2 Busy Figure 7. Exeution time of the pplitions with ifferent prefething lgorithms. yonl2). The remining time (Busy) inlues proessor omputtion plus other pipeline stlls. A system with perfet L2 he woul only hve the Busy n UptoL2 times. On verge, BeyonL2 is the most signifint omponent of the exeution time uner. It ounts for 44% of the time. Thus, lthough our ULMT shemes only trget L2 he misses, they trget the min ontriutor to the exeution time. performs very well on CG euse sequentil ptterns ominte. However, it is ineffetive in pplitions suh s Mf n Tree tht hve purely irregulr ptterns. On verge, reues the exeution time y 17%. The pir-se shemes show mixe performne. shows limite speeups, mostly euse it oes not prefeth fr enough. On verge, it reues s exeution time y 6%. performs little etter, ut it is limite y inury (Figure 5) n high response time (Setion 3.3.1). On verge, it reues s exeution time y 12%. is le to reue the exeution time signifintly. It performs well in lmost ll pplitions. It outperforms oth n in ll ses. Its impt omes from the nie properties of the ite lgorithm, s isusse in Setion The verge of the pplition speeups of over is Finlly, performs the est. On verge, it removes over hlf of the BeyonL2 stll time, n elivers n verge pplition speeup of 1.46 over. If we ompre the impt of proessor-sie prefething only () n memory-sie prefething only (), we see tht they hve onstrutive effet in. The reson is tht the two shemes help eh other. Speifilly, the proessor-sie prefether prefethes n elimintes the sequentil misses. The memory-sie prefether works in Non- Verose moe (Setion 3.2) n, therefore, oes not see the prefeth requests. Therefore, it n fully fous on the irregulr miss ptterns. With the resulting reue lo, the ULMT is more effetive. Algorithm Customiztion. In this first pper on ULMT prefething, we hve ttempte only very simple ustomiztion for few pplitions. Tle 5 shows the hnges. For CG, we run Seq1+ in Verose moe. For MST n Mf, we run with higher Num- Levels. In ll ses, is on. The results re shown in Figure 7 s the Custom r in the three pplitions. Applition Customize ULMT Algorithm CG Seq1+ in Verose moe MST, Mf with NumLevels = 4 Tle 5. Customiztions performe. is lso on. The ustomiztion in CG tries to further exploit positive intertion etween proessor- n memory-sie prefething. While CG only hs sequentil miss ptterns (Figure 5), its multiple strems overwhelm the onventionl prefether. Inee, lthough proessorsie prefethes re very urte (99.8% of the prefethe lines re referene), they re not timely enough (only 64% re timely) euse some of them miss in the L2 he. In our ustomiztion, we turn on the Verose moe so tht proessor-sie prefeth requests re seen y the ULMT. Furthermore, the ULMT is extene with single-strem sequentil prefeth lgorithm (Seq1) efore exeuting. In this environment, the positive intertion etween the two prefethers inreses. Speifilly, while the pplition referenes the ifferent strems in n interleve mnner, the proessor-sie prefether unsrmles the miss sequene into hunks of smestrem prefeth requests. The Seq1 prefether in the ULMT then esily ientifies eh strem n, very effiiently, prefethes he. As result, 81% of the proessor-sie prefethes rrive in timely mnner. With this ustomiztion, the speeup of CG improves from 2.19 (with ) to This se emonstrtes tht even regulr pplitions tht re menle to sequentil proessor-sie prefething n enefit from ULMT prefething. The ustomiztion in MST n Mf tries to exploit preitility eyon the thir level of suessor misses y setting NumLevels to 4 in. As shown in Figure 7, this pproh is suessful for MST, ut it proues mrginl gins in Mf. Overll, this initil ttempt t ustomiztion shows promising results. After pplying ustomiztion on three pplitions, the verge exeution speeup of the nine pplitions reltive to eomes 1.53.

10 Normlize Exeution Time MC MC MC MC MC CG Equke FT Gp Mf MST Prser Sprse Tree Averge Figure 8. Exeution time for ifferent lotions of the memory proessor. MC MC MC MC MC BeyonL2 UptoL2 Busy Lotion of Memory Proessor. Figure 8 exmines the impt of where we ple the memory proessor (Figure 3). The first two rs for eh pplition re tken from Figure 7: n. The lst r for eh pplition orrespons to the lgorithm with the memory proessor ple in the memory ontroller (North Brige) hip (MC). With the proessor in the North Brige hip, we hve twie the memory ess lteny (100 yles vs. 56 yles), eight times lower memory nwith (3.2 GB/se vs GB/se), n n itionl 25-yle ely seen y the prefeth requests efore they reh the DRAM. However, Figure 8 shows tht the impt on the exeution time is very smll. It results in smll erese in verge speeups from 1.46 to The impt is smll thnks to the ility of to urtely prefeth fr he. Only the timeliness of the immeite suessor prefethes is ffete, while the prefething of further levels of suessors is still timely. Overll, given these results n the hrwre ost of the two esigns, we onlue tht putting the memory proessor in the North Brige hip is the most ost-effetive esign of the two. Prefething Effetiveness. To gin further insight into these prefething shemes, Figure 9 exmines the effetiveness of the lines prefethe into the L2 he y the ULMT. These lines re lle prefethes. The figure shows t for Sprse, Tree, n the verge of the other seven pplitions. The figure omines oth L2 misses n prefethes, n reks them own into 5 tegories: prefethes tht eliminte n L2 miss (Hits), prefethes tht eliminte prt of the lteny of n L2 miss euse they rrive it lte (DelyeHits), L2 misses tht py the full lteny (NonPrefMisses), n useless prefethes. Useless prefethes re further roken own into prefethes tht re rought into the L2 ut tht re not referene y the time they re reple (e), n prefethes tht re roppe on rrivl to L2 euse the sme line is lrey in the he (Reunnt). Sine Coverge is the frtion of the originl L2 misses tht re fully or prtilly eliminte, it is represente y the sum of Hits n DelyeHits s shown in Figure 9. NonPrefMisses in Figure 9 is the numer of L2 misses left fter prefething, reltive to the originl numer of L2 misses. Note tht NonPrefMisses n e higher thn 1.0 for some lgorithms. 1.0 NonPrefMisses is the numer of L2 misses eliminte reltive to the originl numer of L2 misses. NonPrefMisses n e roken own into two groups: those misses elow the 1.0 line in Figure 9 (1.0 Hits Delye- Hits) ome from the originl misses, while those ove the 1.0 line (Hits DelyeHits NonPrefMisses 1.0) re the new L2 onflit misses use y prefethes. Looking t the verge of the seven pplitions, we see why n re not effetive: their overge is smll. is hurt All these yle ounts re in min-proessor yles. L2miss+Pref Hits DelyeHits NonPrefMisses e Reunnt MC Sprse Tree Averge for 7 pplitions other thn Sprse n Tree Figure 9. Brekown of the L2 misses n lines prefethe y the ULMT (prefethes). The originl misses re normlize to 1. y its inility to prefeth fr he, while is hmpere y its high response time n limite ury. The figure lso shows tht hs high overge (0.74). However, this omes t the ost of useless prefethes (e plus Reunnt re equivlent to 50% of the originl misses) n itionl misses ue to onflits with prefethes (20% of the originl misses). We n see, therefore, tht vne pir-se shemes nee itionl nwith. seems to hve low overge, espite its high performne in Figure 7. The reson is tht the prefeth requests issue y the proessor-sie prefether, while effetive in eliminting L2 misses, re lumpe into the NonPrefMisses tegory in the figure if they reh memory. Sine the ULMT prefether is in Non-Verose moe, it oes not see these requests. Consequently, the ULMT prefether only fouses on the irregulr miss ptterns. ULMT prefethes tht eliminte irregulr misses pper s Hits+DelyeHits. Finlly, Figure 9 lso shows why Sprse n Tree showe limite speeups in Figure 7. They hve too mny onflits in the he, whih results in mny remining NonPrefMisses. Furthermore, their prefethes re not very urte, whih results in lrge e n Reunnt tegories. Work Lo of the ULMT. Figure 10 shows the verge response time n oupny time (Setion 3.1) for eh of the ULMT lgorithms, verge over ll pplitions. The times re mesure in 1.6 GHz yles. Eh r is roken own into omputtion time (Busy) n memory stll time (Mem). The numers on top of eh r show the verge IPC of the ULMT. The IPC is lulte s the numer MC MC

11 Numer of Proessor Cyles Mem Busy 0.6 MC MC Response Response Timetime Oupny time Time Figure 10. Averge response n oupny time of ifferent ULMT lgorithms in min-proessor yles. of instrutions ivie y the numer of memory proessor yles. The figure shows tht, in ll the lgorithms, the oupny time is less thn 200 yles. Consequently, the ULMT is fst enough to proess most of the L2 misses (Figure 6). Memory stll time is roughly hlf of the ULMT exeution time when the proessor is in the DRAM, n more when the proessor is in the North Brige hip (MC). n hve the lowest oupny time. Note tht s oupny is not muh higher thn s, espite the higher numer of tle uptes performe y. The resons re the fewer ssoitive serhes n the etter he line reuse in. The response time is most importnt for prefething effetiveness. The figure shows tht hs the lowest response time, t roun 30 yles. The response time of MC is out twie s muh. Fortuntely, the ite lgorithm is le to prefeth fr he urtely n, therefore, the effetiveness of prefething is not very sensitive to moest inrese in the response time. Min Memory Bus Utiliztion. Finlly, Figure 11 shows the utiliztion of the min memory us for vrious lgorithms, verge over ll pplitions. The inrese in us utiliztion inue y the vne lgorithms is ivie into two prts: inrese use nturlly y the reue exeution time, n itionl inrese use y the prefething trffi. Overll, the figure shows tht the inrese in us utiliztion is tolerle. The utiliztion inreses from the originl 20% to only 36% in the worst se (). Moreover, most of the inrese omes from the fster exeution; only 6% utiliztion is iretly ttriutle to the prefethes. In generl, the ft tht memory-sie prefething only s one-wy trffi to the min memory us, limits its nwith nees. % Utiliztion 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 No prefething Due to the reue exeution time Due to prefething Relte Work Figure 11. Min memory us utiliztion MC Memory-Sie Prefething. Some memory-sie prefethers re simple hrwre ontrollers. For exmple, the NVIDIA hipset inlues the DASP ontroller in the North Brige hip [22]. It seems tht it is mostly trgete to strie reognition n uffers t lolly. The i860 hipset from Intel is reporte to hve prefeth he, whih my inite the presene of similr engine. Cooksey et l. [9] propose the Content- prefether, whih is hrwre ontroller tht monitors the t oming from memory. If n item ppers to e n ress, the engine prefethes it. Alexner n Keem [1] propose hrwre ontroller tht monitors requests t the min memory. If it oserves repetle ptterns, it prefethes rows of t from the DRAM to n SRAM uffer insie the memory hip. Overll, our sheme is ifferent in tht we use generl-purpose proessor running prefething lgorithm s user-level thre. Other stuies propose speilize progrmmle engines. For exmple, Hughes [11] n Yng n Leek [28] propose ing speilize engine to prefeth linke t strutures. While Hughes fouses on multiproessor proessing-in-memory system, Yng n Leek fous on uniproessor n put the engine t every level of the he hierrhy. The min proessor ownlos informtion on these engines out the linke strutures n wht prefethes to perform. Our sheme is ifferent in tht it hs generl ppliility. Another relte system is Impulse, n intelligent memory ontroller ple of rempping physil resses to improve the performne of irregulr pplitions [4]. Impulse oul prefeth t, ut only implements next-line prefething. Furthermore, it uffers t in the memory ontroller, rther thn sening it to the proessor. Correltion Prefething. Erly work on orreltion prefething n e foun in [2, 24]. More reently, severl uthors hve me further ontriutions. Chrney n Reeves stuy orreltion prefething n suggest omining strie prefether with generl orreltion prefether [6]. Joseph n Grunwl propose the si orreltion tle orgniztion n lgorithm tht we evlute [12]. Alexner n Keem use orreltion prefething slightly ifferently [1], s we inite ove. Sherwoo et l. use it to help strem uffers prefeth irregulr ptterns [26]. Finlly, Li et l. esign slightly ifferent orreltion prefether [18]. Speifilly, prefeth is not triggere y miss; inste, it is triggere y e-line preitor initing tht line in the he will not e use gin n, therefore, new line shoul e prefethe in. This sheme improves prefething timeliness t the expense of tighter integrtion of the prefether with the proessor, sine the prefether nees to oserve not only miss resses, ut lso referene resses n progrm ounters. We iffer from the reent works in importnt wys. First, they propose hrwre-only engines, whih often require expensive hrwre tles; we use flexile user-level thre on generl-purpose ore tht stores the tle s softwre struture in memory. Seon, exept for Alexner n Keem [1], they ple their engines etween the L1 n L2 hes, or etween the proessor n the L1; we ple the prefether in memory n fous on L2 misses. Time intervls etween L2 misses re lrge enough for ULMT to e vile n effetive. Finlly, we propose new tle orgniztion n prefething lgorithm tht, y exploiting inexpensive memory spe, inreses fr-he prefething n prefeth overge. Prefething Regulr Strutures. Severl shemes hve een propose to prefeth sequentil or strie ptterns. They inlue the Referene Preition tle of Chen n Ber [7], n the Strem uffers of Jouppi [13], Plhrl n Kessler [23], n Sherwoo et l. [26]. We se our proessor-sie prefether on these shemes. Proessor-Sie Prefething. There re mny more proposls for proessor-sie prefething, often for irregulr pplitions. A tiny, non-exhustive list inlues Choi et l. [8], Krlsson et l. [14], Lipsti et l. [19], Luk n Mowry [20], Roth et l. [25], n Zhng n Torrells [29]. Most of these shemes speifilly trget linke t strutures. They ten to rely on progrm informtion tht is ville to the proessor, like the resses n sizes of t stru-

Prefetching in an Intelligent Memory Architecture Using a Helper Thread

Prefetching in an Intelligent Memory Architecture Using a Helper Thread Prefething in n Intelligent Memory Arhiteture Using Helper Thre Yn Solihin, Jejin Lee, n Josep Torrells University of Illinois t Urn-Chmpign Mihign Stte University solihin,torrells @s.uiu.eu jlee@se.msu.eu