Reducing SDRAM Energy Consumption in Embedded Systems Λ

Size: px

Start display at page:

Download "Reducing SDRAM Energy Consumption in Embedded Systems Λ"

Antonia Quinn
5 years ago
Views:

1 Reducig SDRAM Eergy Cosumptio i Embedded Systems Λ Jelea Trajkovic, Alexader Veidebaum Uiversity of Califoria, Irvie fjeleat,alexvg@ics.uci.edu Abstract DRAM eergy cosumptio i embedded systems ca be very high, exceedig that of the data cache or eve of the etire processor. This paper presets a scheme for reducig the eergy cosumptio of SDRAM memory access by a combiatio of techiques that take advatage of SDRAM eergy efficiecies i bak ad row access. This is achieved by usig small, cachelike structures i the memory cotroller to prefetch a additioal cache block(s) o reads ad to combie block writes to the same DRAM row. The results quatify the DRAM eergy cosumptio of MiBech applicatios ad demostrate sigificat savigs i both the DRAM eergy cosumptio, a average of 23%, ad the eergy-delay product, a average of 44%. The approach also improves performace: the CPI is reduced by 26% o a average. 1. Itroductio May embedded applicatios are memory itesive, especially multimedia applicatios. Memory access costitutes a sigificat portio of overall eergy cosumptio i such applicatios [2, ]. This research ivestigatges a architectural approach to reducig the memory system eergy without ay performace loss. I fact, it will be show to result i a performace gai. Much of prior research to reduce eergy cosumptio has focused o caches. But mai memory cosumes orders of magitude more eergy per access tha cache. As will be show below, i some applicatios the total eergy of mai memory accesses ca be a order of magitude higher tha the total data cache eergy cosumptio. Thus it is very importat to optimize DRAM access for eergy cosumptio. Some of the techiques proposed for cache optimizatio ca be exteded for this purpose. SDRAM memory is oe of the major types of DRAM used i embedded systems. The eergy of SDRAM access ca be divided ito two mai compoets: the eergy of a bak/row activatio (activate-precharge pair) ad the eergy of a read or write access to a active bak/row. The activate-precharge pair cosumes 65% of the total access eergy (per maufacturer s data). The SDRAM orgaizatio allows the bak/row to be left o after a access, which permits additioal read/write access to such bak/row without icurrig the activate/precharge cost o each access. Readig or writig twice the amout of data withi the same activate-precharge cycle does ot double the eergy cosumptio but oly icreases it by approximately 35%. Embedded systems use a data cache ad oly access memory o cache misses or write-backs. Thus the memory is read/writte i (cache) blocks (lies). Therefore this paper proposes readig/writig multiple lies at a time, i.e. withi a sigle activate-precharge pair. This will be show to lead to sigificat eergy savigs. Multiple block reads are accomplished via hardware prefetchig ad writes via write combiig [16] at the memory iterface. Accessig multiple lies requires itermediate storage. This research proposes to use a small amout of such storage i the memory cotroller. Additioal lies will be prefetched ito this storage o each read. Writes to the same SDRAM row will be buffered first ad combied ito a multiple lie SDRAM write wheever possible. The mai cotributios of this research are adaptig both prefetchig ad write combiig to SDRAM eergy reductio, i particular combiig writes to thesamesdramrow. Λ This research was supported i part by the Natioal Sciece Foudatio uder Grat No. NSF CCR

2 Figure 1. Write curret profile A secod goal of this research is to avoid effectig the executio time or eve to improve it while savig eergy. The small buffers used for prefetchig/combiig act as a memory cache ad ca sigificatly improve read performace. Prefetchig ca sometimes degrade executio time by iterferig with regular memory access without supplyig useful data. Write bufferig eables the processor to cotiue executio ad ot wait for the write access to fiish. 1 The rest of the paper is orgaized as follows. Sectio 2 presets related work. Sectio 3 describes SDRAM eergy compoets for read ad write access. Sectio 4 presets architectural modificatios ad describes the eergy savig techiques of our approach. Experimetal results demostratig the beefits of the approach are preseted i sectio 5. Coclusios are preseted i sectio Related Work May architectural approaches for reducig eergy cosumptio i embedded processors have bee proposed. We are uaware of architectural solutios for SDRAM eergy reductio. For a software-based solutio see [], for istace. There is a large body of prior work o prefetchig, write buffers, ad write combiig buffers that is briefly summarized below. Techiques for reducig eergy cosumptio i mai memory are also described. Numerous prefetchig algorithms have bee proposed ([16, 8, 4, 2, 17]) based o history based predictio of future memory addresses. They differ i their way of predictig which block to fetch, how may blocks to fetch ad what triggers prefetch. May of these schemes require a complex predictio mechaism, although the so-called oe-block-lookahead schemes does ot. For istace, a stream buffer [8] prefetches N cosecutive lies triggered by a cache miss. Noe of the prefetchig proposals targets eergy reductio. Write buffers [16, 9] have bee used i may processors to avoid waitig for data to be writte to memory ad to avoid delayig cache miss fetch. Mergig was itroduced to improve performace of write buffers for write-through caches. This approach combies a icomig write request withi a cache lie with requests already residig i the write buffer, resultig i a more efficiet memory usage. The techique has bee implemeted i may architectures: Digital s VAX 88 [6], StrogARM [14], MIPS R43i [13], ad Itel XScale [3]. Low power mode is preset i most state-of-the-art DRAMs. A sigificat amout of eergy ca be saved by settig as may memory chips as possible to sleep mode [12]. Efforts have bee made i reducig memory eergy cosumptio based o differet compressio techiques. For istace, 1 It is assumed that there is o post-l1 cache write buffer i a CPU with a write-back cache. 2

3 SDRAM Eergy per memory access 6.E-8 5.E-8 E(ACT-PRE) E(WR) Eergy[J] 4.E-8 3.E-8 2.E-8 1.E-8.E+ Write oe Two times write oe Write two Three times write oe Write three Four times write oe Write four Figure 2. Eergy per memory access [1] describes ad evaluates a computer system that supports hardware mai memory compressio. As the compressio ratio of applicatios dyamically chages so does the real memory size that is maaged by the operatig system (OS). OS chages are ecessary to support mai memory compressio. These chages iclude the maagemet of the free pages pool as a fuctio of the physical memory utilizatio ad the effective compressio ratio, coupled with zeroig pages at free time rather tha at allocatio time. Kim et al. [] idetify successive memory accesses to differet rows ad baks as a source for icreased latecy ad eergy cosumptio i memory. They use block-based layouts istead of a traditio oe ad determie the best block size such that a umber of requests for differet row or bak is miimized. 3. SDRAM Eergy Compoets Let us idetify potetial sources of eergy savig i SDRAM memories that are used i embedded devices. I order to access data from a SDRAM, a row i a particular bak has to be activated. After activatio ad a specified delay a read or a write is performed. Whe the access is completed, a row precharge operatio is performed. Also, if access to data i a differet row has to be performed, the curret row eeds to be precharged before the ew row is activated. The total eergy of a access cosists of two mai compoets: eergy cosumed by activate-precharge pair ad read or write access eergy. Micro describes the curret profile for the write (or read) operatio i such memory [19] as show i Figure 1 2 (reproduced from [19]) i Micro s Techical Note. The first large peak i the graph correspods to the activatio commad. The middle plateau correspods to writig four words of data. Fially, a small peak ca be oticed for the precharge commad. This shows that activatio-precharge pair costitutes a sigificat portio of the overall curret ad thus the eergy cosumptio. Figure 2 quatifies eergy compoets of a 64Mb SDRAM memory [18]. These were obtaied usig Micro s System Power Calculator [19]. Each bar o the graph shows the activate-precharge ad write compoets of the total eergy. The figure shows the total eergy for writig 16 bytes (B) of data (oe cache lie), two separate 16B writes, ad two 16B writes combied i oe activatio-precharge. It also presets data for 3 ad 4 accesses, performed separately or combied. The figure shows the activate-precharge pair to be the domiat compoet. It cotributes 65% to the total eergy cosumptio for this SDRAM. The eergy savigs from combied read or write accesses are 24%, 34% ad 38% for two, three ad four combied accesses, respectively, comparig to same size performed i sequece. 4. Proposed Approach The memory subsystem of a typical embedded system is show i Figure 3a. The baselie cofiguratio cosists of a CPU with a sigle level of cache ad a mai memory. The cache is write-back. Memory latecy is high i terms of processor cycles ad the processor has to stall for may cycles waitig for the data to arrive from the memory. As discussed above, it would be good to fetch more tha oe lie o each cache miss ad thus save eergy. This has to be doe uder the usual costraits of embedded systems: reduced cost, eergy, ad complexity. Because of that the prefetchig has to be precise, caot use very complex logic due to potetial time ad eergy overheads, ad caot use a large buffer 2 cfl21 Micro Techology, Ic. All Rights Reserved. Used with permissio. 3

4 a) cache CPU Memory c) cache CPU Read Fetch buffer cotroller Memory b) cache CPU WCB cotroller Memory d) cache CPU c o t r o l l e r WCB Fetch buffer Memory Figure 3. Memory subsystem architectures tag_row tag_col v data Addr?? Read check Write combie check Read check hit Write combie check hit Figure 4. Architecture of write-combie buffer memory, agai due to overheads. Thus, the approach chose for this work is to use a simple oe-block lookahead [?] or stream buffer-like [?] prefetchig. Combiig of multiple lie writes is used i this paper with a goal of eergy reductio. Thus a differet type of combiig or coalescig write buffer (WCB) is proposed for writes. The differece is that it should be able to combie ay two writes to the same SDRAM row. It also eeds to be small simple for the same reasos as discusses above for prefetchig. Let us iitially discuss write ad read combiig separately to uderstad the beefits ad requiremets of each of them. Next, the combied approach will be ivestigated ad the best mechaism selected. I all cases, the sizes of buffers are studied as part of this research. For reasos that will become more clear after the architecture of each separate buffer is studied, the combied approach will actually use separate buffers as opposed to a sigle, mii-l2 cache i the memory iterface. While coceptually the same, the implemetatios are quite differet with separate buffers havig a advatage. 4.1 Write Combiig The memory cotroller for write combiig is show by Figure 3b. Figure 4 shows the write-combiig buffer (WCB) architecture for combiig 2 write requests. Each etry cosists of a split tag, a valid bit, ad data storage for oe cache lie. Tag bits are divided ito two groups: bits that determie the row address i the memory (tag row) ad the remaiig tag bits that are part of the colum address (tag col). The buffer is expected to be very small ad thus full associativity is easily implemetable. A address of a icomig cache lie write request is checked agaist all tag row etries. LRU or pseudo-lru replacemet is used. A hit i the WCB meas that the icomig write request ca be combied ad performed together with a valid WCB etry. Oce the memory write is performed the WCB erty is freed. Notice that the icomig write is ot writte ito the WCB i this case. 4

5 A WCB write miss causes the icomig request to be stored i the WCB ad to be potetially combied with a future write. A write to the WCB may cause a replacemet ad write-back from the WCB of a sigle-block etry. This architecture ca be exteded to combie more tha 2 accesses. To combie N+1 writes, N tag col sub-tags, N valid bits, ad N data store blocks are stored with each tag row etry. The etry cotais a couter to show how may writes to this SDRAM row are already preset. A icomig write causes a write to memory o a hit if the etry couter has a value of N. O replacemet less tha N etries may be writte to memory, as specified by the couter value. To summarize, the WCB differs from the traditioal write buffer or eve a coalescig write buffer because it ca merge data that is aywhere i a give SDRAM row. As a result it writes data to the memory i uits of N+1 cache blocks or less. The goal is thus to write N+1 etries as ofte as possible. A traditioal write buffer, o the other had, ca oly coalesce idividual words (or sub-words) withi a cache lie ad writes data to memory whe the memory is ot busy. A major advatage of this ew form of write combiig is that it ca ot icur ay eergy losses. The total umber of writes is the same as i the baselie case but those writes are potetially grouped i a differet way. I additio, the WCB reduces the processor CPI by allowig the CPU to cotiue executio as soo as data are writte to the WCB (as opposed to waitig for the SDRAM write to complete). The presece of the WCB creates a coherece problem o reads. It is solved as follows: every read address is checked agaist the full lie address (i.e. both tag row ad tag col bits) of every lie i the WCB. A read hit implies the eeded data is i the WCB ad the matched lie is set to the CPU cache. This results i miss latecy reductio. 4.2 Read Prefetchig The goal of read combiig is to perform multi-lie DRAM reads. However, sice there is oly oe read miss at ay give time i a embedded system (with a i-order CPU), there is othig to combie it with. Thus the oly way to read-combie is to geerate a additioal address speculatively via predictio. This is what other sequetial prefetchig mechaisms metioed above do. The differece is that our prefetchig is aimed at mai memory eergy reductio. It is possible to prefetch o-adjacet lies withi a same row, i a way similar to write combiig. This would, however, require a very sophisticated address predictor that would be both large ad complex (see [11]). This is why oly simple, sequetial prefetchig is cosidered here. The memory cotroller for read combiig is show i the Figure 3c. It fetches N additioal cache lies o a read miss. The lies are stored i a fetch buffer (FB): a small, cache-like structure with a tag o each etry. Each cache read miss is checked agaist the FB. O a hit, the lie from FB is set to the CPU cache. O a FB miss the lie is read from the DRAM together with N additioal lies which are stored i the FB. The missed lie is read first ad set to the CPU cache. All N+1 read accesses are performed i the same activate-precharge cycle. The rest of this paper will deal with N = 1, 2, ad 3. As will be show, a small, fully associative FB is sufficiet to achieve sigificat eergy reductio. I additio, the performace is also improved due to FB s cachig effect. 4.3 Read ad Write Combiig Each of the write ad the read combiig has its ow idividual advatages. They are largely idepedet of each other ad thus ca be deployed together for a additive eergy reductio as well as performace improvemet. The questio is what is the best architecture to perform the read ad the write combiig at the same time. The architecture advocated by this paper is show i Fig. 3d. This solutio basically itegrates the separate fetch buffer (FB) ad write-combie buffer (WCB). While a sigle, cache-like structure ca be desiged, it will have two major disadvatages. First, it will likely require that N, the umber of cache lies to combie, be the same for reads ad writes. As will be show i this paper this is ot desirable. Ad secod, more importatly, it will make it more difficult to perform sequetial read combiig ad withi the SDRAM row write combiig which is very importat. I additio, there will be iterferece ad replacemets of write lies by read prefetches ad vice versa i this case. Also, the split tag is ot required for read combiig. Mergig the WCB ad FB desigs is ot very difficult sice each will cotiue to operate idepedetly ad has its ow cotrol. Thus the write-combiig operatio i the WCB remais the same ad the read combiig (prefetchig) operatio i the FB remais the same. Recall that WCB was already checked o each read miss ad could supply data to the CPU cache. There is oe chage that is required for the merged orgaizatio. Additioal coherecy checks have to be performed betwee writes ad prefetches. First, prefetched data ca be ivalidated by a icomig write from the CPU cache. Secod, there is o poit i prefetchig lies already i the write combiig buffer. 5

6 Bechmark Emem relative Bechmark Emem relative to Ecache [%] to Ecache [%] d FFT d rijdael e rijdael e susa 6.5 d jpeg c jpeg d blowfish e blowfish 3. i FFT Table 1. Memory eergy relative to the eergy of data cache Briefly, the solutio is two-fold. First, ay icomig data cache write request is checked agaist both the FB ad the WCB (i parallel). A matchig FB etry is ivalidated. Secod, every prefetch address is checked agaist the WCB first, the set to the DRAM oly if there was o match. The small size of the WCB guaratees that this additioal fully associative search has low eergy overhead ad does ot cause slowdow. The algorithm that the cotroller implemets to keep coherecy betwee buffers is: O outgoig cache request if replacemet the check write i WCB ivalidate FB etry if exists else check read i WCB ad FB i parallel (hit is possible i oly oe of buffers) if hit i WCB the supply data to the CPU ed if if hit i FB the supply data to the CPU else geerate N prefetch addresses check for the same row/bak (drop oes that exceed row/bak boudary) N 1» N check for existece i WCB (drop matched) N 1» N 2 fetch N 2 +1addresses ed if ed if The oly potetial drawback of the combied operatio is over-utilizatio of the limited memory badwidth. A combiatio of write combiig ad read prefetchig ca use up all of the available badwidth. A read miss may thus be delayed ad cause a slowdow. The evaluatio of access combiig is preseted i the ext sectio. It will show that the eergy ad/or performace loss ca be avoided i almost all cases. Whe it does happe it ca be miimized by a proper choice of architectural parameters. 5. Evaluatio Methodology The system modeled i this paper cosists of a i-order processor ad a sigle, large SDRAM memory chip. Oe ca thik of a mobile phoe as a example of such a system. The processor is a sigle issue, 32b embedded processor resemblig Itel s Xscale. It has a 8KB, 4-way set associative istructio ad data caches with a 16Byte lie ad a 2 cycle latecy. Data cache implemets a write-allocate, write-back policy. The CPU operatig frequecy is 4MHz. The CPU memory bus is a MHz, 32b bus. The baselie cache miss latecy is 36 processor cycles for the first word to arrive, ad a additioal 4 6

7 Fetch 2 25 Fetch2_16 Fetch2_32 Fetch2_64 Fetch2_128 E improvemet [%] Figure 5. Memory eergy reductio for read combiig for differet buffer sizes ED improvemet [%] Fetch 2 Fetch2_16 Fetch2_32 Fetch2_64 Fetch2_128 Figure 6. Memory ED product reductio for read combiig for differet buffer sizes processor cycles for each cosecutive word. The mai memory with the modified cotroller has a latecy of 4 processor cycles for the delivery of the first word, ad 4 cycles for each additioal cosecutive word. Both baselie ad modified architectures use the same SDRAM (see data sheet [18]). The extra 4 cycles (s) i the access time to modified memory are due to FB ad WCB access delays. The SDRAM clock rate is is MHz (speed grade -6). The evaluatio is performed usig the SimpleScalar 3. simulator [5] executig PISA biaries. SimpleScalar s bus ad memory model are modified to match this architecture. Both FB ad WCB are fully associative, with 16-byte lies ad a latecy of 12 processor cycles. The WCB ca be cofigured to store N lies per etry. The FB ad WCB sizes were limited to avoid overhead ad reduce cost. As a result, the FB cosumes.75% ad WCB cosumes 4% of the data cache eergy whe both buffers are at full capacity ad assumig a.18 micro process techology is used. Dyamic eergy cosumptio of the cache, WCB, ad FB is modeled usig modified CACTI 3.2 [15] for.18 micro techology. Oe of the mai chages are i the sese amplifier eergy model, which was overestimated i the origial model. Mai memory eergy is modeled usig Micro s System Power Calculator [19]. Bechmarks from a MiBech [7] are used i this study. All bechmarks are simulated usig large iput sets. 5.1 Results The impact of the proposed architecture is evaluated by comparig the memory eergy cosumptio, eergy-delay product, ad CPI relative to the baselie cofiguratio. The followig leged is used: ffl FetchN M for read fetch of N lies with a buffer of M lies (N-1 lies are prefetched); ffl WCB P for write combiig of 2 accesses with a buffer of P lies; ffl WCB PxQ for write combiig of (Q+1) accesses with a buffer of P etries x Q lies; 7

8 Fetch 2 6 Fetch2_16 Fetch2_32 Fetch2_64 Fetch2_ CPI reductio [%] Figure 7. CPI improvemet for read combiig for differet buffer sizes 2 Fetch =2,3,4 Fetch2_32 Fetch2_64 Fetch2_128 Fetch3_32 Fetch3_64 Fetch3_128 Fetch4_32 Fetch4_64 Fetch4_128 E improvemet [%] Figure 8. Memory eergy reductio for read combiig for differet fetch ad buffer sizes ffl FetchN M+WCB PxQ for a hybrid cofiguratio. Table 1 shows memory eergy per bechmark relative to the data cache memory cosumptio for the baselie model. This is with a average cache miss rate of 3.5%. [14] showed the data cache cosumig 16% of overall the processor eergy. For MiBech bechmarks the mai memory cosumes, o a average, 2.6 times the eergy of the data cache. The worst case differece is 15x Read Prefetch: the Effect of Fetch ad Buffer Size First, let us evaluate read combiig ad its effect o memory eergy cosumptio. Figure 5 shows eergy reductio for differet fetch buffer sizes relative to the baselie cofiguratio. Buffer sizes of 16, 32, 64 ad 128 etries are used, fetchig two 16B blocks. The average memory eergy savigs are 12% to 17%. The smallest buffer already obtais a sigificat reductio, with each doublig of the size producig a small (1% to 2%) icrease. Two bechmarks d rijdael ad e rijdael have a oticeable icrease i memory eergy cosumptio for a 16-etry FB. With 32 etries there are basically o eergy icreases, makig it a good choice. The eergy-delay (ED) product show i Figure 6. It is reduced by as much as 68% ad by 38% o average, with buffer size havig almost o impact. It ca be see, that both bechmarks that have eergy icrease obtai sigificat ED product savig. The eergy delay reductio is large due to a improved average memory latecy. The effect of latecy reductio ca be see i the CPI improvemet show i Figure 7. CPI is also isesitive to the buffer size chage. Read combiig techique reduces CPI by as much as 59% ad by 27% o a average. 8

9 7 5 Fetch =2,3,4 Fetch2_32 Fetch2_64 Fetch2_128 Fetch3_32 Fetch3_64 Fetch3_128 Fetch4_32 Fetch4_64 Fetch4_128 ED improvemet [%] Figure 9. Memory eergy-delay product for read combiig for differet fetch ad buffer sizes 7 6 Fetch =2,3,4 Fetch2_32 Fetch2_64 Fetch2_128 Fetch3_32 Fetch3_64 Fetch3_128 Fetch4_32 Fetch4_64 Fetch4_128 CPI reductio [%] Figure. CPI reductio for read combiig for differet fetch ad buffer sizes If we cosider eergy as the mai factor, the smallest buffer that provides savigs with o overhead is oe with 32 etries. O the other had, if we cosider ED product, it is the buffer with 16 etries that brigs the same savigs as the largest buffer i the majority of cases. The oly two bechmarks that have sigificat differece i ED product savig ( 15%) are d rijdael ad e rijdael. Therefore, for combied techique, we will explore read-combie buffer of 16 ad 32 etries. Let us ow explore the use of differet fetch size. Figure 8 shows the eergy reductio relative to the baselie cofiguratio. We have cosidered fetchig 2, 3 ad 4 lies ad buffer sizes of 32, 64 ad 128 etries. It ca be see that a fetch size larger tha 2 is ot beeficial. Geerally, a larger fetch size icreases eergy cosumptio because may prefetched lies are ot used. Fetch size 3 i some cases reduces eergy cosumptio, but ot as much as fetch size 2, while fetch size 4 always has too much overhead. O average, a fetch size of 2 saves 13% to 16% of eergy, fetchig 3 lies saves from -1% to 8% ad fetchig 4 icreases eergy cosumptio by 16% to 37%. Figures 9 ad preset ED product ad CPI savigs. It ca be see that for ED product savigs, fetch size 2 uiformly outperforms other fetch sizes. The oly exceptio is for fetch 3 where the differece is just 1%. The largest savig is 68% ad the average improvemet rages form 5% to 39%. CPI savig is ot sigificatly affected by ay parameter chage except for,,,ad, where the differece is less tha 5%. I the best case, a 65% improvemet is obtaied, ad o average the improvemet rages from 27% to 29% Write Combiig: the Effect of Combiig ad Buffer Size Figure 11 shows the relative memory eergy for write combiig. Buffer sizes of 2, 4, ad 8 etries are used, with 2, 3, ad 4-lie combiig. The buffer cofiguratios are chose to have approximately the same size i all cases. O average, 9

10 Write combie cofiguratios: combie 2, 3, WCB_2 WCB_4 WCB_8 WCB_4x2 WCB_2x E improvemet [%] Figure 11. Memory eergy reductio for differet write-combiig ad buffer sizes Write combie cofiguratios: combie 2, 3, 4 8 WCB_2 WCB_4 WCB_8 WCB_4x2 WCB_2x3 7 ED improvemet [%] Figure 12. Memory ED product reductio for differet write-combiig ad buffer sizes the improvemet rages from 8% to 11%; that is smaller tha for read combiig. Buffer size has little impact (the left three bars), but additioal eergy savigs are obtaied whe combiig 3 or 4 lies. Write combiig achieves up to a 8% reductio of ED product (see Figure 12) with a 4% average. CPI savigs (Figure 13) are also ot affected by size or cofiguratio. Write combiig achieves a up to 76% CPI improvemet, with a 33% average Hybrid Cofiguratios Figure 14 shows the effect of both write ad read combiig. The fetch buffer with 16 etries is used with write combiig buffers of size 8, cofigured to combie either 2, 3 or 4 writes. The results show that combiig 3 lies is the best cofiguratio. O average 21.5% to 23.5% eergy savigs are obtaied. As see i Figures 15 ad 16, the differece i ED product ad CPI savigs for differet cofiguratios is ot more tha 2%. ED product is reduced by 71% i the best case ad by 44% o average. CPI is improved by up to 56%, with a 26% o average. Figure 17 shows the effect of both write ad read combiig whe the fetch buffer with 32 etries is used with write combiig buffers of size 8, cofigured to combie either 2, 3 or 4 writes. Still, combiig 3 lies gives the best cofiguratio. O average 22% to 24% eergy savigs are obtaied. The differece i ED product ad CPI savig is egligible (2%) for differet cofiguratios, as see i Figures 18 ad 19 respectively. ED product ad CPI reductio are the same as for the cofiguratio with 16-etry read-combiig buffer. 6. Coclusios This research developed a techique for reducig eergy cosumptio for SDRAM memory access i embedded systems. We itroduced architectural additios to the memory cotroller of a fully parameterizable uit that cosists of a small high speed fetch buffer ad a write-combie buffer. This allowed read prefetchig ad combied write access to the mai memory. Sice prefetched data resides i a fast ad small cache-like memory, a access to it is sigificatly cheaper, both i terms of time ad eergy cosumptio. Combiig write accesses leads to gais without ay pealty. The techique was evaluated usig the SimpleScalar simulator of a Xscale-like embedded processor. The results demostrate that a sigificat reductio i memory eergy cosumptio ad delays ca be achieved by read prefetchig ad write combiig. Eve with small size buffers, 256B/512B for prefetchig ad 128B for write combig, a

11 Write combie cofiguratios: combie 2, 3, 4 8 WCB_2 WCB_4 WCB_8 WCB_4x2 WCB_2x CPI reductio [%] Figure 13. CPI improvemet for differet write-combiig ad buffer sizes Fetch 2 (16) & WCB cofiguratios (2,3,4) Fetch2_16+WCB_8 Fetch2_16+WCb_4x2 Fetch2_16+WCB_2x3 3 E improvemet [%] 2 - Figure 14. Memory eergy reductio for differet combied cofiguratios, for 16 etry FB average 23% eergy reductio is achieved. The eergy-delay product is improved, o average, by over 4%. The CPI is reduced by 26%, o average. Prefetchig or write combiig ca be powered dow idividually to better tue them to a give applicatio. The proposed approach requires simple hardware suitable to embedded systems. I a resource costraied eviromet of embedded systems ruig multimedia applicatios these eergy savigs provide a sigificat beefit. Refereces [1] B. Abali ad H. Frake. Operatig system support for fast hardware compressio of mai memory cotets. I Memory Wall Workshop, the 27th A. It. Sym. O Computer Architecture, 2. [2] K. Barr ad K. Asaovic. Eergy aware lossless data compressio. I Proceedigs of the First Iteratioal Coferece o Mobile Systems, Applicatios, ad Services (MobiSys 23), Sa Fracisco, CA, 23. [3] L. T. Clark ad et al. A embedded 32b microprocessor core for low-power ad high-performace applicatios. IEEE JSSC, 36(11): , Nov. 21. [4] F. Dahlgre, M. Dubois, ad P. Stestrom. Fixed ad adaptive sequetial prefetchig i red-memory multiprocessors. I I Proceedigs of the 1993 Iteratioal Coferece o Parallel Processig,, pages 56 63, [5] D.Burger ad T. Austi. The simplescalar tool set, versio 2.. Techical report, Techical Report TR , Uiversity of Wiscosi-Madiso, [6] J. Fu, J. Keller, ad K. Haduch. Aspects of the vax 88 c box desig. Digital Techical Joural, Number 4, February [7] M. R. Guthaus, J. S. Rigeberg, D. Erst, T. M. Austi, T. Mudge, ad R. B. Brow. Mibech: A free, commercially represetative embedded bechmark suite. I IEEE 4th Aual Workshop o Workload Characterizatio, pages 83 94, 21. [8] N. P. Jouppi. Improvig direct-mapped cache performace by the additio of a small fully-associative cache ad prefetch buffers. I Proceedigs of the 17th aual iteratioal symposium o Computer Architecture, pages ACM Press, 199. [9] R. Kessler, E. McLella, ad D. Webb. The alpha microprocessor architecture. I ACM SIGPLAN Notices, [] H. S. Kim, N. Vijaykrisha, M. Kademir, E. Brockmeyer, F. Catthoor, ad M. J. Irwi. Estimatig ifluece of data layout optimizatios o sdram eergy cosumptio. I Proceedigs of the 23 iteratioal symposium o Low power electroics ad desig, pages ACM Press, 23. [11] S. Kumar ad C. Wilkerso. Exploitig spatial locality i data cache usig spatial footprit. I Iteratioal symposium o Computer Architecture,

12 Fetch 2 (32) & WCB cofiguratios (2,3,4) 8 Fetch2_32+WCB_8 Fetch2_32+WC_4x2 Fetch_2_32+WC_2x3 7 ED improvemet [%] Figure 15. Memory ED product reductio for differet combied cofiguratios, for 16 etry FB CPI reductio [%] Fetch 2 (16) & WCB cofiguratios (2,3,4) Fetch2_16+WC_8 Fetch2_16+WC_4x2 Fetch2_16+WC_2x3 Figure 16. CPI improvemet for differet combied cofiguratios, for 16 etry FB [12] A. R. Lebeck, X. Fa, H. Zeg, ad C. Ellis. Power aware page allocatio. I I Proceedigs of the 9th Iteratioal Coferece o Architectural Support for Programmig Laguages ad Operatig Systems (ASPLOS IX), November 2, 2. [13] MIPS Techologies, Ic.: R Series Documets [14] J. Motagaro ad et al. A 16 mhz, 32 b,.5 w cmos risc microprocessor. IEEE JSSC, 31(11): , Nov [15] P.Shivakumar ad N. Jouppi. Cacti 3.: A itegrated cache timig, power, ad area model. Techical report, Digital Equipmet Corporatio, COMPAQ Wester Research Lab, 199. [16] A. J. Smith. Cache memories. ACM Comput. Surv., 14(3):473 53, [17] Y. Solihi, J. Torrellas, ad J. Lee. Usig a user-level memory thread for correlatio prefetchig. I I Proceedigs of 29th Aual Iteratioal Symposium o Computer Architecture, May 22., 22. [18] The Micro: Sychroous DRAM 64Mb x32 Part umber: MT48LC2M32B2. [19] The Micro System-Power Calculator [2] A. Veidebaum, W. Tag, R. Gupta, A. Nicolau, ad X. Ji. Adaptig cache lie size to applicatio behavior. I It l Cof. Supercomputig,

13 E improvemet [%] Fetch 2 (32) & WCB cofiguratios (2,3,4) Fetch2_32+WCB_8 Fetch2_32+WC_4x2 Fetch_2_32+WC_2x3 Figure 17. Memory eergy reductio for differet combied cofiguratios, for 32 etry FB ED improvemet [%] Fetch 2 (16) & WCB cofiguratios (2,3,4) Fetch2_16+WC_8 Fetch2_16+WC_4x2 Fetch2_16+WC_2x3 Figure 18. Memory ED product reductio for differet combied cofiguratios, for 32 etry FB CPI reductio [%] Fetch 2 (32) & WCB cofiguratios (2,3,4) Fetch2_32+WCB_8 Fetch2_32+WCB_4x2 Fetch_2_32_WC_2x3 Figure 19. CPI improvemet for differet combied cofiguratios, for 32 etry FB 13

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts