Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities

Size: px

Start display at page:

Download "Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities"

Colleen Harvey
6 years ago
Views:

1 Multple Sub-Row Buffers n DRAM: Unlockng Performance and Energy Improvement Opportuntes ABSTRACT Nagendra Gulur Texas Instruments (Inda) nagendra@t.com Mahesh Mehendale Texas Instruments (Inda) m-mehendale@t.com The twn demands of energy-effcency and hgher performance on DRAM are hghly emphaszed n multcore archtectures. A varety of schemes have been proposed to address ether the latency or the energy consumpton of DRAMs. These schemes typcally requre non-trval hardware changes and end up mprovng latency at the cost of energy or vce-versa. One specfc DRAM performance problem n multcores s that nterleaved accesses from dfferent cores can potentally degrade row-buffer localty. In ths paper, based on the temporal and spatal localty characterstcs of memory accesses, we propose a reorganzaton of the exstng sngle large row-buffer n a DRAM bank nto multple sub-row buffers (MSRB). Ths re-organzaton not only mproves row ht rates, and hence the average memory latency, but also brngs down the energy consumed by the DRAM. The frst major contrbuton of ths work s proposng such a reorganzaton wthout requrng any sgnfcant changes to the exstng wdely accepted DRAM specfcatons. Our proposed reorganzaton mproves weghted speedup by 35.8%, 14.5% and 21.6% n quad, eght and sxteen core workloads along wth a 42%, 28% and 31% reducton n DRAM energy. The proposed MSRB organzaton enables opportuntes for the management of multple row-buffers at the memory controller level. As the memory controller s aware of the behavour of ndvdual cores t allows us to mplement coordnated buffer allocaton schemes for dfferent cores that take nto account program behavour. We demonstrate two such schemes, namely Farness Orented Allocaton and Performance Orented Allocaton, whch show the flexblty that memory controllers can now explot n our MSRB organzaton to mprove overall performance and/or farness. Further, the MSRB organzaton enables addtonal opportuntes for DRAM ntra-bank parallelsm and selectve early prechargng of the LRU row-buffer to further mprove memory access latences. These two optmzatons together provde an addtonal 5.9% performance mprovement. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ICS 12, June 25 29, 2012, San Servolo Island, Vence, Italy. Copyrght 2012 ACM /12/06...$ R Mankantan Indan Insttute of Scence rman@csa.sc.ernet.n R Govndarajan Indan Insttute of Scence govnd@serc.sc.ernet.n Categores and Subject Descrptors C.1.2 [PROCESSOR ARCHITECTURES]: [Multple Data Stream Archtectures (Multprocessors)] Keywords DRAM, Memory Performance, Mult-Core Archtecture 1. INTRODUCTION Wth the wdenng gap between processor and memory performance, the memory performance can mpact the overall performance of a multcore system n a sgnfcant way. Further, the energy consumpton of DRAM memory accounts for a non-trval (greater than 30%) part of the system energy [10], [11]. The mportance of energy effcency and performance of DRAM s emphaszed by current trends n hgh-performance computng whch acheve performance scalng va multcores. In both server-class and personal computers, multcore confguratons are becomng wdespread wth several cores sharng DRAM memory that s accessed va one or more memory controllers. Thus a recent research focus has been on mprovng the performance and energy effcency of DRAM desgn whch has tradtonally been archtected to address hgh densty and low cost. A key component of DRAM that mpacts both performance and energy s the row-buffer. When an access request s made to a bank n DRAM, t fetches a large row of data (typcally 8KB 16KB, across all devces accessed n parallel) nto the row-buffer. Ths operaton s known as row-actvate. In the presence of spatal localty, future requests to the same row ht n the row-buffer. In these cases, the row-buffer provdes the data, whch reduces both access latency and the energy consumed as the row-actvate operaton s elmnated. A request to a dfferent row n the same bank replaces the row-buffer wth the contents of the new row, after the current row-buffer s wrtten back (precharge). The spatal localty exploted n the row-buffer s greatly reduced when accesses from multple cores get nterleaved at the DRAM [12]. A number of memory access re-orderng schemes [1, 2, 3, 25, 4], varyng n complexty from the smple FR-FCFS (Frst Ready-Frst Come Frst Served) [5] to the more complex renforcement learnng based approach [25], have been proposed to mprove row-buffer hts. Further address re-mappng n hardware or software [8] and memory access re-orderng schemes that are prefetch-aware have also been proposed [9]. Whle these schemes are effectve n mprovng performance, they do not address the ssue of energy reducton drectly. Further, they requre non-trval modfcatons and complex schedulng polces to be mplemented at the memory controller.

2 One sgnfcant contrbutor to the hgh energy consumpton nsde DRAMs s the frequent row Actvate and Precharge operatons to move data back and forth between the row-buffer and the DRAM core. Common methods proposed to reduce DRAM energy nclude smaller row-buffers [12], storage re-organzaton [13], and explotng opportuntes for power-down modes of operaton [14]. However, these methods generally ncur performance degradaton and also ntroduce hardware complexty. For nstance, smaller rowbuffers reduce the energy consumpton at the expense of lower ht rates and a small performance loss. Though there s a lack of spatal localty, mult-programmed workloads do exhbt consderable temporal localty among memory accesses, at the level of DRAM pages/rows [15]. The observed temporal and spatal localty characterstcs make a strong case for havng multple smaller row-buffers per bank nstead of a sngle rowbuffer. Such a confguraton has been known to provde benefts n the context of Phase Change Memory [15]. However, such a row-buffer reorganzaton and ts mpact on performance and energy reducton has not been studed n the context of DRAMs. Ths s mportant as the challenge s to accomplsh ths reorganzaton under a farly rgd JEDEC standard [17]. Further, the proposed organzaton enables a set of optmzaton opportuntes whch have not been explored thus far and are relevant specfcally for DRAMs. Our frst contrbuton s to propose a practcal desgn to ncorporate multple sub-row buffers (MSRBs) n DRAMs wth mnmal changes to the exstng DRAM specfcatons. A study of multple narrow row-buffers n the context of DRAM shows that t can sgnfcantly mprove both performance and energy n mult-cores. The performance gans (n terms of weghted speedup) over the baselne are 35.8%, 14.5% and 21% respectvely for quad, eght and sxteen cores respectvely. Ths gan n performance s acheved along wth an energy reducton of 42%, 28% and 31%. We refer to ths organzaton as MSRB. The necessary controls for mplementng MSRB are ncorporated at the memory controller. Whle ths allows the DRAM desgn changes to be mnmal wth no changes to the pn nterface, t also opens up opportuntes for further optmzatons to effectvely utlze the row-buffers to acheve performance and/or farness goals. Ths s because unlke the DRAM, an on-chp memory controller can observe the behavor of varous cores and hence can manage the allocaton of row buffers to sut the observed beahvor. We demonstrate two such buffer allocaton strateges - Farness Orented Allocaton and Performance Orented Allocaton. Farnessorented Allocaton allocates dedcated row-buffers to cores that suffer the most nterference thereby mprovng both performance and farness. Ths allocaton scheme mproves farness by 43%. Performance-orented Allocaton takes nto account the dfferng memory bandwdth requrements of the varous cores and tres to allocate row buffers to cores n lne wth ther demand. Results show that ths scheme further mproves performance. Our thrd contrbuton s to examne the addtonal optmzatons enabled by our multple row-buffer desgn. On a row-buffer mss, t s essental to wrte back the currently open row, before the newly requested data can be brought nto the row-buffer. Ths precharge latency s typcally n the crtcal path of memory requests. In a multple row-buffer confguraton, t s possble to do an eager precharge of the least recently used row-buffer. Ths proactve approach, whch we refer to as Early Precharge, hdes the precharge latency that wll be experenced by a future request that msses n the row-buffer. Ths optmzaton mproves performance by an addtonal 1.8% over MSRB desgn. A second optmzaton enabled by the new desgn s the ablty to smultaneously servce row-ht requests from one row-buffer whle a dfferent row-buffer s beng actvated or precharged. Ths ntroduces Intra-Bank Parallelsm 1 and mproves performance by an addtonal 4.7% over MSRB. The Early Precharge and Intra-bank Parallelsm optmzatons taken together yeld an addtonal performance mprovement of 5.9%. Last, we compare our row-buffer reorganzaton wth two bestn-class memory controller schedulers (Thread Cluster Memory (TCM) schedulng [1] and Parallelsm Aware Batch Schedulng (PARBS) [2]) and demonstrate that our desgn produces far greater system throughput than what s acheved va scheduler optmzatons alone. Further, we compare our results aganst a hypothetcal DRAM devce whch s hghly banked (32-banks and 256-columns per bank structure). We observe that our proposed MSRB organzaton s more effectve, n terms of both performance and energy, than a hghly banked DRAM wth just one small row-buffer per bank. 2. BACKGROUND AND MOTIVATION In ths secton, we provde the necessary background and the requred motvaton for our work. 2.1 Background We consder the popular JEDEC-style ([17]) DRAM as the baselne archtecture throughout the paper. DRAM devces are packaged as Dual In-lne Memory Modules (DIMMs) whch are nterfaced typcally to an on-chp memory controller (refer Fgure 1). DIMMs contan one or more ranks. A rank s a collecton of DRAM devces that operate n parallel. Each DRAM devce typcally serves up a few bts at the specfed (row, column) locaton. Operatng together, the devces n a rank match the data bus wdth. For example, an x16 devce supples 16 bts of data and 4 such devces makng up a rank can supply data needed to match the 64-bt nterface of the memory controller. Each devce n a rank s organzed nto a number (4, 8 or 16) of logcally ndependent banks. Each bank conssts of multple rows (also called pages) of data. Banks wthn the same rank can operate n parallel and ths provdes for some degree of memory level parallelsm. Fgure 1 shows a dagrammatc descrpton of ths organzaton. A typcal DRAM read request has to frst Actvate the correspondng row by brngng the row data to the row-buffer. Ths s followed by a column read/wrte that reads/wrtes the selected words from/to the row-buffer. Fnally Precharge wrtes back the row-buffer to the approprate row. Precharge s needed even on rows that were only read, snce row actvaton depletes the charge n the correspondng row n the DRAM devce. Each bank s equpped wth ts own row buffer and logc to perform row Actvate (termed RAS), column read/wrte (termed CAS) and Precharge (termed PRE) operatons. An open page[5] polcy delays prechargng untl just before the next row actvate has to be performed, whle the closed page polcy eagerly precharges the row soon after the frst read- /wrte operaton to ths row. The memory controller s responsble for effcent schedulng of requests, as well as for the mplementaton of the DRAM access protocol. Each cycle, the controller selects a vald command by examnng the set of pendng requests and ssues t to the memory. 2.2 Motvaton A typcal DRAM bank s equpped wth a large row buffer comprsng 1024 to 2048 columns. Ths large buffer sze s an artfact of the orgnal use-model t was ntended for, vz., explotng spatal localty whle servng requests from a sngle core processor. In the multcore scenaro wheren requests from multple workloads are 1 Note that ths s n addton to bank-level, rank-level and controller-level parallelsm supported by DRAM memory systems.

Fgure 1: DRAM organzaton overvew Fgure 4: Multpel Sub-Row Buffer Organzaton Fgure 2: Cumulatve stack dstance hstogram of hts wth large (1024 columns) row buffers nterleaved by the memory controller,

These scenaros favor the closed page polcy (whch precharges and closes the page mmedately after the frst use) as the de-facto polcy, ncurrng hgh access tme and energy by brngng n a large row for a

3 Fgure 1: DRAM organzaton overvew Fgure 4: Multpel Sub-Row Buffer Organzaton Fgure 2: Cumulatve stack dstance hstogram of hts wth large (1024 columns) row buffers nterleaved by the memory controller, there s nsuffcent reuse of the open row buffer. Further, modern programs have large workng sets and even sngle-core programs could access multple rows n successon n a bank. These scenaros favor the closed page polcy (whch precharges and closes the page mmedately after the frst use) as the de-facto polcy, ncurrng hgh access tme and energy by brngng n a large row for a sngle word or cachelne. The above observatons ndcate a potental beneft n usng multple small row-buffers for each bank n the DRAM. Fgure 2 shows cumulatve stack dstance hstograms of row ht rates acheved n the top 1, 2, and 4 (Most Recently Used) postons for a representatve subset of SPEC 2006 CPU workloads wth 1024-columns wde row buffers, whle Fgure 3 shows the same data for 256-columns wde row buffers. A detaled descrpton of the expermental methodology, smulated confguraton, and workloads s ncluded n Secton 5. The graph n Fgure 2 shows that, on the average, the ht rate more than doubles as we go from one row-buffer to even just two buffers. Ths s especally observed n memory ntensve programs, e.g. mlc, gromacs, soplex, and calculx. Two (or more) row buffers appear to work sgnfcantly better at capturng temporal/spatal localty n programs. The graph n Fgure 3 shows the crucal observaton for our mot- Fgure 3: Cumulatve stack dstance hstogram of hts wth small (256 columns) row buffers vaton: small buffers do nearly as well as ther larger counterparts n explotng spatal localty. On average, we observed that small buffers captured over 90% of the hts seen wth larger ones. Ths study reveals that multple small row-buffers per bank help mprove temporal localty wthout sgnfcant loss n spatal localty. Incorporatng ths nto DRAMs wthn the exstng DRAM standards pose certan challenges. At the same tme, the MSRB also offers a few addtonal opportuntes such as mplementng dfferent row-buffer management polces for performance and farness, and optmzatons such as Early Prechargng and Intra-Bank Parallel accesses. We address these topcs n the followng sectons. 3. MULTIPLE SUB-ROW BUFFER ORGA- NIZATION In ths secton, we descrbe our MSRB organzaton for DRAM. Specfcally, we replace the one large row-buffer n each bank by four small row-buffers, each one-fourth the sze of the orgnal. Ths organzaton conssts of three components, namely () Subrow actvaton, () Row-buffer selecton, and () Row-Buffer allocaton. We descrbe the frst two components n detal below whle the row-buffer allocaton s deferred to Secton 4. Fgure 4 provdes an overvew of a DRAM bank organzed to support multple small row buffers. 3.1 Sub-Row Actvaton In order to select the approprate sub-row, the DRAM needs to have access to both the row address and a part of the column address. Once both are avalable, the DRAM logc decodes and actvates both row_select_ and sub_row_select_j lnes and fetches the selected columns to a row buffer. Whle row_select lnes run across the entre length of the row, they need only partcpate n the sub-row decodng and as such have very lttle load on them. At the selected sub-row, the tradtonal wordlne s actvated to access the entre sub-row. Our mplementaton of sub-row actvaton s smlar to that descrbed n [12]. To perform sub-row actvaton, the DRAM needs both the row address (from the RAS command) and a few address bts from the column access command (CAS), snce the sub-row selecton s dependent on a few column address bts: 2 bts f the sze of each subrow s 1 -th the sze of the row. In a tradtonal DRAM nterface, 4 the RAS command s ssued frst and the CAS command follows t a few cycles later. Ths s done n order to multplex row and column addresses onto the same set of pns. In the JEDEC standard, a tmng parameter termed trcd specfes the gap between the two commands that has to be met [17], and s typcally of the order of nano-seconds. Wth such trcd delay, a nave scheme whch

4 wats for the column address would ntroduce ntolerably hgh latency to sub-row actvaton. In order to avod ths, we dscuss at least three alternatves that could get the sub-row select address bts to the DRAM wthout ncurrng the RAS-to-CAS delay: Expandng the address pns by addtonal sub-row-select pns would address the sub-row selecton problem n a straghtforward way. Typcally, ths s an addtonal 2 to 3 pns dependng on the sze of the sub-row relatve to the full row. Though smple, gven the slow growth n pn-count, we do not consder ths a feasble opton. Issue RAS and CAS commands n back to back cycles. The DRAM s expected to latch the addresses ssued n these two commands and use them at approprate tmes nternally. Ths scheme s termed Posted-RAS n [12] and Posted-CAS n [24]. There s a 1-cycle delay ncurred n ths scheme. We chose not to use ths scheme due to ths 1-cycle latency addton to an already large DRAM access latency. Modern DRAMs support double-data-rate transfers on the data pns (hence termed DDR). That s, both the DRAM and the memory controller have the capablty to transmt/receve data at twce the bus clock. One could extend ths capablty to address pns as well. A RAS command ssued on the rsng edge of the bus clock followed by the sub-row-select command ssued on the fallng of that bus clock serves to transfer all the necessary address bts n one clock cycle and thus ncurs no latency n sub-row actvaton. We term ths scheme Double-Address-Rate. Ths s the scheme we assume n our detaled smulatons. A smlar proposal for fast sgnallng of address bts appears n [26]. 3.2 Row-Buffer Selecton The ntroducton of multple row-buffers necesstates a few addtonal changes n the overall memory organzaton: ) the memory controller needs to remember the sub-rows n each bank that are currently avalable n the row-buffers. Ths s requred so that no RAS command s ssued for a sub-row that s already avalable n a row-buffer. ) when a new sub-row s actvated, allocaton of a row buffer that wll store the newly fetched sub-row data. ) ensure precharge for the sub-row that s beng replaced. Responsbltes () and () are necessarly handled by the memory controller even n the case of a sngle row-buffer. We feel t s natural to let the memory controller handle these responsbltes even for multple row-buffers. Further, as the controller mantans the book-keepng nformaton regardng open sub-rows, responsblty () can also be handled by the controller. Ths decson helps n keepng the DRAM logc smple. Essentally, the controller has to mantan cache-lke metadata for these buffers: vald bts, row and sub-row tags, and recency bts. We observe that ths decson of lettng the memory controller manage row-buffer usage not only keeps DRAM logc smple but has several addtonal benefts, ncludng: Enforcement of dfferent row-buffer allocaton polces (for nstance, the controller could enforce specalzed farness orented or performance orented row-buffer allocaton polces to sut the memory access characterstcs of workloads) Holstc management of the pool of row-buffers avalable across all the DRAM banks n all the DRAM ranks. In each of the three DRAM access operatons (Actvate, Precharge, Column Access), a row buffer s accessed. Therefore we look at each operaton and dscuss sgnalng and tmng ssues nvolved n specfyng row-buffer selecton nformaton: Actvate: Row buffer selecton s not n the tmng crtcal path for ths operaton snce the DRAM has to frst actvate the sub-row and start dschargng column data onto bt lnes. Thus, the row buffer selecton should only be ready by the tme bt lnes have been drven from the storage cells. Thus a smple mechansm such as sgnalng the row buffer specfcaton bts n the cycle followng the RAS command suffces. We call ths Posted-Buffer-Selecton. Ths requres no addtonal pns. Alternatvely, the buffer selecton bts could be drven along wth the sub-row selecton bts usng the Double-Address-Rate scheme. Precharge: In ths case, we assume that the row-buffer selecton s tmng crtcal. For ths operaton, the controller needs to specfy both the sub-row selecton bts as well as the rowbuffer selecton bts. We propose to use the double-addressrate scheme to accomplsh ths transfer wthout addng latency. Column Access: We assume that row-buffer selecton s tmng crtcal to column accesses as well and use the Double- Address-Rate mechansm to ssue these bts quckly. The Double-Address-Rate scheme thus takes care of sgnallng both the sub-row selecton bts as well as the row-buffer selecton bts wthout ncurrng addtonal latency. 3.3 Row-Buffer Allocaton Our MSRB organzaton requres the memory controller to decde whch of the buffers to allocate for a new row actvaton. For our default confguraton, we employ the commonly used Least Recently Used (LRU) polcy n the memory controller to make ths allocaton decson. We defer a more detaled dscusson of alternatve allocaton polces to Secton Sources of Energy Reducton Energy savngs n MSRB are obtaned as a combnaton of: 1. Reducton n the energy consumed by each Actvate and Precharge operaton due to smaller rows: Snce fewer capactors have to be charged and dscharged, and fewer bts have to be latched n the sense amps, the energy consumed reduces. 2. Reducton n the number of Actvate and Precharge operatons due to fewer row msses: Multple row-buffers offer hgher data retentvty n the buffers thereby reducng the number of tmes that rows are actvated and precharged. Every addtonal row ht saves the energy that would have been expended n prechargng one row and actvatng another. Our scheme results n addtonal energy reducton due to the fact that more row hts lead to fewer total memory cycles, thereby savng addtonal background power. 3.5 Area Impact Area overheads comprse the MSRB re-organzaton overhead n DRAM as well as the book-keepng overhead nsde the memory controller, and we dscuss each below DRAM Area Overhead In the followng dscusson, we assume an MSRB organzaton comprsng 4 small row-buffers per bank. Our estmate of the area

5 overhead n MSRB ncludes: addtonal decoders for sub-row selecton, runnng addtonal sub-row selecton lnes (wres), addtonal decoders for row-buffer selecton, addtonal multplexers to control data routng between selected sub-row & selected rowbuffer, and addtonal wrng for data routng. Sub_row_select lnes and AND gates: Snce the sze of each small row buffer s 1 -th the sze of the full row buffer, 4 we need to run 4 sub_row_select lnes and add (4 number of rows) AND gates. Each gate adds about 6 addtonal transstors to the decode logc. As n [12], ths was mplemented usng herarchcal word lnes [28] and modeled analytcally n CACTI [27]. The CACTI model s set up for explorng the hghest densty mplementaton as s the case wth commodty DRAMs. Wth these, we obtan an area overhead of 4.9% to support splttng each row nto 4 sub-rows. Row-buffer selecton demultplexers, and buffer_select_n lnes: Snce each operaton has to access one of the 4 avalable rowbuffers, addtonal decode crcutry s added to decode two buffer selecton bts and drve the approprate buffer selecton lnes. Whle the buffers themselves are sense amps that have a much larger transstor sze, the decoder logc transstors are of a smaller transstor sze and thus do not sgnfcantly ncrease area. We lay out the 4 small buffers n a 2 2 confguraton allowng for effcent wrng of buffer_select_n lnes. Modelng these overheads n CACTI, we obtan an area overhead of 1.9%. Note that n our desgn, the total storage capacty of the rowbuffers (4 buffers of one-fourth the sze compared to a sngle large buffer) does not ncrease. Thus our desgn s buffercapacty-neutral. Thus the total area overhead of the proposed re-organzaton s 6.8% per DRAM bank Area Overhead n the Memory Controller The memory controller has to mantan certan metadata tags, vald bts, drty bts, and recency bts whch consttutes the overhead. Ths overhead s less than 16 bytes for the 4 row-buffers n a bank. For a 4 GB RAM organzed as 4 ranks and 8 banks, ths overhead would be = 512 bytes. For the baselne wth one row buffer per bank, the memory controller stll ncurs one-fourth of ths overhead (128 bytes) snce t has to mantan ths state nformaton anyway. We consder the addtonal storage overhead neglgble. 4. UNLOCKING PERFORMANCE, ENERGY AND FAIRNESS OPPORTUNITIES Our MSRB organzaton opens up the desgn space for rowbuffer allocaton and management polces for allocaton of these resources across the cores for performance and/or farness benefts. As dscussed n Secton 3.3, the row-buffer allocaton decson s done at the memory controller. Whle a detaled exploraton of allocaton polces s outsde the scope of ths paper, we present two smple schemes below to llustrate the flexblty that row-buffer 2 We note here that the commercal DRAM mplementatons are hghly optmzed and the area estmates and overheads calculated usng tools such as Cact may dffer from these. However, the commercal desgns are propretory and are(almost) never avalable for research studes (even f they are avalable such data s seldom publshed due to busness and other reasons). reorganzaton facltates. The frst - Farness orented Buffer Allocaton - mproves farness va judcous buffer allocaton. The Performance orented Buffer Allocaton mproves performance of programs wth hgh mss rates. In addton, we dscuss a par of schedulng and hardware optmzatons, namely: Early Precharge and Intra Bank Parallelsm that are enabled by MSRB. The net effect of mprovng performance (va ncreased row hts) s to also reduce energy consumpton. All of these optmzatons are mplementated at the memory controller Farness orented Buffer Allocaton Here, the ntuton s that n a typcal multcore workload, some cores stand to beneft a lot more from hgher row-buffer ht rates than others. Snce the nterleavng of requests from multple cores causes cores wth lower arrval rates to suffer greater row-buffer msses, ths allocaton scheme counters the dsproportonate ncrease n mss rate by allocatng dedcated buffers for such cores. The scheme works by mantanng on a per-core bass, the actual row-buffer ht rate as well as an estmate of the ht rate had the core been runnng alone (referred to as standalone ht rate). The standalone ht rate s obtaned by keepng a shadow row buffer (one per core) n the memory controller whch would be updated only wth the requests from the gven core. The dfference between the standalone ht rate and the shared (actual) ht rate provdes an estmate of the loss suffered by the core due to nterference. If ths dfference exceeds a threshold (n our case, t s set to 0.5), we classfy the core as sufferng unfarness. The scheduler then attempts to allocate dedcated buffers to cores sufferng unfarness. For nstance, f one of 4 cores n a quad-core confguraton s unfarly sufferng, then the scheduler dedcates one of the 4 row-buffers to ths core, whle the other 3 row-buffers are made avalable to all the cores. For each core c and bank b, the scheme computes the dfference d = (standalone ht rate shared ht rate). Hgher values of ths measure suggest hgher benefts by dedcatng buffers to such cores. At each bank, ths metrc s used to classfy the core: Type-1: Core wth d threshold Type-2: Core wth d < threshold Snce the classfcaton s done per-bank, ths scheme can nherently self-adjust to varatons n bank utlzaton. The same core could be classfed Type-1 n one bank whle classfed Type-2 n another. Ths classfcaton s done perodcally so as to adapt to program behavor changes. The controller then allocates dedcated row-buffers for Type-1 cores. In our mplementaton, we chose the below scheme: Only 1 Type-1 core n the workload: The Type-1 core gets one dedcated buffer, all cores can access remanng 3 buffers Two Type-1 cores n the workload: Each Type-1 core gets one dedcated buffer, all cores can access remanng 2 buffers Three or more Type-1 cores n the workload: It defaults to LRU scheme. Whenever a core needs to actvate a new row, the controller looks up the allocated buffers for that core and chooses the LRU buffer from amongst these to brng the new row nto. 4.2 Performance orented Buffer Allocaton Ths scheme works by dynamcally adaptng buffer allocatons to the most demandng cores. The scheme estmates the needs of each core perodcally on a per-bank bass by takng nto account the 3 ntrabp requres mnor DRAM nterfaces changes to permt multple outstandng accesses to each bank

6 number of memory requests (.e., msses from the last-level cache) from each core, and the row-buffer mss rates suffered by each core at each bank. For each core c and bank b, t computes a rate product defned as: rate_product[c][b] = num_memory_requests[c][b] row_mss_rate[c][b]. Hgher values of ths measure suggest hgher benefts by mprovng ther ht rates. At each bank, rate products are used to classfy cores nto one of two types: Type-1: Core wth rate_product threshold Type-2: Core wth rate_product < threshold Several varatons of ths scheme are possble. We use a smple scheme whch defaults to LRU allocaton when the number of Type-1 workloads for a bank exceeds half the number of rowbuffers per bank (2 n our case). Otherwse, the allocaton scheme allows each Type-1 workload to have exclusve use of certan rowbuffers (upto 2 n our case) and shared use of the remanng rowbuffers by all workloads, smlar to the prevous scheme. For both the above schemes, the storage overhead n the memory controller s neglbly small (of the order of R B M N bytes for R ranks, B banks per rank, M buffers per bank, and N cores). 4.3 Early Precharge Schedulng Tradtonally, memory controllers use ether an open page or closed page polcy for prechargng rows. The polcy essentally determnes whether to precharge an open page eagerly (closed page) or lazly (open page). Eager polces work better n stuatons where t s hghly lkely that the next request would cause a row buffer evcton. Multple row-buffers open up the possblty to precharge dfferent row-buffers usng dfferent polces. We mplemented selectve early precharge schedulng wheren only the LRU row-buffer s precharged early whle the rest of the row-buffers follow the open-page polcy. The ratonale for ths s that the LRU row-buffer s most lkely to be the canddate for evcton and s better off precharged early whle the other row-buffers are more lkely to see addtonal row-hts and therefore are kept open. Eagerly prechargng the LRU buffer helps to reduce the latency of a subsequent rowbuffer mss. In our mplementaton, the memory controller looks for dle cycles and nserts precharge operatons for the LRU rowbuffer n each bank. Whle ths schedulng s orthogonal to the rowbuffer allocaton schemes descrbed n the earler secton, we only mplemented t over the baselne LRU polcy for our experments. 4.4 Intra-Bank Parallelsm Multple row-buffers permt parallel operatons wthn a bank 4 : column accesses on one row-buffer could occur n parallel wth an actvate or precharge operaton on another. Whle each ndvdual memory access follows the standard DRAM access protocol, ths optmzaton (abbrevated ntrabp) allows ppelng of operatons at each bank. It mproves the effcency of data bus utlzaton by allowng us to ssue column accesses faster. Insde each DRAM bank, the necessary crcutry to support ths parallelsm s already avalable. The memory scheduler can easly ncorporate ths enhancement nto ts schedulng of sub-commands. Fgure 5 shows an example tmng dagram ndcatng ths parallelsm. Ths helps to hde some of the actvate and precharge latences va ths optmzaton and t effectvely translates to hgher bandwdth and lower latency. In partcular, programs that have low bank-level parallelsm are greatly benefted by ths feature snce the scheduler s unable to keep multple banks busy n parallel. Note that explotng ntrabp s not possble wthout multple row-buffers. 4 Ths parallelsm s n addton to bank-level, rank-level and memory controller-level parallelsm present n DRAM memory systems Processor L1I Cache L1D Cache L2 Cache Controller DRAM Fgure 5: Example of ntra-bank parallelsm 3.2 GHz OOO Alpha ISA 32kB prvate, 64 byte blocks, Drect-mapped, 3 cycle ht latency 32kB prvate, 64 byte blocks, 2-way set-assocatve, 3 cycle ht latency For 1/4/8/16 cores: 1MB/4MB/8MB/16MB 4-way/8-way/16-way/32-way 32/128/256/512 MSHRs 64-byte blocks, 15 cycle ht latency On-chp; 64-bt nterface to DRAM 256-entry command queue FR_FCFS schedulng [5], open-page polcy Address-nterleavng: rank-bank-row-column Number of memory controllers for 1/4/8/16 cores: 1/1/2/4 DDR3-1600H, BL=8, CL-nRCD-nRP=9-9-9 a rank comprses 4 1GB x16 devces, each devce has 8 banks, each bank has rows, 1024 columns Table 1: CMP confguraton 5. EXPERIMENTAL METHODOLOGY 5.1 Smulaton Setup We evaluate our desgn usng M5 [18] smulator ntegrated wth a detaled n-house DRAM smulator. The DRAM smulator fathfully models both the memory controller as well as the DRAM wth accurate tmng. Each program n the workload s executed n fast-forward mode for 9 bllon nstructons, then n warm-up mode for 500 mllon nstructons and fnally, n detaled cycle-accurate mode for 250 mllon nstructons. Mult-core smulatons are run untl all the programs complete 250 mllon nstructons. 5 As s the standard practce, programs that fnsh early contnue to execute but the performance of only the frst 250 mllon nstructons s consdered for each core. The baselne machne confguraton used n our studes s shown n Table 1. L2 s the last level cache and s shared across all the cores. The baselne confguraton (1 1024) has a sngle large 1024 columns wde row-buffer. Our MSRB confguraton (4 256) uses 4 narrow row buffers, each 256 columns wde. MSRB s managed usng a LRU buffer allocaton polcy unless specfed otherwse. Whle the quad-core has one memory controller, the eght and sxteen cores have two and four memory controllers respectvely. 5.2 Power Estmaton We estmate the mpact of our proposed row-buffer reorganzaton on DRAM power consumpton usng Mcron s PowerCalculator spreadsheet [19]. The spreadsheet models power consumpton of a DRAM confguraton by allowng the user to nput DRAM confguraton parameters and system usage values. We obtan the 5 Although we run only 250M Instructons per core n cycleaccurate mode, our 4, 8 and 16-core smulatons each runs for a total of 1 Bllon - 4 Bllon nstructons n cycle-accurate mode.

7 Quad-Core Workloads Q1:(462,459,470,433), Q2:(429,183,462,459), Q3:(181,435,197,473), Q4:(429,462,471,464), Q5:(470,437,187,300), Q6:(462,470,473,300), Q7:(459,464,183,433), Q8:(410,464,445,433), Q9:(462,459,445,410), Q10:(429,456,450,459), Q11:(181,186,300,177), Q12:(168,401,435,464) Eght Core Workloads E1:(462,459,433,456,464,473,450,445), E2:(300,456,470,445,179,464,473,450), E3:(168,183,437,401,450,435,445,458), E4:(187,172,173,410,470,433,444,177), E5:(434,435,450,453,462,471,164,186), E6:(181,473,401,172,177,178,179,435), E7:(437,459,445,454,456,465,171,197), E8:(429,416,433,454,464,435,444,458) Sxteen Core Workloads S1:(462,459,433,179,183,473,450,445,444,470,429,171,168,172,435,458) S2:(401,433,434,435,444,445,450,300,459,470,471,473,171,181,179,183) S3:(178,177,168,172,173,187,191,410,429,434,462,473,465,458,464,445) S4:(186,454,458,482,181,429,255,254,178,197,179,187,173,401,410,437) Table 2: Workloads Weghted Speedup (WS) = Mnmum slowdown = mn Harmonc Speedup (HS) = Maxmum slowdown = max Farness = Mnmum slowdown Maxmum slowdown IP C shared IP C alone IP C alone IP C shared N IP C alone IP C shared IP C alone IP C shared Table 3: Performance and Farness metrcs power requrements for both the baselne and the multple rowbuffer organzaton usng ths spreadsheet. For the baselne, we use the default settngs provded for -125E speed grade. Change to Actvate & Precharge power s modeled by adjustng the value of IDD0, the foreground current that drves these operatons. Usng the equaton gven n [20], and conservatvely estmatng a reducton of IDD0 from 120mA to 95mA, we compute the Actvate and Precharge power dsspaton for the new organzaton. We also assume that column access power ncreases by 5% owng to the addton of row-buffer selecton logc. In our studes, we separately estmate the power reductons comng from the two ndependent factors, namely smaller rows requrng lesser power and fewer Actvate/Precharge operatons. The power number along wth the executon tme s used to compute the DRAM energy consumpton. We do not model the addtonal power consumpton n the memory controller as the added logc there s margnal and relatvely smple. 5.3 Workload and Metrcs We use mult-programmed workloads comprsng programs from SPEC [23] 2000 and SPEC 2006 sutes to evaluate our proposal. The workloads are typcally a mx of programs wth varyng levels of memory ntensty 6. The workload mx used n our studes s presented n Table 2. We use weghted speedup [21] and harmonc speedup [22] to summarze performance. We report farness usng the rato of mnmum slowdown to maxmum slowdown. These terms are defned n Table 3. 6 We use the L2 MPKI to dentfy memory ntensve programs. 6. RESULTS In ths secton, we evaluate the mpact of MSRB on system performance and the energy benefts 7 provded by t. Further we compare t wth state-of-the-art memory access schedulng methods desgned to mprove row ht rate. A study on the performance mprovements due to memory controller sde buffer allocaton s also presented. 6.1 Performance Benefts of MSRB The performance of MSRB for the quad-core case s summarzed n Fgure 6. As can be seen from Fgure 6(a), MSRB mproves the weghted speedup by 35.8% over The performance gan n terms of harmonc speedup, as shown n Fgure 6(b), s 27.5%. All the workloads show mproved performance wth MSRB. Ths shows the mportance of focusng on the temporal localty (multple row-buffers) at the cost of spatal localty (narrow row-buffers). MSRB acheves a sgnfcantly hgher row-buffer ht rate of 0.6 (n Fgure 6(c)) as compared to 0.2 observed n the baselne case. The observed gans n performance are typcally n lne wth the mprovement n row-buffer ht rate. Fgure 7 shows the performance mprovement n terms of weghted speedup for 8 and 16 core workloads. MSRB provdes 14.5% and 21.6% mprovement n performance for 8 and 16 core workloads respectvely over baselne. An nterestng case-study s the rowbuffer ht rates experenced by the ndvdual programs n the workload E1. Fgure 7(b) shows the ht rates experenced by the ndvdual programs for baselne and MSRB. It can be seen that the row-buffer ht rate mproves wth MSRB for all the ndvdual programs. Last, MSRB mproved the performance n terms of IPC of sngle-core SPEC2000 and SPEC2006 workloads by 7.1% on an average (refer Fgure 10 for IPC values obtaned for a subset of benchmarks). Observe that programs such as 459.GemsFDTD and 462.lbquantum show consderably hgh gans (148% and 21%). 6.2 Energy Benefts of MSRB Improved row-hts not only boosts the performance, but also translates nto energy savngs due to reduced number of actvatons and precharges. The smaller sze of the row-buffers also reduces the energy requred for actvate and precharge operatons. The energy consumpton s computed usng the methodology descrbed n Secton 5.2. Fgure 8 shows the DRAM energy gans provded by MSRB for the quad-core workloads. On an average MSRB reduces the energy requrements by 40% compared to the baselne. It s nterestng to note that hgh energy gans are obtaned not only for workloads showng hgh performance gans but also for others due to the narrow row buffers used. Further, as expected mproved ht rate not only translates to hgh performance gans, but also nto sgnfcant savngs n terms of energy consumed. The total actvate power reduced on an average from 357mW to 127mW. For 8 core and 16 core workloads, the energy gans are 28% and 31% respectvely over the baselne. 6.3 Comparson wth Memory-Access Schedulng Schemes In ths secton, we compare MSRB wth two state-of-the-art memory access schedulng methods, PARallelsm aware Batch Schedulng (PARBS) [2] and Thread Cluster Memory schedulng(tcm) [1]. We used the same values as n the orgnal paper for the varous threshold parameters assocated wth these schemes. We also compare our scheme to a hypothetcal 32 bank DRAM confguraton (32-Bank). Ths s a confguraton wth a hgh number of banks, 7 All energy numbers reported n ths paper refer to DRAM energy.

(a) Weghted Speedup (b) Harmonc Speedup (c) Row-Buffer Ht Rate Fgure 6: Quad-Core Performance and Row-Buffer Ht Rates (a) Weghted-Speedup (b) Row-Buffer

(a) Normalzed Weghted Speedup (b) Row-Buffer Ht Rate Fgure 9: Performance Comparson wth Other Schedulng Schemes wth each bank havng a 256 column wde

In addton, as our scheme s orthogonal to memory-access schedulng schemes to mprove row-ht rates, t s possble to have multple row-buffers n TCM and PARBS.

Fgure 9(a) shows the performance n terms of weghted speedup (normalzed to the baselne) for PARBS, TCM, 32-Bank, MSRB PARBS+MSRB and TCM+MSRB for

On an average PARBS and TCM provde gans of only 8.5% and 10.5% over baselne, whle MSRB mproves the performance by 35.8%.

TCM+MSRB and PARBS+MSRB schemes. The observed trend n performance s also reflected n the row-buffer ht rates exhbted by the varous schemes.

$Ths s prmarly because ncreasng the number of banks can explot only a fracton of the temporal localty and the lmtatons of havng only one row-buffer per$

8 (a) Weghted Speedup (b) Harmonc Speedup (c) Row-Buffer Ht Rate Fgure 6: Quad-Core Performance and Row-Buffer Ht Rates (a) Weghted-Speedup (b) Row-Buffer Ht Rate n E1 Fgure 7: Eght and Sxteen Core Performance Fgure 8: DRAM Energy Consumpton. (a) Normalzed Weghted Speedup (b) Row-Buffer Ht Rate Fgure 9: Performance Comparson wth Other Schedulng Schemes wth each bank havng a 256 column wde row-buffer. 32-Bank has the same number of row-buffers as our MSRB confguraton whch has 8 DRAM banks and 4 row-buffers per bank. In addton, as our scheme s orthogonal to memory-access schedulng schemes to mprove row-ht rates, t s possble to have multple row-buffers n TCM and PARBS. We refer to these confguratons of TCM and PARBS enhanced wth MSRB (four 256 column wde row-buffers per bank) as TCM+MSRB and PARBS+MSRB respectvely. Fgure 9(a) shows the performance n terms of weghted speedup (normalzed to the baselne) for PARBS, TCM, 32-Bank, MSRB PARBS+MSRB and TCM+MSRB for quad-core workloads. For all the workloads, t can be seen that MSRB performs better than PARBS and TCM. On an average PARBS and TCM provde gans of only 8.5% and 10.5% over baselne, whle MSRB mproves the performance by 35.8%. The nterestng thng to note s that MSRB can greatly ad the performance of TCM and PARBS, as can be seen from the sgnfcantly better speedups experenced by TCM+MSRB and PARBS+MSRB schemes. The observed trend n performance s also reflected n the row-buffer ht rates exhbted by the varous schemes. As can be seen from Fgure 9(b), MSRB s more effectve n mprovng row-buffer ht rates compared to PARBS and TCM appled on a top of a sngle large row-buffer. It s nterestng to note that 32-Bank confguraton gves only a 5.9% mprovement n performance over baselne (weghted speedup). Ths s prmarly because ncreasng the number of banks can explot only a fracton of the temporal localty and the lmtatons of havng only one row-buffer per bank shows up after that. 6.4 Senstvty Study We reorganzed the baselne row-buffer nto MSRB n our study and t outperformed all other confguratons. But are there other MSRB confguratons that can yeld even better

Fgure 12: Performance Impact of ntrabp and EarlyPrecharge performance? For example how does the MSRB confguraton wth 2 512 perform?

9 Fgure 12: Performance Impact of ntrabp and EarlyPrecharge performance? For example how does the MSRB confguraton wth perform? As row-buffer szes below 256 resulted n notceable losses even n the case of sngle-cores, we do not evaluate them n detal. Fgure 11 shows the row-buffer ht rates experenced by quad-core workloads for MSRB confguratons and provdes a row ht rate of 0.48 whle acheves In terms of weghted-speedup, provded a gan of 34.5% over the baselne confguraton of We also smulated an eght-core system wth one memory controller. In the case of a sngle memory controller, the performance gans provded by MSRB are further enhanced, resultng n performance mprovement of 16% n terms of weghted-speedup over baselne. 6.5 Benefts of Early Precharge and Intra-Bank Parallelsm Fgure 12 shows the performance n terms of weghted speedup (normalzed to that of ) for MSRB, EarlyPrecharge, IntraBP and MSRB wth both EarlyPrecharge and IntraBP for quadcore workloads. Also ncluded s the performance of the baselne wth a closed page polcy. Ths scheme s equvalent to an EarlyPrecharge wth a sngle row-buffer per bank. It can be observed that EarlyPrecharge wth MSRB provdes a gan of 40% over the baselne on an average. Enablng EarlyPrecharge mproves performance by 30% n workload Q9 compared to MSRB. EarlyPrecharge sacrfces some of the row hts to reduce the latency of a future row-buffer mss. In cases where the reduced latency for a row-buffer mss s not suffcent to offset the loss of row-buffer hts, as n Q5, EarlyPrecharge shows a drop n performance compared to MSRB. In summary, EarlyPrecharge yelds an addtonal performance mprovement of 1.8% on top of MSRB. IntraBP has a postve effect on every workload as s to be expected. By utlzng the data bus cycles more effcently, t acheves an average addtonal mprovement of 4.7% on top of MSRB. Whle we do not present detaled results here due to lack of space, the average latency for each memory access reduces by 19% due to the ncreased parallelsm. When EarlyPrecharge and IntraBP are both enabled, we observe an average addtonal mprovement of 5.9%. It may also be observed that whle EarlyPrecharge fared poorly n workload Q5, the combned optmzaton restores t to the baselne. Smlarly, n workload Q9, EarlyPrecharge gans sgnfcantly and that gan s retaned n the combned optmzaton. 6.6 Impact of Farness-orented and Performance orented Allocaton Fgure 13(a) plots farness of baselne, MSRB, and MSRB wth Farness orented Allocaton (MSRB+Far) allocaton schemes for quad-core workloads. As observed, MSRB mproves farness (over baselne) by 20% whle MSRB+Far mproves farness by an average of 43%. Whle detaled results are not ncluded here, we observed that n several mxes, latency senstve cores that suffered sgnfcant unfarness got a boost by allocatng dedcated buffers to them. Performance orented Allocaton ensures more row-buffers for memory-ntensve programs. Fgure 13(b) plots the weghted speedup for the baselne, MSRB, and MSRB wth Performance orented Allocaton (MSRB+Perf) allocaton schemes for quad-core workloads. MSRB+Perf mproves performance by 40.9% over the baselne. Ths corresponds to an addtonal mprovement of 1.9% over MSRB. In programs wth more memory ntensve benchmarks, MSRB+Perf can mprove performance by as much as 25%. Fgure 13(c) shows the IPCs of the ndvdual programs n workload Q9 wth MSRB as well as MSRB+Perf. It can be seen that MSRB+Perf mproves the performance of memory ntensve programs 459.GemsFDTD and 462.lbquantum wthout affectng the performance of programs lke 445.gobmk and 410.bwaves whch are relatvely less memory ntensve. 7. RELATED WORK There s a large body of work on ntellgent memory schedulng to mprove row-buffer localty and bank-level parallelsm ([1], [2], [3], [4], [9]). These methods generally requre farly sophstcated trackng of memory access patterns n the memory controller to drve schedulng decsons. Work on page colorng [6] and addressmappng technques ( [8], [7]) attempt to redstrbute pages so as to mprove performance ([8]) or reduce power ([6], [14]). Our proposed MSRB organzaton s orthogonal to all the above schemes and can complement them. Work on phase-change memores n [15] dscusses multple row buffers as a mechansm to render PCMs as a vable alternatve to DRAMs. Though conceptually smlar, we propose a practcal mplementaton of ths scheme n the context of the wdely used DRAM memores wthout requrng major changes to the rgd JEDEC standard. Further, we llustrate other optmzaton opportuntes enabled by the MSRB organzaton. The work n [16] explores the beneft of buldng a more full fledged SRAM cache n front of the DRAM array to catch more accesses n the cache. However, t necesstates a sgnfcant logc addton to the densty-optmzed DRAM desgn. Smlarly, VCM [30] memory, ntroduced brefly n the 90 s by NEC, added a set of buffers shared across all the DRAM modules and ntroduced the noton of foreground and background operatons. However t ntroduced sgnfcant changes to the DRAM access standard and the ssue of buffer management has not been systematcally addressed. Smaller row-buffers for energy-effcency have receved recent attenton. Smaller row-buffers result n reduced energy consumpton wth mnmal mpact on performance [12]. Mnrank [13], MC- DIMM [29] and Adaptve-Granularty [26] propose alternate data storage to reduce energy consumpton whle attemptng to mantan performance. In contrast, our MSRB organzaton acheves performance mprovement along wth energy reducton and farness mprovements. 8. CONCLUSIONS In ths paper we have proposed a row-buffer reorganzaton for DRAMs whch offers sgnfcant energy reducton whle smultaneously mprovng performance. These dual benefts make ths proposed DRAM archtecture an attractve soluton for today s DRAM energy and performance ssues. We dscuss a feasble mplementaton of ths archtecture wth mnmal mpact to exstng DRAM standards. Our mplementaton opens up newer optmzaton opportuntes such as Intra-Bank Parallelsm and selectve Early Precharge. Further, wth MSRB dfferent row-buffer allocaton schemes can be tred to mplement performance and farness polces. We llustrated ths flexblty usng a par of schemes - Farness-orented Allocaton and Performance-orented Allocaton.

Fgure 10: Sngle-Core IPC Fgure 11: Senstvty Results for MSRB (a) Farness (b) Weghted Speedup (c) Workload Q9 Fgure 13: Farness and Performance mprovements va buffer allocaton schemes MSRB showed a

In comparson, state of the art memory access schedulng schemes TCM [1] and PARBS [2] were able to mprove the baselne performance by only 10.5%

Mutlu and M. Harchol-Balter. Thread Cluster Memory Schedulng: Explotng dfferences n Memory Access Behavor. In Mcro 2010. [2] O. Mutlu and T. Moscbroda.

Stall-Tme Far Memory Access Schedulng for Chp Multprocessors. In Mcro 2007. [4] H. Zheng, J. Ln, Z. Zhang, Z. Zhu. Memory Access Schedulng Schemes for Systems wth Mult-Core Processors. In ICPP 2008.

Mcro-Pages: Increasng DRAM Effcency wth Localty-Aware Data Placement. In ASPLOS 2010. [7] M. Awasth, D. Nellans, K. Sudan, R. Balasubramonan, and A. Davs.

10 Fgure 10: Sngle-Core IPC Fgure 11: Senstvty Results for MSRB (a) Farness (b) Weghted Speedup (c) Workload Q9 Fgure 13: Farness and Performance mprovements va buffer allocaton schemes MSRB showed a performance mprovement of 35.8% for quadcore workloads. Ths mprovement was accompaned by an energy reducton of 43% n the DRAM. In comparson, state of the art memory access schedulng schemes TCM [1] and PARBS [2] were able to mprove the baselne performance by only 10.5% and 8.5%. Further, the addtonal performance optmzatons, namely Early Prechargng and Intra-Bank Parallelsm mproved the system performance by an addtonal 5.9%. 9. REFERENCES [1] Y. Km, M. Papamchael, O. Mutlu and M. Harchol-Balter. Thread Cluster Memory Schedulng: Explotng dfferences n Memory Access Behavor. In Mcro [2] O. Mutlu and T. Moscbroda. Parallelsm-Aware Batch Schedulng: Enhancng both Performance and Farness of Shared DRAM Systems. In ISCA [3] O. Mutlu and T. Moscbroda. Stall-Tme Far Memory Access Schedulng for Chp Multprocessors. In Mcro [4] H. Zheng, J. Ln, Z. Zhang, Z. Zhu. Memory Access Schedulng Schemes for Systems wth Mult-Core Processors. In ICPP [5] S. Rxner, W. J. Dally, U. J. Kapas, P. Mattson, and J. D. Owens. Memory Access Schedulng. In ISCA [6] K. Sudan, N. Chatterjee, D. Nellans, M. Awasth, R. Balasubramonan and A. Davs. Mcro-Pages: Increasng DRAM Effcency wth Localty-Aware Data Placement. In ASPLOS [7] M. Awasth, D. Nellans, K. Sudan, R. Balasubramonan, and A. Davs. Handlng the Problems and Opportuntes Posed by Multple On-Chp Memory Controllers. In PACT [8] Z. Zhang, Z. Zhu and X. Zhang. A Permutaton-based Page Interleavng Scheme to Reduce Row-buffer Conflcts and Explot Data Localty. In Mcro [9] C. J. Lee, O. Mutlu, V. Narasman and Y. N. Patt Prefetch-Aware DRAM Controllers. In Mcro [10] L. Barroso and U. Holzle. The Datacenter as a Computer: An Introducton to the Desgn of Warehouse-Scale Machnes. Morgan & Claypool, [11] D. Mesner, B. Gold, and T. Wensch. PowerNap: Elmnatng Server Idle Power. In ASPLOS [12] A. N. Udp, N. Muralmanohar, N. Chatterjee, R. Balasubramonan., A. Davs, and N. P. Joupp. Rethnkng DRAM Desgn and Organzaton for Energy-Constraned Mult-Cores. In ISCA [13] H. Zheng, J. Ln, Z. Zhang, E. Gorbatov, H. Davd, and Z. Zhu. Mn-Rank: Adaptve DRAM Archtecture for Improvng Memory Power Effcency. In Mcro [14] H. Huang, K. G. Shn, C. Lefurgy and T. Keller. Improvng Energy Effcency by Makng DRAM Less Randomly Accessed. In ISLPED [15] B. C. Lee, E. Ipek, O. Mutlu and D. Burger. Archtectng Phase Change Memory as a scalable DRAM alternatve. In ISCA [16] W. Wong and J-L Baer. DRAM Cachng Dept of CS and Engg., Unversty of Washngton Tech report UW-CSE [17] The JEDEC consortum. [18] N. L. Bnkert, R. G. Dreslnsk, L. R. Hsu, K. T. Lm, A. G. Sad, and S. K. Renhardt. The M5 Smulator: Modelng Networked Systems. In Mcro [19] Mcron. msc/ddr3_power_calc.xls [20] Mcron. Calculatng Memory System Power for DDR3 TN41_01DDR3%20Power.pdf. [21] A. Snavely and D. M. Tullsen. Symbotc jobschedulng for a smultaneous multthreadng processor. In ASPLOS 2000 [22] K. Luo, J. Gummaraju, and M. Frankln. Balancng thoughput and farness n smt processors. In ISPASS 2001 [23] The SPEC Consortum. [24] O. La. SDRAM havng posted CAS functon of JEDEC standard. Unted States Patent, Number [25] E. Ipek, O. Mutlu, J. F. Martnez, and R. Caruana. Self-Optmzng Memory Controllers: A Renforcement Learnng Approach. In ISCA [26] D. H. Yoon, M. K. Jeong, and M Erez. Adaptve Granularty Memory Systems: A Tradeoff between Storage Effcency and Throughput. In ISCA [27] S. Thozyoor, N. Muralmanohar, and N. Joupp. CACTI 5.0. Techncal report. HP Laboratores [28] K. Itoh. VLSI Memory Chp Desgn. Sprnger [29] J. H. Ahn et al. Multcore DIMM: an Energy Effcent Memory Module wth Independently Controlled DRAMs. In IEEE Computer Archtecture Letters [30] S. Rxner. Memory Controller Optmzatons for Web Servers. In IEEE Mcro 2004.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?