Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities

Size: px
Start display at page:

Download "Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities"

Transcription

1 Multple Sub-Row Buffers n DRAM: Unlockng Performance and Energy Improvement Opportuntes ABSTRACT Nagendra Gulur Texas Instruments (Inda) nagendra@t.com Mahesh Mehendale Texas Instruments (Inda) m-mehendale@t.com The twn demands of energy-effcency and hgher performance on DRAM are hghly emphaszed n multcore archtectures. A varety of schemes have been proposed to address ether the latency or the energy consumpton of DRAMs. These schemes typcally requre non-trval hardware changes and end up mprovng latency at the cost of energy or vce-versa. One specfc DRAM performance problem n multcores s that nterleaved accesses from dfferent cores can potentally degrade row-buffer localty. In ths paper, based on the temporal and spatal localty characterstcs of memory accesses, we propose a reorganzaton of the exstng sngle large row-buffer n a DRAM bank nto multple sub-row buffers (MSRB). Ths re-organzaton not only mproves row ht rates, and hence the average memory latency, but also brngs down the energy consumed by the DRAM. The frst major contrbuton of ths work s proposng such a reorganzaton wthout requrng any sgnfcant changes to the exstng wdely accepted DRAM specfcatons. Our proposed reorganzaton mproves weghted speedup by 35.8%, 14.5% and 21.6% n quad, eght and sxteen core workloads along wth a 42%, 28% and 31% reducton n DRAM energy. The proposed MSRB organzaton enables opportuntes for the management of multple row-buffers at the memory controller level. As the memory controller s aware of the behavour of ndvdual cores t allows us to mplement coordnated buffer allocaton schemes for dfferent cores that take nto account program behavour. We demonstrate two such schemes, namely Farness Orented Allocaton and Performance Orented Allocaton, whch show the flexblty that memory controllers can now explot n our MSRB organzaton to mprove overall performance and/or farness. Further, the MSRB organzaton enables addtonal opportuntes for DRAM ntra-bank parallelsm and selectve early prechargng of the LRU row-buffer to further mprove memory access latences. These two optmzatons together provde an addtonal 5.9% performance mprovement. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ICS 12, June 25 29, 2012, San Servolo Island, Vence, Italy. Copyrght 2012 ACM /12/06...$ R Mankantan Indan Insttute of Scence rman@csa.sc.ernet.n R Govndarajan Indan Insttute of Scence govnd@serc.sc.ernet.n Categores and Subject Descrptors C.1.2 [PROCESSOR ARCHITECTURES]: [Multple Data Stream Archtectures (Multprocessors)] Keywords DRAM, Memory Performance, Mult-Core Archtecture 1. INTRODUCTION Wth the wdenng gap between processor and memory performance, the memory performance can mpact the overall performance of a multcore system n a sgnfcant way. Further, the energy consumpton of DRAM memory accounts for a non-trval (greater than 30%) part of the system energy [10], [11]. The mportance of energy effcency and performance of DRAM s emphaszed by current trends n hgh-performance computng whch acheve performance scalng va multcores. In both server-class and personal computers, multcore confguratons are becomng wdespread wth several cores sharng DRAM memory that s accessed va one or more memory controllers. Thus a recent research focus has been on mprovng the performance and energy effcency of DRAM desgn whch has tradtonally been archtected to address hgh densty and low cost. A key component of DRAM that mpacts both performance and energy s the row-buffer. When an access request s made to a bank n DRAM, t fetches a large row of data (typcally 8KB 16KB, across all devces accessed n parallel) nto the row-buffer. Ths operaton s known as row-actvate. In the presence of spatal localty, future requests to the same row ht n the row-buffer. In these cases, the row-buffer provdes the data, whch reduces both access latency and the energy consumed as the row-actvate operaton s elmnated. A request to a dfferent row n the same bank replaces the row-buffer wth the contents of the new row, after the current row-buffer s wrtten back (precharge). The spatal localty exploted n the row-buffer s greatly reduced when accesses from multple cores get nterleaved at the DRAM [12]. A number of memory access re-orderng schemes [1, 2, 3, 25, 4], varyng n complexty from the smple FR-FCFS (Frst Ready-Frst Come Frst Served) [5] to the more complex renforcement learnng based approach [25], have been proposed to mprove row-buffer hts. Further address re-mappng n hardware or software [8] and memory access re-orderng schemes that are prefetch-aware have also been proposed [9]. Whle these schemes are effectve n mprovng performance, they do not address the ssue of energy reducton drectly. Further, they requre non-trval modfcatons and complex schedulng polces to be mplemented at the memory controller.

2 One sgnfcant contrbutor to the hgh energy consumpton nsde DRAMs s the frequent row Actvate and Precharge operatons to move data back and forth between the row-buffer and the DRAM core. Common methods proposed to reduce DRAM energy nclude smaller row-buffers [12], storage re-organzaton [13], and explotng opportuntes for power-down modes of operaton [14]. However, these methods generally ncur performance degradaton and also ntroduce hardware complexty. For nstance, smaller rowbuffers reduce the energy consumpton at the expense of lower ht rates and a small performance loss. Though there s a lack of spatal localty, mult-programmed workloads do exhbt consderable temporal localty among memory accesses, at the level of DRAM pages/rows [15]. The observed temporal and spatal localty characterstcs make a strong case for havng multple smaller row-buffers per bank nstead of a sngle rowbuffer. Such a confguraton has been known to provde benefts n the context of Phase Change Memory [15]. However, such a row-buffer reorganzaton and ts mpact on performance and energy reducton has not been studed n the context of DRAMs. Ths s mportant as the challenge s to accomplsh ths reorganzaton under a farly rgd JEDEC standard [17]. Further, the proposed organzaton enables a set of optmzaton opportuntes whch have not been explored thus far and are relevant specfcally for DRAMs. Our frst contrbuton s to propose a practcal desgn to ncorporate multple sub-row buffers (MSRBs) n DRAMs wth mnmal changes to the exstng DRAM specfcatons. A study of multple narrow row-buffers n the context of DRAM shows that t can sgnfcantly mprove both performance and energy n mult-cores. The performance gans (n terms of weghted speedup) over the baselne are 35.8%, 14.5% and 21% respectvely for quad, eght and sxteen cores respectvely. Ths gan n performance s acheved along wth an energy reducton of 42%, 28% and 31%. We refer to ths organzaton as MSRB. The necessary controls for mplementng MSRB are ncorporated at the memory controller. Whle ths allows the DRAM desgn changes to be mnmal wth no changes to the pn nterface, t also opens up opportuntes for further optmzatons to effectvely utlze the row-buffers to acheve performance and/or farness goals. Ths s because unlke the DRAM, an on-chp memory controller can observe the behavor of varous cores and hence can manage the allocaton of row buffers to sut the observed beahvor. We demonstrate two such buffer allocaton strateges - Farness Orented Allocaton and Performance Orented Allocaton. Farnessorented Allocaton allocates dedcated row-buffers to cores that suffer the most nterference thereby mprovng both performance and farness. Ths allocaton scheme mproves farness by 43%. Performance-orented Allocaton takes nto account the dfferng memory bandwdth requrements of the varous cores and tres to allocate row buffers to cores n lne wth ther demand. Results show that ths scheme further mproves performance. Our thrd contrbuton s to examne the addtonal optmzatons enabled by our multple row-buffer desgn. On a row-buffer mss, t s essental to wrte back the currently open row, before the newly requested data can be brought nto the row-buffer. Ths precharge latency s typcally n the crtcal path of memory requests. In a multple row-buffer confguraton, t s possble to do an eager precharge of the least recently used row-buffer. Ths proactve approach, whch we refer to as Early Precharge, hdes the precharge latency that wll be experenced by a future request that msses n the row-buffer. Ths optmzaton mproves performance by an addtonal 1.8% over MSRB desgn. A second optmzaton enabled by the new desgn s the ablty to smultaneously servce row-ht requests from one row-buffer whle a dfferent row-buffer s beng actvated or precharged. Ths ntroduces Intra-Bank Parallelsm 1 and mproves performance by an addtonal 4.7% over MSRB. The Early Precharge and Intra-bank Parallelsm optmzatons taken together yeld an addtonal performance mprovement of 5.9%. Last, we compare our row-buffer reorganzaton wth two bestn-class memory controller schedulers (Thread Cluster Memory (TCM) schedulng [1] and Parallelsm Aware Batch Schedulng (PARBS) [2]) and demonstrate that our desgn produces far greater system throughput than what s acheved va scheduler optmzatons alone. Further, we compare our results aganst a hypothetcal DRAM devce whch s hghly banked (32-banks and 256-columns per bank structure). We observe that our proposed MSRB organzaton s more effectve, n terms of both performance and energy, than a hghly banked DRAM wth just one small row-buffer per bank. 2. BACKGROUND AND MOTIVATION In ths secton, we provde the necessary background and the requred motvaton for our work. 2.1 Background We consder the popular JEDEC-style ([17]) DRAM as the baselne archtecture throughout the paper. DRAM devces are packaged as Dual In-lne Memory Modules (DIMMs) whch are nterfaced typcally to an on-chp memory controller (refer Fgure 1). DIMMs contan one or more ranks. A rank s a collecton of DRAM devces that operate n parallel. Each DRAM devce typcally serves up a few bts at the specfed (row, column) locaton. Operatng together, the devces n a rank match the data bus wdth. For example, an x16 devce supples 16 bts of data and 4 such devces makng up a rank can supply data needed to match the 64-bt nterface of the memory controller. Each devce n a rank s organzed nto a number (4, 8 or 16) of logcally ndependent banks. Each bank conssts of multple rows (also called pages) of data. Banks wthn the same rank can operate n parallel and ths provdes for some degree of memory level parallelsm. Fgure 1 shows a dagrammatc descrpton of ths organzaton. A typcal DRAM read request has to frst Actvate the correspondng row by brngng the row data to the row-buffer. Ths s followed by a column read/wrte that reads/wrtes the selected words from/to the row-buffer. Fnally Precharge wrtes back the row-buffer to the approprate row. Precharge s needed even on rows that were only read, snce row actvaton depletes the charge n the correspondng row n the DRAM devce. Each bank s equpped wth ts own row buffer and logc to perform row Actvate (termed RAS), column read/wrte (termed CAS) and Precharge (termed PRE) operatons. An open page[5] polcy delays prechargng untl just before the next row actvate has to be performed, whle the closed page polcy eagerly precharges the row soon after the frst read- /wrte operaton to ths row. The memory controller s responsble for effcent schedulng of requests, as well as for the mplementaton of the DRAM access protocol. Each cycle, the controller selects a vald command by examnng the set of pendng requests and ssues t to the memory. 2.2 Motvaton A typcal DRAM bank s equpped wth a large row buffer comprsng 1024 to 2048 columns. Ths large buffer sze s an artfact of the orgnal use-model t was ntended for, vz., explotng spatal localty whle servng requests from a sngle core processor. In the multcore scenaro wheren requests from multple workloads are 1 Note that ths s n addton to bank-level, rank-level and controller-level parallelsm supported by DRAM memory systems.

3 Fgure 1: DRAM organzaton overvew Fgure 4: Multpel Sub-Row Buffer Organzaton Fgure 2: Cumulatve stack dstance hstogram of hts wth large (1024 columns) row buffers nterleaved by the memory controller, there s nsuffcent reuse of the open row buffer. Further, modern programs have large workng sets and even sngle-core programs could access multple rows n successon n a bank. These scenaros favor the closed page polcy (whch precharges and closes the page mmedately after the frst use) as the de-facto polcy, ncurrng hgh access tme and energy by brngng n a large row for a sngle word or cachelne. The above observatons ndcate a potental beneft n usng multple small row-buffers for each bank n the DRAM. Fgure 2 shows cumulatve stack dstance hstograms of row ht rates acheved n the top 1, 2, and 4 (Most Recently Used) postons for a representatve subset of SPEC 2006 CPU workloads wth 1024-columns wde row buffers, whle Fgure 3 shows the same data for 256-columns wde row buffers. A detaled descrpton of the expermental methodology, smulated confguraton, and workloads s ncluded n Secton 5. The graph n Fgure 2 shows that, on the average, the ht rate more than doubles as we go from one row-buffer to even just two buffers. Ths s especally observed n memory ntensve programs, e.g. mlc, gromacs, soplex, and calculx. Two (or more) row buffers appear to work sgnfcantly better at capturng temporal/spatal localty n programs. The graph n Fgure 3 shows the crucal observaton for our mot- Fgure 3: Cumulatve stack dstance hstogram of hts wth small (256 columns) row buffers vaton: small buffers do nearly as well as ther larger counterparts n explotng spatal localty. On average, we observed that small buffers captured over 90% of the hts seen wth larger ones. Ths study reveals that multple small row-buffers per bank help mprove temporal localty wthout sgnfcant loss n spatal localty. Incorporatng ths nto DRAMs wthn the exstng DRAM standards pose certan challenges. At the same tme, the MSRB also offers a few addtonal opportuntes such as mplementng dfferent row-buffer management polces for performance and farness, and optmzatons such as Early Prechargng and Intra-Bank Parallel accesses. We address these topcs n the followng sectons. 3. MULTIPLE SUB-ROW BUFFER ORGA- NIZATION In ths secton, we descrbe our MSRB organzaton for DRAM. Specfcally, we replace the one large row-buffer n each bank by four small row-buffers, each one-fourth the sze of the orgnal. Ths organzaton conssts of three components, namely () Subrow actvaton, () Row-buffer selecton, and () Row-Buffer allocaton. We descrbe the frst two components n detal below whle the row-buffer allocaton s deferred to Secton 4. Fgure 4 provdes an overvew of a DRAM bank organzed to support multple small row buffers. 3.1 Sub-Row Actvaton In order to select the approprate sub-row, the DRAM needs to have access to both the row address and a part of the column address. Once both are avalable, the DRAM logc decodes and actvates both row_select_ and sub_row_select_j lnes and fetches the selected columns to a row buffer. Whle row_select lnes run across the entre length of the row, they need only partcpate n the sub-row decodng and as such have very lttle load on them. At the selected sub-row, the tradtonal wordlne s actvated to access the entre sub-row. Our mplementaton of sub-row actvaton s smlar to that descrbed n [12]. To perform sub-row actvaton, the DRAM needs both the row address (from the RAS command) and a few address bts from the column access command (CAS), snce the sub-row selecton s dependent on a few column address bts: 2 bts f the sze of each subrow s 1 -th the sze of the row. In a tradtonal DRAM nterface, 4 the RAS command s ssued frst and the CAS command follows t a few cycles later. Ths s done n order to multplex row and column addresses onto the same set of pns. In the JEDEC standard, a tmng parameter termed trcd specfes the gap between the two commands that has to be met [17], and s typcally of the order of nano-seconds. Wth such trcd delay, a nave scheme whch

4 wats for the column address would ntroduce ntolerably hgh latency to sub-row actvaton. In order to avod ths, we dscuss at least three alternatves that could get the sub-row select address bts to the DRAM wthout ncurrng the RAS-to-CAS delay: Expandng the address pns by addtonal sub-row-select pns would address the sub-row selecton problem n a straghtforward way. Typcally, ths s an addtonal 2 to 3 pns dependng on the sze of the sub-row relatve to the full row. Though smple, gven the slow growth n pn-count, we do not consder ths a feasble opton. Issue RAS and CAS commands n back to back cycles. The DRAM s expected to latch the addresses ssued n these two commands and use them at approprate tmes nternally. Ths scheme s termed Posted-RAS n [12] and Posted-CAS n [24]. There s a 1-cycle delay ncurred n ths scheme. We chose not to use ths scheme due to ths 1-cycle latency addton to an already large DRAM access latency. Modern DRAMs support double-data-rate transfers on the data pns (hence termed DDR). That s, both the DRAM and the memory controller have the capablty to transmt/receve data at twce the bus clock. One could extend ths capablty to address pns as well. A RAS command ssued on the rsng edge of the bus clock followed by the sub-row-select command ssued on the fallng of that bus clock serves to transfer all the necessary address bts n one clock cycle and thus ncurs no latency n sub-row actvaton. We term ths scheme Double-Address-Rate. Ths s the scheme we assume n our detaled smulatons. A smlar proposal for fast sgnallng of address bts appears n [26]. 3.2 Row-Buffer Selecton The ntroducton of multple row-buffers necesstates a few addtonal changes n the overall memory organzaton: ) the memory controller needs to remember the sub-rows n each bank that are currently avalable n the row-buffers. Ths s requred so that no RAS command s ssued for a sub-row that s already avalable n a row-buffer. ) when a new sub-row s actvated, allocaton of a row buffer that wll store the newly fetched sub-row data. ) ensure precharge for the sub-row that s beng replaced. Responsbltes () and () are necessarly handled by the memory controller even n the case of a sngle row-buffer. We feel t s natural to let the memory controller handle these responsbltes even for multple row-buffers. Further, as the controller mantans the book-keepng nformaton regardng open sub-rows, responsblty () can also be handled by the controller. Ths decson helps n keepng the DRAM logc smple. Essentally, the controller has to mantan cache-lke metadata for these buffers: vald bts, row and sub-row tags, and recency bts. We observe that ths decson of lettng the memory controller manage row-buffer usage not only keeps DRAM logc smple but has several addtonal benefts, ncludng: Enforcement of dfferent row-buffer allocaton polces (for nstance, the controller could enforce specalzed farness orented or performance orented row-buffer allocaton polces to sut the memory access characterstcs of workloads) Holstc management of the pool of row-buffers avalable across all the DRAM banks n all the DRAM ranks. In each of the three DRAM access operatons (Actvate, Precharge, Column Access), a row buffer s accessed. Therefore we look at each operaton and dscuss sgnalng and tmng ssues nvolved n specfyng row-buffer selecton nformaton: Actvate: Row buffer selecton s not n the tmng crtcal path for ths operaton snce the DRAM has to frst actvate the sub-row and start dschargng column data onto bt lnes. Thus, the row buffer selecton should only be ready by the tme bt lnes have been drven from the storage cells. Thus a smple mechansm such as sgnalng the row buffer specfcaton bts n the cycle followng the RAS command suffces. We call ths Posted-Buffer-Selecton. Ths requres no addtonal pns. Alternatvely, the buffer selecton bts could be drven along wth the sub-row selecton bts usng the Double-Address-Rate scheme. Precharge: In ths case, we assume that the row-buffer selecton s tmng crtcal. For ths operaton, the controller needs to specfy both the sub-row selecton bts as well as the rowbuffer selecton bts. We propose to use the double-addressrate scheme to accomplsh ths transfer wthout addng latency. Column Access: We assume that row-buffer selecton s tmng crtcal to column accesses as well and use the Double- Address-Rate mechansm to ssue these bts quckly. The Double-Address-Rate scheme thus takes care of sgnallng both the sub-row selecton bts as well as the row-buffer selecton bts wthout ncurrng addtonal latency. 3.3 Row-Buffer Allocaton Our MSRB organzaton requres the memory controller to decde whch of the buffers to allocate for a new row actvaton. For our default confguraton, we employ the commonly used Least Recently Used (LRU) polcy n the memory controller to make ths allocaton decson. We defer a more detaled dscusson of alternatve allocaton polces to Secton Sources of Energy Reducton Energy savngs n MSRB are obtaned as a combnaton of: 1. Reducton n the energy consumed by each Actvate and Precharge operaton due to smaller rows: Snce fewer capactors have to be charged and dscharged, and fewer bts have to be latched n the sense amps, the energy consumed reduces. 2. Reducton n the number of Actvate and Precharge operatons due to fewer row msses: Multple row-buffers offer hgher data retentvty n the buffers thereby reducng the number of tmes that rows are actvated and precharged. Every addtonal row ht saves the energy that would have been expended n prechargng one row and actvatng another. Our scheme results n addtonal energy reducton due to the fact that more row hts lead to fewer total memory cycles, thereby savng addtonal background power. 3.5 Area Impact Area overheads comprse the MSRB re-organzaton overhead n DRAM as well as the book-keepng overhead nsde the memory controller, and we dscuss each below DRAM Area Overhead In the followng dscusson, we assume an MSRB organzaton comprsng 4 small row-buffers per bank. Our estmate of the area

5 overhead n MSRB ncludes: addtonal decoders for sub-row selecton, runnng addtonal sub-row selecton lnes (wres), addtonal decoders for row-buffer selecton, addtonal multplexers to control data routng between selected sub-row & selected rowbuffer, and addtonal wrng for data routng. Sub_row_select lnes and AND gates: Snce the sze of each small row buffer s 1 -th the sze of the full row buffer, 4 we need to run 4 sub_row_select lnes and add (4 number of rows) AND gates. Each gate adds about 6 addtonal transstors to the decode logc. As n [12], ths was mplemented usng herarchcal word lnes [28] and modeled analytcally n CACTI [27]. The CACTI model s set up for explorng the hghest densty mplementaton as s the case wth commodty DRAMs. Wth these, we obtan an area overhead of 4.9% to support splttng each row nto 4 sub-rows. Row-buffer selecton demultplexers, and buffer_select_n lnes: Snce each operaton has to access one of the 4 avalable rowbuffers, addtonal decode crcutry s added to decode two buffer selecton bts and drve the approprate buffer selecton lnes. Whle the buffers themselves are sense amps that have a much larger transstor sze, the decoder logc transstors are of a smaller transstor sze and thus do not sgnfcantly ncrease area. We lay out the 4 small buffers n a 2 2 confguraton allowng for effcent wrng of buffer_select_n lnes. Modelng these overheads n CACTI, we obtan an area overhead of 1.9%. Note that n our desgn, the total storage capacty of the rowbuffers (4 buffers of one-fourth the sze compared to a sngle large buffer) does not ncrease. Thus our desgn s buffercapacty-neutral. Thus the total area overhead of the proposed re-organzaton s 6.8% per DRAM bank Area Overhead n the Memory Controller The memory controller has to mantan certan metadata tags, vald bts, drty bts, and recency bts whch consttutes the overhead. Ths overhead s less than 16 bytes for the 4 row-buffers n a bank. For a 4 GB RAM organzed as 4 ranks and 8 banks, ths overhead would be = 512 bytes. For the baselne wth one row buffer per bank, the memory controller stll ncurs one-fourth of ths overhead (128 bytes) snce t has to mantan ths state nformaton anyway. We consder the addtonal storage overhead neglgble. 4. UNLOCKING PERFORMANCE, ENERGY AND FAIRNESS OPPORTUNITIES Our MSRB organzaton opens up the desgn space for rowbuffer allocaton and management polces for allocaton of these resources across the cores for performance and/or farness benefts. As dscussed n Secton 3.3, the row-buffer allocaton decson s done at the memory controller. Whle a detaled exploraton of allocaton polces s outsde the scope of ths paper, we present two smple schemes below to llustrate the flexblty that row-buffer 2 We note here that the commercal DRAM mplementatons are hghly optmzed and the area estmates and overheads calculated usng tools such as Cact may dffer from these. However, the commercal desgns are propretory and are(almost) never avalable for research studes (even f they are avalable such data s seldom publshed due to busness and other reasons). reorganzaton facltates. The frst - Farness orented Buffer Allocaton - mproves farness va judcous buffer allocaton. The Performance orented Buffer Allocaton mproves performance of programs wth hgh mss rates. In addton, we dscuss a par of schedulng and hardware optmzatons, namely: Early Precharge and Intra Bank Parallelsm that are enabled by MSRB. The net effect of mprovng performance (va ncreased row hts) s to also reduce energy consumpton. All of these optmzatons are mplementated at the memory controller Farness orented Buffer Allocaton Here, the ntuton s that n a typcal multcore workload, some cores stand to beneft a lot more from hgher row-buffer ht rates than others. Snce the nterleavng of requests from multple cores causes cores wth lower arrval rates to suffer greater row-buffer msses, ths allocaton scheme counters the dsproportonate ncrease n mss rate by allocatng dedcated buffers for such cores. The scheme works by mantanng on a per-core bass, the actual row-buffer ht rate as well as an estmate of the ht rate had the core been runnng alone (referred to as standalone ht rate). The standalone ht rate s obtaned by keepng a shadow row buffer (one per core) n the memory controller whch would be updated only wth the requests from the gven core. The dfference between the standalone ht rate and the shared (actual) ht rate provdes an estmate of the loss suffered by the core due to nterference. If ths dfference exceeds a threshold (n our case, t s set to 0.5), we classfy the core as sufferng unfarness. The scheduler then attempts to allocate dedcated buffers to cores sufferng unfarness. For nstance, f one of 4 cores n a quad-core confguraton s unfarly sufferng, then the scheduler dedcates one of the 4 row-buffers to ths core, whle the other 3 row-buffers are made avalable to all the cores. For each core c and bank b, the scheme computes the dfference d = (standalone ht rate shared ht rate). Hgher values of ths measure suggest hgher benefts by dedcatng buffers to such cores. At each bank, ths metrc s used to classfy the core: Type-1: Core wth d threshold Type-2: Core wth d < threshold Snce the classfcaton s done per-bank, ths scheme can nherently self-adjust to varatons n bank utlzaton. The same core could be classfed Type-1 n one bank whle classfed Type-2 n another. Ths classfcaton s done perodcally so as to adapt to program behavor changes. The controller then allocates dedcated row-buffers for Type-1 cores. In our mplementaton, we chose the below scheme: Only 1 Type-1 core n the workload: The Type-1 core gets one dedcated buffer, all cores can access remanng 3 buffers Two Type-1 cores n the workload: Each Type-1 core gets one dedcated buffer, all cores can access remanng 2 buffers Three or more Type-1 cores n the workload: It defaults to LRU scheme. Whenever a core needs to actvate a new row, the controller looks up the allocated buffers for that core and chooses the LRU buffer from amongst these to brng the new row nto. 4.2 Performance orented Buffer Allocaton Ths scheme works by dynamcally adaptng buffer allocatons to the most demandng cores. The scheme estmates the needs of each core perodcally on a per-bank bass by takng nto account the 3 ntrabp requres mnor DRAM nterfaces changes to permt multple outstandng accesses to each bank

6 number of memory requests (.e., msses from the last-level cache) from each core, and the row-buffer mss rates suffered by each core at each bank. For each core c and bank b, t computes a rate product defned as: rate_product[c][b] = num_memory_requests[c][b] row_mss_rate[c][b]. Hgher values of ths measure suggest hgher benefts by mprovng ther ht rates. At each bank, rate products are used to classfy cores nto one of two types: Type-1: Core wth rate_product threshold Type-2: Core wth rate_product < threshold Several varatons of ths scheme are possble. We use a smple scheme whch defaults to LRU allocaton when the number of Type-1 workloads for a bank exceeds half the number of rowbuffers per bank (2 n our case). Otherwse, the allocaton scheme allows each Type-1 workload to have exclusve use of certan rowbuffers (upto 2 n our case) and shared use of the remanng rowbuffers by all workloads, smlar to the prevous scheme. For both the above schemes, the storage overhead n the memory controller s neglbly small (of the order of R B M N bytes for R ranks, B banks per rank, M buffers per bank, and N cores). 4.3 Early Precharge Schedulng Tradtonally, memory controllers use ether an open page or closed page polcy for prechargng rows. The polcy essentally determnes whether to precharge an open page eagerly (closed page) or lazly (open page). Eager polces work better n stuatons where t s hghly lkely that the next request would cause a row buffer evcton. Multple row-buffers open up the possblty to precharge dfferent row-buffers usng dfferent polces. We mplemented selectve early precharge schedulng wheren only the LRU row-buffer s precharged early whle the rest of the row-buffers follow the open-page polcy. The ratonale for ths s that the LRU row-buffer s most lkely to be the canddate for evcton and s better off precharged early whle the other row-buffers are more lkely to see addtonal row-hts and therefore are kept open. Eagerly prechargng the LRU buffer helps to reduce the latency of a subsequent rowbuffer mss. In our mplementaton, the memory controller looks for dle cycles and nserts precharge operatons for the LRU rowbuffer n each bank. Whle ths schedulng s orthogonal to the rowbuffer allocaton schemes descrbed n the earler secton, we only mplemented t over the baselne LRU polcy for our experments. 4.4 Intra-Bank Parallelsm Multple row-buffers permt parallel operatons wthn a bank 4 : column accesses on one row-buffer could occur n parallel wth an actvate or precharge operaton on another. Whle each ndvdual memory access follows the standard DRAM access protocol, ths optmzaton (abbrevated ntrabp) allows ppelng of operatons at each bank. It mproves the effcency of data bus utlzaton by allowng us to ssue column accesses faster. Insde each DRAM bank, the necessary crcutry to support ths parallelsm s already avalable. The memory scheduler can easly ncorporate ths enhancement nto ts schedulng of sub-commands. Fgure 5 shows an example tmng dagram ndcatng ths parallelsm. Ths helps to hde some of the actvate and precharge latences va ths optmzaton and t effectvely translates to hgher bandwdth and lower latency. In partcular, programs that have low bank-level parallelsm are greatly benefted by ths feature snce the scheduler s unable to keep multple banks busy n parallel. Note that explotng ntrabp s not possble wthout multple row-buffers. 4 Ths parallelsm s n addton to bank-level, rank-level and memory controller-level parallelsm present n DRAM memory systems Processor L1I Cache L1D Cache L2 Cache Controller DRAM Fgure 5: Example of ntra-bank parallelsm 3.2 GHz OOO Alpha ISA 32kB prvate, 64 byte blocks, Drect-mapped, 3 cycle ht latency 32kB prvate, 64 byte blocks, 2-way set-assocatve, 3 cycle ht latency For 1/4/8/16 cores: 1MB/4MB/8MB/16MB 4-way/8-way/16-way/32-way 32/128/256/512 MSHRs 64-byte blocks, 15 cycle ht latency On-chp; 64-bt nterface to DRAM 256-entry command queue FR_FCFS schedulng [5], open-page polcy Address-nterleavng: rank-bank-row-column Number of memory controllers for 1/4/8/16 cores: 1/1/2/4 DDR3-1600H, BL=8, CL-nRCD-nRP=9-9-9 a rank comprses 4 1GB x16 devces, each devce has 8 banks, each bank has rows, 1024 columns Table 1: CMP confguraton 5. EXPERIMENTAL METHODOLOGY 5.1 Smulaton Setup We evaluate our desgn usng M5 [18] smulator ntegrated wth a detaled n-house DRAM smulator. The DRAM smulator fathfully models both the memory controller as well as the DRAM wth accurate tmng. Each program n the workload s executed n fast-forward mode for 9 bllon nstructons, then n warm-up mode for 500 mllon nstructons and fnally, n detaled cycle-accurate mode for 250 mllon nstructons. Mult-core smulatons are run untl all the programs complete 250 mllon nstructons. 5 As s the standard practce, programs that fnsh early contnue to execute but the performance of only the frst 250 mllon nstructons s consdered for each core. The baselne machne confguraton used n our studes s shown n Table 1. L2 s the last level cache and s shared across all the cores. The baselne confguraton (1 1024) has a sngle large 1024 columns wde row-buffer. Our MSRB confguraton (4 256) uses 4 narrow row buffers, each 256 columns wde. MSRB s managed usng a LRU buffer allocaton polcy unless specfed otherwse. Whle the quad-core has one memory controller, the eght and sxteen cores have two and four memory controllers respectvely. 5.2 Power Estmaton We estmate the mpact of our proposed row-buffer reorganzaton on DRAM power consumpton usng Mcron s PowerCalculator spreadsheet [19]. The spreadsheet models power consumpton of a DRAM confguraton by allowng the user to nput DRAM confguraton parameters and system usage values. We obtan the 5 Although we run only 250M Instructons per core n cycleaccurate mode, our 4, 8 and 16-core smulatons each runs for a total of 1 Bllon - 4 Bllon nstructons n cycle-accurate mode.

7 Quad-Core Workloads Q1:(462,459,470,433), Q2:(429,183,462,459), Q3:(181,435,197,473), Q4:(429,462,471,464), Q5:(470,437,187,300), Q6:(462,470,473,300), Q7:(459,464,183,433), Q8:(410,464,445,433), Q9:(462,459,445,410), Q10:(429,456,450,459), Q11:(181,186,300,177), Q12:(168,401,435,464) Eght Core Workloads E1:(462,459,433,456,464,473,450,445), E2:(300,456,470,445,179,464,473,450), E3:(168,183,437,401,450,435,445,458), E4:(187,172,173,410,470,433,444,177), E5:(434,435,450,453,462,471,164,186), E6:(181,473,401,172,177,178,179,435), E7:(437,459,445,454,456,465,171,197), E8:(429,416,433,454,464,435,444,458) Sxteen Core Workloads S1:(462,459,433,179,183,473,450,445,444,470,429,171,168,172,435,458) S2:(401,433,434,435,444,445,450,300,459,470,471,473,171,181,179,183) S3:(178,177,168,172,173,187,191,410,429,434,462,473,465,458,464,445) S4:(186,454,458,482,181,429,255,254,178,197,179,187,173,401,410,437) Table 2: Workloads Weghted Speedup (WS) = Mnmum slowdown = mn Harmonc Speedup (HS) = Maxmum slowdown = max Farness = Mnmum slowdown Maxmum slowdown IP C shared IP C alone IP C alone IP C shared N IP C alone IP C shared IP C alone IP C shared Table 3: Performance and Farness metrcs power requrements for both the baselne and the multple rowbuffer organzaton usng ths spreadsheet. For the baselne, we use the default settngs provded for -125E speed grade. Change to Actvate & Precharge power s modeled by adjustng the value of IDD0, the foreground current that drves these operatons. Usng the equaton gven n [20], and conservatvely estmatng a reducton of IDD0 from 120mA to 95mA, we compute the Actvate and Precharge power dsspaton for the new organzaton. We also assume that column access power ncreases by 5% owng to the addton of row-buffer selecton logc. In our studes, we separately estmate the power reductons comng from the two ndependent factors, namely smaller rows requrng lesser power and fewer Actvate/Precharge operatons. The power number along wth the executon tme s used to compute the DRAM energy consumpton. We do not model the addtonal power consumpton n the memory controller as the added logc there s margnal and relatvely smple. 5.3 Workload and Metrcs We use mult-programmed workloads comprsng programs from SPEC [23] 2000 and SPEC 2006 sutes to evaluate our proposal. The workloads are typcally a mx of programs wth varyng levels of memory ntensty 6. The workload mx used n our studes s presented n Table 2. We use weghted speedup [21] and harmonc speedup [22] to summarze performance. We report farness usng the rato of mnmum slowdown to maxmum slowdown. These terms are defned n Table 3. 6 We use the L2 MPKI to dentfy memory ntensve programs. 6. RESULTS In ths secton, we evaluate the mpact of MSRB on system performance and the energy benefts 7 provded by t. Further we compare t wth state-of-the-art memory access schedulng methods desgned to mprove row ht rate. A study on the performance mprovements due to memory controller sde buffer allocaton s also presented. 6.1 Performance Benefts of MSRB The performance of MSRB for the quad-core case s summarzed n Fgure 6. As can be seen from Fgure 6(a), MSRB mproves the weghted speedup by 35.8% over The performance gan n terms of harmonc speedup, as shown n Fgure 6(b), s 27.5%. All the workloads show mproved performance wth MSRB. Ths shows the mportance of focusng on the temporal localty (multple row-buffers) at the cost of spatal localty (narrow row-buffers). MSRB acheves a sgnfcantly hgher row-buffer ht rate of 0.6 (n Fgure 6(c)) as compared to 0.2 observed n the baselne case. The observed gans n performance are typcally n lne wth the mprovement n row-buffer ht rate. Fgure 7 shows the performance mprovement n terms of weghted speedup for 8 and 16 core workloads. MSRB provdes 14.5% and 21.6% mprovement n performance for 8 and 16 core workloads respectvely over baselne. An nterestng case-study s the rowbuffer ht rates experenced by the ndvdual programs n the workload E1. Fgure 7(b) shows the ht rates experenced by the ndvdual programs for baselne and MSRB. It can be seen that the row-buffer ht rate mproves wth MSRB for all the ndvdual programs. Last, MSRB mproved the performance n terms of IPC of sngle-core SPEC2000 and SPEC2006 workloads by 7.1% on an average (refer Fgure 10 for IPC values obtaned for a subset of benchmarks). Observe that programs such as 459.GemsFDTD and 462.lbquantum show consderably hgh gans (148% and 21%). 6.2 Energy Benefts of MSRB Improved row-hts not only boosts the performance, but also translates nto energy savngs due to reduced number of actvatons and precharges. The smaller sze of the row-buffers also reduces the energy requred for actvate and precharge operatons. The energy consumpton s computed usng the methodology descrbed n Secton 5.2. Fgure 8 shows the DRAM energy gans provded by MSRB for the quad-core workloads. On an average MSRB reduces the energy requrements by 40% compared to the baselne. It s nterestng to note that hgh energy gans are obtaned not only for workloads showng hgh performance gans but also for others due to the narrow row buffers used. Further, as expected mproved ht rate not only translates to hgh performance gans, but also nto sgnfcant savngs n terms of energy consumed. The total actvate power reduced on an average from 357mW to 127mW. For 8 core and 16 core workloads, the energy gans are 28% and 31% respectvely over the baselne. 6.3 Comparson wth Memory-Access Schedulng Schemes In ths secton, we compare MSRB wth two state-of-the-art memory access schedulng methods, PARallelsm aware Batch Schedulng (PARBS) [2] and Thread Cluster Memory schedulng(tcm) [1]. We used the same values as n the orgnal paper for the varous threshold parameters assocated wth these schemes. We also compare our scheme to a hypothetcal 32 bank DRAM confguraton (32-Bank). Ths s a confguraton wth a hgh number of banks, 7 All energy numbers reported n ths paper refer to DRAM energy.

8 (a) Weghted Speedup (b) Harmonc Speedup (c) Row-Buffer Ht Rate Fgure 6: Quad-Core Performance and Row-Buffer Ht Rates (a) Weghted-Speedup (b) Row-Buffer Ht Rate n E1 Fgure 7: Eght and Sxteen Core Performance Fgure 8: DRAM Energy Consumpton. (a) Normalzed Weghted Speedup (b) Row-Buffer Ht Rate Fgure 9: Performance Comparson wth Other Schedulng Schemes wth each bank havng a 256 column wde row-buffer. 32-Bank has the same number of row-buffers as our MSRB confguraton whch has 8 DRAM banks and 4 row-buffers per bank. In addton, as our scheme s orthogonal to memory-access schedulng schemes to mprove row-ht rates, t s possble to have multple row-buffers n TCM and PARBS. We refer to these confguratons of TCM and PARBS enhanced wth MSRB (four 256 column wde row-buffers per bank) as TCM+MSRB and PARBS+MSRB respectvely. Fgure 9(a) shows the performance n terms of weghted speedup (normalzed to the baselne) for PARBS, TCM, 32-Bank, MSRB PARBS+MSRB and TCM+MSRB for quad-core workloads. For all the workloads, t can be seen that MSRB performs better than PARBS and TCM. On an average PARBS and TCM provde gans of only 8.5% and 10.5% over baselne, whle MSRB mproves the performance by 35.8%. The nterestng thng to note s that MSRB can greatly ad the performance of TCM and PARBS, as can be seen from the sgnfcantly better speedups experenced by TCM+MSRB and PARBS+MSRB schemes. The observed trend n performance s also reflected n the row-buffer ht rates exhbted by the varous schemes. As can be seen from Fgure 9(b), MSRB s more effectve n mprovng row-buffer ht rates compared to PARBS and TCM appled on a top of a sngle large row-buffer. It s nterestng to note that 32-Bank confguraton gves only a 5.9% mprovement n performance over baselne (weghted speedup). Ths s prmarly because ncreasng the number of banks can explot only a fracton of the temporal localty and the lmtatons of havng only one row-buffer per bank shows up after that. 6.4 Senstvty Study We reorganzed the baselne row-buffer nto MSRB n our study and t outperformed all other confguratons. But are there other MSRB confguratons that can yeld even better

9 Fgure 12: Performance Impact of ntrabp and EarlyPrecharge performance? For example how does the MSRB confguraton wth perform? As row-buffer szes below 256 resulted n notceable losses even n the case of sngle-cores, we do not evaluate them n detal. Fgure 11 shows the row-buffer ht rates experenced by quad-core workloads for MSRB confguratons and provdes a row ht rate of 0.48 whle acheves In terms of weghted-speedup, provded a gan of 34.5% over the baselne confguraton of We also smulated an eght-core system wth one memory controller. In the case of a sngle memory controller, the performance gans provded by MSRB are further enhanced, resultng n performance mprovement of 16% n terms of weghted-speedup over baselne. 6.5 Benefts of Early Precharge and Intra-Bank Parallelsm Fgure 12 shows the performance n terms of weghted speedup (normalzed to that of ) for MSRB, EarlyPrecharge, IntraBP and MSRB wth both EarlyPrecharge and IntraBP for quadcore workloads. Also ncluded s the performance of the baselne wth a closed page polcy. Ths scheme s equvalent to an EarlyPrecharge wth a sngle row-buffer per bank. It can be observed that EarlyPrecharge wth MSRB provdes a gan of 40% over the baselne on an average. Enablng EarlyPrecharge mproves performance by 30% n workload Q9 compared to MSRB. EarlyPrecharge sacrfces some of the row hts to reduce the latency of a future row-buffer mss. In cases where the reduced latency for a row-buffer mss s not suffcent to offset the loss of row-buffer hts, as n Q5, EarlyPrecharge shows a drop n performance compared to MSRB. In summary, EarlyPrecharge yelds an addtonal performance mprovement of 1.8% on top of MSRB. IntraBP has a postve effect on every workload as s to be expected. By utlzng the data bus cycles more effcently, t acheves an average addtonal mprovement of 4.7% on top of MSRB. Whle we do not present detaled results here due to lack of space, the average latency for each memory access reduces by 19% due to the ncreased parallelsm. When EarlyPrecharge and IntraBP are both enabled, we observe an average addtonal mprovement of 5.9%. It may also be observed that whle EarlyPrecharge fared poorly n workload Q5, the combned optmzaton restores t to the baselne. Smlarly, n workload Q9, EarlyPrecharge gans sgnfcantly and that gan s retaned n the combned optmzaton. 6.6 Impact of Farness-orented and Performance orented Allocaton Fgure 13(a) plots farness of baselne, MSRB, and MSRB wth Farness orented Allocaton (MSRB+Far) allocaton schemes for quad-core workloads. As observed, MSRB mproves farness (over baselne) by 20% whle MSRB+Far mproves farness by an average of 43%. Whle detaled results are not ncluded here, we observed that n several mxes, latency senstve cores that suffered sgnfcant unfarness got a boost by allocatng dedcated buffers to them. Performance orented Allocaton ensures more row-buffers for memory-ntensve programs. Fgure 13(b) plots the weghted speedup for the baselne, MSRB, and MSRB wth Performance orented Allocaton (MSRB+Perf) allocaton schemes for quad-core workloads. MSRB+Perf mproves performance by 40.9% over the baselne. Ths corresponds to an addtonal mprovement of 1.9% over MSRB. In programs wth more memory ntensve benchmarks, MSRB+Perf can mprove performance by as much as 25%. Fgure 13(c) shows the IPCs of the ndvdual programs n workload Q9 wth MSRB as well as MSRB+Perf. It can be seen that MSRB+Perf mproves the performance of memory ntensve programs 459.GemsFDTD and 462.lbquantum wthout affectng the performance of programs lke 445.gobmk and 410.bwaves whch are relatvely less memory ntensve. 7. RELATED WORK There s a large body of work on ntellgent memory schedulng to mprove row-buffer localty and bank-level parallelsm ([1], [2], [3], [4], [9]). These methods generally requre farly sophstcated trackng of memory access patterns n the memory controller to drve schedulng decsons. Work on page colorng [6] and addressmappng technques ( [8], [7]) attempt to redstrbute pages so as to mprove performance ([8]) or reduce power ([6], [14]). Our proposed MSRB organzaton s orthogonal to all the above schemes and can complement them. Work on phase-change memores n [15] dscusses multple row buffers as a mechansm to render PCMs as a vable alternatve to DRAMs. Though conceptually smlar, we propose a practcal mplementaton of ths scheme n the context of the wdely used DRAM memores wthout requrng major changes to the rgd JEDEC standard. Further, we llustrate other optmzaton opportuntes enabled by the MSRB organzaton. The work n [16] explores the beneft of buldng a more full fledged SRAM cache n front of the DRAM array to catch more accesses n the cache. However, t necesstates a sgnfcant logc addton to the densty-optmzed DRAM desgn. Smlarly, VCM [30] memory, ntroduced brefly n the 90 s by NEC, added a set of buffers shared across all the DRAM modules and ntroduced the noton of foreground and background operatons. However t ntroduced sgnfcant changes to the DRAM access standard and the ssue of buffer management has not been systematcally addressed. Smaller row-buffers for energy-effcency have receved recent attenton. Smaller row-buffers result n reduced energy consumpton wth mnmal mpact on performance [12]. Mnrank [13], MC- DIMM [29] and Adaptve-Granularty [26] propose alternate data storage to reduce energy consumpton whle attemptng to mantan performance. In contrast, our MSRB organzaton acheves performance mprovement along wth energy reducton and farness mprovements. 8. CONCLUSIONS In ths paper we have proposed a row-buffer reorganzaton for DRAMs whch offers sgnfcant energy reducton whle smultaneously mprovng performance. These dual benefts make ths proposed DRAM archtecture an attractve soluton for today s DRAM energy and performance ssues. We dscuss a feasble mplementaton of ths archtecture wth mnmal mpact to exstng DRAM standards. Our mplementaton opens up newer optmzaton opportuntes such as Intra-Bank Parallelsm and selectve Early Precharge. Further, wth MSRB dfferent row-buffer allocaton schemes can be tred to mplement performance and farness polces. We llustrated ths flexblty usng a par of schemes - Farness-orented Allocaton and Performance-orented Allocaton.

10 Fgure 10: Sngle-Core IPC Fgure 11: Senstvty Results for MSRB (a) Farness (b) Weghted Speedup (c) Workload Q9 Fgure 13: Farness and Performance mprovements va buffer allocaton schemes MSRB showed a performance mprovement of 35.8% for quadcore workloads. Ths mprovement was accompaned by an energy reducton of 43% n the DRAM. In comparson, state of the art memory access schedulng schemes TCM [1] and PARBS [2] were able to mprove the baselne performance by only 10.5% and 8.5%. Further, the addtonal performance optmzatons, namely Early Prechargng and Intra-Bank Parallelsm mproved the system performance by an addtonal 5.9%. 9. REFERENCES [1] Y. Km, M. Papamchael, O. Mutlu and M. Harchol-Balter. Thread Cluster Memory Schedulng: Explotng dfferences n Memory Access Behavor. In Mcro [2] O. Mutlu and T. Moscbroda. Parallelsm-Aware Batch Schedulng: Enhancng both Performance and Farness of Shared DRAM Systems. In ISCA [3] O. Mutlu and T. Moscbroda. Stall-Tme Far Memory Access Schedulng for Chp Multprocessors. In Mcro [4] H. Zheng, J. Ln, Z. Zhang, Z. Zhu. Memory Access Schedulng Schemes for Systems wth Mult-Core Processors. In ICPP [5] S. Rxner, W. J. Dally, U. J. Kapas, P. Mattson, and J. D. Owens. Memory Access Schedulng. In ISCA [6] K. Sudan, N. Chatterjee, D. Nellans, M. Awasth, R. Balasubramonan and A. Davs. Mcro-Pages: Increasng DRAM Effcency wth Localty-Aware Data Placement. In ASPLOS [7] M. Awasth, D. Nellans, K. Sudan, R. Balasubramonan, and A. Davs. Handlng the Problems and Opportuntes Posed by Multple On-Chp Memory Controllers. In PACT [8] Z. Zhang, Z. Zhu and X. Zhang. A Permutaton-based Page Interleavng Scheme to Reduce Row-buffer Conflcts and Explot Data Localty. In Mcro [9] C. J. Lee, O. Mutlu, V. Narasman and Y. N. Patt Prefetch-Aware DRAM Controllers. In Mcro [10] L. Barroso and U. Holzle. The Datacenter as a Computer: An Introducton to the Desgn of Warehouse-Scale Machnes. Morgan & Claypool, [11] D. Mesner, B. Gold, and T. Wensch. PowerNap: Elmnatng Server Idle Power. In ASPLOS [12] A. N. Udp, N. Muralmanohar, N. Chatterjee, R. Balasubramonan., A. Davs, and N. P. Joupp. Rethnkng DRAM Desgn and Organzaton for Energy-Constraned Mult-Cores. In ISCA [13] H. Zheng, J. Ln, Z. Zhang, E. Gorbatov, H. Davd, and Z. Zhu. Mn-Rank: Adaptve DRAM Archtecture for Improvng Memory Power Effcency. In Mcro [14] H. Huang, K. G. Shn, C. Lefurgy and T. Keller. Improvng Energy Effcency by Makng DRAM Less Randomly Accessed. In ISLPED [15] B. C. Lee, E. Ipek, O. Mutlu and D. Burger. Archtectng Phase Change Memory as a scalable DRAM alternatve. In ISCA [16] W. Wong and J-L Baer. DRAM Cachng Dept of CS and Engg., Unversty of Washngton Tech report UW-CSE [17] The JEDEC consortum. [18] N. L. Bnkert, R. G. Dreslnsk, L. R. Hsu, K. T. Lm, A. G. Sad, and S. K. Renhardt. The M5 Smulator: Modelng Networked Systems. In Mcro [19] Mcron. msc/ddr3_power_calc.xls [20] Mcron. Calculatng Memory System Power for DDR3 TN41_01DDR3%20Power.pdf. [21] A. Snavely and D. M. Tullsen. Symbotc jobschedulng for a smultaneous multthreadng processor. In ASPLOS 2000 [22] K. Luo, J. Gummaraju, and M. Frankln. Balancng thoughput and farness n smt processors. In ISPASS 2001 [23] The SPEC Consortum. [24] O. La. SDRAM havng posted CAS functon of JEDEC standard. Unted States Patent, Number [25] E. Ipek, O. Mutlu, J. F. Martnez, and R. Caruana. Self-Optmzng Memory Controllers: A Renforcement Learnng Approach. In ISCA [26] D. H. Yoon, M. K. Jeong, and M Erez. Adaptve Granularty Memory Systems: A Tradeoff between Storage Effcency and Throughput. In ISCA [27] S. Thozyoor, N. Muralmanohar, and N. Joupp. CACTI 5.0. Techncal report. HP Laboratores [28] K. Itoh. VLSI Memory Chp Desgn. Sprnger [29] J. H. Ahn et al. Multcore DIMM: an Energy Effcent Memory Module wth Independently Controlled DRAMs. In IEEE Computer Archtecture Letters [30] S. Rxner. Memory Controller Optmzatons for Web Servers. In IEEE Mcro 2004.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , VRT012 User s gude V0.1 Thank you for purchasng our product. We hope ths user-frendly devce wll be helpful n realsng your deas and brngng comfort to your lfe. Please take few mnutes to read ths manual

More information

(12) United States Patent Ogawa et al.

(12) United States Patent Ogawa et al. US007151027B1 (12) Unted States Patent Ogawa et al. (o) Patent No.: (45) Date of Patent: US 7,151,027 Bl Dec. 19, 2006 (54) METHOD AND DEVICE FOR REDUCING INTERFACE AREA OF A MEMORY DEVICE (75) Inventors:

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt ECE Department The Unversty of Texas at Austn Austn, TX, USA {joao, patt}@ece.utexas.edu

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Solution Brief: Creating a Secure Base in a Virtual World

Solution Brief: Creating a Secure Base in a Virtual World Soluton Bref: Creatng a Secure Base n a Vrtual World Soluton Bref: Creatng a Secure Base n a Vrtual World Abstract The adopton rate of Vrtual Machnes has exploded at most organzatons, drven by the mproved

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Thread Cluster Memory Schedulng: Explotng Dfferences n Memory Access Behavor Yoongu Km Mchael Papamchael Onur Mutlu Mor Harchol-Balter yoonguk@ece.cmu.edu papamx@cs.cmu.edu onur@cmu.edu harchol@cs.cmu.edu

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton Sungpack Hong EE Department Stanford Unversty Sungjoo Yoo, Byeong Bn, Kyu-Myung Cho, Soo-Kwan Eo Samsung Electroncs Taehwan

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

3. CR parameters and Multi-Objective Fitness Function

3. CR parameters and Multi-Objective Fitness Function 3 CR parameters and Mult-objectve Ftness Functon 41 3. CR parameters and Mult-Objectve Ftness Functon 3.1. Introducton Cogntve rados dynamcally confgure the wreless communcaton system, whch takes beneft

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Memory Modeling in ESL-RTL Equivalence Checking

Memory Modeling in ESL-RTL Equivalence Checking 11.4 Memory Modelng n ESL-RTL Equvalence Checkng Alfred Koelbl 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 koelbl@synopsys.com Jerry R. Burch 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 burch@synopsys.com

More information

A Hybrid Genetic Algorithm for Routing Optimization in IP Networks Utilizing Bandwidth and Delay Metrics

A Hybrid Genetic Algorithm for Routing Optimization in IP Networks Utilizing Bandwidth and Delay Metrics A Hybrd Genetc Algorthm for Routng Optmzaton n IP Networks Utlzng Bandwdth and Delay Metrcs Anton Redl Insttute of Communcaton Networks, Munch Unversty of Technology, Arcsstr. 21, 80290 Munch, Germany

More information

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach Dstrbuted Resource Schedulng n Grd Computng Usng Fuzzy Approach Shahram Amn, Mohammad Ahmad Computer Engneerng Department Islamc Azad Unversty branch Mahallat, Iran Islamc Azad Unversty branch khomen,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management. //7 Prnceton Unversty Computer Scence 7: Introducton to Programmng Systems Goals of ths Lecture Storage Management Help you learn about: Localty and cachng Typcal storage herarchy Vrtual memory How the

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Video Proxy System for a Large-scale VOD System (DINA)

Video Proxy System for a Large-scale VOD System (DINA) Vdeo Proxy System for a Large-scale VOD System (DINA) KWUN-CHUNG CHAN #, KWOK-WAI CHEUNG *# #Department of Informaton Engneerng *Centre of Innovaton and Technology The Chnese Unversty of Hong Kong SHATIN,

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems Relablty and Energy-aware Cache Reconfguraton for Embedded Systems Yuanwen Huang and Prabhat Mshra Department of Computer and Informaton Scence and Engneerng Unversty of Florda, Ganesvlle FL 326-62, USA

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Convolutional interleaver for unequal error protection of turbo codes

Convolutional interleaver for unequal error protection of turbo codes Convolutonal nterleaver for unequal error protecton of turbo codes Sna Vaf, Tadeusz Wysock, Ian Burnett Unversty of Wollongong, SW 2522, Australa E-mal:{sv39,wysock,an_burnett}@uow.edu.au Abstract: Ths

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments Effcent Broadcast Dsks Program Constructon n Asymmetrc Communcaton Envronments Eleftheros Takas, Stefanos Ougaroglou, Petros copoltds Department of Informatcs, Arstotle Unversty of Thessalonk Box 888,

More information

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7 Optmzed Regonal Cachng for On-Demand Data Delvery Derek L. Eager Mchael C. Ferrs Mary K. Vernon Unversty of Saskatchewan Unversty of Wsconsn Madson Saskatoon, SK Canada S7N 5A9 Madson, WI 5376 eager@cs.usask.ca

More information

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems FIRM: Far and Hgh-Performance Memory Control for Persstent Memory Systems Jshen Zhao, Onur Mutlu, Yuan Xe Pennsylvana State Unversty, Carnege Mellon Unversty, Unversty of Calforna, Santa Barbara, Hewlett-Packard

More information

An Investigation into Server Parameter Selection for Hierarchical Fixed Priority Pre-emptive Systems

An Investigation into Server Parameter Selection for Hierarchical Fixed Priority Pre-emptive Systems An Investgaton nto Server Parameter Selecton for Herarchcal Fxed Prorty Pre-emptve Systems R.I. Davs and A. Burns Real-Tme Systems Research Group, Department of omputer Scence, Unversty of York, YO10 5DD,

More information

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices MQSm: A Framework for Enablng Realstc Studes of Modern Mult-Queue SSD Devces Arash Tavakkol, Juan Gómez-Luna, and Mohammad Sadrosadat, ETH Zürch; Saugata Ghose, Carnege Mellon Unversty; Onur Mutlu, ETH

More information

A fair buffer allocation scheme

A fair buffer allocation scheme A far buffer allocaton scheme Juha Henanen and Kalev Klkk Telecom Fnland P.O. Box 228, SF-330 Tampere, Fnland E-mal: juha.henanen@tele.f Abstract An approprate servce for data traffc n ATM networks requres

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems:

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems: Speed/RAP/CODA Presented by Octav Chpara Real-tme Systems Many wreless sensor network applcatons requre real-tme support Survellance and trackng Border patrol Fre fghtng Real-tme systems: Hard real-tme:

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.15 No.10, October 2015 1 Evaluaton of an Enhanced Scheme for Hgh-level Nested Network Moblty Mohammed Babker Al Mohammed, Asha Hassan.

More information

arxiv: v3 [cs.ds] 7 Feb 2017

arxiv: v3 [cs.ds] 7 Feb 2017 : A Two-stage Sketch for Data Streams Tong Yang 1, Lngtong Lu 2, Ybo Yan 1, Muhammad Shahzad 3, Yulong Shen 2 Xaomng L 1, Bn Cu 1, Gaogang Xe 4 1 Pekng Unversty, Chna. 2 Xdan Unversty, Chna. 3 North Carolna

More information

Energy-Efficient Workload Placement in Enterprise Datacenters

Energy-Efficient Workload Placement in Enterprise Datacenters COVER FEATURE CLOUD COMPUTING Energy-Effcent Workload Placement n Enterprse Datacenters Quan Zhang and Wesong Sh, Wayne State Unversty Power loss from an unnterruptble power supply can account for 15 percent

More information

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Cache Sharing Management for Performance Fairness in Chip Multiprocessors Cache Sharng Management for Performance Farness n Chp Multprocessors Xng Zhou Wenguang Chen Wemn Zheng Dept. of Computer Scence and Technology Tsnghua Unversty, Bejng, Chna zhoux07@mals.tsnghua.edu.cn,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

THere are increasing interests and use of mobile ad hoc

THere are increasing interests and use of mobile ad hoc 1 Adaptve Schedulng n MIMO-based Heterogeneous Ad hoc Networks Shan Chu, Xn Wang Member, IEEE, and Yuanyuan Yang Fellow, IEEE. Abstract The demands for data rate and transmsson relablty constantly ncrease

More information

WIRELESS communication technology has gained widespread

WIRELESS communication technology has gained widespread 616 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 4, NO. 6, NOVEMBER/DECEMBER 2005 Dstrbuted Far Schedulng n a Wreless LAN Ntn Vadya, Senor Member, IEEE, Anurag Dugar, Seema Gupta, and Paramvr Bahl, Senor

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information

3D vector computer graphics

3D vector computer graphics 3D vector computer graphcs Paolo Varagnolo: freelance engneer Padova Aprl 2016 Prvate Practce ----------------------------------- 1. Introducton Vector 3D model representaton n computer graphcs requres

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements Explct Formulas and Effcent Algorthm for Moment Computaton of Coupled RC Trees wth Lumped and Dstrbuted Elements Qngan Yu and Ernest S.Kuh Electroncs Research Lab. Unv. of Calforna at Berkeley Berkeley

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries Run-Tme Operator State Spllng for Memory Intensve Long-Runnng Queres Bn Lu, Yal Zhu, and lke A. Rundenstener epartment of Computer Scence, Worcester Polytechnc Insttute Worcester, Massachusetts, USA {bnlu,

More information

Advanced Computer Networks

Advanced Computer Networks Char of Network Archtectures and Servces Department of Informatcs Techncal Unversty of Munch Note: Durng the attendance check a stcker contanng a unque QR code wll be put on ths exam. Ths QR code contans

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information