Utility-Based Hybrid Memory Management

Size: px

Start display at page:

Download "Utility-Based Hybrid Memory Management"

Douglas Bates
6 years ago
Views:

1 Uiliy-Based Hybrid Memory Managemen Yang Li Saugaa Ghose Jongmoo Choi Jin Sun Hui Wang Onur Mulu Carnegie Mellon Universiy Dankook Universiy Beihang Universiy ETH Zürich While he memory fooprins of cloud and HPC applicaions coninue o increase, fundamenal issues wih DRAM scaling are likely o preven radiional main memory sysems, composed of monolihic DRAM, from grealy growing in capaciy. Hybrid memory sysems can miigae he scaling limiaions of monolihic DRAM by pairing ogeher muliple memory echnologies (e.g., differen ypes of DRAM, or DRAM and non-volaile memory) a he same level of he memory hierarchy. The goal of a hybrid main memory is o combine he differen advanages of he muliple memory ypes in a cos-effecive manner while avoiding he disadvanages of each echnology. Memory pages are placed in and migraed beween he differen memories wihin a hybrid memory sysem, based on he properies of each page. I is imporan o make inelligen page managemen (i.e., placemen and migraion) decisions, as hey can significanly affec sysem performance. In his paper, we propose uiliy-based hybrid memory managemen (UH-MEM), a new page managemen mechanism for various hybrid memories, ha sysemaically esimaes he uiliy (i.e., he sysem performance benefi) of migraing a page beween differen memory ypes, and uses his informaion o guide daa placemen. UH-MEM operaes in wo seps. Firs, i esimaes how much a single applicaion would benefi from migraing one of is pages o a differen ype of memory, by comprehensively considering access frequency, row buffer localiy, and memory-level parallelism. Second, i ranslaes he esimaed benefi of a single applicaion o an esimae of he overall sysem performance benefi from such a migraion. We evaluae he effeciveness of UH-MEM wih various ypes of hybrid memories, and show ha i significanly improves sysem performance on each of hese hybrid memories. For a memory sysem wih DRAM and non-volaile memory, UH- MEM improves performance by 14% on average (and up o 26%) compared o he bes of hree evaluaed sae-of-he-ar mechanisms across a large number of daa-inensive workloads. 1. Inroducion Modern large-scale compuing clusers coninue o employ dynamic random access memory (DRAM) as he main memory sysem wihin each server. However, as he amoun of memory consumed by he applicaions running on hese clusers (e.g., high-performance compuing workloads, largescale daa analyics) grows, radiional DRAM-based memory sysems are unlikely o be able o keep up wih his growh. DRAM scaling is expeced o become increasingly difficul [90, 91] due o increasing cell leakage curren [42, 65, 66, 97], reduced cell reliabiliy [46, 76, 91, 113], and increasing manufacuring complexiy [37, 41, 46, 74, 90, 91, 96, 107]. As a resul, oher memory soluions have emerged o offer low-laency, low-power, or high-capaciy subsraes wihou heavily relying on DRAM scaling. New DRAM producs such as 3D-sacked DRAM [3,45,60,61,99], reduced-laency DRAM (RLDRAM) [80], and low-power DRAM (LPDRAM) [82] make use of novel DRAM circui design, archiecures, and inerfaces o beer caer o applicaions such as scienific compuing, daa mining, nework raffic, and mobile compuing. In addiion, emerging non-volaile memory (NVM) echnologies (e.g., PCM [53, 54, 55, 104, 124], STT-RAM [52], ReRAM [68], and 3D XPoin [83]) have shown promise for fuure main memory sysem designs o mee increasing memory capaciy demands of daa-inensive workloads. Wih projeced scaling rends, NVM cells can be manufacured more easily a smaller feaure sizes han DRAM cells, achieving high densiy and capaciy [14, 15, 52, 53, 54, 55, 68, 104, 107, 120, 124, 131]. However, hese new memory echnologies are unlikely o fully replace commodiy DRAM in main memory sysems. For example, 3D-sacked DRAM is limied in capaciy [12]. RLDRAM has higher cos-per-bi han commodiy DRAM [8,49,58,59]. Mos NVMs incur high access laency and high dynamic energy consumpion, and some NVM echnologies have limied wrie endurance. To address hese weaknesses, hybrid memory sysems or heerogeneous memory sysems, comprised of boh commodiy DRAM and one of hese alernaive memory echnologies, have been proposed. A hybrid memory sysem aims o combine he benefis of boh of is componen memory ypes in a cos-effecive manner [104,126]. For example, commodiy DRAM is faser han NVM, bu has a higher cos per bi. A hybrid memory wih boh commodiy DRAM and NVM uilizes a small amoun of DRAM and a large amoun of NVM, o provide he illusion ha he sysem has large memory capaciy (of NVM), and ha all daa can be accessed a low laency (of DRAM). Hybrid memory sysems can poenially mee boh he performance and memory capaciy (as well as memory energy efficiency) needs of large-scale compuing clusers [4, 5, 31, 33, 64, 73, 75, 98, 100, 104, 126]. In order o successfully deliver high memory capaciy a low laency, hybrid memory sysems mus make inelligen daa placemen decisions, choosing wheher each page should be placed in he high capaciy memory or in he fas memory. Previous daa managemen proposals for hybrid memories consider only a limied number of characerisics, using hese few daa poins o consruc a placemen heurisic ha is specific o he memory ypes being used in he sysem. For example, he majoriy of prior work on hybrid DRAM

2 NVM main memory sysems eiher reas DRAM as a convenional cache [104] or places daa wih high access frequency, high wrie inensiy, and/or low row buffer localiy in DRAM [20, 39, 106, 126, 129], while placing he remaining daa in NVM, as he access laency of NVM is generally higher han ha of DRAM [53, 104]. A mechanism for combining commodiy DRAM wih 3D-sacked DRAM organizes he faser 3D-sacked DRAM as a page-granulariy cache of he commodiy DRAM, bu idenifies and places only he cache blocks ha will be accessed in 3D-sacked DRAM [38]. Work on combining RLDRAM wih commodiy DRAM idenifies and places only criical daa words ino he RLDRAM o reduce access laency [11]. These heurisic-based approaches do no direcly capure he overall sysem performance benefis of daa placemen decisions (as we will show in Secion 3). Therefore, hey can only indirecly opimize sysem performance, which someimes leads o sub-opimal daa placemen decisions. For example, le us consider a memory manager ha migraes memory pages ha are accessed frequenly [39] and ha inherenly have a high access laency (i.e., hey have low row buffer localiy) [126] from he slower NVM o he faser commodiy DRAM. A page migraion based on only hese wo heurisics may no improve sysem performance, if, for insance, accesses o he page being migraed are compleely overlapped wih oher requess from he same applicaion ha coninue o access he slower NVM. In such a case, he laency reducion for accesses o he migraed page would no reduce he applicaion s execuion ime, as he applicaion sill needs o wai for he accesses o he slower NVM o complee. The example memory manager is unable o capure his overlap wih is simple heurisics, and hus incorrecly decides o migrae he page in his example. Our goal in his work is o devise a generalized mechanism ha direcly esimaes he overall sysem performance benefi of migraing a page beween any wo ypes of memory, and places only he performance-criical daa in he fases memory wihin he hybrid main memory sysem. To his end, we propose uiliy-based hybrid memory managemen (UH-MEM), a new hardware mechanism ha esimaes he marginal performance uiliy of each page (i.e., he sysem performance benefi of migraing he page o a faser memory ype), and migraes only hose pages wih he greaes uiliy. UH-MEM employs wo seps. Firs, i deermines how much migraing a page belonging o ha individual applicaion would improve he applicaion s performance. To do his, UH-MEM uses a new performance model ha considers several facors, including how frequenly each page is accessed, wheher row buffer localiy impacs he performance benefis of migraion, and how much he page access laency is hidden by overlapping requess (i.e., he level of memory-level parallelism, or MLP [13, 57, 87, 92, 93, 94]). Second, UH-MEM esimaes how much he improvemen of a single applicaion s performance benefis he overall sysem performance, as differen workloads have differen amouns of impac on overall sysem performance. UH-MEM migraes hose pages wih he greaes esimaed sysem-level performance benefi from slow memory ino fas memory. Key Resuls. We exensively evaluae UH-MEM using a wide range of hybrid memory configuraions, and show ha i is effecive a improving sysem performance over sae-ofhe-ar hybrid memory managers. We quaniaively show ha for a memory sysem wih boh convenional DRAM and NVM, UH-MEM improves sysem performance by 14% on average (and up o 26%) compared o he bes of hree sae-of-he-ar mechanisms ha we evaluae (a convenional cache inserion mechanism [104], an access frequency based mechanism [39, 106], and a row buffer localiy based mechanism [126]), for a large number of daa-inensive workloads. We also show ha he hardware cos of UH-MEM is very modes ( 40KB in our baseline sysem). In his paper, we make hree main conribuions: We propose he firs general uiliy meric o esimae he poenial sysem performance benefi of migraing a page beween he differen memories wihin a hybrid main memory sysem. This uiliy meric represens he sysem performance benefi as a funcion of (1) an applicaion s sall ime reducion if he accessed page is migraed o a faser ype of memory, and (2) how an improvemen o a single applicaion s sall ime impacs overall sysem performance. We propose a new performance model ha can be implemened in hardware, which comprehensively considers he access frequency, row buffer localiy, and MLP of a page o sysemaically esimae an applicaion s sall ime reducion from migraing he page. This is he firs work o consider MLP in addiion o access frequency, row buffer localiy, and wrie inensiy, and o model he ineracions beween hem, for page placemen decisions. Based on our new meric and new performance model, we propose he firs uiliy-based hybrid memory managemen mechanism, UH-MEM, which selecively places pages ha are mos beneficial o overall sysem performance in fas memory wihin a hybrid memory sysem. Our mechanism is general, and works wih a wide variey of memory ypes ha can be used in a hybrid memory sysem. We quaniaively demonsrae ha UH-MEM ouperforms hree sae-of-he-ar hybrid memory managemen echniques. 2. Background In his secion, we provide background on he organizaion and managemen of hybrid memory sysems. Figure 1 shows an example hybrid memory sysem. This hybrid memory sysem has wo differen ypes of memory, which we call Memory A and Memory B. One of hese memories (we arbirarily choose Memory A) is faser han he oher, while he oher memory (Memory B) has a greaer capaciy due o is higher densiy. The goal of a hybrid memory sysem is o provide he large main memory capaciy of Memory B, while providing he fas access laencies of Memory A for memory accesses ha affec execuion ime. 2

3 Channel A Memory A (Fas, Small) Cores/Caches Memory Conrollers Row Buffer Bank Channel B Memory B (Large, Slow) Figure 1: A ypical hybrid memory sysem. When a memory reques is issued by a processor (e.g., he CPU), he memory conrollers deermine wheher he reques should be sen o Memory A or Memory B. Each memory has is own memory channel (i.e., a bus ha connecs he memory o is respecive memory conroller), and is inernally organized similar o oday s DRAM. 1 Each memory consiss of muliple banks, where each bank is a wo-dimensional array of memory cells organized ino rows and columns. Each bank can operae in parallel, bu all banks wihin a channel share he address, daa, and command buses. Wihin each bank, here is an inernal buffer called he row buffer. When daa is accessed from a bank, he enire row conaining he daa is brough ino he row buffer. Hence, a subsequen access o daa from he same row can be served from he row buffer and need no access he array. Such an access is called a row buffer hi. If a subsequen access is o daa in a differen row, he conens of he row buffer need o be wrien back o he row, and he new row s conens need o be brough ino he row buffer. Such an access is called a row buffer conflic (or row buffer miss). A row buffer miss incurs a much higher laency han a row buffer hi. Previous works on hybrid memory sysems observe ha he laency of a row buffer hi is similar across memory ypes, while he laency of a row buffer conflic/miss is generally much higher in denser memories [53,54,55,78,79,126]. The fracion of row buffer his ou of all memory accesses o a row is called row buffer localiy. We can expec ha migraing a page wih low row buffer localiy o he fas memory benefis performance, as a low-localiy page experiences more row buffer misses, and such misses are serviced a a lower laency in he fas memory. Conversely, we can expec ha migraing a page wih high row buffer localiy does no benefi performance much, as mos of he accesses o such a high-localiy page hi in he row buffer, and a row buffer hi has a similar laency in boh he fas memory and he slow memory [126]. An imporan issue for a hybrid memory sysem is how o manage daa sored in differen memory devices. In our sudy, we adop he configuraion proposed by Qureshi e al. [104], and organize he fas, small memory (Memory A) as a cache for he pages in he large, slow memory (Memory B). We assume ha all pages are iniially in Memory B. Insead of uncondiionally migraing a page when he page is accessed [69, 77,102,104], we selecively migrae pages ino 1 We refer he reader o prior works for he deailed inernal operaion, organizaion, and conrol of DRAM [9, 10, 34, 46, 49, 59, 66, 88, 93, 111]. Memory A based on some meric, which is he uiliy of he page in our proposal. This migraion may rigger he evicion of a vicim page cached in Memory A, which is handled by he cache replacemen policy of Memory A. We discuss our migraion mechanism in Secion 4.1. The migraion process beween memory devices is fully managed by hardware, and is ransparen o he OS. 3. Moivaion In sysems ha can issue muliple memory requess in parallel (e.g., ou-of-order execuion processors, mulicore processors, runahead processors), he number of cycles saved for a single memory reques does no direcly ranslae ino a reducion in he applicaion s execuion ime. In order o esimae he rue uiliy of a page (i.e., he impac ha migraing ha page has on sysem performance), we need o esimae (1) by how much he laency reducion from migraion would reduce he individual applicaion s execuion ime (i.e., he applicaion s sall ime reducion), and (2) by how much he applicaion s sall ime reducion ranslaes o an improvemen in overall sysem performance (i.e., he sensiiviy of overall sysem performance o each applicaion s sall ime). In his secion, we firs demonsrae ha we need o comprehensively consider hree major facors, i.e., access frequency, row buffer localiy, and memory-level parallelism (MLP), o esimae he sall ime reducion a page provides when migraed. These facors were no fully capured in prior works [20, 39, 106, 126, 129], none of which ry o esimae he effec of migraion on applicaion or sysem performance. Then, we show ha overall sysem performance exhibis differen sensiiviy o differen applicaions sall ime reducions, and ha we wan o migrae pages from applicaions wih high sensiiviy o maximize overall sysem performance Comprehensive Sall Time Esimaion of an Applicaion To he firs order, an applicaion s sall ime reducion depends on wo pars: (1) how much he laency for accessing he page can be reduced, and (2) how his laency overlaps wih he laencies of oher memory requess from he applicaion. For he firs par, since only he row buffer miss accesses can achieve shorer laency afer he migraion, we need o comprehensively consider access frequency and row buffer localiy of he page (i.e., we can coun he number of row buffer misses o he page) o esimae he laency reducion for he memory requess o he page. The second par depends on he parallelism of memory requess from an applicaion (MLP). MLP is he number of concurren ousanding requess (i.e., he in-fligh memory requess ha are ye o be compleed) from he same applicaion [13,30,87,92,93,94]. In our mechanism, we consider he MLP for each page, and check how many concurren requess from he same applicaion ypically exis when he page is accessed. If here are many concurren requess, he access laency o he page is likely o overlap wih he access laency o oher pages, and herefore migraing he page o fas memory, while i may reduce is 3

4 access laency, will likely resul in only a limied or small reducion in he applicaion s sall ime. We illusrae his MLP effec using he concepual example in Figure 2. Pages 0, 1, and 2 all have he same number of row buffer miss requess. Requess o Page 0 are no overlapped wih oher requess from he same applicaion, while requess o Pages 1 and 2 are overlapped. We would like o see by how much he applicaion s sall ime would be reduced if we migrae each of hese pages from slow memory o fas memory. Before migraion: reques o Page 0 Afer migraion: reques o Page 0 T Applicaion sall ime reduced by T (a) (a) alone Alone reques reques reques o Page 1 reques o Page 2 reques o Page 1 reques o Page 2 T Applicaion sall ime reduced by T (b) (b) overlapped Overlapped requess requess Figure 2: Concepual example showing ha he MLP of a page influences how much effec is migraion o fas memory has on he applicaion sall ime. Suppose we migrae Page 0 o fas memory (Figure 2a). As here is no oher reques ha overlaps wih he reques o Page 0, he reques o Page 0 is likely o be salling a he head of he processor reorder buffer (ROB), which ofen salls he enire applicaion [29, 51, 87, 92, 94, 95, 103]. The requess o Page 0 will complee faser upon migraion, hereby decreasing he applicaion s sall ime and hus being more likely o improve applicaion performance [29, 51, 87, 92, 94, 95, 103]. On he oher hand, if we migrae boh Pages 1 and 2 o fas memory (Figure 2b), requess o boh pages also complee faser, bu he applicaion s overall sall ime will be reduced by roughly he same amoun as ha enabled by migraing only Page 0, since he access laencies o Pages 1 and 2 are overlapped. In oher words, despie incurring double he number of migraions and consuming double he amoun of limied fas memory capaciy by migraing wo overlapping pages (Pages 1 and 2), we achieve only he same performance benefi enabled by migraing only a single page ha is serviced alone (Page 0). Unforunaely, wihou MLP, we are unable o build a comprehensive model ha disinguishes beween hese wo scenarios, and mechanisms ha consider only row buffer localiy and access frequency may migrae pages like Pages 1 and 2 ha conribue less o reducing he applicaion s sall ime. 2 Figure 3 shows he disribuion of MLP across all memory pages for hree represenaive benchmarks: soplex, xalan- 2 In fac, if a mechanism migraes only one of he overlapping pages (eiher Page 1 or Page 2), i is unlikely ha i will reduce sall ime a all as he non-migraed page would sill sall he CPU. A similar observaion is made by Qureshi e al. in he conex of caching [103]. cbmk, and YCSB-B [16, 35]. 3 We can see ha differen pages wihin an applicaion have very differen MLP. Oher benchmarks in our evaluaion exhibi similar MLP diversiy across heir pages. Hence, we can ake advanage of his diversiy o opimize sysem performance. Frequency (%) MLP Frequency (%) MLP MLP (a) soplex (b) xalancbmk (c) YCSB-B Figure 3: MLP disribuion for all pages in hree workloads. In order o quanify he impac of differen facors on an applicaion s sall ime, we measure he sall ime conribuion of each page (i.e., he ime ha he ousanding memory requess o he page cause he processor o sall) for every benchmark in our evaluaion. Table 1 shows he correlaion coefficiens beween he average sall ime per page and hree differen page-level access characerisic merics (i.e., access frequency, row buffer localiy, and MLP, along wih combinaions of he hree). 4 This shows ha independenly, access frequency, row buffer localiy, and MLP all correlae somewha wih a page s sall ime conribuion. However, his correlaion becomes very srong when we comprehensively consider all hree facors ogeher (correlaion coefficien = 0.92). We see ha he wo facors considered ogeher in prior work (access frequency and row buffer localiy) [126] do no correlae nearly as srongly (correlaion coefficien = 0.76). Therefore, we conclude ha access frequency, row buffer localiy, and MLP are all indispensable facors o comprehensively model he performance impac of daa placemen. Frequency (%) AF RBL MLP Correlaion AF+RBL AF+MLP AF+RBL+MLP Correlaion Table 1: Absolue Spearman correlaion coefficiens beween he average sall ime per page and differen facors (AF: access frequency; RBL: row buffer localiy; MLP: memory level parallelism). The correlaion coefficiens are beween 0 and 1, where 0 = no correlaion, and 1 = perfec correlaion. 3 We run each workload separaely on a sysem ha is similar o he configuraion shown in Secion 5, hough we use a single-core processor for he experimens shown here. When a page in he workload is accessed by a memory reques, we measure how many ousanding memory requess wih he same ype (i.e., eiher read or wrie) exis in he workload, and use ha number as he curren MLP of he page. We hen calculae he average MLP of each page, and repor he disribuion of average MLP across all of he pages in hese figures. 4 For each benchmark, we divide all of is pages ino several bins, sored by he values of he facors under consideraion. We hen calculae he average sall ime per page for each bin. We analyze he correlaion beween he average sall ime and he facors, and obain he correlaion coefficien. We repor he average correlaion coefficien over all of our benchmarks

5 3.2. Esimaing Effec on Overall Sysem Performance Prior proposals for hybrid memory page managemen ha only use heurisics ha are, as we have shown in Secion 3.1, only somewha correlaed o applicaion performance [11, 20, 38, 39, 104, 106, 126, 127, 128, 129] fail o capure how he sall ime of a single applicaion affecs overall sysem performance. We find ha his impac is no uniform across he applicaions wihin a muliprogrammed workload. There are several differen merics ha can be used o express sysem performance, as has been discussed in a number of prior works [6, 26, 71, 112] (e.g., weighed speedup, harmonic speedup). These merics express overall sysem performance by weighing he performance of each applicaion wihin he workload differenly, based on some applicaion characerisics. For example, weighed speedup normalizes he performance of each applicaion o is performance when running alone, in order o capure he effecs of sysem inerference beween applicaions [26, 112]. For wo applicaions wih an equal amoun of sall ime reducion (in erms of absolue cycle coun), he reducion for he applicaion wih a greaer weigh will resul in a greaer sysem performance improvemen. As prior page managemen mechanisms are oblivious o he unequal impac of applicaion performance benefis on overall sysem performance, hey can migrae pages ha are less imporan for overall sysem performance ino he fas memory. We, herefore, incorporae he relaion beween applicaion performance and overall sysem performance direcly ino our mechanism, using applicaion weighing o prioriize pages from applicaions ha impac he overall sysem performance he mos. In his work, we use weighed speedup [112], which has been shown o correspond o sysem hroughpu for muliprogrammed workloads [26]. However, sysem designers wih oher arge objecives can use differen sysem performance merics, by simply modifying he sysem performance esimaion hardware wihin our proposed mechanism. 4. UH-MEM: Uiliy-Based Hybrid Memory Managemen In his secion, we inroduce uiliy-based hybrid memory managemen (UH-MEM). UH-MEM is a hardware mechanism ha resides wihin he memory conroller. I performs inerval-based calculaions o deermine which pages should be migraed from slow memory o fas memory, where fas memory is reaed as a se-associaive (16-way) page cache wih LRU cache replacemen policy, similar o prior work [77, 104, 126]. During each inerval (1 million cycles in our experimens, deermined empirically), pages are seleced for migraion by UH-MEM, and a migraion mechanism caches he daa in he fas memory by copying he daa firs o he migraion buffer in he memory conroller, and hen o he fas memory. Once a page is migraed o fas memory, i is insered ino a ag sore wihin he memory conroller. Whenever a reques misses in he las-level on-chip cache, i looks up he ag sore and he migraion buffer, o see if he requesed daa resides in fas memory or in he migraion buffer. The reques is hen dispached o he appropriae locaion based on his lookup. As wih on-chip caches, UH-MEM s operaions are ransparen o he OS Mechanism Overview UH-MEM comprehensively esimaes how he migraion of each page would improve overall sysem performance, which we define as he uiliy of each page (see Secion 3). The page uiliy calculaion, as performed in hardware, is described in deail in Secion 4.2. During each inerval, when a page is accessed in slow memory, UH-MEM migraes he page o fas memory if is uiliy is greaer han he migraion hreshold. I is no beneficial o move every accessed page ino fas memory, because (1) migraion operaions ake ime o complee, and (2) doing so would cause he slow memory bandwidh o go unused. We include a mechanism o dynamically se he migraion hreshold a he end of each inerval, which we discuss in Secion 4.3. When a page is seleced for migraion, we firs check he ag sore of he fas memory o see if we need o evic anoher page in he desinaion fas memory cache se. We implemen a migraion buffer wihin he memory conroller o emporarily hold he migraing page(s). Each cache block in he buffer includes wo migraion saus bis o deermine where he cache block currenly resides (i.e., in eiher of he memories, or in he buffer). The saus bis allow UH-MEM o direc incoming memory requess for a migraing page o he correc place. Afer compleing he daa movemen, he corresponding meadaa informaion in he ag sore is updaed Compuing Page Uiliy The uiliy of a page depends on (1) he sall ime reducion of an applicaion due o migraion of he page o he fas memory, and (2) he sysem performance sensiiviy o he applicaion. 5 Suppose ha one page of Applicaion i is migraed o fas memory, such ha he applicaion sall ime is reduced by Sall Time i. The uiliy of ha page (U ) can be expressed as: U = Sall Time i Sensiiviy i (1) Esimaing Applicaion Sall Time Reducion. The sall ime reducion due o a page migraion is dependen on wo facors: (1) he access laency reducion for ha page, and (2) he degree o which he page s access laency is masked (i.e., overlapped) by he access laency of oher concurren requess for he same applicaion. The degree o which a page s oal access laency is reduced can be deermined by using a combinaion of he page s access frequency and row buffer localiy. If a page is migraed from slow memory o fas memory, he laency of row buffer 5 Wihou loss of generaliy, we use he erm applicaion o refer o a hardware hread conex execuing an applicaion. 5

6 misses decreases, while row buffer his sill achieve a similar laency. Therefore, he expeced decrease in access laency is proporional o he oal number of row buffer misses for ha page, which is a funcion of access frequency and row buffer localiy. We can esimae his decrease as: Read Laency = #ReadMiss ( slow,read fas,read ) Wrie Laency = #WrieMiss ( slow,wrie fas,wrie ) where #ReadMiss and #WrieMiss are he number of row buffer read and wrie misses, respecively, and fas,read, fas,wrie, slow,read, and slow,wrie are he device-specific read/wrie laencies incurred on a row buffer miss for fas memory and slow memory, respecively. In order o quanify he degree of access laency masking, we sample he oal number of ousanding memory requess for ha same applicaion o model he overlap effec. Specifically, we define he MLP raio of an applicaion o be he reciprocal of he ousanding memory reques coun. 6 Inuiively, if here are fewer ousanding requess, hen here is less memory-level parallelism available o overlap he page s access laency. As such, we use he reciprocal of he number of ousanding memory requess so ha he MLP raio represens he fracion of he access laency ha impacs he applicaion s performance. During a sampling period, he MLP raio for an applicaion wih N read, /N wrie, ousanding read/wrie requess is as follows, respecively for reads and wries: 1 1 MLPRaio read, = MLPRaio wrie, = (3) N read, N wrie, We can use he MLP raio of he applicaion o deermine he MLP raio for individual pages. For mos applicaions, differen pages do no ypically have equal amouns of MLP. Therefore, we approximae an average MLP raio for each page across all of he sampling periods ha have aken place so far in he curren inerval. We compue wo values, PageMLPRaio read and PageMLPRaio wrie, which are he average MLP raio of a page during he inerval for ousanding read and wrie requess, respecively, o ha page. We can model PageMLPRaio read and PageMLPRaio wrie as: m MLPRaio read, m read, read, N read, PageMLPRaio read = = m read, MLPRaio wrie, m wrie, PageMLPRaio wrie = = m wrie, m read, m wrie, N wrie, m wrie, (2) (4) To calculae PageMLPRaio read, we sar wih he overall applicaion MLP raio a each sampling period (MLPRaio read, ). We deermine he oal conribuion of he page o he applicaion s MLP during sampling period by muliplying MLPRaio read, wih he number of ousanding read requess during he sampling period o he page (m read, ). We hen sum up he page s MLP conribuions over all of he sampling periods so far in he curren inerval, and divide i by he oal number of ousanding read requess o he page during hese sampling periods. This, in effec, gives us he average MLP conribuion of each ousanding read reques for he page. We repea he same calculaion for wrie requess. We can now combine he laency reducion (Equaion 2) and he average MLP raio (Equaion 4) o deermine he sall ime reducion for Applicaion i as a resul of migraing a paricular page: Sall Time i = Read Laency PageMLPRaio read + p Wrie Laency PageMLPRaio wrie (5) where p represens he probabiliy ha he wrie requess appear on he criical pah. Prior work [130] has shown ha his probabiliy is dependen on an applicaion s wrie access paern, and is generally larger if he applicaion has a large number of wrie requess. For simpliciy, we choose o se p = 1, hough using an online ieraive approach o deermine p [130] may yield beer performance since i can enhance he accuracy of he sall ime esimaion. Equaion 5 shows ha he sall ime reducion due o a page migraion from slow memory o fas memory can be deermined by using a combinaion of access frequency, row buffer localiy, and MLP for each page. Inuiively, a high access frequency and low row buffer localiy increase he number of oal row buffer misses, hus enlarging he benefis of migraing o fas memory. Likewise, poor MLP, wih fewer concurren ousanding requess, increases he average MLP raio due o low likelihood of overlapping he reques laency, and also increases he benefis from migraion Esimaing Sysem Performance Sensiiviy. For muliprogrammed workloads, we use he weighed speedup meric [27, 112] o characerize sysem performance. 7 For each applicaion, he speedup componen of Applicaion i is he raio of execuion ime when running alone, i.e., wihou inerference from oher applicaions (T alone,i ) o ha when running ogeher wih oher applicaions (T shared,i ): T alone,i Sysem Performance = Speedup i = (6) T shared,i i i 6 We calculae he MLP raio separaely for reads and wries, o accoun for heir differen behavior in main memory. While reads are ofen serviced as soon as possible (as hey can fall along he criical pah of execuion), wries are deferred, and are evenually drained in baches [56, 110]. Disinguishing beween reads and wries allows us o more accuraely deermine he MLP behavior affecing each ype of reques. 7 UH-MEM can be adaped o use differen sysem performance or fairness merics [22, 24, 32, 47, 48, 86, 88, 93, 116, 117, 121, 125]. In order o suppor differen sysem performance merics, we can implemen logic o esimae he sensiiviy for each meric, and le he OS choose he mos suiable meric o opimize based on he applicaions currenly running wihin he sysem and he user s preferences. 6

7 When Applicaion i migraes a page o fas memory, he speedup of ha applicaion improves by : Speedup i = T alone,i T shared,i Since he sall ime reducion due o page migraion is generally much smaller han he execuion ime ( T alone,i, T shared,i ), we can perform a Taylor expansion o find he change in speedup: Speedup i = Speedup i T alone,i Speedup i = (T shared,i )T shared,i T alone,i = Speedup i T shared,i T shared,i T shared,i We defined he performance sensiiviy of he sysem o an applicaion in Secion 3.1 as he measure of how he change in an applicaion s sall ime impacs he overall sysem performance. We can hus esimae i using Equaion 9 (by plugging in Equaion 8 a he appropriae place): Sensiiviy i = Performance Sall Time i = Speedup i (7) (8) = Speedup i T shared,i (9) We calculae he performance sensiiviy using an inervalbased approach, where he speedup (Speedup i ) and execuion ime (T shared,i ) obained in he las inerval are used o esimae performance sensiiviy in he curren inerval. The execuion ime of each applicaion running on he sysem is equal o he lengh of an inerval. We need o esimae he speedup of he applicaion (Speedup i ) during he inerval. This speedup esimae can be obained by using prior proposals [22, 23, 84, 88, 118, 119]. These works consider he impac of memory inerference and/or cache conenion on he speedup of an applicaion. In our implemenaion, we esimae speedup based on he approach in [88]. Equaions 5 and 9 are combined using Equaion 1 o give us he overall uiliy of migraing he page in quesion. A few measuremens are required o obain his uiliy calculaion, and we discuss he implemenaion deails of hese mechanisms in Secion Performing Page Migraion Algorihm 1 summarizes how UH-MEM decides which pages i should move o he fas memory. Whenever an ousanding memory reques complees, UH-MEM (1) updaes couners ha hold saisics for he page accessed by he reques, (2) recalculaes he uiliy of he page, and (3) compares he calculaed uiliy wih he migraion hreshold. The page will only be migraed from slow memory o fas memory if he uiliy exceeds he migraion hreshold. A he end of each inerval, UH-MEM adjuss he migraion hreshold o accoun for ransien applicaion behavior, and clears he page saisic couners. Algorihm 1 Migraing pages wih UH-MEM. 1: for every inerval do 2: for every compleed memory reques do 3: Updae he corresponding page s saisics couners 4: Calculae he page s uiliy (Secion 4.2) 5: if he page s uiliy exceeds he migraion hreshold hen 6: Migrae he page o he fas memory 7: end if 8: end for 9: if a he end of he inerval hen 10: Adjus he migraion hreshold (Secion 4.3) 11: Esimae speedup for each applicaion (Secion 4.2.2) 12: Rese all couners o zero 13: end if 14: end for A key quesion is how o deermine his migraion hreshold. We choose o use a hill climbing based approach o deermine his hreshold dynamically, similar o he policy used by Yoon e al. [126]. We use he oal sall ime of all applicaions in each inerval o reflec he sysem performance. A he end of each inerval, he oal sall ime is recalculaed. We hen compare he curren oal sall ime wih he oal sall ime from he previous inerval, and deermine wheher he previous hreshold adjusmen yielded a sysem performance improvemen. If he oal sall ime of he curren inerval is lower (meaning ha he hreshold adjusmen improved sysem performance), we coninue o adjus he hreshold in he same direcion. Oherwise, since he previous adjusmen degraded performance, we move he hreshold in he opposie direcion Hardware Srucures UH-MEM performs he calculaions described in Secion 4.2 in hardware. We firs discuss he various hardware componens required for UH-MEM o calculae he MLP raios and page uiliy. Then, we summarize he oal cos of he hardware MLP Raio Calculaion. To calculae he MLP raios from Equaion 4, we mus mainain four emporary couners for every page wih ousanding requess in he memory conroller. Two of he couners, MLPAcc read and MLPAcc wrie, accumulae he numeraor from Equaion 4, while he oher wo couners, MLPWeigh read and MLPWeigh wrie, accumulae he denominaor of he equaion, as follows: MLPAcc read = MLPAcc wrie = m read, N read, m wrie, N wrie, MLPWeigh read = m read, MLPWeigh wrie = (10) m wrie, For every sampling period (30 cycles in our experimens), we monior boh he ousanding read/wrie requess N read and N wrie for each applicaion, as well as he ousanding requess m read and m wrie for each page, and updae he corresponding couners. 7

8 When all he ousanding requess o a page have compleed, he conens of he page s emporary couners are added o is corresponding couners in a saisics sore (i.e., sas sore), and are hen rese. The sas sore is a 32-way seassociaive cache wih LRU replacemen policy, residing in he memory conroller. Each sas sore enry corresponds o a page, and consiss of six couners ha record he number of row buffer misses, he sum of weighed MLP raios (MLPAcc), and he sum of weighs for he MLP raios (MLPWeigh) for read/wrie requess. We can use he raio of MLPAcc o MLPWeigh o calculae he average MLP raio of he page (PageMLPRaio), respecively for read and wrie requess. When a page in slow memory is accessed, if i has an exising enry in he sas sore, he conen of is enry is updaed; oherwise, an enry is allocaed, which may evic he enry of he leas recenly used page wihin he se. The access laency o he sas sore is no on he criical pah, as we updae he sas sore in he background. When a sysem has muliple memory conrollers, he sas sore and he couners used o calculae MLP raios need o be shared by hese memory conrollers. Differen memory conrollers need o communicae wih each oher o mainain he informaion, such as he number of ousanding requess, as done in prior works [17, 36, 47, 85, 86] Uiliy Calculaion for Shared Pages. For pages shared by muliple applicaions, we can use separae enries in he sas sore o record he saisical informaion of he page wih respec o each applicaion. We can use our previous mehod o calculae he page uiliy for each applicaion, and hen add hese uiliy values o obain he aggregae uiliy for he page. The insigh is ha he oal sysem performance improvemen correlaes wih he sum of he performance improvemen of each applicaion. Therefore, summing up he page uiliy for each applicaion (i.e., is performance improvemen) should reflec he sysem performance improvemen Hardware Cos. Table 2 describes he main hardware coss for UH-MEM. The larges componen is he sas sore. We use a 2048-enry sas sore (organized as 32-way seassociaive cache), as i leads o negligible performance degradaion compared wih an unlimied-size sas sore. The main hardware cos of UH-MEM is 42.87KB, 8 which is only approximaely 2% of our baseline sysem s L2 cache size. UH-MEM also requires hardware logic o calculae he MLP raios. For each page wih ousanding requess in slow memory (96 a mos; limied by he read reques queue size and wrie buffer), we need o perform 4 25-bi addiions and 2 fas divisions every 30 cycles o compue he MLP raios. 9 We achieve his by pipelining he logic, and making i 3- way superscalar. We can implemen fas division using a ROM able ha conains he precompued resuls of he division, since boh he numeraor and denominaor of he division are limied by he MSHR size of he las-level cache. As each quoien is 10 bis wide, he oal size of such a ROM able is 1.25KB. UH-MEM does no require any modificaions o he operaing sysem o suppor page migraion. This is because UH-MEM does no use he virual or physical address of a page o deermine wheher he page resides in fas memory or slow memory. Insead, UH-MEM uses a dedicaed hardware ag sore in he memory conroller o deermine wheher he page has been migraed o he fas memory. 5. Evaluaion Mehodology Similar o prior works [39, 104, 106, 126], we evaluae our proposed UH-MEM mechanism using a cycle-accurae x86 mulicore simulaor [2], whose fron end is based on Pin [70]. We released our simulaor [2, 109]. This in-house developed simulaor is similar o Ramulaor [1, 50], which is a widelyacceped open-source mulicore simulaor ha models he main memory sysem in deail. In our simulaor, page migraions beween fas and slow memories are modeled as addiional read requess o he memory device where he page is currenly locaed, o read he enire page from i, followed by addiional wrie requess in he desinaion memory device o wrie he enire page. The laency for deermining wheher a page resides in fas or slow memory is modeled as six cycles. Table 3 summarizes he major parameers of he baseline sysem consising of DRAM and NVM in our 8 This does no include he hardware used o deermine wheher a page resides in fas memory or slow memory, as his hardware is required by mos hybrid memory managemen mechanisms [104, 106, 126], and he implemenaion of UH-MEM is orhogonal o he implemenaion of his srucure. 9 We deermined all values empirically and did no opimize heavily. Reducion in hardware cos is possible wih careful opimizaion. Name Purpose Srucure (number of bis in parenheses) Size Sas sore Couners for ousanding pages in slow memory ROM able for MLP raios Tracks saisical informaion for recenly-accessed pages Records updaes of MLPAcc and MLPWeigh for pages wih ousanding requess Sores precompued resuls of division used o calculae MLP raios Toal Hardware Cos (for our evaluaed sysem in Table 3) 2048 enries; each enry consiss of read row buffer miss coun (14), 40.00KB wrie row miss coun (14), MLPAcc read (30), MLPAcc wrie (30), MLPWeigh read (21), MLPWeigh wrie (21) and page number ag (30) For each page wih ousanding requess in slow memory (96 a 1.62KB mos), MLPAcc read (30), MLPAcc wrie (30), MLPWeigh read (21), MLPWeigh wrie (21) and page number (36) 32 x 32 enries; each enry consumes 10 bis 1.25KB Table 2: Main hardware cos of UH-MEM KB 8

9 Processor L1 Cache L2 Cache Fas Memory Conroller Slow Memory Conroller Baseline Fas Memory Sysem Baseline Slow Memory Sysem 8 cores, 2.67GHz, 3-wide issue, 128-enry insrucion window 32KB per core, 4-way, 64B cache block 256KB per core, 8-way, 32 MSHR enries per core, 64B cache block 64-bi channel, 64-enry read reques queue, 32- enry wrie buffer, FR-FCFS scheduling policy [108, 132] 64-bi channel, 64-enry read reques queue, 32- enry wrie buffer, FR-FCFS scheduling policy [108, 132] 512MB DRAM, 1 rank (8 banks), CLK =1.875ns, CL =15ns, RCD =15ns, RP =15ns, WR =15ns, array read (wrie) energy = 1.17 (0.39) pj/bi, row buffer read (wrie) energy = 0.93 (1.02) pj/bi 16GB NVM, 1 rank (8 banks), CLK =1.875ns, CL =15ns, RCD =67.5ns, RP =15ns, WR =180ns, array read (wrie) energy = 2.47 (16.82) pj/bi, row buffer read (wrie) energy = 0.93 (1.02) pj/bi Table 3: Baseline sysem parameers. evaluaion. The deailed DRAM and NVM iming and energy parameers are based on prior sudies [53, 54, 78, 79, 81]. We calculae he saic power of he hybrid memory sysem o be 5.6W [53]. In order o evaluae differen ypes of hybrid memory sysems, such as DRAM RLDRAM and DRAM NVM memories, we vary he size of he fas memory and he read/wrie wrie laency raios of slow memory o fas memory. We also measure he performance of our evaluaed page placemen mechanisms under hese differen configuraions Workloads We use 30 benchmarks chosen from SPEC CPU2006 [35] and he Yahoo Cloud Serving Benchmark (YCSB) suie [16]. We classify hem as memory-inensive or non-memory-inensive based on heir las level cache misses per 1K insrucions (MPKI) when running alone. Each experimen runs an eighapplicaion workload on he sysem, wih one applicaion running on each core. The memory inensiy caegory of he workload is deermined by he percenage of memoryinensive benchmarks wihin he workload. For example, a workload has 75% inensiy if i consiss of six memoryinensive benchmarks and wo non-memory-inensive benchmarks. We generae 40 workloads, eigh for each caegory of workload memory inensiy (0%, 25%, 50%, 75%, 100%). In each experimen, every benchmark was warmed up for 500 million insrucions, and hen execued for anoher 500 million insrucions. A benchmark in a muliprogrammed workload is resared afer i complees unil all he benchmarks in he workload complee once Merics We use weighed speedup (WSpeedup) [26, 112] and maximum slowdown (MaxSlowdown) [6, 17, 18, 43, 44, 47, 48, 86, 116, 117, 119, 121, 123] o evaluae sysem performance and unfairness, respecively, using he equaions shown below. N is he number of cores; IPC alone,i and IPC shared,i are he insrucions compleed per cycle (IPC) when Applicaion i is running alone and running wih oher applicaions, respecively. Weighed speedup (see Secion 4.2) firs weighs he performance of each applicaion (when i is running wih ohers; IPC shared,i ) by he reciprocal of is performance while running alone (IPC alone,i ), reflecing he speedup of he applicaion. Then, weighed speedup sums up he speedup of all he applicaions, reflecing he overall sysem performance. Weighed speedup is a widely-used muliprogrammed sysem performance meric in compuer archiecure evaluaion [26]. I quanifies sysem hroughpu [26]. For unfairness, we use maximum slowdown o quanify he wors-case slowdown of any applicaion in a muliprogrammed workload. Boh weighed speedup and maximum slowdown use normalized IPC raios, insead of he IPC iself, o avoid biasing eiher meric in favor of high-ipc or low-ipc applicaions. N 1 WSpeedup = i=0 MaxSlowdown = max 6. Experimenal Resuls IPC shared,i IPC alone,i ( ) IPCalone,i IPC shared,i We evaluae our proposed UH-MEM mechanism across a wide variey of sysem configuraions, covering several fas memory sizes and laency raios of slow memory o fas memory. Throughou our evaluaion, we compare UH-MEM o hree oher sae-of-he-ar mechanisms: ALL: a convenional cache inserion mechanism. This mechanism reas fas memory as a cache o slow memory, and insers all he pages accessed in slow memory ino fas memory using he LRU replacemen policy. This is similar o he proposal by Qureshi e al. [104]. FREQ: an access frequency based mechanism. This mechanism migraes pages wih high access frequency o fas memory. I is similar o wo proposals ha ry o improve he emporal localiy in fas memory and reduce he number of accesses o slow memory [39, 106]. RBLA: a row buffer localiy based mechanism [126]. This mechanism migraes pages ha have experienced a large number of row buffer misses in slow memory o fas memory. The inuiion is ha only he laency of row buffer miss requess can be reduced when he page is migraed o fas memory Resuls on he Baseline Sysem Configuraion Figure 4 shows he normalized weighed speedup of he four evaluaed mechanisms on he baseline sysem configuraion, averaged for each workload inensiy caegory. UH-MEM ouperforms he bes previous proposal, RBLA, in all workload caegories wih non-zero memory inensiy. For he mos memory-inensive caegory, UH-MEM provides a 14% 9

10 ALL FREQ RBLA UH-MEM Normalized WSpeedup % 25% 50% 75% 100% Workload Memory Inensiy Caegory Figure 4: Normalized weighed speedup for he baseline configuraion. ALL FREQ RBLA UH-MEM Memory Energy (J) % 25% 50% 75% 100% Workload Memory Inensiy Caegory Figure 7: Memory energy consumpion for he baseline configuraion. ALL FREQ RBLA UH-MEM Average App Sall Time (x10^9 cycles) % 25% 50% 75% 100% Workload Memory Inensiy Caegory Figure 5: Average applicaion sall ime for he baseline configuraion. ALL FREQ RBLA UH-MEM WSpeedup MB 512MB 1GB 2GB Fas Memory Capaciy Figure 8: Weighed speedup for various fas memory sizes. ALL FREQ RBLA UH-MEM Normalized Unfairness % 25% 50% 75% 100% Workload Memory Inensiy Caegory Figure 6: Normalized unfairness for he baseline configuraion. ALL FREQ RBLA UH-MEM WSpeedup RCD: WR: x3.0 x4.0 x4.5 x6.0 x7.5 x3.0 x4.0 x12 x16 x20 Slow Memory Laency Muliplier Figure 9: Weighed speedup for various slow-o-fas memory laency raios for RCD and WR. average performance improvemen over RBLA. The maximum performance gain of UH-MEM over RBLA for a single workload is 26%. UH-MEM s performance advanage is wofold. Firs, UH-MEM no only considers he laency of each individual reques (as FREQ and RBLA do), bu also akes ino accoun he memory-level parallelism beween requess o esimae each reques s individual conribuion o he applicaion s overall sall ime. Therefore, UH-MEM can reduce sall ime more effecively compared wih hose prior proposals by selecing and caching hose pages ha are more likely o sall he processor. This is demonsraed by Figure 5, which shows ha each applicaion wihin a workload salls for less wih UH-MEM han wih RBLA. Second, UH-MEM is aware of which applicaions impac he sysem performance he mos as i esimaes sysem performance sensiiviy o differen applicaions, and prioriizes page migraions from hose applicaions ha are likely o benefi sysem performance he mos. Figure 6 shows he normalized unfairness of he four evaluaed mechanisms on he baseline sysem configuraion. We can see ha UH-MEM achieves equivalen or improved fairness compared o all prior proposals. We also sudy he energy efficiency of he four mechanisms on he baseline sysem configuraion. Figure 7 shows he memory energy consumpion of he four mechanisms on workloads wih varying memory inensiies. We observe ha energy consumpion grows wih he memory inensiy of he workload. Compared o prior mechanisms, UH-MEM consumes similar energy for non-memory-inensive workloads, and uses less energy for memory-inensive workloads. For he memory-inensive workloads, UH-MEM reduces saic energy consumpion as a resul of is shorer execuion ime. UH-MEM also reduces he dynamic energy consumed due o page migraions, as i selecively migraes he imporan pages o DRAM insead of migraing less imporan pages as he baseline mechanisms do. We conclude ha UH-MEM improves performance and lowers energy consumpion compared o hree sae-of-hear hybrid memory managemen mechanisms, because i can effecively gauge he sysem performance benefi of each page migraion Sensiiviy o Fas Memory Size The fas memory size deermines he room for performance opimizaion in hybrid memory sysems. A larger fas memory can allow more pages o migrae from slow memory, hereby likely offering greaer sysem performance. However, he fas memory size, in pracice, canno be oo large, and herefore can limi he scalabiliy of hybrid memory sysems. In his secion, we evaluae how each mechanism performs across a range of fas memory sizes (256MB, 512MB, 1GB, and 2GB). Figure 8 shows he weighed speedup of workloads wih 100% memory inensiy under various fas memory sizes. We observe ha sysem performance increases wih fas memory size. Under he four evaluaed sizes, UH-MEM ouperforms RBLA by 14%, 14%, 12%, and 12%, respecively. Even for a 256MB fas memory, which offers less opporuniy for opimizaion, UH-MEM achieves a weighed speedup of 3.30, which is larger han RBLA s weighed speedup of 3.04 for a 2GB fas memory. In oher words, UH-MEM can exceed RBLA s performance even wih only an eighh of he fas memory capaciy. This implies ha by esimaing he sysem performance benefi of each page and selecively placing only criical pages in fas memory, UH-MEM can grealy shrink he fas memory size (while achieving higher performance), and hereby improve hybrid memory scalabiliy. 10

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Compuer Archiecure and Engineering Lecure 7 - Memory Hierarchy-II Krse Asanovic Elecrical Engineering and Compuer Sciences Universiy of California a Berkeley hp://www.eecs.berkeley.edu/~krse hp://ins.eecs.berkeley.edu/~cs152