FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems

Size: px

Start display at page:

Download "FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems"

Damian Jenkins
6 years ago
Views:

1 FIRM: Far and Hgh-Performance Memory Control for Persstent Memory Systems Jshen Zhao, Onur Mutlu, Yuan Xe Pennsylvana State Unversty, Carnege Mellon Unversty, Unversty of Calforna, Santa Barbara, Hewlett-Packard Labs Abstract Byte-addressable nonvolatle memores promse a new technology, persstent memory, whch ncorporates desrable attrbutes from both tradtonal man memory (byte-addressablty and fast nterface) and tradtonal storage (data persstence). To support data persstence, a persstent memory system requres sophstcated data duplcaton and orderng control for wrte requests. As a result, applcatons that manpulate persstent memory (persstent applcatons) have very dfferent memory access characterstcs than tradtonal (non-persstent) applcatons, as shown n ths paper. Persstent applcatons ntroduce heavy wrte traffc to contguous memory regons at a memory channel, whch cannot concurrently servce read and wrte requests, leadng to memory bandwdth underutlzaton due to low bank-level parallelsm, frequent wrte queue drans, and frequent bus turnarounds between reads and wrtes. These characterstcs undermne the hgh-performance and farness offered by conventonal memory schedulng schemes desgned for non-persstent applcatons. Our goal n ths paper s to desgn a far and hgh-performance memory control scheme for a persstent memory based system that runs both persstent and non-persstent applcatons. Our proposal, FIRM, conssts of three key deas. Frst, FIRM categorzes request sources as non-ntensve, streamng, random and persstent, and forms batches of requests for each source. Second, FIRM strdes persstent memory updates across multple banks, thereby mprovng bank-level parallelsm and hence memory bandwdth utlzaton of persstent memory accesses. Thrd, FIRM schedules read and wrte request batches from dfferent sources n a manner that mnmzes bus turnarounds and wrte queue drans. Our detaled evaluatons show that, compared to fve prevous memory scheduler desgns, FIRM provdes sgnfcantly hgher system performance and farness. Index Terms memory schedulng; persstent memory; farness; memory nterference; nonvolatle memory; data persstence 1. INTRODUCTION For decades, computer systems have adopted a two-level storage model consstng of: 1) a fast, byte-addressable man memory that temporarly stores applcatons workng sets, whch s lost on a system halt/reboot/crash, and 2) a slow, block-addressable storage devce that permanently stores persstent data, whch can survve across system boots/crashes. Recently, ths tradtonal storage model s enrched by the new persstent memory technology a new ter between tradtonal man memory and storage wth attrbutes from both [2, 9, 47, 54, 59]. Persstent memory allows applcatons to perform loads and stores to manpulate persstent data, as f they are accessng tradtonal man memory. Yet, persstent memory s the permanent home of persstent data, whch s protected by versonng (e.g., loggng and shadow updates) [17, 54, 88, 90] and wrte-order control [17, 54, 66], borrowed from databases and fle systems to provde consstency of data, as f data s stored n tradtonal storage devces (.e., hard dsks or flash memory). By enablng data persstence n man memory, applcatons can drectly access persstent data through a fast memory nterface wthout pagng data blocks n and out of slow storage devces or performng context swtches for page faults. As such, persstent memory can dramatcally boost the performance of applcatons that requre hgh relablty demand, such as databases and fle systems, and enable the desgn of more robust systems at hgh performance. As a result, persstent memory has recently drawn sgnfcant nterest from both academa and ndustry [1, 2, 16, 17, 37, 54, 60, 66, 70, 71, 88, 90]. Recent works [54, 92] even demonstrated a persstent memory system wth performance close to that of a system wthout persstence support n memory. Varous types of physcal devces can be used to buld persstent memory, as long as they appear byte-addressable and nonvolatle to applcatons. Examples of such byte-addressable nonvolatle memores (BA-NVMs) nclude spn-transfer torque RAM (STT- MRAM) [31, 93], phase-change memory (PCM) [75, 81], resstve random-access memory (ReRAM) [14, 21], battery-backed DRAM [13, 18, 28], and nonvolatle dual n-lne memory modules (NV-DIMMs) [89]. 1 As t s n ts early stages of development, persstent memory especally serves applcatons that can beneft from reducng storage (or, persstent data) access latency wth relatvely few or lghtweght changes to applcaton programs, system software, and hardware [10]. Such applcatons nclude databases [90], fle systems [1, 17], keyvalue stores [16], and persstent fle caches [8, 10]. Other types of applcatons may not drectly beneft from persstent memory, but can stll use BA-NVMs as ther workng memory (nonvolatle man memory wthout persstence) to leverage the benefts of large capacty and low stand-by power [45, 73]. For example, a large number of recent works am to ft BA-NVMs as part of man memory n the tradtonal two-level storage model [22, 23, 33, 45, 46, 57, 72, 73, 74, 91, 94]. Several very recent works [38, 52, 59] envson that BA-NVMs can be smultaneously used as persstent memory and workng memory. In ths paper, we call applcatons leveragng BA- NVMs to manpulate persstent data as persstent applcatons, and those usng BA-NVMs solely as workng memory as non-persstent applcatons. 2 Most pror work focused on desgnng memory systems to accommodate ether type of applcatons, persstent or non-persstent. Strkngly lttle attenton has been pad to study the cases when these two types of applcatons concurrently run n a system. Persstent applcatons requre the memory system to support crash consstency, or the persstence property, typcally supported n tradtonal storage systems. Ths property guarantees that the system s data wll be n a consstent state after a system or applcaton crash, by ensurng that persstent memory updates are done carefully such that ncomplete updates are recoverable. Dong so requres data duplcaton and careful control over the orderng of wrtes arrvng at memory (Secton 2.2). The sophstcated desgns to support persstence lead to new memory access characterstcs for persstent applcatons. In partcular, we fnd that these applcatons have very hgh wrte ntensty and very low memory bank parallelsm due to frequent streamng wrtes to persstent data n memory (Secton 3.1). These characterstcs lead to substantal resource contenton between reads and wrtes at the 1 STT-MRAM, PCM, and ReRAM are collectvely called nonvolatle random-access memores (NVRAMs) or storage-class memores (SCMs) n recent studes [1, 52, 90] 2 A system wth BA-NVMs may also employ volatle DRAM, controlled by a separate memory controller [22, 57, 73, 74, 91]. As we show n ths paper, sgnfcant resource contenton exsts at the BA-NVM memory nterface of persstent memory systems between persstent and non-persstent applcatons. e do not focus on the DRAM nterface.

2 shared memory nterface for a system that concurrently runs persstent and non-persstent applcatons, unfarly slowng down ether or both types of applcatons. Prevous memory schedulng schemes, desgned solely for non-persstent applcatons, become neffcent and lowperformance under ths new scenaro (Secton 3.2). e fnd that ths s because the heavy wrte ntensty and low bank parallelsm of persstent applcatons lead to three key problems not handled well by past schemes: 1) frequent wrte queue drans n the memory controller, 2) frequent bus turnarounds between reads and wrtes, both of whch lead to wasted cycles on the memory bus, and 3) low memory bandwdth utlzaton durng wrtes to memory due to low memory bank parallelsm, whch leads to long perods durng whch memory reads are delayed (Secton 3). Our goal s to desgn a memory control scheme that acheves both far memory access and hgh system throughput n a system concurrently runnng persstent and non-persstent applcatons. e propose FIRM, a far and hgh-performance memory control scheme, whch 1) mproves the bandwdth utlzaton of persstent applcatons and 2) balances the bandwdth usage between persstent and non-persstent applcatons. FIRM acheves ths usng three components. Frst, t categorzes memory request sources as non-ntensve, streamng, random and persstent, to ensure far treatment across dfferent sources, and forms batches of requests for each source n a manner that preserves row buffer localty. Second, FIRM strdes persstent memory updates across multple banks, thereby mprovng bank-level parallelsm and hence memory bandwdth utlzaton of persstent memory accesses. Thrd, FIRM schedules read and wrte request batches from dfferent sources n a manner that mnmzes bus turnarounds and wrte queue drans. Compared to fve prevous memory scheduler desgns, FIRM provdes sgnfcantly hgher system performance and farness. Ths paper makes the followng contrbutons: e dentfy new problems related to resource contenton at the shared memory nterface when persstent and non-persstent applcatons concurrently access memory. The key fundamental problems, caused by memory access characterstcs of persstent applcatons, are: 1) frequent wrte queue drans, 2) frequent bus turnarounds, both due to hgh memory wrte ntensty, and 3) memory bandwdth underutlzaton due to low memory wrte parallelsm. e descrbe the neffectveness of pror memory schedulng desgns n handlng these problems. (Secton 3) e propose a new strded wrtng mechansm to mprove the bank-level parallelsm of persstent memory updates. Ths technque mproves memory bandwdth utlzaton of memory wrtes and reduces the stall tme of non-persstent applcatons read requests. (Secton 4.3) e propose a new persstence-aware memory schedulng polcy between read and wrte requests of persstent and non-persstent applcatons to mnmze memory nterference and reduce unfar applcaton slowdowns. Ths technque reduces the overhead of swtchng the memory bus between reads and wrtes by reducng bus turnarounds and wrte queue drans. (Secton 4.4) e comprehensvely compare the performance and farness of our proposed persstent memory control mechansm, FIRM, to fve pror memory schedulers across a varety of workloads and system confguratons. Our results show that 1) FIRM provdes the hghest system performance and farness on average and for all evaluated workloads, 2) FIRM s benefts are robust across system confguratons, 3) FIRM mnmzes the bus turnaround overhead present n pror scheduler desgns. (Secton 7) 2. BACKGROUND In ths secton, we provde background on exstng memory schedulng schemes, the prncples and mechancs of persstent memory, and the memory requests generated by persstent applcatons Conventonal Memory Schedulng Mechansms A memory controller employs memory request buffers, physcally or logcally separated nto a read and a wrte queue, to store the memory requests watng to be scheduled for servce. It also utlzes a memory scheduler to decde whch memory request should be scheduled next. A large body of prevous work developed varous memory schedulng polces [7, 26, 27, 34, 41, 42, 48, 49, 61, 62, 63, 64, 65, 67, 76, 77, 84, 85, 95]. Tradtonal commodty systems employ a varant of the frst-ready frst-come-frst-serve (FR-FCFS) schedulng polcy [76, 77, 95], whch prortzes memory requests that are row-buffer hts over others and, after that, older memory requests over others. Because of ths, t can unfarly deprortze applcatons that have low buffer ht rate and that are not memory ntensve, hurtng both farness and overall system throughput [61, 64]. Several desgns [41, 42, 63, 64, 65, 67, 84, 85] am to mprove ether system performance or farness, or both. PAR-BS [65] provdes farness and starvaton freedom by batchng requests from dfferent applcatons based on ther arrval tmes and prortzng the oldest batch over others. It also mproves system throughput by preservng the bank-level parallelsm of each applcaton va the use of rankbased schedulng of applcatons. ATLAS [41] mproves system throughput by prortzng applcatons that have receved the least memory servce. However, t may unfarly deprortze and slow down memory-ntensve applcatons due to the strct rankng t employs between memory-ntensve applcatons [41, 42]. To address ths ssue, TCM [42] dynamcally classfes applcatons nto two clusters, low and hgh memory-ntensty, and employs heterogeneous schedulng polces across the clusters to optmze for both system throughput and farness. TCM prortzes the applcatons n the lowmemory-ntensty cluster over others, mprovng system throughput, and shuffles thread rankng between applcatons n the hgh-memoryntensty cluster, mprovng farness and system throughput. hle shown to be effectve n a system that executes only non-persstent applcatons, unfortunately, none of these schedulng schemes address the memory request schedulng challenges posed by concurrentlyrunnng persstent and non-persstent applcatons, as we dscuss n Secton 3 and evaluate n detal n Secton Persstent Memory Most persstent applcatons stem from tradtonal storage system workloads (databases and fle systems), whch requre persstent memory [1, 2, 16, 17, 88, 90, 92] to support crash consstency [6],.e., the persstence property. The persstence property guarantees that the crtcal data (e.g., database records, fles, and the correspondng metadata) stored n nonvolatle devces retans a consstent state n case of power loss or a program crash, even when all the data n volatle devces may be lost. Achevng persstence n BA-NVM s nontrval, due to the presence of volatle processor caches and memory wrte reorderng performed by the wrte-back caches and memory controllers. For nstance, a power outage may occur whle a persstent applcaton s nsertng a node to a lnked lst stored n BA-NVM. Processor caches and memory controllers may reorder 3 The recently developed BLISS scheduler [84] was shown to be more effectve than TCM whle provdng low cost. Even though we do not evaluate BLISS, t also does not take nto account the nature of nterference caused by persstent applcatons.

3 the wrte requests, wrtng the ponter nto BA-NVM before wrtng the values of the new node. The lnked lst can lose consstency wth danglng ponters, f values of the new node remanng n processor caches are lost due to power outage, whch may lead to unrecoverable data corrupton. To avod such nconsstency problems, most persstent memory desgns borrow the ACID (atomcty, consstency, solaton, and durablty) concepts from the database and fle system communtes [17, 54, 88, 90, 92]. Enforcng these concepts, as explaned below, leads to addtonal memory requests, whch affect the memory access behavor of persstent applcatons. Versonng and rte Orderng. hle durablty can be guaranteed by BA-NVMs non-volatle nature, atomcty and consstency are supported by storng multple versons of the same pece of data and carefully controllng the order of wrtes nto persstent memory (please refer to pror studes for detals [17, 54, 88, 90, 92]). Fgure 1 shows a persstent tree data structure as an example to llustrate the dfferent methods to mantan versons and orderng. Assume nodes N 3 and N 4 are updated. e dscuss two commonly-used methods to mantan multple versons and orderng. The frst one s redo loggng [16, 90]. th ths method, new values of the two nodes, along wth ther addresses, are wrtten nto a log (logn 3 and logn 4) before ther orgnal locatons are updated n memory (Fgure 1(a)). If a system loses power before loggng s completed, persstent memory can always recover, usng the ntact orgnal data n memory. A memory barrer s employed between the wrtes to the log and wrtes to the orgnal locatons n memory. Ths orderng control, wth enough nformaton kept n the log, ensures that the system can recover to a consstent state even f t crashes before all orgnal locatons are updated. The second method, llustrated n Fgure 1(b), s the noton of shadow updates (copy-on-wrte) [17, 88]. Instead of storng logs, a temporary data buffer s allocated to store new values (shadow copes) of the nodes. Note that the parent node N 1 s also shadow-coped, wth the new ponter N 1 pontng to the shadow copes N 3 and N 4. Orderng control (shown as a memory barrer n Fgure 1(b)) ensures that the root ponter s not updated untl wrtes to the shadow copes are completed n persstent memory. (Persstent wrtes are updates of and ) Root Root Shadow Copes Memory N1 N1 N1' Barrer Root N1 N2 N3 N4 Root N1 N2 N3 N4 Root N1 N3' N4' N2 N3 N4 Root N2 N3 N4 Root N3' N4' Memory Barrer Log: log log N3' N4' (a) N2 log log N3' N4' N2 N1' N3' N4' N2 (b) N1 N3 N4 N1' N3' N4' Fg. 1. Example persstent wrtes wth (a) redo loggng and (b) shadow updates, when nodes N3 and N4 n a tree data structure are updated. Relaxed Persstence. Strct persstence [53, 54, 70] requres mantanng the program order of every wrte request, even wthn a sngle log update. Pelley et al. recently ntroduced a relaxed persstence model to mnmze the orderng control to buffer and coalesce wrtes to the same data [70]. 4 Our desgn adopts ther relaxed persstence model. For example, we only enforce the orderng between the wrtes to shadow copes and to the root ponter, as shown n Fgure 1(b). Another recent work, Kln [92] relaxed versonng, elmnatng the 4 More recently, Lu et al. [54] proposed the noton of loose-orderng consstency, whch relaxes the orderng of persstent memory wrtes even more by performng them speculatvely. use of loggng or shadow updates by mplementng a nonvolatle last-level cache (NV cache). However, due to the lmted capacty and assocatvty of the NV cache, the desgn cannot effcently accommodate large-granularty persstent updates n database and fle system applcatons. Consequently, we envson that loggng, shadow updates, and Kln-lke desgns wll coexst n persstent memory desgns n the near future Memory Requests of Persstent Applcatons Persstent rtes. e defne the wrtes to perform crtcal data updates that need to be persstent (ncludng updates to orgnal data locatons, log updates, and shadow-copy updates), as persstent wrtes. Each crtcal data update may generate an arbtrary number of persstent wrtes dependng on the granularty of the update. For example, n a key-value store, an update may be the addton of a new value of several bytes, several klobytes, several megabytes, or larger. Note that persstent memory archtectures ether typcally flush persstent wrtes (.e., drty blocks) out of processor caches at the pont of memory barrers, or mplement persstent wrtes as uncacheable (UC) wrtes [54, 88, 90, 92]. Non-persstent rtes. Non-crtcal data, such as stacks and data buffers, are not requred to survve system falures. Typcally, persstent memory does not need to perform versonng or orderng control over these wrtes. As such, persstent applcatons not only perform persstent wrtes but also non-persstent wrtes as well. Reads. Persstent applcatons also perform reads of n-flght persstent wrtes and other ndependent reads. Persstent memory can relax the orderng of ndependent reads wthout volatng the persstence requrement. However, dong so can mpose substantal performance penaltes (Secton 3.2). Reads of n-flght persstent updates need to wat untl these persstent wrtes arrve at BA-NVMs. Conventonal memory controller desgns provde read-after-wrte orderng by servcng reads of n-flght wrtes from wrte buffers. th volatle memory, such a behavor does not affect memory consstency. th nonvolatle memory, however, power outages or program crashes can destroy n-flght persstent wrtes before they are wrtten to persstent memory. Speculatve reads of n-flght persstent updates can lead to ncorrect orderng and potental resultng nconsstency, because f a read has already gotten the value of an n-flght wrte that would dsappear on a crash, wrong data may eventually propagate to persstent memory as a result of the read. 3. MOTIVATION: HANDLING PERSISTENT MEMORY ACCESSES Conventonal memory schedulng schemes are desgned based on the assumpton that man memory s used as workng memory,.e., a fle cache for storage systems. Ths assumpton no longer holds when man memory also supports data persstence, by accommodatng persstent applcatons that access memory dfferently from tradtonal non-persstent applcatons. Ths s because persstent memory wrtes have dfferent consstency requrements than workng memory wrtes, as we descrbed n Sectons 2.2 and 2.3. In ths secton, we study the performance mplcatons caused by ths dfferent memory access behavor of persstent applcatons (Secton 3.1), dscuss the problems of drectly adoptng exstng memory schedulng methods to handle persstent memory accesses (Secton 3.2), and descrbe why naïvely extendng past memory schedulers does not solve the problem (Secton 3.3) Memory Access Characterstcs of Persstent Applcatons An applcaton s memory access characterstcs can be evaluated usng four metrcs: a) memory ntensty, measured as the number of

4 last-level cache msses per thousand nstructons (MPKI) [19, 42]; b) wrte ntensty, measured as the porton of wrte msses (R%) out of all cache msses; c) bank-level parallelsm (BLP), measured as the average number of banks wth outstandng memory requests, when at least one other outstandng request exsts [50, 65]; d) row-buffer localty (RBL), measured as the average ht rate of the row buffer across all banks [64, 77]. To llustrate the dfferent memory access characterstcs of persstent and non-persstent applcatons, we studed the memory accesses of three representatve mcro-benchmarks, streamng, random, and KVStore. Streamng and random [42, 61] are both memory-ntensve, non-persstent applcatons, performng streamng and random accesses to a large array, respectvely. They serve as the two extreme cases wth dramatcally dfferent BLP and RBL. The persstent applcaton KVStore performs nserts and deletes to key-value pars (25-byte keys and 2K-byte values) of an n-memory B+ tree data structure. The szes of keys and values were specfcally chosen so that KVStore had the same memory ntensty as the other two mcro-benchmarks. e buld ths benchmark by mplementng a redo loggng (.e., wrtng new updates to a log whle keepng the orgnal data ntact) nterface on top of STX B+ Tree [12] to provde persstence support. Redo loggng behaves very smlarly to shadow updates (Secton 2.2), whch perform the updates n a shadow verson of the data structure nstead of loggng them n a log space. Our experments (not shown here) show that the performance mplcatons of KVStore wth shadow updates are smlar to those of KVStore wth redo loggng, whch we present here. Table 1 lsts the memory access characterstcs of the three mcrobenchmarks runnng separately. The persstent applcaton KVStore, especally n ts persstence phase when t performs persstent wrtes, has three major dscrepant memory access characterstcs n comparson to the two non-persstent applcatons. Table 1. Memory access characterstcs of three applcatons runnng ndvdually. The last row shows the memory access characterstcs of KVStore when t performs persstent wrtes. MPKI R% BLP RBL Streamng 100/Hgh 47%/Low 0.05/Low 96%/Hgh Random 100/Hgh 46%/Low 6.3/Hgh 0.4%/Low KVStore 100/Hgh 77%/Hgh 0.05/Low 71%/Hgh Persstence Phase (KVStore) 675/Hgh 92%/Hgh 0.01/Low 97%/Hgh 1. Hgh wrte ntensty. hle the three applcatons have the same memory ntensty, KVStore has much hgher wrte ntensty than the other two. Ths s because each nsert or delete operaton trggers a redo log update, whch appends a log entry contanng the addresses and the data of the modfed key-value par. The log updates generate extra wrte traffc n addton to the orgnal locaton updates. 2. Hgher memory ntensty wth persstent wrtes. The last row of Table 1 shows that whle the KVStore applcaton s n ts persstence phase (.e., when t s performng persstent wrtes and flushng these wrtes out of processor caches), t causes greatly hgher memory traffc (MPKI s 675). Durng ths phase, wrtes make up almost all (92%) the memory traffc. 3. Low BLP and hgh RBL wth persstent wrtes. KVStore, especally whle performng persstent wrtes, has low BLP and hgh RBL. KVStore s log s mplemented as a crcular buffer, smlar to those used n pror persstent memory desgns [90], by allocatng (as much as possble) one or more contguous regons n the physcal address space. As a result, the log updates lead to consecutve wrtes to contguous locatons n the same bank,.e., an access pattern that can be characterzed as streamng wrtes. Ths makes KVStore s wrte behavor smlar to that of streamng s reads: low BLP and hgh RBL. However, the memory bus can only servce ether reads or wrtes (to any bank) at any gven tme because the bus can be drven n only one drecton [49], whch causes a fundamental dfference (and conflct) between handlng streamng reads and streamng wrtes. e conclude that the persstent wrtes cause persstent applcatons to have wdely dfferent memory access characterstcs than nonpersstent applcatons. As we show next, the hgh wrte ntensty and low bank-level parallelsm of wrtes n persstent applcatons cause a fundamental challenge to exstng memory scheduler desgns for two reasons: 1) the hgh wrte ntensty causes frequent swtchng of the memory bus between reads and wrtes, causng bus turnaround delays, 2) the low wrte BLP causes underutlzaton of memory bandwdth whle wrtes are beng servced, whch delays any reads n the memory request buffer. These two problems become exacerbated when persstent applcatons run together wth non-persstent ones, a scenaro where both reads and persstent wrtes are frequently present n the memory request buffer Ineffcency of Pror Memory Schedulng Schemes As we mentoned above, the memory bus can servce ether reads or wrtes (to any bank) at any gven tme because the bus can be drven n only one drecton [49]. Pror memory controllers (e.g., [26, 27, 41, 42, 49, 63, 64, 65, 76, 77, 95]) buffer wrtes n a wrte queue to allow read requests to aggressvely utlze the memory bus. hen the wrte queue s full or s flled to a predefned level, the memory scheduler swtches to a wrte dran mode where t drans the wrte queue ether fully or to a predetermned level [49, 78, 83], n order to prevent stallng the entre processor ppelne. Durng the wrte dran mode, the memory bus can servce only wrtes. In addton, swtchng nto and out of the wrte dran mode from the read mode nduces addtonal penalty n the DRAM protocol (called read-to-wrte and wrte-to-read turnaround delays, t RT and t TR, approxmately 7.5ns and 15ns, respectvely [43]) durng whch no read or wrte commands can be scheduled on the bus, causng valuable memory bus cycles to be wasted. Therefore, frequent swtches nto the wrte dran mode and long tme spent n the wrte dran mode can sgnfcantly slow down reads and can harm the performance of read-ntensve applcatons and the entre system [49]. Ths desgn of conventonal memory schedulers s based on two assumptons, whch are generally sound for non-persstent applcatons. Frst, reads are on the crtcal path of applcaton executon whereas wrtes are usually not. Ths s sound when most nonpersstent applcatons abound wth read-dependent arthmetc, logc, and control flow operatons and wrtes can be servced from wrte buffers n caches and n the memory controller. Therefore, most pror memory schedulng schemes prortze reads over wrtes. Second, applcatons are usually read-ntensve, and memory controllers can delay wrtes wthout frequently fllng up the wrte queues. Therefore, optmzng the performance of wrtes s not as crtcal to performance n many workloads as the wrte queues are large enough for such read-ntensve applcatons. Unfortunately, these assumptons no longer hold when persstent wrtes need to go through the same shared memory nterface as nonpersstent requests. Frst, the orderng control of persstent wrtes requres the seralzaton of the persstent wrte traffc to man memory (e.g., va the use of memory barrers, as descrbed n Secton 2.2). Ths causes the persstent wrtes, reads of n-flght persstent wrtes, and computatons dependent on these wrtes (and

5 potentally all computatons after the persstent wrtes, dependng on the mplementaton) to be seralzed. As such, persstent wrtes are also on the crtcal executon path. As a result, smply prortzng read requests over persstent wrte requests can hurt system performance. Second, persstent applcatons are wrte-ntensve as opposed to read-ntensve. Ths s due to not only the persstent nature of data manpulaton, whch mght lead to more frequent memory updates, but also the way persstence s guaranteed usng multple persstent updates (.e., to the orgnal locaton as well as the alternate verson of the data n a redo log or a shadow copy, as explaned n Secton 2.2). Because of these characterstcs of persstent applcatons, exstng memory controllers are neffcent n handlng them concurrently wth non-persstent applcatons. Fgure 2 llustrates ths neffcency n a system that concurrently runs KVStore wth ether the streamng or the random applcaton. Ths fgure shows the fracton of memory access cycles that are spent due to delays related to bus turnaround between reads and wrtes as a functon of the number of wrte queue entres. 5 The fgure shows that up to 17% of memory bus cycles are wasted due to frequent bus turnarounds, wth a commonly-used 64- entry wrte queue. e found that ths s manly because persstent wrtes frequently overflow the wrte queue and force the memory controller to dran the wrtes. Typcal schedulers n modern processors have only 32 to 64 wrte queue entres to buffer memory requests [30]. Smply ncreasng the number of wrte queue entres n the scheduler s not a scalable soluton [7]. nd Bus Turnarou Overhead 20% 10% 0% 17% 8% KVStore+Streamng KVStore+Random rte Queue Entres Fg. 2. Fracton of memory access cycles wasted due to delays related to bus turnaround between reads and wrtes. In summary, conventonal memory schedulng schemes, whch prortze reads over persstent wrtes, become neffcent when persstent and non-persstent applcatons share the memory nterface. Ths causes relatvely low performance and farness (as we show next) Analyss of Pror and Naïve Schedulng Polces e have observed, n Secton 2.3, that the persstent applcatons (e.g., KVStore s) wrtes behave smlarly to streamng reads. As such, a natural dea would be to assgn these persstent wrtes the same prorty as read requests, nstead of deprortzng them below read requests, to ensure that persstent applcatons are not unfarly penalzed. Ths s a naïve (yet smple) method of extendng past schedulers to potentally deal wth persstent wrtes. In ths secton, we provde a case study analyss of farness and performance of both pror schedulers (FR-FCFS [76, 77, 95] and TCM [42]) and naïve extensons of these pror schedulers (FRFCFS-modfed and TCM-modfed) that gve equal prorty to persstent reads and wrtes. 6 Fgure 3 llustrates farness and system performance of these schedulers for two workloads where KVStore 5 Secton 6 explans our system setup and methodology. 6 Note that we preserve all the other orderng rules of FR-FCFS and TCM n FRCFCS-modfed and TCM-modfed. thn each prortzaton level, reads and persstent wrtes are prortzed over non-persstent wrtes. For example, wth FRCFCS-modfed, the hghest prorty requests are row-buffer-ht read and persstent wrte requests, second hghest prorty requests are row-bufferht non-persstent wrte requests. s run together wth streamng or random. To evaluate farness, we consder both the ndvdual slowdown of each applcaton [48] and the maxmum slowdown [20, 41, 42, 87] across both applcatons n a workload. e make several major observatons. Slowdo own FR-FCFS FRFCFS-modfed TCM TCM-modfed L1 L2 5 Maxmum Slowdown 1.5 Maxmum Slowdown 4 23% 29% % 0% Slowdo own KVStore Streamng KVStore Random L1 L2 (a) (b) (c) Fg. 3. Performance and farness of pror and naïve schedulng methods. Case Study 1 (L1 n Fgure 3(a) and (c)): hen KVStore s run together wth streamng, pror schedulng polces (FR-FCFS and TCM) unfarly slow down the persstent KVStore. Because these polces delay wrtes behnd reads, and streamng s reads wth hgh row-buffer localty capture a memory bank for a long tme, KVStore s wrtes need to wat for long tme perods even though they also have hgh row buffer localty. hen the naïve polces are employed, the effect s reversed: FRFCFS-modfed and TCM-modfed reduce the slowdown of KVStore but ncrease the slowdown of streamng compared to FRFCFS and TCM. KVStore performance mproves because, as persstent wrtes are the same prorty as reads, ts frequent wrtes are not delayed too long behnd streamng s reads. Streamng slows down greatly due to two major reasons. Frst, ts read requests are nterfered much more frequently wth the wrte requests of KVStore. Second, due to equal read and persstent wrte prortes, the memory bus has to be frequently swtched between persstent wrtes and streamng reads, leadng to hgh bus turnaround latences where no request gets scheduled on the bus. These delays slow down both applcatons but affect streamng a lot more because almost all accesses of streamng are reads and are on the crtcal path, and are affected by both read-to-wrte and wrte-to-read turnaround delays whereas KVStore s wrtes are less affected by wrte-to-read turnaround delays. Fgure 3(c) shows that the naïve polces greatly degrade overall system performance on ths workload, even though they mprove KVStore s performance. e fnd ths system performance degradaton s manly due to the frequent bus turnarounds. Case Study 2 (L2 n Fgure 3(b) and (c)): KVStore and random are two applcatons wth almost exactly opposte BLP, RBL, and wrte ntensty. hen these two run together, random slows down the most wth all of the four evaluated schedulng polces. Ths s because random s more vulnerable to nterference than the mostly-streamng KVStore due to ts hgh BLP, as also observed n prevous studes [42]. FRFCFS-modfed slghtly mproves KVStore s performance whle largely degradng random s performance due to the same reason descrbed for L1. TCMmodfed does not sgnfcantly affect ether applcaton s performance because three competng effects end up cancelng any benefts. Frst, TCM-modfed ends up prortzng the random-access random over streamng KVStore n some tme ntervals, as t s aware of the hgh vulnerablty of random due to ts hgh BLP and low RBL. Second, at other tmes, t prortzes the frequent persstent wrte requests of KVStore over read requests of random due to the equal prorty of reads and persstent wrtes. Thrd, frequent bus turnarounds (as dscussed above for L1) degrade both applcatons performance. Fgure 3(c) shows that the naïve polces slghtly degrade or not affect overall system performance on ths workload. eghted Speedup

6 3.4. Summary and Our Goal In summary, nether conventonal schedulng polces nor ther naïve extensons that take nto account persstent wrtes provde hgh farness and hgh system performance. Ths s because they lead to 1) frequent entres nto wrte dran mode due to hgh ntensty of persstent wrtes, 2) resultng frequent bus turnarounds between read and wrte requests that cause wasted bus cycles, and 3) memory bandwdth underutlzaton durng wrte dran mode due to low BLP of persstent wrtes. These three problems are pctorally llustrated n Fgure 4(a) and (b), whch depct the servce tmelne of memory requests wth conventonal schedulng and ts naïve extenson. Ths llustraton shows that 1) persstent wrtes heavly access Bank-1, leadng to hgh bandwdth underutlzaton wth both schedulers, 2) both schedulers lead to frequent swtchng between reads and wrtes, and 3) the naïve scheduler delays read requests sgnfcantly because t prortzes persstent wrtes, and t does not reduce the bus turnarounds. Our evaluaton of 18 workload combnatons (n Secton 7) shows that varous conventonal and naïve schedulng schemes lead to low system performance and farness, due to these three reasons. Therefore, a new memory scheduler desgn s needed to overcome these challenges and provde hgh performance and farness n a system where the memory nterface s shared between persstent and non-persstent applcatons. Our goal n ths work s to desgn such a scheduler (Secton 4). Conventonal Schedulng (a) NaÏve Schedulng (b) FIRM (c) R Streamng Reads R Random Reads Persstent rtes Bus Turnaround Bank 1 R R R R R Bank 2 R Read rte rte rte Batch Queue Batch Queue Servced Full Servced Full Bank 1 R R R R R Bank 2 Bank 1 Bank 2 R Read rte Batch Batch Servced Servced R R R rte Queue Full Saved Tme R R R Tme Tme Tme 1 Persstent wrte strdng: 2 Persstence-aware Increasng BLP of memory schedulng: persstent wrtes Reducng wrte queue dran and bus turnarounds Fg. 4. Example comparng conventonal, naïve, and proposed schemes. 4. FIRM DESIGN Overvew. e propose FIRM, a memory control scheme that ams to serve requests from persstent and non-persstent applcatons n a far and hgh throughput manner at the shared memory nterface. FIRM ntroduces two novel desgn prncples to acheve ths goal, whch are llustrated conceptually n Fgure 4(c). Frst, persstent wrte strdng ensures that persstent wrtes to memory have hgh BLP such that memory bandwdth s well-utlzed durng the wrte dran mode. It does so by ensurng that consecutvely-ssued groups of wrtes to the log or shadow copes n persstent memory are mapped to dfferent memory banks. Ths reduces not only the duraton of the wrte dran mode but also the frequency of entry nto wrte dran mode compared to pror methods, as shown n Fgure 4(c). Second, persstence-aware memory schedulng mnmzes the frequency of wrte queue drans and bus turnarounds by schedulng the queued up reads and wrtes n a far manner. It does so by balancng the amount of tme spent n wrte dran mode and read mode, whle ensurng that the tme spent n each mode s long enough such that the wasted cycles due to bus turnaround delays are mnmzed. Persstence-aware memory schedulng therefore reduces: 1) the latency of servcng the persstent wrtes, 2) the amount of tme persstent wrtes block outstandng reads, and 3) the frequency of entry nto wrte queue dran mode. The realzaton of these two prncples leads to hgher performance and effcency than conventonal and naïve scheduler desgns, as shown n Fgure 4. FIRM desgn conssts of four components: 1) request batchng, whch forms separate batches of read and wrte requests that go to the same row, to maxmze row buffer localty, 2) source categorzaton, whch categorzes the request sources for effectve schedulng by dstngushng varous access patterns of applcatons, 3) persstent wrte strdng, whch maxmzes BLP of persstent requests, and 4) persstence-aware memory schedulng, whch maxmzes performance and farness by approprately adjustng the number of read and wrte batches to be servced at a tme. Fgure 5(a) depcts an overvew of the components, whch we descrbe next Request Batchng The goal of request batchng s to group together the set of requests to the same memory row from each source (.e., process or hardware thread context, as descrbed below n Secton 4.2). Batches are formed per source, smlarly to prevous work [7, 65], separately for reads and wrtes. If scheduled consecutvely, all requests n a read or wrte batch (except for the frst one) wll ht n the row buffer, mnmzng latency and maxmzng memory data throughput. A batch s consdered to be formed when the next memory request n the request buffer of a source s to a dfferent row [7] Source Categorzaton To apply approprate memory control over requests wth varous characterstcs, FIRM dynamcally classfes the sources of memory requests nto four: non-ntensve, streamng, random, persstent. A source s defned as a process or thread durng a partcular tme perod, when t s generatng memory requests n a specfc manner. For example, a persstent applcaton s consdered a persstent source when t s performng persstent wrtes. It may also be a non-ntensve, a streamng, orarandom source n other tme perods. FIRM categorzes sources on an nterval bass. At the end of an nterval, each source s categorzed based on ts memory ntensty, RBL, BLP, and persstence characterstcs durng the nterval, predctng that t wll exhbt smlar behavor n the next nterval. 7 The man new feature of FIRM s source categorzaton s ts detecton of a persstent source (nspred by the dscrepant characterstcs of persstent applcatons descrbed n Secton 2.3). Table 2 depcts the rules FIRM employs to categorze a source as persstent. FIRM uses program hnts (wth the software nterface descrbed n Secton 5) to determne whether a hardware context belongs to a persstent applcaton. Ths ensures that a non-persstent applcaton does not get classfed as a persstent source. If a hardware context belongng to such an applcaton s generatng wrte batches that are larger than a pre-defned threshold (.e., has an average wrte batch sze greater than 30 n the prevous nterval) and f t nserts memory barrers between memory requests (.e., has nserted at least one memory barrer between wrte requests n the prevous nterval), FIRM categorzes t as a persstent source. 7 e use an nterval sze of one mllon cycles, whch we emprcally fnd to provde a good tradeoff between predcton accuracy, adaptvty to workload behavor, and overhead of categorzaton.

7 Memory Requests Request Batchng Source Categorzaton Nonntensve Random Streamng Persstent Persstent rte Strdng Persstence-aware Memory Schedulng (a) FIRM components. A Batch Data Buffer (Log or Shadow Copes) A Batch (b) Offset Buffer Group (Row-buffer Sze) Persstent wrtes ssued to a contguous memory space Strded persstent wrtes scheduled to memory Persstent wrte strdng. Tme Tme R Streamng Reads R Random Reads Persstent rtes (Strded) Read Queue R2 R2 R2 R2 R2 R1 R1 Bank 1 Bank 2 rte Queue t r max = t r 3 R3 R3 R3 R3 R3 R1 R1 Bank 3 t w max = t w R1 Bank 4 Schedule 1 read batch group and 1 persstent wrte batch t r group n each tme nterval. t r 3 2 t r t r j 1 t w t w 2 1 Ready read batches: R1, R2, R3 Possble batch groups: 1. [ R1 ] 2. [ R1, R2 ] 3. [ R1, R2, R3 ] hch batch group to schedule? Ready persstent wrte batches: 1, 2 Possble batch groups: 1. [ 1 ] 2. [ 1, 2 ] hch batch group to schedule? 2 1 Bank 1 Bank 2 Bank 3 Bank 4 (c) Persstence-aware memory schedulng polcy. Fg. 5. Overvew of the FIRM desgn and ts two key technques. t w j Table 2. Rules used to dentfy persstent sources. A thread s dentfed as a persstent source, f t 1: belongs to a persstent applcaton; 2: s generatng wrte batches that are larger than a pre-defned threshold n the past nterval; 3: nserts memory barrers between memory requests. Sources that are not persstent are classfed nto non-ntensve, streamng, and random based on three metrcs: MPKI (memory ntensty), BLP, RBL. Ths categorzaton s nspred by prevous studes [42, 63], showng varyng characterstcs of such sources. A non-ntensve source has low memory ntensty. e dentfy these sources to prortze ther batches over batches of other sources; ths maxmzes system performance as such sources are latency senstve [42]. Streamng and random sources are typcally read ntensve, havng opposte BLP and RBL characterstcs (Table 1). 8 Ths streamng and random source classfcaton s used later by the underlyng schedulng polcy FIRM borrows from past works to maxmze system performance and farness (e.g., TCM [42]) Persstent rte Strdng The goal of ths mechansm s to reduce the latency of servcng consecutvely-ssued persstent wrtes by ensurng they have hgh BLP and thus fully utlze memory bandwdth. e acheve ths goal by strdng the persstent wrtes across multple memory banks va hardware or software support. The basc dea of persstent wrte strdng s smple: nstead of mappng consecutve groups of row-buffer-szed persstent wrtes to consecutve row-buffer-szed locatons n a persstent data buffer (that s used for the redo log or shadow copes n a persstent applcaton), whch causes them to map to the same memory bank, change the mappng such that they are strded by an offset that ensures they map to dfferent memory banks. Fgure 5(b) llustrates ths dea. A persstent applcaton can stll allocate a contguous memory space for the persstent data buffer. Our method maps the accesses to the data buffer to dfferent banks n a strded manner. Contguous persstent wrtes of less than or equal to the row-buffer sze are stll mapped to contguous data buffer space wth of a row buffer sze (called a buffer group ) to acheve hgh RBL. However, contguous persstent wrtes beyond the sze of the row-buffer are strded by an offset. The value of the offset s determned by the poston of bank ndex bts used n the physcal 8 In our experments, a hardware context s classfed as non-ntensve f ts MPKI< 1. A hardware context s classfed as streamng f ts MPKI>1, BLP<4 and RBL>70%. All other hardware contexts that are not persstent are classfed as random. Hgher-order address bts Fg. 6. Physcal address to bank mappng example. address mappng scheme employed by the memory controller. For example, wth the address mappng scheme n Fgure 6, the offset should be 128K bytes f we want to fully utlze all eght banks wth persstent wrtes (because a contguous memory chunk of 16KB gets mapped to the same bank wth ths address mappng scheme,.e., the memory nterleavng granularty s 16KB across banks). Ths persstent wrte strdng mechansm can be mplemented n ether the memory controller hardware or a user-mode lbrary, as we descrbe n Secton 5. Note that the persstent wrte strdng mechansm provdes a determnstc (re)mappng of persstent data buffer physcal addresses to physcal memory addresses n a strded manner. The remapped physcal addresses wll not exceed the boundary of the orgnal data buffer. As a result, re-accessng or recoverng data at any tme from the persstent data buffer s not an ssue: all accesses to the buffer go through ths remappng. Alternatve Methods. Note that commodty memory controllers randomze hgher-order address bts to mnmze bank conflcts (Fgure 6). However, they can stll fal to map persstent wrtes to dfferent banks because as we showed n Secton 3.1, persstent wrtes are usually streamng and hence they are lkely to map to the same bank. It s mpractcal to mprove the BLP of persstent wrtes by aggressvely bufferng them due to two reasons: 1) The large bufferng capacty requred. For example, we mght need a wrte queue as large as 128KB to utlze all eght banks of a DDR3 channel wth the address mappng shown n Fgure 6. 2) The regon of concurrent contguous wrtes may not be large enough to cover multple banks (.e., there may not be enough wrtes present to dfferent banks). Alternatvely, kernel-level memory access randomzaton [69] may dstrbute wrtes to multple banks durng persstent applcaton executon. However, the address mappng nformaton can be lost when the system reboots, leavng the BA-NVM wth unrecoverable data. Fnally, t s also prohbtvely complex to randomze the bank mappng of only persstent wrtes by choosng a dfferent set of address bts as ther bank ndexes,.e., mantanng multple address mappng schemes n a sngle memory system. Dong so requres complex bookkeepng mechansms to ensure correct mappng of memory addresses. For these very reasons, we have developed the persstent wrte strdng technque we have descrbed.

8 4.4. Persstence-Aware Memory Schedulng The goal of ths component s to mnmze wrte queue drans and bus turnarounds by ntellgently parttonng memory servce between reads and persstent wrtes whle maxmzng system performance and farness. To acheve ths mult-objectve goal, FIRM operates at the batch granularty and forms a schedule of read and wrte batches of dfferent source types: non-ntensve, streamng, random, and persstent. To maxmze system performance, FIRM prortzes non-ntensve read batches over all other batches. For the remanng batches of requests, FIRM employs a new polcy that determnes 1) how to group read batches and wrte batches and 2) when to swtch between servcng read batches and wrte batches. FIRM does ths n a manner that balances the amount of tme spent n wrte dran mode (servcng wrte batches) and read mode (servcng read batches) n a way that s proportonal to the read and wrte demands, whle ensurng that tme spent n each mode s long enough such that the wasted cycles due to bus turnaround delays are mnmzed. hen the memory scheduler s servcng read or persstent wrte batches, n read mode or wrte dran mode, the schedulng polcy employed can be any of the prevously-proposed memory request schedulng polces (e.g., [26, 41, 42, 49, 64, 65, 76, 77, 95]) and the orderng of persstent wrte batches s fxed by the orderng control of persstent applcatons. The key novelty of our proposal s not the partcular prortzaton polcy between requests, but the mechansm that determnes how to group batches and when to swtch between wrte dran mode and read mode, whch we descrbe n detal next. 9 To mnmze wrte queue drans, FIRM schedules reads and persstent wrtes wthn an nterval n a round-robn manner wth the memory bandwdth (.e., the tme nterval) parttoned between them based on ther demands. To prevent frequent bus turnarounds, FIRM schedules a group of batches n one bus transfer drecton before schedulng another group of batches n the other drecton. Fgure 5(c) llustrates an example of ths persstence-aware memory schedulng polcy. Assume, wthout loss of generalty, that we have the followng batches ready to be scheduled at the begnnng of a tme nterval: a random read batch R1, two streamng read batches R2 and R3, and two (already-strded) persstent wrte batches 1 and 2. e defne a batch group as a group of batches that wll be scheduled together. As llustrated n Fgure 5(c), the memory controller has varous optons to compose the read and wrte batch groups. Ths fgure shows three possble batch groups for reads and two possble batch groups for wrtes. These possbltes assume that the underlyng memory request schedulng polcy dctates the order of batches wthn a batch group. Our proposed technque thus bols down to determnng how many read or wrte batches to be grouped together to be scheduled n the next read mode or wrte dran mode. e desgn a new technque that ams to satsfy the followng two goals: 1) servcng the two batch groups (read and wrte) consumes duratons proportonal to ther demand, 2) the total tme spent servcng the two batch groups s much longer than the bus turnaround tme. The frst goal s to prevent the starvaton of ether reads or persstent wrtes, by farly parttonng the memory bandwdth between them. The second goal s to maxmze performance by ensurng mnmal mount of tme s wasted on bus turnarounds. Mathematcally, we formulate these two goals as the followng 9 Note that most prevous memory schedulng schemes focus on read requests and do not dscuss how to handle swtchng between read and wrte modes n the memory controller, mplctly assumng that reads are prortzed over wrtes untl the wrte queue becomes full or close to full [49]. followng two nequaltes: { t r tr t w max t w max t RT +t TR t r +t w μ turnaround (1) where t r and t w are the tmes to servce a read and a persstent wrte batch group, respectvely (Fgure 5(c)). They are the maxmum servce tme for the batch group at any bank : { t r =max {H r t rht t w =max {H w t wht + M r t rmss }, + M w t wmss } where t rht, t wht, t rmss, and t wmss are the tmes to servce a row buffer ht/mss read/wrte request; H r and H w are the number of row-buffer read/wrte hts; M r and M w are row-buffer read/wrte msses. t r max and t w max are the maxmum tmes to servce all the n-flght read and wrte wrte requests (llustrated n Fgure 5 (c)). μ turnaround s a user-defned parameter to represent the maxmum tolerable fracton of bus turnaround tme out of the total servce tme of memory requests. The goal of our mechansm s to group read and wrte batches (.e., form read and wrte batch groups) to be scheduled n the next read mode and wrte dran mode n a manner that satsfes Equaton 1. Thus, the technque bols down to selectng from the set of possble read/wrte batch groups such that they satsfy t r next (the duraton of the next read mode) and t w next (the duraton of the next wrte dran mode) as ndcated by the constrants n Equaton 3 (whch s obtaned by solvng the nequalty n Equaton 1). Our technque, Algorthm 1, forms a batch group that has a mnmum servce duraton that satsfes the constrant on the rght hand sde of Equaton t r next =mnt r j j, t r j (t RT +t TR )/μ turnaround 1+t w max /tr max t w next = mn t w j, t w j (t RT +t TR )/μ turnaround j 1+t r max /tw max Algorthm 1 Formaton of read and wrte batch groups. Input: t RT,t TR, and μ turnaround. Output: The read batch group to be scheduled, ndcated by t r next ; The persstent wrte batch group to be scheduled, ndcated by t w next. Intalzaton: k r number of read batch groups; k w number of persstent wrte batch groups; for j 0 to k r 1 do Calculate t r j wth Equaton 2; end for for j 0 to k w 1 do Calculate t w j wth Equaton 2; end for t r max kr 1 max j=0 tr j ; t w max k w 1 max j=0 tw j ; Calculate t r next and tw next wth Equaton 3; 5. IMPLEMENTATION 5.1. Software-Hardware Interface FIRM software nterface provdes memory controllers wth the requred nformaton to dentfy persstent sources durng the source categorzaton stage. Ths ncludes 1) the dentfcaton of the persstent applcaton, 2) the communcaton of the executon of memory 10 Ths algorthm can be nvoked only once at the begnnng of each nterval to determne the duraton of consecutve read and wrte dran modes for the nterval. (2) (3)

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?