Cache and I/O Efficient Functional Algorithms

Size: px

Start display at page:

Download "Cache and I/O Efficient Functional Algorithms"

Stephen Barnaby Howard
5 years ago
Views:

1 Cache ad I/O Efficiet Fuctioal Algorithms Guy E. Blelloch Robert Harper Caregie Mello Uiversity Abstract The widely studied I/O ad ideal-cache models were developed to accout for the large differece i costs to access memory at differet levels of the memory hierarchy. Both models are based o a two level memory hierarchy with a fixed size primary memory (cache) of size M, a ubouded secodary memory orgaized i blocks of size B. The cost measure is based purely o the umber of block trasfers betwee the primary ad secodary memory. All other operatios are free. May algorithms have bee aalyzed i these models ad ideed these models predict the relative performace of algorithms much more accurately tha the stadard RAM model. The models, however, require specifyig algorithms at a very low level requirig the user to carefully lay out their data i arrays i memory ad maage their ow memory allocatio. I this paper we preset a cost model for aalyzig the memory efficiecy of algorithms expressed i a simple fuctioal laguage. We show how some algorithms writte i stadard forms usig just lists ad trees (o arrays) ad requirig o explicit memory layout or memory maagemet are efficiet i the model. We the describe a implemetatio of the laguage ad show provable bouds for mappig the cost i our model to the cost i the idealcache model. These boud imply that purely fuctioal programs based o lists ad trees with o special attetio to ay details of memory layout ca be as asymptotically as efficiet as the carefully desiged imperative I/O efficiet algorithms. For example we describe a O( log B M/B ) cost sortig algorithm, which is B optimal i the ideal cache ad I/O models. Categories ad Subject Descriptors D.3.1 [Programmig Laguages]: Formal Defiitios ad Theory; F.2.2 [Aalysis of Algorithms ad Problem Complexity]: Tradeoffs ad Complexity Measures; F.3.2 [Logics ad Meaigs of Programs]: Sematics of Programmig Laguages Geeral Terms Theory. Keywords 1. Itroductio Algorithms, Desig, Laguages, Performace, cost sematics, I/O algorithms O today s computers there is a vast differece i cost for accessig differet levels of the memory hierarchy, whether it be registers, Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. POPL 13, Jauary 23 25, 2013, Rome, Italy. Copyright c 2013 ACM /13/01... $10.00 oe of may levels of cache, the mai memory, or a disk. O curret processors, for example, there is over a factor of a hudred betwee the time to access a register ad mai memory, ad aother factor of a hudred or so betwee mai memory ad disk, eve a solid state disk (SSD). This variace i costs is cotrary to the stadard Radom Access Machie (RAM) model, which assumes that the cost of accessig memory is uiform. To accout for o uiformity several cost models have bee developed that assig differece costs to differet levels of the memory hierarchy. The widely used I/O [2] ad ideal-cache [9] models both assume a two level memory hierarchy with a fixed size primary memory (cache) of size M, a ubouded secodary memory partitioed ito blocks of size B. Cost is measured i terms of the umber of block trasfers betwee primary ad secodary memory all other operatios are cosidered free. The parameters M ad B are cosidered variables for the sake of aalysis ad therefore show up i asymptotic bouds. Algorithms that do well i these models are ofte referred to as I/O efficiet or cache efficiet i this paper we will geerically use the term cache efficiet. The theory of cache efficiet algorithms is ow well developed (see e.g. the surveys [4, 6, 10, 15, 17, 22]) ad the models ideed much more accurately capture the relative cost of algorithms o real machies tha the RAM model does. This is true both i the cotext of algorithms that must ru off disk whe there is ot eough mai memory, ad also i the cotext of algorithms that ca fit i mai memory, but ot i various levels of the cache. For example, the models properly idicate that a blocked or hierarchical matrix-matrix multiply ( is ) much more efficiet tha the aïve triply ested loop (Θ 3 B vs. Θ( 3 )). I the RAM M they have equal costs. The models also idicate that properly implemeted versios of mergesort ad quicksort are reasoably cache efficiet but that samplesort ad multiway mergesort are more efficiet, ad i fact optimal. Correspodigly all the fastest disk sorts ideed use some variat of samplesort or multiway mergesort, as the theory predicts [19]. Although the study of cache efficiet algorithms has bee very successful i idetifyig algorithms that are fast i practice, ot surprisigly desigig ad programmig algorithms for these models requires a careful layout of memory ad careful maagemet of space. Both temporal ad spatial locality is critical i achievig good bouds. Spatial locality is importat sice memory is moved i blocks of size B, correspodig to either cache lies or memory pages. For example although mergig two arrays of itegers is reasoably efficiet, the cost of mergig two liked lists will deped o how the liks are laid out i memory ad eeds to be cosidered with care. Care is also eeded whe allocatig ad freeig memory sice touchig uused memory icurs a cache miss. It is therefore importat to reuse freed space immediately rather tha returig it to a pool which might be evicted by the time it is reused a geeric memory allocator or garbage collectio scheme will likely ot do the right thig. To properly maage this problem, memory 39

2 is typically preallocated ad fully maaged by the user/algorithm desiger. Needless to say, this form of programmig is icosistet with fuctioal programmig, especially whe usig recursive data types such as lists or trees. However, it is kow experimetally that by usig certai stadard memory allocatio schemes purely fuctioal programs (o side effects) ca be reasoably cache efficiet with regards to both spatial ad temporal locality [7, 8, 12, 23]. We give two examples. Firstly, cosider applyig map with some simple fuctio (e.g. icremet) over a list of itegers, ad the applyig the same map to the output. If the allocator keeps a poiter that gets icremeted o each allocatio, the after the first map all the cells of the list will be allocated adjacetly. O the secod map sice the allocatios are adjacet, readig the whole list will oly icur O(/B) cache misses, where B is the block size, ad evictig the ewly geerated blocks will also icur oly O(/B) misses. This gives O(/B) cost, which asymptotically matches the cost of a optimal array versio i a imperative settig. If the list were i a arbitrary order, the cost would be O(). All we have doe is oted that the temporal locality of the allocatios will lead to spatial locality of how they are laid out i memory. Secodly, cosider a block recursive matrix multiply o two matrices. Such a algorithm will ever require more tha O( 2 ) live space but if recursio stops at problems of a costat size it will allocate a total of O( 3 ) space. Assumig that the maximum live space fits withi the cache we should be able to ru our matrix multiply with oly O( 2 /B) cache misses, eeded for loadig the two matrices ad storig the result, but this would require beig careful about reusig freed space that is already i cache. Fortuately geeratioal garbage collectors have approximately this effect [8]. I particular if we make the first geeratio smaller tha the size of the cache (M) the we will reclaim the memory wheever the allocatio area fills, ad reuse memory that is already i cache. This does ot quite work i geeral sice what is live at the time of the mior collectio might get bumped from cache, but it gives some idicatio that it is ot hopeless to make the atural recursive matrix multiply algorithm, as well as similar recursive algorithms, cache efficiet. We show that oe ca ideed implemet cache-efficiet algorithms i a call-by-value fuctioal settig usig recursive data types, ad get provably efficiet bouds o cache complexity. I particular we show that oe ca express algorithms at a high level usig stadard techiques ad achieve optimal asymptotic performace whe implemeted o the ideal cache. Of course we do ot expect the algorithm desiger to uderstad the itricacies the garbage collector works i order to aalyze their algorithm. Istead our approach cosist of providig a reasoably high-level cost sematics that abstracts away from implemetatio details such as the garbage collectio method, but still admits precise aalysis of the cost of a algorithm o a two-level memory architecture. We the describe a provably-efficiet implemetatio of the laguage o the ideal-cache model. We show that by usig this implemetatio the costs aalyzed i the high-level cost model asymptotically match the umber of cache misses i the uderlyig ideal-cache model. The geeral idea of usig high-level cost models based o a cost sematics alog with a provable efficiet implemetatio that maps the cost oto a lower level machie model has previously bee used i the cotext of parallel cost models [5, 11, 13, 21]. Our high-level cost model cosists of a operatioal sematics for a call-by-value variat of PCF i which we make explicit the allocatio of ad access to data objects. The store cosists of three parts: a mai memory, a allocatio cache ad a read cache. Both caches have size M ad the memory is orgaized i blocks of size B (both measured i terms of abstract data objects). Data ca migrate from the allocatio cache to memory ad from memory to the read cache, always i blocks of size B. Allocatios are made i the allocatio cache, ad if the umber of live objects i the cache exceeds M, the the B oldest locatios are evicted to memory as a block, havig uit cost. The read cache cotais a subset of the memory blocks. A read has o cost if its locatio is i the read or allocatio cache, otherwise it requires loadig a block from memory ito the read cache, havig uit cost, ad possibly ejectig a existig block. Hece the oly costs are for evictig a block from the allocatio cache or loadig a block ito the read cache. Sice we are oly cocered with measurig the traffic betwee mai memory ad cache memory, garbage collectio for mai memory is ot modeled, but we do accout for the detectio of live objects, ad their migratio to mai memory, whe the cache limit is exceeded. The provable implemetatio uses a geeratioal collector to maitai the allocatio cache. It uses a ursery of size 2M ad allocates util the space rus out. It the traces the ursery for the live data. If there is L > M live data, the L M locatios are writte to memory i blocks of B, leavig the ursery with at most M locatios. The implemetatio allocates the stack i the heap ad must amortize the cost of loadig old stack frames agaist other operatios sice they are ot modeled i the high-level cost sematics. We emphasize that the algorithm desiger eed ot kow aythig about the garbage collector or how the stack is maaged to aalyze their algorithm; these cocepts are oly part of the provable implemetatio. The cost model is described i Sectio 3 ad the provable implemetatio is described i Sectios 4 ad 5. To demostrate the utility of our approach, i Sectio 6 we describe some geeral techiques for aalyzig the cost of algorithms i our model ad show three examples of how to aalyze the cost of algorithms i the model: mergesort, k-way mergesort ad matrix multiply. Importatly our results o sortig ad matrix multiply match the bouds ( for algorithms) implemeted ( directly ) i the ideal-cache model (O log B M/B ad O 3 B B respectively). The bouds for sortig are optimal. Because of our provable N implemetatio bouds these results imply that o the ideal-cache model our algorithms writte i a fuctioal style usig lists ad trees are asymptotically as efficiet as the low-level imperative programs. To aalyze the algorithms we itroduce the otio of a data structure beig compact with respect to a traversal order. This is the way we capture the spatial locality of data structures i a laguage that has o explicit way to express memory layout. Related Work Although there has bee a large amout of experimetal work o showig how good garbage collectio ca lead to efficiet use of caches ad disks ([7, 8, 12, 23] ad may refereces i [14]), we kow of oe that try to prove bouds for algorithms for fuctioal programs whe maipulatig recursive data types such as lists or trees. Abello et. al. [1] show how a fuctioal style ca be used to desig cache efficiet graph algorithms. They however assume that data structures are i arrays (called lists), ad that primitives for operatios such as sortig, map, filter ad reductios are supplied ad implemeted with optimal asymptotic cost (presumably at a lower level usig imperative code). Their goal is therefore to desig graph algorithms by composig these high-level operatios o collectios. They do ot explai how to deal with garbage collectio or memory maagemet. 2. Backgroud I/O ad Cachig Models The two-level I/O model of Aggarwal ad Vitter [2] assumes a memory hierarchy cosistig of mai memory of size M ad a u- 40

3 bouded secodary memory. 1 Both memories are partitioed ito blocks of size B of cosecutive memory locatios. All computatio must be performed from mai memory, which is treated like a stadard RAM, but there is a additioal istructio for movig a block of memory from secodary memory to mai memory ad oe for movig the other way. The cost of a algorithm is aalyzed i terms of the umber of block trasfers the cost of operatios withi the mai memory is igored. May algorithms ca be aalyzed i this model ad it is perhaps surprisig how accurately it is able to capture the relative performace of algorithms. I their origial work, for example, Aggarwal ad Vitter showed tight upper ad ( B log M/B B lower bouds for sortig keys, with I/O cost Θ The two algorithm that match this boud are a multiway mergesort ad a distributio sort, which are the stadard algorithms used for disk based sortig, ad they both perform sigificatly better tha quicksort or stadard mergesort. These algorithms are more efficiet sice they do ot eed to pass over the data as may times. The I/O model ca capture either the distictio betwee cache ad mai memory or betwee mai memory ad disk. I the first case the memory size correspods to the cache size ad the block size to the cache-lie size, ad i the secod case the memory size correspods to the mai memory size ad the block size to the page size (or whatever the trasfer size betwee the disk ad mai memory is). Oe might ote, however, that while the I/O model assumes two address spaces ad the user explicitly moves data betwee them, a machie with caches assumes a sigle address space ad makes its ow decisios about what gets evicted from cache, e.g., usig a least recetly used (LRU) policy. The ideal-cache model [9] ca be used to better model a cache. It is similar to the I/O model but assumes the primary memory is treated as a cache with a ideal evictio policy. I particular the programmer oly accesses oe address space ad a block is brought ito the cache whe a memory locatio is accessed whose block is ot already i cache. Brigig i a block might require evictig aother block from the cache. The model assumes that the best decisio is always made, which is to evict the lie used furthest i the future (the optimal off-lie replacemet policy). Sice i practice we do t kow the future, this is ot possible o-lie, but it is proved by Sleator ad Tarja s semial work o competitive pagig [20] that a LRU policy is always competitive with the optimal strategy (withi costat factors i time ad cache size). Therefore from a theoretical poit of view the models are asymptotically the same. I this paper we will be usig the idealcache model for simplicity although the results are also apply to the I/O model. The ideal-cache model is ofte used i the cotext of cacheoblivious algorithms. These are simply algorithms for which the algorithm does ot make ay decisios based o the cache parameters M ad B, although of course the aalysis of cache complexity will deped o M ad B. The advatage of cache-oblivious algorithms is that sice they are oblivious to the cache parameters they work across multiple levels of a cache hierarchy simultaeously. Most of the algorithms i this paper are cache oblivious, but our k-way mergesort is ot. We leave it as a ope questio whether it is possible to develop a I/O-efficiet cache-oblivious sortig algorithm i our model. Cache Efficiet Algorithms We ow review some basic well kow results o cache efficiet algorithms i the imperative settig. The fuctioal algorithms we preset i Sectio 6 are based o the algorithms described here, 1 Aggarwal ad Vitter also cosidered a versio of the model with parallel disk access, but most iterestig results are explaied with the sigle disk versio. ). but do ot require arrays or explicitly memory maagemet. We first cosider mergesort. Throughout our discussio we assume that the elemets beig sorted each fit i a sigle machie word. All cache efficiet algorithms we kow of for sortig store the iput ad output elemets directly i arrays. First cosider mergig two arrays of keys A ad B ito a output array C of legth (as usual we assume the iputs are sorted i A ad B i icreasig order). The stadard sequetial algorithm for mergig starts at the begiig of each array keepig a figer o each, fidig the the lesser of the two keys at the figers, copyig this key to C, ad icremetig the appropriate figer. This algorithm has a cache complexity O(/B) as log as M 3B. This ca be see by otig that at ay give time we oly eed oe block from each of A, B ad C residet i cache, ad that we fully process the block before eedig the ext block. Therefore every block is oly eeded oce. For mergesort we assume the stadard divide-ad-coquer versio, which recursively sorts each half of the array ad the merges the result. Sice mergig as described caot be doe i place we have to be specific o how to maage memory. I particular allocatig a ew array for the result ad the freeig the two old arrays usig a geeral purpose memory allocator will likely ot lead to the desired bouds (uless oe ca esure special properties of the memory allocator). Istead the algorithm eeds to pre-allocate a temporary array of legth ad pass parts of this array to all subcalls. I particular mergesort could take as argumets both the iput array ad a equal legth temporary array. The result is retured i the iput array ad the temporary array is used to merge ito. Although these optimizatios are relatively obvious ad stadard to programmers of imperative code, we brig them up to emphasize the care that eeds to be take to esure the cache bouds it is ot simply a issue of reducig the umber of calls to the memory allocator, it ca actually asymptotically affect the cache bouds. The cache complexity of this mergesort ca be aalyzed by cosiderig two cases. The first is whe the full computatio fits i cache. I this case the two arrays eed oly be loaded ito cache oce ad all the work ca be doe i cache. The problem fits i cache as log as 2 + log M, where the log accouts for the stack size. The secod case is whe the problem does ot fit i cache. I this case we have to pay for the cache misses o the two recursive calls plus the cache misses of the merge. This gives the followig recurrece for the cache complexity Q(): { 2Q( Q() = ) + O( ) 2 + log > M 2 B O( ) otherwise (1) B The solutio to this recurrece ca be derived by otig that the top log 2 (2/M) levels of the recursio do ot fit i memory while the lower levels do. The total cache complexity across each of the upper levels is O(/B) so the total overall cache complexity is O(/B log 2 (/M)). We ote that this does ot match the optimal cache complexity for sortig but is sigificatly better tha simply assumig every access is a cache miss specifically, a factor of B log /(log log M) better. For sortig words i a memory with 10 9 words ad a block size of 10 3 words, it is about a factor of about 4000 better. Quicksort has basically the same complexity as mergesort, although i the expected case. This is because scaig the iput array to split it ito the lesser ad larger elemets ca be doe usig two figers like i mergig so agai each block oly eeds to be loaded oce. We ow describe a sort that is optimal for the I/O model. The idea is istead of partitioig the iput array ito two ad recursively callig sort o each, to partitio the iput array ito k parts, sort each part, ad the merge all the parts. Sice istead of havig just two arrays to merge we have k arrays, we require a k-way merge. Without goig ito too much detail, such a merge 41

4 ca be implemeted usig oe block of memory for each of the iputs eedig to be merged as well as oe block for the output. We keep a figer o each iput ad o each step select the miimum key at the figers, move it to the output buffer ad icremet that figer. As log as all iput blocks, the output block ad ay data for maitaiig the figers fit i cache, the the k-way merge will ru with cache complexity O(/B), which is the same as the biary merge. Sice there will be k iput blocks ad 1 output blocks, the space eeded for the blocks is (k + 1)B. Therefore accoutig for overheads everythig will fit i cache as log as ckb M, or equivaletly k M/(cB) for some costat c. We therefore pick k to be as large as possible, givig k = M/(cB) As i the two way merge we eed to be careful about allocatio ad preallocate temporary arrays to copy the output. We agai ca aalyze the algorithm by cosiderig the case whe the problem fits i memory ad whe it does ot. This gives the recurrece: Q() = { M Q( ) + O( ) B M/cB B c > M O( ) otherwise (2) B where c ad c are costats, but, B ad M are variables. This solves to O(/B log M/B (/B)). This boud matches the lower boud for sortig i the I/O model [2] ad hece also the idealcache model. The k-way mergesort is therefore asymptotically optimal. 3. Cost Sematics I this sectio we defie a evaluatio dyamics that assigs a cost to a complete executio of a program. Followig the I/O model, the cost measures the cache complexity, which is defied to be the traffic caused by the trasfer of objects betwee the mai memory ad the memory cache. Accesses to objects i cache are cosidered to be cost-free, whereas migratio of objects from cache ito memory ad from memory ito the cache are charged uit cost. The dyamics is based o a two-level model of storage that icludes a fixed-size allocatio cache ad a fixed-size read cache together with a mai memory of ubouded size. The evaluatio dyamics provides the basis for assessig both the correctess ad the cache complexity of programs. It is formulated at a sufficietly abstract level to free the programmer from havig to reaso directly about the compiler ad ru-time system, but is sufficietly cocrete as to admit a implemetatio with a provable boud o its cache complexity. Thus, we may achieve the same overall results as are obtaied usig oly low-level machie models i previous work o I/O algorithms, while workig at the much more practical level of abstractio offered by fuctioal programmig laguages. We give the dyamics of a call-by-value variat of Plotki s PCF laguage [18]. The sytax of expressios is summarized by the followig grammar: e ::= x z s(e) ifz(e; e 0; x.e 1) fu(x, y.e) app(e 1; e 2) The coditioal tests whether a umber is zero or ot, ad passes the predecessor to the o-zero case. Fuctios are equipped with a ame for themselves to allow for recursio. The typig rules are stadard, ad are omitted here for the sake of cocisio. (See, for example, Chapter 10 of [13].) For illustrative purposes atural umbers are treated as heapallocated data structures of ubouded size (as will become evidet shortly). It is straightforward to exted the laguage to accout for a richer variety of data structures, icludig sum, product, fiite sequece, ad recursive types, ad to accout for typical hardware-orieted cocepts such as machie words ad floatig poit umbers. Storage Model Followig Morrisett, et al. [16], the dyamics distiguishes large from small values, with large values beig allocated i memory ad represeted by a locatio, ad small values beig those that are maipulated directly. I the preset case the oly small values are locatios, but it is also possible to cosider, for example, fixedsized umbers as forms of small value. Correspodigly, all other forms of value (umbers ad fuctios) are large. We also allocate stack frames, which reify the cotrol state of evaluatio, i memory. A memory object is either a large value of a stack frame. The two-level memory model is parameterized by two costats, the block size B, ad the cache size M = c B determied by some costat c represetig the umber of blocks i the cache. A memory µ is a fiite mappig assigig a memory object to each of a fiite set dom(µ) of abstract locatios. The memory may grow without boud. (We do ot cosider here the separate problem of garbage collectio for mai memory, for which see Morrisett, et al. [16].) As a techical coveiece, we assume that locatios are divided ito two classes, value locatios, l, ad stack locatios, s, ad require that a memory map value locatios to large values ad stack locatios to stack frames. Whe the distictio is immaterial, we speak simply of locatios ad objects i memory. A memory µ comes equipped with a equivalece relatio l µ l over dom(µ) specifyig that l ad l are eighbors i µ. Additioally, we require that each equivalece class i the domai of a memory is of size B. A memory whose domai cosists of a sigle equivalece class of size B is called a block. The eighborhood bhd(µ, l) of a locatio l dom(µ) is the restrictio of µ to the eighbors of l i µ, a sigle block. The expasio µ β of a memory µ by a block β such that dom(β) dom(µ) = is the memory µ that agrees with µ ad β o their respective domais ad for which l µ l iff l µ l or l β l. There are two forms of cache mediatig access to memory. A read cache ρ for a memory µ is the restrictio of µ to a fiite set of locatios of size at most M. The cotractio ρ β of a read cache ρ by a block β ρ is the read cache ρ such that ρ = ρ β. A ursery ν is a fiite mappig that associates a object to each a fiite set dom(ν) of locatios. A ursery comes equipped with a liear orderig l ν l of dom(ν), called the allocatio orderig. If l ν l we say that l is older tha l ad that l is ewer tha l i ν. The extesio ν[l o] of a ursery ν bidig a locatio l / dom(µ) to a object o is the ursery ν such that (1) ν (l) = o ad ν (l ) = ν(l ) for each l dom(ν), ad (2) l ν l for every l dom(ν). The cotractio ν β of a ursery ν by a block β ν is the restrictio of ν to dom(ν) \ dom(β). The live locatios live(r, ν) i a ursery ν relative to a subset R dom(ν) cosists of those locatios i dom(ν) that are (trasitively) reachable from locatios i R. The sca sca(r, ν) of a ursery ν with respect to a subset R dom(ν) is the block β of cosistig of the oldest B live locatios i live(r, ν). (See Morrisett, et al. [16] for formal defiitios of these stadard cocepts.) It will be a ivariat of the dyamics that the ursery cotais at most M live objects relative to the roots of the computatio. A store σ is a triple (µ, ρ, ν) cosistig of a memory µ, a read cache ρ for µ, ad a ursery ν such that dom(ν) dom(µ) =. The domai of a store σ is defied by dom(σ) = dom(µ) dom(ν). A iitial store is a store i which the mai memory cotais oly large values ad i which the read cache ad allocatio area are empty. Evaluatio Dyamics The overall goal of the evaluatio dyamics is to defie the evaluatio of a closed expressio by a iductive defiitio of a relatio betwee a expressio ad its value, which is always small, ad its cost, a o-egative iteger. The cost is computed by trackig the 42

5 z R l z R l (3a) { } s( ) R locs(e ) s e R {s } l s(l ) R l s(e ) + + R l (3b) ifz( ; e 2; x.e 3) 1 R locs(e 1 ) σ e 1 1 R {s 1 } σ l 1 s1 σ l 1 1 σ z σ e 2 2 R l ifz(e 1; e 2; x.e 3) R l (3c) ifz( ; e 2; x.e 3) 1 R locs(e 1 ) s1 σ e 1 1 R {s 1 } σ l 1 σ l 1 1 σ s(l 1 ) σ [l 1 /x]e 3 3 R l ifz(e 1; e 2; x.e 3) R l (3d) fu(x, y.e) R l fu(x, y.e) R l (3e) app( ; e 2) 1 R locs(e 1 ) s1 e1 1 R {s 1 } σ l 1 σ l 1 1 σ fu(x, y.e) σ app(l 1; ) 1 R s2 σ e 2 2 R {s 2 } σ l 2 σ [l 1, l 2/x, y]e 2 R l app(e 1; e 2) R l (3f) Figure 1. Cost Dyamics l dom(ρ) (µ, ρ, l 0 (µ, ρ, ρ(l) (4a) l dom(ν) (µ, ρ, l 0 (µ, ρ, ν(l) (4b) l / dom(ρ) dom(ν) dom(ρ) M B (µ, ρ, l 1 (µ, ρ bhd(µ, l), µ(l) (4c) l / dom(ρ) dom(ν) dom(ρ) = M β ρ (µ, ρ, l 1 (µ, ρ β bhd(µ, l), µ(l) (4d) live(r locs(o), ν) < M l / dom(ν) (µ, ρ, o 0 R (µ, ρ, ν[l l (5a) live(r locs(o), ν) = M β = sca(r locs(o), ν) l / dom(ν) (µ, ρ, o 1 R (µ β, ρ, (ν β)[l l Figure 2. Readig ad Allocatio (5b) movemet of objects amog the compoets of the store, chargig oe uit of cost wheever a block of objects must be moved to or copied from mai memory, ad chargig zero cost otherwise. (So, for example, a computatio that rus etirely i cache will be assiged zero cost, cosistetly with the I/O model.) To accout for the memory traffic ivolvig values, the dyamics makes explicit the allocatio of objects i the ursery, their evictio to mai memory whe the capacity of the ursery is exceeded, ad their movemet ito the read cache as they are required by the computatio. To accout for the memory traffic attributable to the implicit cotrol stack, the dyamics also allocates (but does ot otherwise use) stack frames, ad esures that ay data that would appear i the stack is kept live by the dyamics. These cosideratios lead to the evaluatio judgmet e R l statig that the expressio e, whe evaluated with respect to a store σ such that dom(σ) locs(e) ad to roots R dom(σ), results i a modified store σ, a locatio l represetig the (large) value of the expressio, ad a cost represetig the cache complexity of the executio. The modificatios to the store cosist of allocatios i the ursery, migratios of objects from the ursery to the mai memory, ad copyig of objects from the mai memory to the read cache. All memory traffic occurs i blocks of B objects, correspodig to loadig a cache lie or readig a block from disk. The roots R represet locatios that are to be kept live by virtue of their beig preset i the implicit cotrol stack or expressio uder evaluatio. The evaluatio judgmet is defied by the rules i Figure 1, makig use of two auxiliary judgmets for readig ad allocatig objects defied i Figure 2. It may be helpful to read through the rules oce while igorig all but the evaluatio judgmets to see that the rules defie a covetioal eager dyamics for a fuctioal laguage. O such a readig the root set plays o role, ad ca be igored. Moreover, the cost assigmet has o sigificace uder such a simplificatio. Next, let us cosider the roles of the read judgmets l σ ad the allocate judgmets v R l, where v is a value, i the dyamics. The quoted read judgmet states that the result of readig locatio l i store σ results i the object o ad the modified store σ, ad has cost = 0 or = 1. The cost is o-zero oly if the read causes a block to be loaded ito the read cache. The modified store represets the possible effect of loadig a block ito the read cache. The quoted write judgmet states that allocatig the large value v i store σ results i a modified store σ ad locatio l dom(σ ), ad has cost = 0 or = 1. The cost is o-zero oly if the allocatio causes the evictio of a block from the ursery i order to maitai the live-size ivariat. The read ad allocate operatios i the dyamics record the memory traffic egedered by the creatio ad examiatio of values durig computatio. It remais to cosider the role of the allocatio judgmets of the form s R f, which represet the allocatio of a stack frame i the store at stack locatio s. The purpose of allocatig these frames is purely to esure that the cost assiged to a computatio is accurate with respect to the uderlyig implemetatio. Although a evaluatio sematics has o explicit cotrol stack, it is evertheless the case that a implemetatio must allocate space for the represetatio of the cotrol state, ad this space allocatio does ifluece the cache behavior of the computatio. It may ot, therefore, be igored. Our method for accoutig for the memory effects of the cotrol stack is to allocate explicitly frames that would appear i the cotrol stack to esure that space usage is properly accouted for, ad that required liveess iformatio (to be 43

6 detailed shortly) is properly maitaied. The frames are deoted as app( ; e 2) ad app(l 1; ) i the cost dyamics. With this i mid, let us examie i detail Rule 3f i Figure 1. We are to evaluate ad determie the cost of app(e 1; e 2) i store σ with give roots R. First, we allocate a stack frame s 1 represetig the pedig evaluatio of e 2 durig the evaluatio of e 1. This frame is ow cosidered live, eve though it does ot appear i ay expressio uder cosideratio. Accordigly, we evaluate e 1 relative to the store cotaiig this frame, treatig the just-allocated stack poiter to be live (as idicated by R {s1 }). This results i a locatio l 1, which we the read from the store to obtai a fuctio abstractio (as would be guarateed by the static type disciplie omitted here). We the create aother frame s 2 correspodig to the suspeded applicatio of the fuctio at locatio l 1, ad evaluate e 2 with this stack poiter cosidered live (as idicated by R {s2 }) to obtai locatio l 2. Fially, we evaluate the fuctio body, replacig the self variable by l 1 ad the argumet by l 2. The overall cost of the computatio is the sum of the costs of each of these steps, which are give either iductively or by the uses of the read ad allocate judgmets. Observe that this rule properly accouts for tail recursio i that o extra space is held durig tail recursive calls (as idicated by R ). It remais to explai the read ad allocate judgmets defied i Figure 2. The read judgmet assigs cost zero to ay read from a locatio i either the ursery or the read cache (Rules 4b ad 4a). Such reads have o effects, ad hece iduce o cache traffic. A read of a locatio that is oly i mai memory iduces a load of the eighborhood of that locatio (a block of memory) ito the read cache. If there is sufficiet room for it i the read cache, the block is added to the cache ad the cotets is retured, at a cost of oe uit (Rule 4c). If there is isufficiet room i the read cache, a block is selected o-determiistically to be replaced by the required block, ad oce agai a uit cost is charged to the read (Rule 4d). At the ed of the sectio we discuss the use of o-determiistic evictio. The allocate judgemet defies the procedure for creatig ew objects i the store. Of course, ew objects are cosidered ewer i the allocatio orderig tha the objects already preset i the ursery. If the ew object fits withi the ursery, it is allocated there at zero cost (Rule 5a). If the ew object will ot fit withi the ursery, the the block cosistig of the oldest B live objects i the ursery is evicted to mai memory, makig room for the ewly allocated object; such a allocatio is charged uit cost (Rule 5b). It is importat to our method that the oldest objects be evicted from the cache as a block formig the eighborhood of each of its locatios. Whether a object fits withi the ursery is determied as follows. The ursery is full if the umber of live objects i it is exactly M. (It is for the sake of assessig liveess that the allocatio judgmet is parameterized by a root set.) Evictio of a block reduces this to at most M B objects, so that the ext B 1 allocatios will ot cause a evictio. Thus we are, i effect, chargig at most 1/B uits of cost to each allocatio (less if objects die before eedig to be evicted). It is essetial to our results that the liveess of objects i the ursery may be assessed without accessig mai memory. Give roots R we eed oly trace objects i the ursery itself, ad eed ever cosider locatios lyig outside of it. This is esured by two properties of the dyamics. First, sice the model is purely fuctioal, the depedecy graph of objects i the ursery is acyclic; a object may oly refer to objects allocated earlier i the computatio as defied by the allocatio orderig. Secod, implicit stack frames are explicitly allocated i the ursery to esure that liveess may be assessed solely by examiig the ursery itself, startig from the root set. Put aother way, a object i the ursery caot be live solely because of a poiter from mai memory back to the ursery. This property is a cosequece of immutability ad the explicit allocatio of stack frames i the sematics. I Sectio 6 we will make use of a deep copy operatio o values of certai types. I the illustrative laguage cosidered here this operatio is defiable o atural umbers as follows: fu(copy, x.ifz(x; z; x.s(app(copy; x )))). Callig this fuctio o a umber has the effect of creatig a fresh copy of i the heap. No such operatio is defiable, or required, for fuctio types. Deep copyig is easily exteded to product, sum, ad iductive types, but would eed to be provided as a primitive for base types such as fixed precisio itegers or floatig poit umbers. Discussio We briefly discuss some of the motivatio for the decisios we made i formulatig the dyamic sematics. The overall goal is to allow a simple aalysis for the algorithm developer while capturig all the costs eeded to prove asymptotic implemetatio bouds. The separate allocatio cache is importat both for coveiece of aalysis ad properly accoutig for costs. It esures that all short lived allocatios ever eed to be allocated to memory. For a subcomputatio i which the maximum footprit of live data allocated fits i the allocatio cache, the user eed ot worry about ay costs for ay temporary memory. I a block matrix multiply o matrices, for example, oce k 2 M for some small costat k, the oly cost that eeds be cosidered is the cost of readig the iput ad evictig the output. This is the case eve though the multiply will allocate a total of Θ( 3 ) space. It is also importat that the partitioig of locatios ito blocks is ot decided util locatios are evicted from the allocatio cache, which esures that oly live data is ever migrated to memory. If blockig were to be decided o allocatio, for example, the by the time the objects are evicted most of the objects i a block may o loger be live. This would break the bouds we give i Sectio 6. The cost sematics accouts for the allocatio of stack frames i order to accout for the space required to maage the cotrol state of evaluatio. This is particularly importat i the case that o allocatio is associated with the creatio of a frame, for the there is o possibility to amortize the space required for the frame agaist the allocated object. Note that the sematics oly models the space take by the frames i the allocatio cache ad the cost of evictig them. It does ot model ay costs associated with reloadig them ito the read cache. As described i the ext sectio, i a lower level model this ca be amortized agaist the cost of evictig the frames i the first place. It is importat that the stack frames be heap allocated. A crucial ivariat we require is that all live data i the allocatio cache ca be determied solely through the caches. If we had a separate stack cache it could allow for the evictio of a stack frame that refereces data i the allocatio cache, breakig the required ivariat. There are other techiques to hadle this problem but we foud that allocatig the stack frames i the heap is the easiest. Our model is o-determiistic i the choice of what block is evicted from the read cache i the case of a read miss. I our provable implemetatio bouds we show that if there is a (odetermiistic) executio that gives certai cache complexity the we ca guaratee those bouds o the ideal cache model (withi costat factors). Whe aalyzig a algorithm this allows oe to cosider ay policy for evictio. This is possible because the ideal cache makes the optimal decisios ad will therefore be at least as good as the policy the user assumes. The justificatio for the ideal cache model is give i Sectio 2. 44

7 4. Abstract Machie The abstract cost of a computatio assiged by the evaluatio dyamics give i the precedig sectio is validated i two stages. First, i this sectio we defie a abstract machie with a explicit cotrol stack, ad show that the evaluatio dyamics accurately predicts the behavior of the abstract machie with respect to both the outcome ad the cost of the computatio. Secod, i Sectio 5 we show how to implemet the basic operatios of the abstract machie with oly a small overhead. Take together these two argumets demostrate that the evaluatio sematics provides a accurate model of the cache complexity of a program whe implemeted as described i these two steps. The abstract machie takes the form of a labeled trasitio system betwee states of two differet forms: 1. Evaluatio state: k e, where k dom(σ) is a stack poiter, ad locs(e) dom(σ), statig that e is to be evaluated o stack k relative to store σ. 2. Retur state: k l, where k, l dom(σ), statig that small value l is to be retured to stack k relative to store σ. The cotrol stack is represeted by a stack locatio, k, that refers to a liked list of frames, either the empty stack, writte, or a frame together with aother stack locatio, writte f;k. The label o a trasitio is either 0, 1, or 2, ad specifies the amout of work to be charged for that trasitio. The rules give i Figure 3 defie the abstract machie. Their overall form is stadard (see, for example, Chapter 27 of [13]), with the mai differeces beig that allocatio ad readig of values is made explicit, just as i the evaluatio dyamics, ad that the stack is explicitly represeted as a liked data structure i the store. The multistep trasitio judgmet s s meas that there is a fiite, possibly empty, sequece of trasitios from s to s whose labels sum to. THEOREM 4.1 (Correctess of Evaluatio Dyamics). Let σ 0 be a iitial store, let e 0 be a closed expressio such that locs(e 0) dom(σ 0). Let the abstract machie be equipped with oe additioal block i the read cache, ad let k 0 be a reserved stack locatio ot used i the evaluatio dyamics. If the there is a evaluatio such that σ 0[k 0 k 0 e 0 σ e 0 l, m σ [k 0 k 0 l 1. the results are isomorphic, l = l, ad 2. the cost m is at most 3. The relatio l = l states that the reachable graph from l i σ is isomorphic to the reachable graph from l i σ. Theorem 4.1 states that the outcome of a computatio o the abstract machie is the same, up to choice of locatios, as the outcome of the same computatio accordig to the evaluatio dyamics. Moreover, the total cost of the machie executio (measured i accordace with the I/O model described earlier) is at most a small costat factor larger tha the cost assiged by the evaluatio dyamics. The cotet of the theorem amouts to a proof that the space required by the cotrol stack i a computatio may be maaged so as ot to iterfere with space usage of the computatio itself. The correctess proof may be decomposed ito three major compoets. The first obligatio is to relate the outcome of the evaluatio dyamics to that of the abstract machie, disregardig, for the momet, the cost. The required correspodece is proved k l 0 k l z {k} l k z k l s( );k locs(e ) k k s(e ) k e k s( );k s(l) {k } l k l + k l ifz( ; e 2; x.e 3);k locs(e 1 ) k k ifz(e 1; e 2; x.e 3) k e 1 k 1 ifz( ; e 2; x.e 3);k k l 1+ 2 k e 2 l 2 z k 1 ifz( ; e 2; x.e 3);k l 2 s(l ) k l 1+ 2 k [l /x]e 3 fu(x, y.e) {k} l k fu(x, y.e) k l app( ; e 2);k locs(e 1 ) σ k 1 k app(e 1; e 2) σ k 1 e 1 { k 1 σ app( ; e 2);k 1 } σ app(l 1; );k 1 2 locs(e 2 ) k2 k 1 σ app(l 1; );k 2 k l k 2 e 2 (6a) (6b) (6c) (6d) (6e) (6f) (6g) (6h) (6i) (6j) σ l 1 2 σ fu(x, y.e) k l k 2 [l 1, l 2/x, y]e Figure 3. Abstract Machie (6k) by iductio o the derivatio of evaluatio judgmet. Specifically, we prove that if e R l, the for ay stack poiter k, k e k l. The proof proceeds alog stadard lies, as described, for example, i Chapter 27 of the secod author s textbook [13]. The same choice of locatios may be made i the machie derivatio as were made i the evaluatio derivatio, because the sequece of value allocatios is precisely the same i both forms of dyamics. The ext step of the proof is to show that the abstract machie performs the same sequece of value reads i the same order as specified by the evaluatio dyamics. This may be proved alog with the correspodece described i the precedig paragraph. The argumet relies o two importat properties of the evaluatio dyamics: 1. Ay read of a value locatio is either a read of a locatio i the iitial store, or a locatio that was allocated earlier i the evaluatio. 2. Stack frames are allocated, but ever read, i order to esure that evictio of blocks from the ursery occurs i exactly the order imposed by the stack-based abstract machie. 45

8 A determiistic ursery evictio policy is required to esure that the memory reads correspod exactly betwee the evaluatio dyamics ad the abstract machie. We ca assume whatever policy is used the the dyamic sematics is also used by the abstract machie. It remais to show that the stack reads employed by the abstract machie do ot impose a asymptotically sigificat cost beyod what is predicted by the evaluatio dyamics. Without special provisio, access to the cotrol stack would iterfere with the allocatio of data i the read cache, ivalidatig the cost give to the computatio by the evaluatio dyamics. To avoid this we make use of a dedicated read cache block i the abstract machie, which we will call the stack cache block, ad explicitly maage this cache block as follows. Wheever a stack locatio is read from mai memory, its eighborhood is loaded ito the stack cache block, evictig the block that curretly occupies it. We will argue that the cost of loadig the stack cache ca be amortized across the executio sequece, eve if the same block is loaded ito the stack cache more tha oce, a possibility that will be detailed shortly. The validity of the argumet depeds o two special properties of the ru-time stack, amely that each allocated frame is read exactly oce i a complete computatio, ad that the precedig stack frame is always older tha the curret oe. Give such a amortizatio, it is the clear that the overall cost of executio o the abstract machie is bouded by a small costat factor of the cost ascribed to it by the evaluatio dyamics, establishig the theorem. To complete the proof, we describe the amortizatio of the cost of stack maagemet i more detail. As a ivariat we put a dollar o every memory block that cotais a stack frame, except if it is the yougest such block ad resides i the stack cache block, i which case it has o moey associated with it. Whe the abstract machie evicts a block cotaiig a stack frame from the allocatio cache we sped three dollars oe for the evictio itself, oe to put a dollar o the evicted block, ad oe to put a dollar o the block that is i the cache stack block. This third dollar might be eeded to maitai our ivariat sice that block, if there is oe, will o loger be the yougest memory block cotaiig a stack frame. Now whe the abstract machie loads a block ito the stack cache block from memory we sped its dollar for the load. All blocks with older frames have a dollar o them already by the ivariat, so the ivariat is maitaied. I summary we sped 3 block trasfers (worst case) per block that is evicted from the allocatio cache. We fially ote that there is o eed to explicitly maitai the stack cache block i the abstract machie sematics sice we are assumig a ideal cache. Therefore as log as the cache has a extra block available, the the cache policy will do at least as well as the oe we described. 5. Provable Implemetatio I this sectio we describe a implemetatio of the abstract machie give i Sectio 4 i the ideal cache model with the same asymptotic cost. The efficiecy proof for the implemetatio takes accout of two issues that are treated abstractly i the evaluatio sematics ad i the abstract machie. The mai issue is how to implemet the allocatio judgmet defied i Figure 2. Rules 5a ad 5b make referece to liveess of the data, ad evict a block cosistig of the oldest B live objects i the ursery. To esure that the predicted costs are realized i practice we must argue that these coditios ca be met by a implemetatio. The secod issue is that we must accout for the size of the stored objects (values ad frames) that may appear i a computatio, ad accout for the cost of hadlig these objects i a implemetatio. Defie the size of a machie state σ k 0 e 0 be the sum of the size of e 0 ad the size of ay fuctio i σ 0. This may be thought of as the size of the program, icludig ay λ-abstractios that may be preset i the iitial store. THEOREM 5.1. Fix a iitial state σ k 0 e 0 of size s 0, ad cosider a complete computatio σ k 0 e 0 m k 0 l with s 1 objects i the fial store, σ. This computatio be simulated i the ideal cache model with cache complexity c m for some costat c, provided that words are of size at least d log(max(s 0, s 1)) for some d > 0 ad that the cache has at least (4M + B) s 0 words. (The costats c ad d are idepedet of σ 0, k 0 ad e 0.) Theorem 5.1 states that the implemetatio asymptotically realizes the work attributed to the computatio by the evaluatio sematics, ad hece validates the algorithm aalysis performed usig that sematics. The requiremet o the word size i Theorem 5.1 esures that all objects are addressable by a word-sized poiter, ad accouts for the sizes of the objects themselves i storage. (A closure ca be as large as the iitial program.) The requiremet o the cache size i Theorem 5.1 esures that we may implemet the abstract memory hierarchy with o more tha a small costat factor of overhead i a maer that we ow describe. (The B s 0 additioal words accout for the stack cache described i Sectio 4; it remais to discuss the implemetatio of allocatio.) The allocatio judgmet defied i Figure 2 relies o a assessmet of the live size of the ursery, ad o the evictio of blocks from the ursery to esure that the ursery cotais o more tha M live objects. As we ote earlier, the liveess of data i the ursery may be assessed without referece to the mai memory; the liveess computatio takes place etirely withi the cache. Rather tha assess liveess, possibly evictig a block, o each allocatio, we istead amortize these costs across multiple allocatios accordig to the followig strategy. We reserve 2M s 0 words of cache memory for the allocatio area to accommodate at least 2M objects. Objects are allocated by maitaiig a poiter ito the ursery area, icremetig it o each allocatio util 2M objects have bee allocated, at which poit the ursery space is exhausted. Whe this occurs, we perform a compactig garbage collectio that preserves the allocatio order of objects, simultaeously evictig as may blocks as ecessary to obtai a live size of M objects i the ursery. After compactio, allocatio cotiues as before util the ursery is agai exhausted. As log as there is sufficiet space, allocatio takes costat time. Whe a garbage collectio is required, the cost may be attributed to the allocatios of the live data i the ursery, so that i a amortized sese garbage collectio is cost-free [3]. It is easy to see that o object is evicted to mai memory usig this implemetatio that would ot have bee evicted i the abstract sese. However, the evictios will, i geeral, happe later tha predicted by the sematics. As a result, fewer objects may be live at the time of evictio, ad so fewer blocks overall may be moved to mai memory. As a result of this compressio effect, two locatios that were eighbors i the evaluatio sematics may be i two differet blocks i the implemetatio. To accout for this, two blocks must be loaded to esure that eighborig objects i the sematics are loaded ito the read cache together. Thus we require 2M s 0 words of cache i the ideal cache model to accout for the M objects i the read cache. With regards to the evictio policy from the read cache we ote that a ideal cache will always choose a optimal policy (furtherst i the future). It will therefore do at least as well as ay policy assumed by the abstract machie. This completes the proof of the implemetatio boud stated i Theorem

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator