Storing Matrices on Disk: Theory and Practice Revisited

Size: px

Start display at page:

Download "Storing Matrices on Disk: Theory and Practice Revisited"

Albert Owen
5 years ago
Views:

1 Storng Matrces on Dsk: Theory and Practce Revsted Y Zhang Duke Unversty yzhang@cs.duke.edu Kamesh Munagala Duke Unversty kamesh@cs.duke.edu Jun Yang Duke Unversty junyang@cs.duke.edu ASTRACT We consder the problem of storng arrays on dsk to support scalable data analyss nvolvng lnear algebra. We propose Lnearzed Array -tree, or LA-tree, whch supports flexble array layouts and automatcally adapts to varyng sparsty across parts of an array and over tme. We reexamne the -tree splttng ategy for handlng nsertons and the flushng polcy for batchng updates, and show that common practces may n fact be suboptmal. Through theoretcal and emprcal studes, we propose alternatves wth good theoretcal guarantees and/or practcal performance. Introducton Arrays are one of the fundamental data types. Vectors and matrces, n partcular, are the most natural representaton of data for many statstcal analyss and machne learnng tasks. As we apply ncreasngly sophstcated analyss to bgger and bgger datasets, effcent handlng of large arrays s rapdly ganng mportance. In the RIOT project [9], we are buldng a system to support scalable statstcal analyss of massve data n a transparent fashon, whch allows users to enjoy the convenence of languages lke R and MATLA wth bult-n support for vectors/matrces and lnear algebra, wthout rewrtng code to use systems lke databases that scale better over massve data. Scalablty requres effcent handlng of dsk-resdent arrays. Our target applcatons make prevalent use of hgh-level, whole-array operators such as matrx multply, nverse, and factorzaton, but low-level, element-wse reads and wrtes are also possble. We have dentfed the followng requrements for an array storage engne:. We must support dfferent array access patterns (ncludng those that appear random). Our storage engne should allow a user or optmzer to select from a varety of storage layouts, because many whole-array operators have access patterns that prefer specfc storage layouts: e.g., I/O-effcent matrx multply prefers row, column, or blocked layouts, whle FFT prefers the bt-reversal order. Moreover, a sngle array may be used n operators wth dfferent access patterns; nstead of convertng the storage layout for every use, sometmes t s cheaper to allow access patterns that do not match the storage layout (see Remark. n appendx for a concrete example), even though t Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redbute to lsts, requres pror specfc permsson and/or a fee. Artcles from ths volume were nvted to present ther results at The 7th Internatonal Conference on Very Large Data ases, August 9th - September rd, Seattle, Washngton. Proceedngs of the VLD Endowment, Vol., No. Copyrght VLD Endowment 5-97//... $.. makes accesses more random. Fnally, some operators nherently contan some degree of randomness n ther access patterns that cannot be removed by storage layouts, e.g., LU factorzaton wth partal pvotng.. We must handle updates. One common update pattern s populatng an array one element at a tme n some order, whch may or may not be the same as the storage layout order. Handlng updates goes beyond bulk loadng: some operators, such as LU factorzaton, teratvely update an array and read prevously updated values, whch means that we cannot smply log all updates wthout effcently supportng nterleavng (and sometmes random) reads to updated values.. We want the storage format to automatcally adapt to array sparsty. For a sparse array, we want to avod wastng space for elements that are zero (or some other default value), whch can be done by storng array ndces and values only for non-zero elements. On the other hand, for a dense array, we want to avod the overhead of storng array ndces by densely packng the values and nferrng ther ndces from storage postons. In practce, there s no obvous delneaton between sparse and dense ; sparsty often vares across parts of an array and over tme, and s dffcult to predct n advance. For example, consder an applcaton program that updates an ntally empty (all-zero) matrx one element at a tme n random order accordng to some ongong computaton. The matrx may turn out dense, sparse, or partly dense (e.g., mostly upper-trangular); regardless of ts fnal content, our storage engne should store the matrx n a way that provdes good performance throughout the update uence, wthout user nterventon. There has been a myrad of approaches to storng arrays on dsk, but many fal to meet all requrements above. Targetng n-memory computaton, popular platforms for statstcal computng such as R and MATLA offer separate dense and sparse storage formats, but these formats do not adapt to varyng sparsty across parts of an array and over tme, and users must choose one format n advance. Compressed sparse column, used by MATLA and representatve of popular sparse formats, does not support updates or random accesses for dsk-resdent arrays. Alternatvely, a database system can store an array as a table wth columns representng array ndex and value, but the overhead s hgh for dense arrays. It s generally beleved that specal support for arrays s needed n database systems, ether through user-defned extensons or by completely new desgns [,,, 7]. Secton surveys addtonal related work. A promsng approach s to leverage -tree []. To handle multdmensonal arrays, we use lnearzaton, whch maps a mult- For memory-resdent arrays, ths format s easer to search but stll neffcent to update.

2 dmensonal coordnate to a -d array ndex accordng to a lnearzaton functon that offers control over data layout. To adapt to varyng sparsty, we apply the dea of compresson, allowng each -tree leaf to swtch dynamcally between sparse and dense formats accordng to the array densty wthn the leaf. Smply outfttng -tree wth these features, however, falls short of offerng optmal performance for arrays, as lluated below. Example. Consder uentally nsertng elements of array nto an empty -tree, whch s a very common update pattern. Suppose the array has sze and a -tree leaf can hold at most records. When a leaf overflows, the standard ategy s spltn-mddle, whch dvdes the leaf nto two wth equal number of records (or as closely as possble). The leaf level of the -tree after the nserton uence looks as follows (only record keys are shown). About half of the space s empty, whch s partcularly wasteful as no future nsertons can possbly fll t. The suboptmal space utlzaton also hurts access performance; e.g., array scans become twce as costly Whle one can handle uental nsertons as a specal case, other patterns that lead to waste are dffcult to detect. Are there alternatve splttng ateges that are provably reslent aganst such waste, wthout knowng the nserton uence n advance? Example. A popular trck to speed -tree updates s to batch them by keepng ndvdual record updates n a memory buffer. When the buffer flls up, we flush the buffered updates by applyng them n key order. Ths approach reduces I/Os by applyng multple updates to the same -tree leaf wth a sngle leaf access. A large buffer also helps make the leaf accesses more uental. However, for the followng update uence, the conventonal polcy of flushng all buffered updates when the buffer s full s not optmal. Here, K denotes the number of updates that the buffer can hold, and each P represents an update of some record on leaf P. P,..., P, P,..., P, P,..., P, P,..., P,... {z } {z } {z } {z } K/ K K K The flush-all polcy ncurs two leaf wrtes (of P and P +) for every K updates. However, the optmal polcy would flush all P updates after the (K/)-th update; subuently, only one leaf wrte would be ncurred for every K updates. For ths smple nserton uence, flush-all s only a factor of worse than the optmal. As we wll see later, however, there exst uences for whch flush-all s a factor of Ω( K) worse. Are there flushng polces that offer better compettve rato n theory or perform better n practce? In ths paper, we present LA-tree (Lnearzed Array -tree), the backbone of the RIOT array storage engne, whch meets all requrements dentfed earler. LA-tree offers flexble layouts va lnearzaton; t nherts from -tree effcent support for accesses and updates; and t adapts to varyng sparsty by swtchng between dense and sparse storage formats automatcally on a per-leaf bass. LA-tree reexamnes the leaf splttng ateges and batched update flushng polces, for whch common practces have been rarely questoned. We present theoretcal and emprcal results that contrbute to the fundamental understandng of these problems. These results challenge the common practces. For leaf splttng, explotng the fact that the doman of array ndces s bounded and dscrete, we devse a ategy that naturally produces trees wth no-dead-space, often twce as effcent as those produced by spltn-mddle. Ths advantage does ncur a fundamental trade-off n the worst-case, splt-n-mddle has compettve rato, whle ths ategy has, whch s the best possble for any no-dead-space ategy. Nonetheless, on common workloads, ths ategy consstently and sgnfcantly outperforms splt-n-mddle. For update batchng, we gve a flushng polcy wth compettve rato O(log K) n the worst case, beatng flush-all s Ω( K). For common workloads, however, flush-all actually performs better n practce. On the other hand, startng from a smple polcy wth a poor compettve rato of Ω(K), we devse a randomzed varant that ncurs fewer number of I/Os than flush-all for some workloads (and comparable numbers for others). Our approach can be seen as brngng to the update batchng problem the same level of rgor as n the study of cachng (though results do not carry over because of fundamental dfferences n ther problem defntons). Fnally, we note that our technques are easy to mplement as they do not requre ntrusve modfcatons to the conventonal - tree. Also, many of our results generalze to other settngs: the dea of no-dead-space splttng makes sense for other dscrete, ordered key domans; theoretcal analyss of update batchng generalzes to other block-orented or dbuted data uctures. Related Work Database systems have been extended wth support for arrays, and more specfcally, lnear algebra. esdes storng arrays as tables whose rows correspond to ndvdual array elements, UDTs and UDFs are popular mplementaton optons (e.g., [, ]). In general, these approaches can be seen as dvdng an array nto chunks and storng each chunk n a database row as a unt of access. SQL can express many lnear algebra operatons by callng UDFs that operate on chunks or pars of chunks. Database ndexng s used for accessng chunks. Whle ths paper does not store arrays n databases, many deas, such as lnearzaton, dynamc storage format, and update batchng, are readly applcable by regardng a table of chunks as a block-orented storage ucture. There has also been work buldng database systems specalzng n arrays (e.g., RasDaMan [], ArrayD [], and ScD [7]). These approaches dvde arrays nto rectangular chunks, and often rely on spatal ndexng to retreve chunks n hgh-dmensonal arrays. Our approach of lnearzaton supports more layouts (e.g., bt-reversed) and avods the dffculty of hgh-dmensonal ndexng. One reason for ths dfferent approach s that we focus less on ad hoc regon-based retreval, but more on whole-matrx operatons wth more predctable but specfc access patterns. Nonetheless, t would be nterestng to see how our deas can be appled n ther settngs (e.g., lnearzaton, alternatve ndex reorganzaton and update bufferng methods) and vce versa (e.g., allowng replcaton of boundary elements between neghborng chunks as n ScD). Lnearzaton s frequently used for mult-dmensonal ndexng. U-tree [] s the most related to our work n ths regard. Whle U-tree lnearzes arrays usng Z-order, LA-tree provdes more lnearzaton optons to match dfferent applcaton needs (wth a smlar goal as RodentStore [7], but at a dfferent level). More mportantly, we reexamne ndex reorganzaton and update bufferng practces, whch U-tree does not address. There s no shortage of -tree trcks [, ] amed at mprovng ts effcency. Prefx -tree compresson, for example, s a more general form of compresson than our dynamc leaf format, though ts generalty also carres some overhead. There s also work on alternatve splttng ateges, such as avod splttng by scannng adjacent nodes for free space [9]. Most of these technques are orthogonal to ours and may further mprove LA-tree n some cases. We are not aware of any prevous work on alternatve splttng ateges for bounded, dscrete key domans and how they nteract

3 wth compresson. Work on update batchng dates back to Lohman et al. []. Lke us, nstead of a complete reorganzaton, Lang et al. [] propose accumulatng nsertons n a batch, sortng them by key, and applyng them to -tree by traversng from left to rght and backtrackng along root-to-leaf paths when necessary. Our contrbuton to the update batchng problem les n analyzng and questonng the standard practce of flushng all buffered updates. Overvew of LA-Tree ased on -tree, LA-tree ntroduces modfcatons and extensons desgned for arrays: lnearzaton (ths secton), new leaf splttng ateges (Secton.), dynamc leaf storage format (Secton.), and alternatve flushng polces for update batchng (Secton 5). Each LA-tree has a lnearzaton functon that specfes the storage layout of the array. For an array of dmenson d and sze N N d, a lnearzaton functon f : [, N ) [, N d ) [, N N d ), where all ntervals are over N, s a bjecton that maps each d-d array ndex to a -d array ndex. When d =, f s a permutaton. Conceptually, LA-tree ndexes the values of array elements by ther lnearzed array ndces;.e., the element of array A wth ndex ı =,..., d s ndexed as the key-value par (f( ı), A[ ı]). Popular layouts, such as row-major, column-major, blocked, Z-order, bt-reversal, can be easly and succnctly defned as lnearzaton functons (Remark. gves concrete examples). LA-tree supports arbtrary user-defned lnearzaton functons; for convenence and effcency, however, the frequently used ones have support bult nto LA-tree. Each LA-tree also has a default value (often ) for array elements. Conceptually, LA-tree only ndexes elements whose values dffer from the default. A new, empty array s flled wth the default value. Settng a default-valued element to non-default value amounts to an nserton; the nverse operaton amounts to a deleton. For convenence and wthout loss of generalty, we wll assume the default value to be for the remander of the paper. Wth LA-tree, we support three types of array accesses: Accessng an element by ts array ndex ı, whch amounts to accessng the LA-tree wth key f( ı). Accessng elements of an array va an terator wth lnearzaton functon g, whch specfes the access order and may dffer from the lnearzaton functon f used for controllng the storage order. The -th element n the access order has LA-tree key f(g ()). We mplement varous optmzatons to speed up key calculaton, ncludng ncremental computaton of f g and detectng the specal (but common) case of f = g. Further detals can be found n Remark.. We also support an opton to terate over only non-zero elements. Readng/wrtng elements n a specfed hyper-rectangle n the array ndex space. Ths type of access s common n I/O-effcent matrx algorthms (such as multply) that process matrces a chunk at a tme, whose sze depends on the amount of avalable memory. Supportng such accesses as batch operatons allows us to avod the overhead of terator calls and provde more effcent mplementaton for bult-n lnearzaton functons. Effcency Through etter Space Utlzaton Ths secton tackles -trees effcency problem from two angles: splttng ategy (Secton.) and leaf storage format (Secton.). oth am at mprovng space utlzaton, whch, as ponted out n [5] and valdated by our emprcal study (Secton.), s largely n lne wth the goal of mprovng tme effcency as well. We show that by explotng the specal characterstcs of arrays, LA-trees can acheve much better performance than conventonal -trees.. Splttng Strategy Revsted As motvated n Secton, the standard -tree splttng ategy can lead to lots of wasted space wthn leaves that wll never get used. In the followng, we formalze the desrable propertes of a splttng ategy, propose several alternatves, and dscuss ther propertes. We begn wth some termnology. Let κ denote the leaf capacty, or the maxmum number of records that can be stored n a leaf of the ndex. Each leaf has a (key) range, whch contans all keys of records stored n ths leaf. The set of all leaf ranges forms a dsjont parttonng of the key doman. Snce our ndex stores a -d array, a leaf range s an nterval [l, u), where l and u are the lower bound (nclusve) and upper bound (exclusve) of the -based array ndces stored n the leaf. We defne the densty of a leaf l, denoted ρ(l), as the number of records n l dvded by ts capacty. Densty can be smlarly defned for a set of leaves or the entre ndex. When a record needs to be nserted nto a leaf wth range [l, u) and already κ records (thereby causng t to overflow), a splttng ategy chooses a splttng pont x, such that the orgnal leaf s splt nto two leaves wth ranges [l, x) and [x, u). A splttng ategy operates n an onlne fashon;.e., t processes the current nserton wthout knowledge of future nsertons. To ensure low runtme overhead, we consder only local splttng ateges,.e., ones that do not read or modfy leaves other than the one beng nserted nto. Also, we focus on leaf splttng ateges; splttng at upper levels of the ndex has lttle mpact on the overall space and effcency, and we smply follow the standard -tree ategy. The standard -tree leaf splttng ategy s as follows: Splt-n-Mddle. Gven an overflowng leaf wth κ + records wth keys,,..., κ, ths ategy chooses the splttng pont to be x = j, where j = (κ + )/. There are two desrable propertes that a good splttng ategy should have: bounded space consumpton and no dead space. The space consumpton of a splttng ategy can be measured by ts compettve rato wth respect to an optmal offlne algorthm. Formally, a splttng ategy Σ s α-compettve f, for any nserton uence S, the number of leaves produced by Σ at the end of S s less than α tmes that produced by an optmal offlne algorthm, wthn an addtve constant. Knowng the entre S, the optmal offlne algorthm bascally stores all non-zero array elements compactly, so an array wth range [, N) and n N non-zero elements can be stored n n/κ leaves. Splt-n-mddle s clearly -compettve, because t always generates leaves that are half full. It turns out that ths compettve rato s the best we can hope for: we show that no determnstc local splttng ategy can have a compettve rato of less than (Theorem n appendx). A second desrable property of splttng ateges s no-deadspace. y dead space we mean empty slots n leaves that can never be flled by future nsertons. For example, every leaf except the last one n Example has two slots of dead space. Note that the noton of dead space s specal to unque ndexes wth dscrete key domans such as our settng. General -tree leaves do not have dead space; t s always possble to nsert a record wth a duplcate key, or a record between two adjacent exstng keys (up to some lmt precson of floatng-pont keys or maxmum length of ng keys). Formally, we defne the no-dead-space property as follows. Wthout loss of generalty, assume that the array sze s a multple of κ. We assume standard -tree leaf format for now; optmzatons for dense array regons are dscussed later n Secton.. Otherwse, for an array wth range [, N), the last leaf can have N κ N/κ slots of dead space.

4 Defnton (No-Dead-Space). A splttng ategy Σ s no-deadspace f for any ndex state Σ may result n, there exsts a future nserton uence that causes all leaves to be full under Σ. As we have seen Secton, splt-n-mddle does not have ths property. ut how mportant s no-dead-space, gven that splt-nmddle already has the best possble compettve rato? Consder any array (or a regon wthn an array) wth densty ϱ. A ategy that s no-dead-space would be guaranteed to have a compettve rato of no more than /ϱ for storng the array (or the dense regon). In contrast, regardless of densty, splt-n-mddle may well take twce the mnmum space requred, as lluated n Example. Thus, splt-n-mddle s less attractve than a no-dead-space ategy when ϱ > /, whch s a rather common case n our settng. For example, all dense matrces fall nto ths case, unless they are at the early stage of beng populated n non-uental order. Hence, no-dead-space s an mportant property that focuses less on the worst case and more on the common case of dense matrces or dense regons n matrces. We propose a novel ategy that s naturally no-dead-space: Splt-Algned. Gven an overflowng leaf l wth range [l, u), ths ategy chooses the splttng pont x to be a multple of κ that mnmzes the dfference between the number of records n [l, x) and that n [x, u). If multple values of x satsfy the condton, the one that mnmzes x l+u s chosen. In other words, splt- favors a splt that s most balanced, lke splt-n-mddle, but under the condton that the splttng pont algns wth, κ, κ,...,.e., endponts of the leaf ranges had we lad out all array elements (zero or non-zero) compactly. For example, wth κ = 5, splt- wll choose the followng splt: It s easy to see that, startng wth a sngle leaf wth range [, N), splt- s no-dead-space. An obvous queston s how splt- does on compettve rato. Unfortunately, there s a fundamental trade-off between nodead-space and bounded space consumpton we show that any no-dead-space splttng ategy must have a compettve rato of at least (Theorem n appendx), whch s worse than splt-nmddle n the worst case. We also show that splt- ndeed has a compettve rato of ;.e., t s the best no-dead-space ategy possble (Theorem n appendx). Ths bound s non-trval, consderng that splt- may generate near-empty leaves. esdes splt-n-mddle and splt-, we also consder: Splt-off-Dense. Gven a leaf to splt wth range [l, u), ths ategy frst consders two canddate splttng ponts l + κ and u κ, whch would result n a leaf wth range [l, l + κ) or one wth range [u κ, u), respectvely. Note these leaves wll never be splt further. If ether leaf has densty greater than.5, we choose the splttng pont that would result n the leaf wth the hgher densty. Otherwse, we fall back to splt-n-mddle. Intutvely, ths ategy can be seen as a tweak to splt-n-mddle that frst tres to splt off a dense leaf that wll not splt agan n the future. It s not hard to see that splt- s no worse than splt-n-mddle n terms of compettve rato, but splt-offdense may sometmes do better, e.g., the uental nserton uence n Example. Splt-Defer-Next. Ths ategy tres to choose a splttng pont that delays the splt of ether result leaf as much as possble. Suppose we splt a leaf l wth range [l, u) and keys,..., κ nto leaves l and l wth splttng pont x. Assumng that each future nserton hts each mssng key wth equal probablty, we can calculate τ(x), the expected number of future nsertons nto [l, u) that wll cause the frst splt of ether l or l, usng a formula nvolvng l, u, and,..., κ (see Remark. n appendx for the formula and ts dervaton). Splt-defer-next choose the splttng pont to be arg max x τ(x). Unfortunately, the formula for τ(x) s qute nvolved, and we have no closedform soluton for ths maxmzaton problem; therefore, we resort to tryng every x {,..., κ} n a brute-force fashon. Splt-alanced-Rato. Ths ategy shares the same goal as splt-defer-next, but uses a smpler optmzaton objectve that s computatonally easer. Gven a leaf l, consder the rato χ(l) between the number of free storage slots n l and the number of keys mssng from (and hence can be later nserted nto) l s range. Intutvely, a bgger χ(l) means l s less lkely to splt n the future. Splt-balanced-rato pcks the splttng pont that maxmzes the mnmum of the two resultng leaves ratos. Specfcally, gven an overflowng leaf wth range [l, u) and keys,,..., κ, ths ategy sets x = k, where k = arg max j mn κ j, κ (κ+ j) ( j l) j (u j. ) (κ+ j) Secton. compares these ateges wth splt-n-mddle and splt usng varous metrcs, and evaluates ther performance n practce wth common workloads for matrces. We have only dscussed nsertons so far. Deletons can be handled usng standard -tree technques; see Remark.. They are not the focus of ths paper because we fnd deletons to be rare n our workloads and hence less mportant to overall performance.. Dynamc Leaf Storage Format As dscussed n Secton, plan -trees are not effcent for dense arrays. We want LA-tree to be effcent for dense arrays as well as arrays whose sparsty vares over tme and across dfferent regons nsde them. To ths end, LA-tree supports two leaf storage formats, sparse and dense. Dfferent leaves can have dfferent storage formats, and each leaf can swtch between the two formats dynamcally. A sparse-format leaf stores each non-zero array element n ts range as a key-value par; zeros are not stored. Let κ s denote the sparse leaf capacty,.e., the maxmum number of records that can be stored by a sparse-format leaf. A dense-format leaf, on the other hand, stores all values (zero or non-zero) of array elements from a contnuous subrange of ts key range. The key that starts the subrange s also stored, but the other keys n the subrange are not, because they can be smply nferred from the startng key and the entry postons. Let κ d denote the dense leaf capacty,.e., the maxmum length of the subrange, or the maxmum number of records that can be stored by a dense-format leaf. Clearly, κ d > κ s. For example, f the keys are -bt ntegers and values are -bt doubles, then κ d κ s. Ths two-format approach can be regarded as a smple compresson method, whch we feel provdes a good trade-off between storage space and access tme. More sophstcated compresson methods are certanly possble, but they wll lkely add non-trval decompresson overhead to data accesses. LA-tree automatcally swtches between the two formats when a leaf s wrtten. We call the effectve range of a leaf l to be the tghtest nterval contanng all keys stored n l. The effectve range of l s always contaned n the range of l. If an nserton overflows a sparse-format leaf l, and the length of l s effectve range (contanng all κ s + keys) s no greater than κ d, then we swtch l to the dense format wthout splttng l. Conversely, f an nserton nto a dense-format leaf l expands the length of ts effectve range to greater than κ d but the total number of records s stll below κ s, then we swtch l to the sparse format wthout splttng l. The splttng ateges n Secton. need to be modfed to

5 # pages allocated ( 5 ) n-mddle balanced-rato # pages allocated ( 5 ) n-mddle balanced-rato # pages allocated ( 5 ) nt n-mddle balanced-rato I/O + CPU = total tme ( s) n-mddle balanced-rato Fgure : Splttng ateges, wth all leaves usng the sparse format. In the frst three graphs (for,, and nt), horzontal axes show the percentage of elements nserted so far; each plot contans one data pont every nsertons, and shows one tck every nsertons. In the last fgure, the vertcal axs shows the break-down of runnng tme nto I/O and CPU, wth CPU on top. work wth the dynamc leaf format. For splt-, we requre the splttng pont to be a multple of κ d. Other necessary modfcatons are not dffcult to devse, but care s needed to cover all possble cases. ecause of lmted space, we wll lluate just one ntrcacy wth an example. Wth κ s = and κ d =, consder the followng overflowng dense leaf upon the nserton of key 97: Wthout modfcaton, splt- would choose (a multple of κ d ) as the splttng pont. However, the result rght leaf cannot store all of, 9,,,, and 97, wth ether dense or sparse format. Hence, t s necessary to further modfy splt- to rule out nfeasble splttng ponts. In ths case, wll be ruled out, and 9 wll be chosen nstead.. Expermental Evaluaton Splttng Strateges on Common Inserton Patterns We frst compare the performance of varous splttng ateges, for now assumng sparse formats across all leaves. We consder the followng patterns for populatng an ntally empty matrx wth row-major layout: (uental) nserts elements n row-major order; (ded) nserts elements n column-major order; nt(erleaved) nserts elements n row- and column-major orders n an nterleavng fashon (as n LU factorzaton); and ran(dom) nserts elements n random order. Fgure summarzes the results for a matrx and a M buffer pool; see Remark. for detaled expermental setup. Results on other scales are smlar. For ths experment, ran s too expensve to run to completon; t takes an hour just to process % of the nsertons. As ts performance s clearly unacceptable regardless of the choce of splttng ategy, we do not dscuss ran further here. We wll, however, revst ran n Secton 5. because update batchng helps mprove ts performance. From the frst three graphs n Fgure, we see that standard spltn-mddle uses about twce as much space as others throughout the course of each workload. From the last graph, we see that spltn-mddle s smpler splttng logc s not enough to make up for ts loss n I/O effcency. On the other hand, splt- mantans a notceable lead ahead splt-n-mddle n runnng tme, and s the best ategy overall n both space and tme effcency. As for other ateges, splt- has curously hgh runnng tme for despte ts low number of I/Os (whose plots are not shown here but are consstent wth the frst three graphs); a closer examnaton of the traces reveals that splt- s tendency to generate far more unbalanced leaves than others leads to Note that our CPU tme accountng ncludes tme spent outsde system calls on behalf of I/Os. In partcular, tme spent on I/Os served from our buffer pool wthout httng the dsk s counted towards the CPU tme nstead of the I/O tme. In ths fgure, the CPU tme s sgnfcant proporton s n part explaned by the effectveness of our buffer pool for these workloads. 97 # pages allocated ( 5 ) n-mddle balanced-rato nt I/O + CPU = total tme ( s) n-mddle balanced-rato Fgure : Splttng ateges, wth dynamc leaf storage format. very scattered I/Os. Splt-balanced-rato has no better space utlzaton than splt- but carres hgher CPU overhead. We omt splt-defer-next here and subuently, because t has prohbtve CPU overhead but offers no sgnfcant space savngs. Next, we repeat the experments wth dynamc leaf storage format, to study how ths feature further affects performance. Fgure summarzes the results. All ateges beneft from ths feature, but splt- benefts more, thanks to ts ablty to produce leaves that are better (and hence better prepared ) for the dense format. For the more nterestng patterns of and nt, ts advantage over splt-n-mddle wdens to a factor of more than.5 n terms of space, and more than.7 n terms of tme; ts advantage over other ateges are also more pronounced than n Fgure. Moreover, the relatve performance dfferences stay the same over the course of the workloads (plots are omtted here, but exhbt the same lnear trends as the frst three graphs n Fgure ). In concluson, splt- s a clear wnner. Fnally, note that these experments only report the runnng tme of populatng the matrx. Splt-, wth ts hghest space effcency, becomes even more appealng f we consder the cost of accessng the matrx subuently. For other ateges, one could bulk load (and compact) the array at end of the nserton uence to make subuent scans more effcent, but dong so would further add to the runnng tme and, for a dense matrx, result n a fnal tree no better than splt-. Scalablty Test The exerments above are all performed on a matrx (wth mllon elements). We also vary the matrx sze and plot the normalzed total runnng tme (obtaned by dvdng the total runnng tme by that of splt-n-mddle) n Fgure. The results show a consstent relatve gap between splt-nmddle and splt-, wth or wthout the dynamc leaf storage format. In terms of absolute runnng tme (not plotted here), both ateges scale lnearly wth the matrx sze. It s clear that splt- s space effcency advantage extends to dfferent data scales. uffer Pool Settngs We next replcate the experments n Fgure wth dfferent buffer pool szes: a smaller M and a bgger M. The I/O and CPU tme breakdown for the four splttng ateges wth dynamc leaf page format s shown n Fgure. Splt- and splt- are generally able to better explot a larger buffer pool to reduce ther I/O tme, although a larger-than- nt nt

6 I/O + CPU = total tme ( s) # pages allocated ( 5 ) LA M M M n-mddle balanced-rato nt I/O + CPU = total tme ( s) M M M Fgure : LA-tree, -tree, ; dense matrx. # pages allocated ( 5 ) 5 5 LA nt I/O + CPU = total tme ( s) I/O + CPU = total tme ( s) LA LA Fgure 5: LA-tree, -tree, : sparse matrx. enough buffer pool does not brng further beneft, and n some case may even cause extra CPU overhead (namely splt- wth M pool under ). Splt-n-mddle and splt-balanced-rato are relatvely nsenstve to the sze of buffer pool. In ths sense, ther performance s more predctable. However, even f the memory resource s scarce, splt- stll has consderable advantage over them. LA-Tree, -Tree, and Drectly Addressable Fle We now step up a level and compare the performance of LA-tree (wth splt and dynamc leaf storage format), standard -tree (wth splt-n-mddle and sparse leaf format), and drectly addressable fle (). stores all array values compactly n a fle, enablng drect lookups and elmnatng the need to store array ndces or to use extra ndrectons for ndexng. Fle system optmzatons allow us to allocate dsk pages for lazly: f a page has never been wrtten (because t contans all zeros), t s never allocated. Frst, we repeat the same experments for a matrx n Fgure, and summarze the results n Fgure. In terms of space utlzaton, LA-tree s on par wth, the best possble n ths case; -tree s four tmes worse, because t lacks the dense format and ts leaves are mostly half-full. As for runnng tme, the break-down nto CPU and I/O offers nterestng nsghts. In terms of CPU tme, s the fastest, and -tree s the slowest; the reasons are that s drect address calculaton s smpler than tree lookups, and that searchng wth the sparse leaf format (whch - tree uses exclusvely) s more expensve than the dense format. In terms of I/O tme, -tree suffers from a larger number of I/Os. Surprsngly, has the worst I/O tme for and nt, even though t ncurs a smlar number of I/Os (not plotted here) as LA-tree. A closer look shows that generates very scattered I/Os because column-major nsertons ht faraway portons of the fle. In ths n-mddle balanced-rato Fgure : Impact of buffer pool sze. nt nt I/O + CPU = total tme ( s) M M M nt n-mddle balanced-rato regard, LA- and -trees are better at placng and movng array elements durng the course of these workloads. Ths observaton offers the nsght that t can be suboptmal to smply place each element where t should be at the end of the nserton uence, as the ntermedate states of the data ucture also affect performance. In the second set of experments, we populate a sparse matrx wth % randomly dbuted non-zero elements. Fgure 5 summarzes the results. As expected, really suffers whle -tree shnes, as there are not even locally dense regons n ths matrx. Despte beng unable to explot any densty, LA-tree mantans comparable performance to -tree, except that LA-tree has slghtly hgher I/O tme due to slghtly more random I/Os. From the above two sets of experments, whch addle the opposte ends of the dense-sparse spectrum, we see that LA-tree s able to automatcally acheve optmal (or close to optmal) performance wthout manual tunng. Scalablty Test We also scaled the experments above wth dfferent matrx szes (Fgure 7). Whle LA-tree and -tree scale lnearly under all tests, s scalablty s not lnear. For dense matrces under non-uental nserton pattern, s performance degrades quckly and becomes nferor to LA-tree as the matrx sze ncreases. For sparse matrces, s always substantally slower than LA-tree and -tree. Also note that across all scales LA-tree s able to mantan a factor of performance advantage over -tree for dense matrces, whle havng comparable performance for sparse matrces. More Interestng Inserton Patterns We have only consdered three common yet fundamental nserton patterns so far, namely, and nt. Note that these patterns are ndependent from the storage layout or access pattern; nstead, an nserton pattern s generated by a combnaton of two lnearzatons access and storage. For nstance, can happen f a row-major layout matrx s populated n column-major order, or vce versa. Now, we are ready to test two other patterns obtaned by nsertng nto matrces wth a block-based layout. Gven a matrx, we choose a block-based lnearzaton as ts layout. We set the block sze to be (the bggest sze that can stll ft n a dsk page). Wthn every block, elements are lad out n row-major order, and so are the blocks themselves. On top of ths fxed block storage layout, we consder two ways of populatng a matrx: row-major order and row-wse bt-reversal order. We call the two resultng patterns row/block and bt-reversal/block, respectvely. Note that the second access pattern s an essental part of the D FFT algorthm. Combnng the block storage layout wth these two access patterns, the resultng patterns httng the lnear storage medum become more complcated and nterestng. We test the two patterns on a dense matrx. Fgure plots the results. Agan, n terms of space utlzaton, LAtree s the same as, the best possble n both cases. -tree s more than three tmes worse due to ts lack of dense format. In

7 , sparse leaf format n-mddle balanced-rato 5, dynamc leaf format n-mddle balanced-rato , sparse leaf format n-mddle balanced-rato 5, dynamc leaf format n-mddle balanced-rato nt, sparse leaf format n-mddle balanced-rato 5 nt, dynamc leaf format n-mddle balanced-rato 5 Fgure : Splttng ateges: scalablty test wth sparse (top row) and dynamc (bottom row) leaf formats. X-axes show the scale of matrx ( elements), whle y-axes show the normalzed total runnng tme (n-mddle as baselne) LA-tree -tree, dense matrx , sparse matrx, % densty LA-tree -tree LA-tree -tree, dense matrx , sparse matrx, % densty LA-tree -tree LA-tree -tree nt, dense matrx nt, sparse matrx, % densty LA-tree -tree 5 Fgure 7: LA-tree, -tree, : scalablty on dense and sparse matrces. X-axes show the scale of matrx ( elements, ncludng zeros n case of sparse matrx), whle y-axes show the normalzed runnng tme (-tree as baselne). terms of both I/O tme and total tme, -tree s also the worst, not surprsngly. For row/block, LA-tree s I/O tme s on par wth s, but t has more CPU overhead; so the result s smlar to n Fgure. For bt-reversal/block, LA-tree s I/O tme s only % of s, whch s enough to compensate for ts hgher CPU tme. Overall, the results from these two new nserton pattens agree wth prevous results n Fgure and do not change our concluson. LAS on UFSparse Steppng up yet another level, we examne how LA-tree compares wth -tree and for lnear algebra operatons nvolvng real-world matrces. For the operaton, we test matrx multply, an essental and often performance-crtcal buldng block of more sophstcated analyss. We use an I/O-effcent verson of the block matrx multply algorthm, whch computes the

8 Table : LA-tree, -tree, : Total runnng tme of dgemm on UFSparse and dense matrces. Name(ID) sze #nonzeros LA-tree (s) -tree (s) (s) opt (7) ramage (7) shp (77) std Jac () GaAsH (5) net75 (9) human gene () 7 TSOPF RS b (9) 79 5 Dense # pages allocated ( 5 ) LA row/block bt-reversal/block I/O + CPU = total tme ( s) 5 LA row/block bt-reversal/block Fgure : LA-tree, -tree, : more nserton patterns on blocked dense matrx. # pages allocated ( 5 ) LA human_gene I/O + CPU = total tme ( s) # pages allocated ( 5 ) TSOPF_RS_b Fgure 9: LA-tree, -tree, : UFSparse matrces. LA I/O + CPU = total tme ( s) result matrx one block (submatrx) at a tme by readng and multplyng pars of blocks from the nput matrces and accumulatng the multplcaton results n memory. For multplyng submatrces n memory, we use the LAS routne dgemm f both submatrces have densty greater than.5, or the CHOLMOD [5] routnes cholmod ssmult or cholmod sdmult otherwse. For nput, we use matrces from UFSparse, the Unversty of Florda Sparse Matrx Collecton []. To test each storage method, we prepare the nput matrces wth ths method usng a blocked layout that matches the pattern of blocks accessed by the I/O-effcent matrx multply. We multply each nput matrx wth tself, and save the result usng the same storage method as the nput. Here, we dscuss results for two matrces, human gene and TSOPF RS b (Fgure 9). We report the total runnng tme, whch excludes nput preparaton but ncludes wrtng the result. For human gene ( and densty.79%), we use 5 5 blocks, and the total runnng tme s sec for LAtree, sec for -tree, and 7sec for. suffers from a bloated nput fle. LA- and -trees both perform well, wth LAtree leadng by about %. Ther nput trees are comparable n sze, because human gene looks unformly sparse. The result matrx turns out farly dense, so the LA-tree result s more compact. For TSOPF RS b ( and densty.%), we use blocks, and the total runnng tme s sec for LA-tree, 5sec for -tree, and sec for. Unlke human gene, ths matrx has a dense regon despte ts overall sparsty. LA-tree s able to explot ths local densty to wden ts lead over -tree to a factor of.. Its lead over narrows slghtly, but s stll more than a factor of.. Results on more matrces are presented n Table. The concluson s consstent: for sparse matrces, LA-tree performs much better than, and as well as or better than -tree (dependng on the unformty of sparsty); for the full matrx, LA-tree has comparable performance to, whch s the best, whle -tree really suffers from ts space neffcency. 5 Update atchng We now turn to the problem of batchng ndex updates n a memory buffer 5 to consoldate wrtes to dsk. To support ndex access whle updates are beng buffered, we organze ths buffer as an ndex over the buffered updates; a record lookup would be frst checked aganst ths n-memory ndex. Whenever the buffer s full, we need to flush updates,.e., applyng them n a batch to the underlyng dsk-resdent ndexes. As dscussed n Secton, we queston the common practce of flushng all buffered updates whenever the buffer s full. Secton 5. presents alternatve polces and a theoretcal analyss of ther performance. Secton 5. dscusses mplementaton ssues and Secton 5. presents an emprcal evaluaton. 5. Flushng Polces and Analyss To smplfy theoretcal analyss, we make some assumptons. Frst, we vew each update to a record r as a request for the dsk page (leaf) that contans r or wll contan r, and we assume that we know the denttes of all requested pages before each flushng acton (see Secton 5. for mplementaton detals). Second, we assume that each flush ncurs a fxed cost per update plus a fxed cost per page; multple updates requestng the same page ncur the per-page cost only once for the flush, reflectng the beneft of batchng. ecause the sum of per-update costs n the end reman the same no matter how we flush, we focus on mnmzng the sum of per-page costs over tme. Note that ths analytcal model s an mperfect smplfcaton of realty. For example, t gnores the cost of obtanng page denttes (Secton 5.) and that of splttng (whch depends on factors such as the splttng ategy). Nonetheless, t provdes a reasonable estmate of the true cost, and makes our analyss more generalzable to other batch processng settngs. Wth these assumptons, we now formally defne the problem. Defnton. There are a set of pages P on dsk, and a buffer of capacty K n memory for bufferng requests. Every request refers to a page and takes unt space n the buffer. A flushng polcy selects subsets of requests to flush as needed to keep the buffer sze capped at K at all tmes. Flushng requests for the same page ncurs unt cost. We are nterested n an onlne flushng polcy that mnmzes the total cost over a request uence. For brevty, by buffered requests we mean all requests elgble 5 The buffer n ths context should not be confused wth the system buffer pool. Ths buffer batches updates whle the buffer pool caches dsk pages. Updates to currently buffered records are smply appled to the buffer, and are not counted as new requests. Therefore, n requests, even f they are for the same page, would take n unts of space.

9 for flushng, whch nclude the ncomng request. Wthout loss of generalty, we assume a polcy only flushes when the buffer s full (any polcy can be modfed to do so wthout affectng the cost). We can also assume that f a polcy flushes any request for P, t flushes all buffered requests for P ; n ths case, we smply say t flushes P. As t may have occurred to the reader, ths problem looks smlar to cache replacement []. Unfortunately, known results on cachng do not carry over. Although cachng has been generalzed to cases where pages can have varyng szes and evcton cost can be a functon of the page sze, an underlyng assumpton remans that the cache space devoted to a page P does not change as the number of requests to P ncreases. On the contrary, wth our problem, n requests to the same page take n unts of buffer space. Ths dfference turns out to be fundamental. Whle we can develop flushng polces analogous to well-studed cache replacement polces, we wll see that ther performance dffers both analytcally and expermentally; new polces specalzed for flushng are needed. We now present our flushng polces. Here we summarze our theoretcal results; see Appendx A for formal statements and proofs. We measure the performance of a flushng polcy by ts compettve rato aganst OPT, the optmal offlne polcy, whch knows the entre request uence n advance. OPT can be mplemented by an exponental-tme search; the algorthmc detals are rrelevant here. (As a sde note, the optmal offlne cache replacement polcy, furthest-n-future [], s not optmal for flushng; see Remark.5.) We show that any polcy s O(K)-compettve (Lemma ). (Had we been dealng wth cachng nstead, ths compettve rato would have been the best that any determnstc polcy can offer.) The most commonly used flushng polcy actually does better: Flush-All (ALL). Ths polcy smply flushes the entre buffer whenever the buffer s full. We show that ALL s Ω( K)- and O( K log K)-compettve (Theorems and ). We can generalze the lower bound above to what we call c-recent flushng polces (Defnton n appendx), whch do not buffer a request for a page f there has been no request for that page durng the past ck requests. Clearly, ALL s -recent. We show that any c-recent polcy s Ω( K/c)-compettve (Theorem 5). The next few flushng polces have analoges n cachng: Least-Recently-Used (LRU). Ths polcy always flushes the page whose most recent request s the oldest (among all pages most recent requests). It s analogous to the classc cache replacement polcy of the same name. We show that LRU s Ω( K)-compettve (Corollary ) by notng that LRU s - recent. (Note that for cachng, LRU s optmally compettve, wth a compettve rato of K.) Smallest-Page (SP). Ths polcy always flushes the smallest page,.e., one wth the smallest number of currently buffered requests. It s analogous to the LFU (least-frequently-used) cache replacement polcy. Whle LFU s wdely used for cachng, SP does not make much sense for flushng. Intutvely, SP flushes small pages, but flushng larger ones s more proftable as more requests can be processed wth one page wrte. Whle SP attempts to preserve large pages, pages have lttle chance to grow large because they may get flushed when stll small. We show that SP s Θ(K)-compettve (Lemma and Theorem ). The example conucted n the proof of Theorem makes the above ntuton concrete. (Note that for cachng, LFU s compettve rato s unbounded.) Largest-Page (LP). Ths polcy always flushes the largest page,.e., one wth the largest number of currently buffered requests. It s analogous to the MFU (most-frequently-used) cache replacement polcy. LP avods SP s problem of flushng small pages. On the other hand, LP may flush a page prematurely just because t s currently the largest; however, that page may grow even larger f t not mmedately flushed. We show that, just lke SP, LP s Θ(K)-compettve (Lemma and Theorem 7). The proof of Theorem 7 gves a concrete example of the premature flushng problem. Next, we present two new polces: the frst s a randomzed varant of LP, whle the second s a novel polcy amed at achevng a fundamentally better compettve rato than the polces above. Largest-Page-Probablstcally (LPP). Ths polcy randomly flushes a page wth probablty proportonal to the number of requests currently buffered for ths page. It can be seen as a randomzaton of LP. Intutvely, LPP s desgned to avod the problems of LP and SP: larger pages have a hgher chance of beng flushed, but all pages have a chance to survve and grow larger. Another attractve feature of LPP s ts effcency of mplementaton, as we shall see n Secton 5.. Largest-Group (LG). Ths polcy parttons buffered requests nto groups: Group, where log K, contans a page P f the number of buffered requests for P s n the range [, + ). We defne the sze of a group to be the total number of buffered requests for ts consttuent pages. When the buffer s full, LG flushes the group wth the largest sze. LG s a novel polcy desgned specfcally for the update batchng problem. Intutvely, LG s practce of flushng a group at a tme offers better protecton aganst an adversary than flushng a page at a tme. Wth log K + groups, the largest group K log K + has at least requests, so LG always flushes a szable number of requests. Even f LG had chosen a wrong subset of requests to flush, ths mstake cannot be repeated untl the buffer s full agan, whch only happens after at least K log K + more requests. In contrast, an adversary can more easly penalze polces that may flush a few requests. We show that LG has a compettve rato of O(log K) (Theorem 9), makng t the theoretcally best among our polces. 5. Implementaton Obtanng Page Identtes and Ranges All polces above except ALL requre obtanng the page dentty and key range for a buffered request. Such nformaton s readly avalable by executng a partal lookup for the requested key n the LA-tree, wthout vstng the leaf page contanng the key. Only one partal lookup s needed for requests to the same page, because once we obtan page P s range, we can check whether a request refers to P by comparng the requested key wth P s range. Snce only non-leaf levels are vsted, a generc system buffer pool (not to be confused wth the update buffer) s effectve n reducng I/Os. LP, SP, and LRU At the tme of flush, these polces make one pass over the buffered requests n key order. In the process, we fnd the dentty and range of each requested page P, usng one partal lookup (as opposed to one per request to P, as explaned above). Remanng detals are polcy-specfc and are gven n Remark.. To further reduce page dentfcaton cost, we mantan a cache that remembers the dentty and range for up to a confgurable number of pages. At the next flush, we avod the cost of dentfyng such pages. Of course, ths page nformaton cache consumes space that could otherwse be devoted to bufferng requests, whch we account for n our emprcal evaluaton n Secton 5.. LPP At the frst glance, LPP seems to requre knowng the counts of buffered requests for all pages. A far more effcent mplementaton s possble, however. We smply need to pck one buffered request unformly at random, fnd the dentty and range of ts page,

Storing Matrices on Disk: Theory and Practice Revisited

Storing Matrices on Disk: Theory and Practice Revisited Storng Matrces on Dsk: Theory and Practce Revsted Y Zhang Duke Unversty yzhang@cs.duke.edu amesh Munagala Duke Unversty kamesh@cs.duke.edu Jun Yang Duke Unversty junyang@cs.duke.edu ABSTRACT We consder