The Persistent Cache: Improving OID Indexing in Temporal Object-Oriented Database Systems

Size: px

Start display at page:

Download "The Persistent Cache: Improving OID Indexing in Temporal Object-Oriented Database Systems"

Millicent Wood
5 years ago
Views:

1 The Persistent Cache: Imroving ID Indexing in Temoral bject-riented Database Systems Kjetil Nørvåg Deartment of Comuter and Information Science Norwegian University of Science and Technology, Norway Abstract In a temoral DB, an ID index (IDX) is needed to ma from ID to the hysical location of the object. In a transaction time temoral DB, the IDX should also index the object versions. In this case, the index entries, which we call object descritors (D), also include the commit timestam of the transaction that created the object version. The IDX in a non-temoral DB only needs to be udated when an object is created, but in a temoral DB, the IDX has to be udated every time an object is udated. This has reviously been shown to be a otential bottleneck, and in this aer, we resent the Persistent Cache (PCache), a novel aroach which reduces the index udate and looku costs in temoral DBs. We develo a cost model for the PCache, and use this to show that the use of a PCache can reduce the average access cost to only a fraction of the cost when not using the PCache. Even though the rimary context of this aer is ID indexing in a temoral DB, the PCache can also be alied to general secondary indexing, and can be esecially beneficial for alications where udates are non-clustered. 1 Introduction In a transaction time temoral object-oriented database system (TDB), udating an object creates a new version of the object, but the old version is still accessible. A system Permission to coy without fee all or art of this material is granted rovided that the coies are not made or distributed for direct commercial advantage, the VLDB coyright notice and the title of the ublication and its date aear, and notice is given that coying is by ermission of the Very Large Data Base Endowment. To coy otherwise, or to reublish, requires a fee and/or secial ermission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, maintained timestam is associated with every object version, usually the commit time of the transaction that created this version of the object. An imortant feature of DBs, is that an object is uniquely identified by an object identifier (ID), and that the object can be accessed via its ID. The ID can be hysical, which means that the disk age of the object is given directly from the ID, or logical, which means that an ID index (IDX) is needed to ma from logical ID to the hysical location of the object. In a TDB, logical IDs is the most reasonable alternative, because an IDX is necessary anyway to index the object versions. The entries in the IDX, which we call object descritors (D), contain administrative information, including information to do the maing from logical ID to hysical address, and the commit timestam. The IDX can be quite large. In non-temoral DBs, a tyical size is in the order of 2% of the size of the database itself [2]. This means that in general, only a small art of the IDX fits in main memory, and that IDX retrieval can become a bottleneck if efficient access and buffering strategies are not alied. An imortant difference between IDX management in non-temoral and TDBs, is that with only one version of an object (non-temoral), the IDX needs only to be udated when a new object is created. This can be done in an efficient aend-only oeration [2], and we can focus on otimizing IDX lookus. In a TDB however, the IDX must be udated every time an object is udated. An object udate creates a new object version, without deleting the revious version, hence, a new D for the new version has to be inserted into the IDX. The index ages will usually have low locality (the unique art of an ID is usually an integer that will always be assigned monotonic increasing values), and as a result index udates might become a serious bottleneck in a TDB. To reduce disk I/ in index oerations, the most recently used index ages are ket in an index age buffer. IDX ages will in general have low locality, and to increase the robability of finding a certain D needed for a maing from ID to hysical address, it is also ossible to kee the most recently used index entries (the Ds) in a searate D 66

2 cache, as is done in the Shore DB [6]. With low locality on index ages, a searate D cache utilizes memory better, sace is not wasted on large ages where only small arts of them will be used. An D cache reduces the index looku costs considerably, and can be extended to reduce index udate costs as well [1]. However, even when using a writable D cache, IDX udates are still very costly. In this aer, we resent an aroach to further reduce the IDX udate costs. Noting that the main reason for the bottleneck against the IDX is the low locality of entries in the IDX tree nodes, we use an intermediate disk resident buffer between the main memory buffer, and the IDX itself. We call this the Persistent Cache (PCache). The PCache is tyically much larger than the available main memory buffer, but smaller than the IDX itself. The entries in the PCache are managed in an LRU like way, just like a main memory cache. In addition to reducing udate costs, the PCache also reduces the looku costs. The reason for the reduced costs, is the higher locality on PCache ages, comared to the IDX tree nodes. Higher locality means that less disk oerations are necessary to read and write Ds. The PCache-to-IDX tree writeback can be done very efficiently later. This will be described later in this aer. In this aer, we describe the PCache in detail, and analyze its erformance by the use of cost functions. We will study otimal size of the PCache, and see how buffering of nodes in main memory should be done to otimize the PCache erformance. It should also be noted that even though our rimary context for this aer is ID indexing, the results are also relevant to entry access cost and index entry caching for general secondary indexes. The organization of the rest of the aer is as follows. In Section 2 we give an overview of related work. In Section 3 we describe object and index management in TDBs. In Section 4 we describe the PCache. In Section 5 we develo an the ID access cost model, and in Section 6 we use this cost model to study how different PCache sizes, memory sizes, index sizes, and access atterns affect the erformance. Finally, in Section 7, we conclude the aer and outline issues for further research. 2 Related Work Temoral database systems are in general still an immature technology, and in the case of transaction time TDBs, we are only aware of one rototye that has temoral ID indexing, PST/C++ [12]. The erformance results resented for PST/C++ are only for relatively small databases, where the index fits in main memory, and we exect that with a larger number of objects, the IDX would be a bottleneck. The PCache has similarities to LHAM [7], where a hierarchy of indexes is used. ne imortant differences between the PCache and LHAM, is that in LHAM, all entries in one level is regularly moved to the next, there are no Suort for versioning exists in most DBs, but not temoral management, indexing, and oerations. LRU management, and as such, LHAM only hels in imroving write efficiency, not read efficiency. The cost models in this aer are based on revious work on modeling non-temoral DBs [9] and temoral DBs [1], but the models have been extended to include the asects of the PCache. The buffer and tree models have been comared with simulation results. Detailed results from the simulations with different index sizes, buffer sizes, index age fanout, and access atterns, can be found in [11]. 3 TDB bject and Index Management We start with a descrition of how ID indexing and version management can be done in a TDB. This brief outline is not based on any existing system, but the design is close enough to make it ossible to integrate into current DBs if desired, and it will also be used as a basis for the ID indexing in the Vagabond TDB. 3.1 Temoral ID Indexing In a traditional DB, the IDX is usually realized as a hash file or a B -tree, with Ds as entries, and using the ID as the key. In a TDB, we have more than one version of some of the objects, and we need to be able to access current as well as old versions efficiently. If access is mostly reading current objects, it is efficient to have two indexes, one with Ds reresenting the current version of the objects, and one with Ds reresenting historical objects (i.e., revious versions). The roblem with this aroach, is that every time a new version is created, we have to udate two indexes. A second aroach, is to use a linked list of versions for each object. If accesses are mostly of the tye get all versions of an object with ID, this is an efficient alternative. However, access to a articular version, valid at time, is very costly with this aroach, because we have to traverse the object chain. ur aroach to indexing is to have one index structure, containing all Ds, current as well as revious versions. While several efficient multiversion access methods exist, e.g., TSB-tree [4] and LHAM [7], they are not suitable for our urose. We will never have search for a (consecutive) range of IDs, ID search will always be for erfect match, and most of them are assumed to be to the current version. TSB-trees rovides more flexibility than needed, e.g., combined key range and time range search, which imlies an extra cost, while LHAM can have a high looku cost when the current version of an object is searched for. In this aer, we assume one D for each object version, stored in a B -tree. We include the commit time TIME in the D, and use the concatenation of ID and time, ID TIME, as the index key. By doing this, Ds for a articular ID will be clustered together in the leaf nodes, sorted on commit time. As a result, search for the current version of a articular ID as well as retrieval of a articular time interval for an ID can be done efficiently. When a new object is created, i.e., a new ID allocated, its D is aended to the index tree as is done in the case of 67

3 ? the Monotonic B -tree [3]. This oeration is very efficient. Ho wever, when an object is udated, the D for the new version has to be inserted into the tree. It should be noted that this IDX is inefficient for many tyical temoral queries. As a result, additional secondary indexes can be needed, of which both TSB-tree and LHAM are good candidates. However, the IDX is still needed, to suort navigational queries, one of the main features of DBs comared to relational database systems. Some otimizations are ossible to this IDX, e.g., using variants of nested tree index, but as the PCache is the focus of this aer, we will not elaborate more on this subject here. 3.2 Temoral bject Management In a non-temoral (one-version) DB, sace is allocated for an object when it is created, and further udates to the object are done in-lace. This imlies that after an object udate, the revious version of the object is not available. The hysical location of the new version is the same as the revious version, hence, the IDX needs only to be udated when objects are created and when they are deleted. In a TDB, it is usually assumed that most accesses will be to the current versions of the objects in the database. To kee these accesses as efficient as ossible, and benefit from object clustering, the database is artitioned, with current objects in one artition, and the revious versions in the other artition, in the historical database. When an object is udated in a TDB, the revious version is first moved to the historical database, before the new version is stored in-lace in the current database. The IDX needs to be udated every time an object is udated. As long as the modified Ds are written to the log before commit, we do not need to udate the IDX itself immediately. This is done in the background, and can be ostoned until the second checkoint after the D have been written to the log. Index ages will be written to disk either because of checkointing, or because of buffer relacement. Not all the data in a TDB is temoral, for some of the objects, we are only interested in the current version. To imrove efficiency, the system can be made aware of this. In this way, some of the objects can be defined as non-temoral. ld versions of these are not ket, and the objects can be udated in-lace as in an one-version DB, and the costly IDX udate is not needed when an object is modified. This is an imortant oint: using an DB which efficiently suorts temoral data management, should not reduce the erformance of alications that do not utilize these features. 4 The Persistent Index Entry Cache The Ds accessed will be almost uniformly distributed over the index leaf nodes. The D cache makes read ac- It is also ossible that in a TDB alication, a good object clustering includes historical objects as well as current objects. This should be studied further, but does not have any imlications to the results studied here, all udates to objects will necessarily necessitate allocations of sace for the new object, and an IDX udate. %& ' %&)(+*-,/.! "$# 6!? M ctr? M M i iages 21 67*-8:9;<!3245$ < ), = 6 >#? M ocache B 1!3 6 > 6 >A 6! 6!' Figure 1: verview of index, PCache, and index related main memory buffers. PCache nodes PC1, PC4 and PC6, and 3 IDX nodes (denoted NDC ) are in the buffer. cesses efficient, but in a database with many objects, most of the Ds that are udated during one checkoint interval will reside in different leaf nodes. This low locality means that many leaf nodes have to be udated. When an index node is to be udated, an installation read of the node has to be done first. With a large index, the access to the nodes will be random disk accesses, and as a result, the installation read is very costly. To reduce the average access cost, the ersistent cache (PCache) can be used. The PCache contains a subset of the entries in the IDX, the goal is to have the most frequently used Ds in the PCache. In contrast to the main memory cache (the D cache), the PCache is ersistent, so that we do not have to write its entries back to the IDX itself during each checkoint interval. This is actually the main urose of the PCache: to rovide an intermediate storage area for ersistent data, in this case, Ds. The size of the PCache is in general larger than the size of the main memory, but smaller than the size of the IDX. The contents of the PCache is maintained according to an LRU like mechanism. The result should be high locality on accesses to the PCache nodes, reducing the total number of installation reads, and making checkointing less costly. Average IDX looku cost should also be less than without a PCache. To avoid confusion, we will hereafter denote the index tree itself as the TIDX, and use IDX to mean the combined index system, i.e., PCache and TIDX. Thus, when we say an entry is in the IDX, it can be in the PCache, in!1 >3 68

4 the TIDX, or in both. This is illustrated on Figure PCache rganization The index related main memory buffers, the PCache, and the TIDX, are illustrated on Figure 1. The number of nodes in the PCache should be small enough to make it ossible to store ointers to all the PCache nodes in main memory. To be able to do the coying of the Ds from the PCache to the TIDX efficiently, the nodes in the PCache should be accessed in the same order as the leaf nodes in the IDX. Therefore, the nodes in the PCache are range artitioned, each node stores a certain interval of IDs. Range-artitioning is vulnerable to skew. To avoid this, the artitioning can be dynamically changed (for each node, we have the ID range boundary in main memory). This is done based on the udate access rates on each node. A high udate access rate to a node results in a smaller interval being allocated to that node. Because the PCache nodes are frequently accessed, the reartitioning does not reresent any extra cost (the reartitioning can be done when neighbor nodes are resident in the buffer). It is imortant to note that even though reads of PCache nodes will be random disk accesses, the PCache nodes will be clustered together, so that the random read of PCache nodes will have a small seek time. There will also be several PCache node read requests at any time, so by using an elevator algorithm, the cost of reading from the PCache will be low. 4.2 PCache LRU Management The PCache nodes are oerated similar to an ordinary main memory cache, and when an entry is to be stored in a node in the PCache, one of the existing entries have to be discarded. To rovide access statistics, an access table is maintained for each node, with one bit for each entry in the node, and we use the clock algorithm as an LRU aroximation. An access bit is set each time an entry is accessed. As for the storage of the access tables, we have several otions: 1. The access table of a node could be stored in the node itself. A roblem with this otion, is that when a node is to be discarded from the main memory buffer, and entries have been accessed, the node needs to be written back to disk, even if none of the entries have been changed. This is not desirable. 2. Access tables are maintained in main memory, for each main memory resident node. When a node is discarded, i.e., due to buffer relacement, the table is discarded as well. A roblem with the this otion, is that until enough accesses have been done to the entries, the bit ma is unreliable as a way to aroximate LRU, and the wrong entries might get discarded. 3. Access tables are maintained in main memory for all PCache nodes. ne roblem with the this aroach, is that one table for each node in the PCache is needed, but because the size of each table is small (one bit for each entry in the node), this will not reresent a roblem as long as the PCache is not too large. If the system crashes, the contents of the access tables will be lost, and wrong caching decisions might be done when the system is restarted. This does only affect erformance at startu time, the Ds of committed oerations are always safe on disk. To reduce the amount of wrong caching decisions at startu time, we store the access table in the node as well when the node is written to disk. Note that this differs from otion 1, where the node is always written back when it is discarded, even when there are no udates to the Ds stored in the node. Based on the observations above, we conclude that maintaining access tables in main memory for all PCache nodes is the best aroach. In addition to the access tables, each node also contains a table to kee track of the status of the entries with resect to the TIDX. ne dirty bit is needed for each entry. The dirty bit is set each time an entry is modified or inserted into the node. nly the dirty entries need to be written back to the TIDX, entries not marked as modified can be safely discarded when needed. 4.3 Udate erations When emloying the PCache, inserting Ds resulting from object udates are always done to the entry in the PCache, never directly to the TIDX. Inserting an D into a node, imlies discarding another D from the node, based on the LRU strategy. It is referable to discard a non-dirty entry, so that a synchronous writeback of the dirty entry is avoided. To have a high robability of non-dirty slots in the PCache node, dirty entries in the PCache are regularly coied over to the TIDX itself, asynchronously in the background. This is done efficiently by mostly sequential reading of the PCache nodes, and mostly sequential installation read and subsequent writing of the TIDX nodes. 4.4 bject Creations bject creations are still alied directly to the TIDX. An D resulting from an object creation is an efficient aend oeration into the TIDX. In the case where a new Ds is to be art of the hot set, it will usually be retrieved from the TIDX nodes before they are discarded from the buffer. These Ds will on access be inserted into the PCache. 4.5 Read erations Read oerations belong to one of two classes: navigational (single object) read, and scan oerations. Single bject Read Read oerations are done by first checking if the entry to be accessed is in the D cache or the PCache, if not found there, the TIDX itself is searched. The search in the PCache 69

5 might result in one disk access if the actual node is not in buf D fer. When using range artitioning, there is only one candidate node, so that at most one disk access will be needed. Accessing the TIDX can result in one or more disk accesses if the TIDX nodes are not in buffer. When found, either in the PCache or the TIDX, the D is inserted into the D cache. If the D was not already in the PCache, we now have several otions, for examle: 1. Insert the D into the PCache immediately. Note that at this oint, we are guaranteed to have the candidate PCache node resident in buffer, because we have robed it during the search for the D. If we manage to get a high hit rate on the PCache, the otimal D cache size might be quite small in this case. Note that the D contains, in addition to the entries in the PCache, dirty entries resulted from udate oerations not yet installed into the PCache. 2. Insert the D into the PCache only when it is to be discarded from the D cache. In this way, we get good memory utilization, we do not have to use sace for the D both in the D cache and in the PCache. However, in this case, we are not guaranteed to have the candidate PCache node resident in buffer, it might have been discarded since it was robed. 3. Never insert the D into the PCache, only insert entries into the PCache when doing udate oerations. In this case, we rely on the D cache to kee the most frequently accessed Ds, and use the PCache to be able to do efficient udate of the TIDX. This strategy delays the udate of the TIDX, and means that more entries can be collected before batch udating the TIDX. The best otion to choose, deends on access attern. The ossible installation read of otion 2 can make it costly, and because Ds retrieved from read oerations are not inserted into the PCache in the case of otion 3, we only consider otion 1 in this aer. Scan erations Scan oerations must be treated different from single object read oerations, as one single scan oeration can make the current contents of the whole PCache to be discarded. The Ds retrieved during a scan oeration will in general have less chance of being used again, it is not likely that the whole collection or container to be scanned, reresents a hot set. Even if this is the case, it is ossible that the number of Ds retrieved during the scan, is larger than the number of Ds that fits in the PCache. In this case, if we do a new scan over the collection/container, we will have a PCache hit robability of. This is similar to general buffer management in the case of scan oerations. As a result, scan oerations should not udate the PCache, but the PCache must be consulted during read, because recently udated Ds from the actual container/collection might reside in the PCache. However, this will not be very costly, because the contents of a hysical container cached in the PCache will be clustered in the PCache s ages as well, so that the extra cost of reading the relevant PCache ages is only marginal. 4.6 PCache-to-TIDX Writeback The udate of the TIDX, i.e., writing dirty entries in the PCache to the TIDX, will be done in the background. This is done by reading the PCache, and install the dirty entries of these nodes into the TIDX. This is done in segments, i.e., a number of nodes, and will be mostly sequential reading and writing. The PCache-to-TIDX writeback is a scan oeration, and to avoid buffer ollution, nodes accessed during this oeration should not affect the rest of the buffer contents, i.e., they should not make other nodes to be removed from the buffer. The rate of the writeback is a tuning question. By giving it higher riority, i.e., doing more frequent writeback of PCache nodes, the robability of a PCache node being full of dirty entries is less likely. This is imortant, because it reduces the robability of synchronous writebacks. n the other hand, higher riority to the writeback also means that more of the disk bandwidth will be used for this urose, because each node contains a smaller number of dirty entries. All Ds udated since the enultimate checkoint, and still dirty in the D cache, needs to be installed into the PCache or TIDX during one checkoint eriod. This is not the case with the PCache-to-TIDX writeback. The eriod between each time the contents of a articular PCache node is written back can be very long, but still short enough to avoid overflow of dirty Ds in the PCache. 4.7 Buffer Considerations We can have a buffer shared between TIDX nodes and PCache. However, this does not necessarily give otimal erformance. In some cases, it might be that TIDX accesses ollutes the buffer, resulting in a low hit rate on PCache nodes. To avoid this, we can use searate buffers, one TIDX buffer, and one PCache buffer. We can also in a certain number of the uer TIDX levels in memory, this can be advantageous because strict use of LRU is not otimal when buffering nodes of an index tree. 5 Analytical Model Analytical modeling in database research has mostly focused on I/ costs. This is the most significant cost factor, and in reasonable imlementations, the CPU rocessing should go in arallel with I/ transfer, making the CPU cost invisible. With increasing amounts of main memory available, this is not necessarily correct, but CPU costs can easily be incororated into analytic models, and hence we consider it as an orthogonal issue to the one discussed in this aer (though it should be noted, that CPU cost should not affect the qualitative results in this aer). A more imortant asect of the increasing amount of main memory, 7

6 E ` i ` H n however, is that buffer characteristics become more imortant, hence, the increased buffer sace available must be reflected in the models. We use a traditional disk model, where the cost of reading a block from disk is the sum of the start u cost F start and the transfer cost F transfer. In our model, the average start u cost is fixed, and is set equivalent to G, the time it takes to do one disk revolution. The transfer cost is directly roortional to the block size, and is equivalent to reading disk tracks contiguously, e.g., transfer cost is equal to I>J H G, where K is the block size to be transferred, and LNM is the amount of data on one track. Thus, the total time it takes to transfer one block is F H KPRQSF start T F transfer QU G T I J G. Index costs can be reduced by artitioning the index over several disks. Declustering PCache nodes and TIDX nodes over several disks is straightforward. The time to read or write a random index age is FWVXQ Z F H VRP, where V is the index age size. In this aer, we do not consider the cost of reading and writing the objects themselves, or log oerations. Those costs are indeendent of the indexing costs, usually done on searate disks, and are issues orthogonal to the ones studied in this aer. In this aer, we focus on reducing access times. bviously, the reduced access times comes at the exense of more disk sace for the PCache. As disk caacity increases raidly, with a corresonding decrease in rice, we exect that in most cases, using the extra sace for the PCache will be worthwhile. As illustrated on Figure 1, a certain amount of memory, [ iages, is reserved for the index age buffer, i.e., for buffering PCache and TIDX ages, and [ ocache, is reserved for the D cache. If we assume the size of each D is od, and an overhead of oh bytes is needed for each entry in the D cache, the number of entries that fits in the D cache is aroximately ocache ] _5` ocache ^ oha. If we use an hash table and a clock od algorithm as an LRU aroximation, oh ]Ub B (B=bytes). The number of index ages that fits in the buffer is aroximately ibuf ] _5` iages ^ P oha. For each PCache age on disk, we need to kee in memory the disk address of the node (4 B), the ID range boundary (4 B), and the LRU access table as described reviously. This occuies a total of [ ctr Q Vdc of PCache nodes. `fe&gh` 5.1 Index Entry Access Model od T b P B, where Vdc is the number We assume accesses to objects in the database system to be random, but skewed (some objects are more often accessed than others). We further assume it is ossible to (logically) artition the range of IDs into artitions, where each artitions has a certain size and access robability. This is illustrated at the bottom of Figure 2 (note that this is not how it is stored on disk, this is just a model of accesses). We consider a database in a stable condition, with a total of objver objects versions (and hence, objver index entries). Note that with the TIDX described in Section 3.1, Leaf nodes jmkl jmkl... jkl jkl jkl... jkl... jmklfjmkl... jmkl... jkl jkl... jkl jmkl β... n β1 Root node jkl jkl... jkl jmkl jmklfjmkl jmkl... jkl jkl jkl n β2 Tree based ID index, hysical view Maing Logical view Figure 2: ID index. The lower art shows the index from a logical view, the uer art is as an index tree, which is how it is realized hysically. We have indicated with arrows how the entries are distributed over the leaf nodes. erformance is not deendent of the number of existing versions of an object, only the total number of versions in the database. In many analysis and simulations, the 8/2 model is alied, where 8% of the accesses go to 2% of the database. While this is satisfactory for analysis of some roblems, it has a major shortcoming when used to estimate the number of distinct objects to be accessed. When alied, it gives a much higher number of distinct accessed objects than in a real system. The reason is that for most alications, inside the hot sot area (2% in this case), there is an even hotter and smaller area, with a much higher access robability. This is even more imortant for a temoral database. Most of the accesses will be to a small number of the current versions. With a large number of revious versions, this hot sot area will be much smaller and hotter than the one in a tyical traditional database. This has to be reflected in the model. 5.2 Index Tree Size If we assume an index tree with sace utilization o (tyically less q than.69 for a B -tree), the number of leaf objver nodes is tree ] sut r`fevgh` odw. The fanout x of internal nodes is xyq{zzo V} ie~, where ie is the size of an entry in an internal node. The number of levels in the tree is Q q Tƒ Š treeˆ. The Ž number of nodes at each level in tree the tree is GhŒŒ Q r ˆ and the total number of nodes in the tree is tree Q U 5 W tree. 5.3 Buffer Performance Model ur buffer model is an extension of the Bhide, Dan and Dias LRU buffer model (BDD) [1]. Due to sace constraints, we only resent the most imortant asects of our model in this aer, but a detailed descrition can be found in [9]. A database in the BDD model has a size of data granules (e.g., ages), artitioned into artitions. Each ar- 71

7 ² ` _ ` tition contains a fraction of the data granules, and š of the accesses are done to each artition. The distributions within each of the artitions are assumed to be uniform, and all accesses are assumed to be indeendent. We denote a articular artitioning set œ Q š ž:ÿ:ÿ Ÿž šd ž 7ž:Ÿ Ÿ:Ÿž P. For examle, for the 8/2 model, œ i g Q Ÿ b ž Ÿ ž Ÿ/ ž Ÿ b P. We will in the following use œ as short for the actual D access artitioning set. In the BDD model, the steady state average buffer hit - ž ž œªp, where is the num- robability is denoted buf ber of data granules that fits in the buffer. The buffer hit robability for data granules belonging to a articular artition is denoted buf ž ž œªp. The BDD model can also be used to calculate the total number of distinct data granules accessed after C accesses to the database, distinct C ž ž œªp. D Cache Hit Rate Accesses to the D cache can be assumed to follow the assumtions behind the BDD model, they are indeendent and random requests. By alying this model with object entries as data granules, the robability of an D cache hit is ocache QS buf ocachež General Index Buffer Model objverž œ P. The BDD LRU buffer model only models indeendent, non-hierarchical, access. Modeling buffer for hierarchical access is more comlicated. Even though searches to the leaf ages can be considered to be random and indeendent, nodes accessed during traversal of the tree are not indeendent. We have extended the original model to be able to analyze buffer erformance in the case of hierarchical index accesses as well [11]. This is based on the observation that each level in the tree is accessed with the same robability (assuming traversal from root to leaf on every search). Thus, with a tree with levels, we initially Š have artitions. Each of these artitions are of size where tree, tree is the number of index ages on level in the tree. The access robability is for each artition. To account for hot sots, we further divide the leaf age artition into &«artitions, each with a fraction of N of the leaf nodes, and access robability š relative to the other leaf age artitions. Thus, in a global view, each of these artitions have size N q tree and access robability. In total, we have Q Ñ«{ T!P artitions. The hot sots at the leaf age level make access to nodes on uer levels non-uniform, but as long as the fanout is sufficiently large, and the hot sot areas are not too narrow, we can treat accesses to nodes on uer levels as uniformly distributed within each level. An examle of this artitioning is illustrated to the right on Figure 3, where a tree with Q³² levels and &«WQ leaf age artitions is artitioned into Q T!PRQSµ artitions. ¹ƒ»º 5 ) ÂWÃÅÄ7Æ:ÇNÈ ÎNÏ Ð È!Ñ ¼ Index Page Access Model ½Á¾½ÁÀ ½ ¾h½ À ) 5 ½Á¾½ÁÀ ½ ¾h½ÁÀ 5 ) ½Á¾½ÁÀ ) 5 ½Á¾½ÁÀ ½ ¾h½ À ) 5 ÉdÊ-ËÍÌ ÎNÏ Ð È!Ñ Figure 3: Access artitions. ½ ¾h½ À ½ ¾h½ÁÀ ½ ¾h½ À As noted, we can assume low locality in index ages. Because of the way IDs are generated, entries from a certain artition are not clustered in the index. This is illustrated in Figure 2, where a leaf node containing index entries contains unrelated entries from different artitions. This means that the access attern for the leaf nodes is different from the access attern to the database from a logical view. As described in [11], we can use the initial D artitioning (the D access attern) œ as basis for deriving the leaf node access artitioning œ. PCache Hit Rate We denote the robability that a certain D is in the PCache as ÒVÓc, the actual PCache age might be in memory or on disk. `fe The number of Ds in each PCache node is z ~, and od with a total of Vdc nodes, the` enumber of Ds that fits in the PCache is VdcQ VdcÔz od~. The robability that a certain D is in one of the PCache nodes can be aroximated to: VÓc QU buf Vdc ž PCache and TIDX Buffer Model objverž œ P When the PCache and the TIDX share the same main memory age buffer, the model has to reflect this. The access robabilities for TIDX nodes and PCache nodes are different, and as a result, in an LRU managed buffer, the hit rate will be different. In the buffer model, we use a artitioning as illustrated on Figure 3. n the figure, the PCache is one artition, and each level in the tree is one artition, with the leaf node artition further divided into two artitions, reflecting the existence of hot sot nodes (nodes belonging to one of the two leaf node artitions need not actually be hysically adjacent as on the figure). Considering the age accesses, all IDX lookus will access one PCache age, and Vdc P of the lookus will also access the TIDX. Each IDX looku results on average in T VÓc P age accesses. Thus, the total access robability of the PCache is š Vdc Q» V e+õ a, which is the fraction of accessed ages that is art of the PCache artition. The TIDX access robability is V e+õ. š Ö& hørù³q» V e+õa a 72

8 J T ã r The total number of index ages is Vdc `feûõ T G Œ»Œ. The PCache Ú contains Vdc Q ` e Õ of these ages, the TIDX contains NÖ& hørù Q ` e+õr Ü5ÝÞZÞ ròü5ýþ-þ of the ages. The TIDX artitions is further ròü5ýþ-þ artitioned into artitions as described above, and we denote the resulting artitioning (Figure 3) as œ Shared. We denote the PCache and TIDX node buffer hit robabilities as: buf PC Q Vdc buf buf TIDX Q ÖN Ø}Ù buf ibufž ibufž Vdc T Vdc T treež œ sharedp treež œ sharedp As noted in Section 4.7, it can be advantageous to use searate buffers for the PCache and TIDX. In that case, ibuf PC buffer ages are reserved for the PCache, and ibuf TIDXbuffer ages are reserved for the TIDX, so that ibuf PC T ibuf TIDX Q ibuf. Denoting the tree artitioning as œ Tree, the corresonding buffer hit robabilities using searate buffers are: ibuf PC buf PC Q r ` e Õ buf TIDX Q buf ibuf TIDXž GhŒŒ ž œ TreeP IDX Looku Cost Assuming we have the address of all PCache ages in memory, and use a range artitioned PCache, at most one disk access is needed for each PCache looku. Before a age can be read in, another have to be relaced. A age may contain dirty entries, because all read Ds from the TIDX are inserted immediately. In this case, the PCache age has to be written back. To be able to do this efficiently, we use the following strategy: n disk, we allocate sace for more nodes than the number of nodes in the PCache. When we read a age, we at the same time schedule dirty age(s) for writing, to an emty slot near the node(s) to be read. In that way, the extra write cost is only marginal comared to the read. Because we at all times kee the ointers to all PCache nodes in the memory, they can be written back to different laces every time.ß We aroximate the looku cost to: F looku PC Q buf PCPF V To access an entry in the TIDX, it is necessary to traverse the levels from the root to a leaf age. To do this, we need buf TIDXP disk accesses. The average cost of an TIDX looku is: F looku TIDX Q buf TIDXP With a robability of ocache, the ID entry requested ocachep of the re- is already in the D cache, but for quests, we have to access the PCache. With a robability of PC, the entry is in the PCache. If not, the TIDX itself has to be accessed. The average cost to retrieve an entry is: à It is interesting to note that by using this aroach, we do things according to the log-structured file system hilosohy, which our Vagabond TDB will be based on [8]. FWV F looku Q ocachep IDX Udate Cost F looku PC Vdc PF looku TIDXP We do not need to udate the PCache or the index ages in the TIDX immediately after an udate has been done. This is done in the background, and can be ostoned, increasing the robability that several udates can be done to each index age before they are written back. We calculate the average index udate cost as the total index udate costs during an interval, divided on the number of udates. In this context, we define the checkoint interval to be the number of objects that can be written between two checkoints. The number of written objects, cóv, includes created as well as udated objects. new objects are creations of new objects, and còv of the written newp of the written objects are udates of existing objects. We assume that the D cache is large enough to kee all dirty Ds through one checkoint interval, and that deleting and comacting ages can be done in background. This means that cóv=á ocache. Using a strategy that write a larger amount of Ds to the log before installing them into the TIDX, is difficult: If we did not kee all Ds not yet installed into the IDX in the memory, we would have to search the log on each access, to check if the log contained a more recent version than the one in the IDX. Creation of New Ds New object descritors are created when new objects are created. The number of created objects is còâ Q new cóv. When new objects are created, their Ds are aended to the index (we do not distribute the entries over the old node and the new one when the rightmost nodes are slit), and we have clustered udates. As described reviously, object creations are done directly q to the TIDX, Õ+ä and not to the PCache. This contributes to ã Q t `fengh` odw created leaf ages. This is a subtree in the index tree, of height ã ages. The total cost of creat- M, with ã Q W ing these object descritors is the cost of writing ã index ages to the disk. No installation read is needed for these ages. Assuming that the disk is not too fragmented, these ages can be written in one oeration, mostly sequentially: Z F writenew QSF H Modification of Existing Ds in the PCache When an object is udated, a new object version is created, and a new D has to be inserted into the IDX. The number of udated objects during one checkoint interval is s Q còv å còâ. In general, at least one D will be inserted into each PCache age (the number of udates in one checkoint interval is much larger than the number of PCache ages). In this case, the most efficient way to udate the PCache is to read sequentially a number of PCache ages, udate them, V P cóv 73

9 D Q Q Q T T T g T write them back, and continue with the next segment. Assuming that PCache-to-TIDX writeback has high enough riority, we can assume that when inserting a new entry into a PCache node, there is always a non-dirty entry that can be removed. The cost of this is: F write to PC Q - F H Vdc where Vdc is the number of PCache nodes. PCache-to-TIDX Writeback Cost The urose of the PCache-to-TIDX writeback is to always have non-dirty slots in the PCache nodes, where new entries can be inserted. The PCache-to-TIDX writeback runs continuously in the background. The eriod for each round is ideally so long that each PCache node is almost full of dirty entries when it is rocessed. The cost is equal to reading a number of PCache nodes (sequential reading), and writing the dirty entries back to the TIDX. If we assume each PCache node is almost full when we rocess it, we have æ VRP Vdc entries to write back in each round, where æ is the PCache node dirty fill factor, i.e., the amount of dirty entries in the node. This value should ideally be close to 1, but to avoid delays in normal rocessing due to overflow of dirty entries in nodes, æ should be sufficiently small, we will in the calculations in this aer use a value of.9. The number of udate objects during each round of PCache-to-TIDX writeback is ` còv Qçæ VÓc, which we call the suer checkoint eriod. Udating the index involves a age installation read, where the age where the last (current) version resides is read from disk, if the age is not already in the buffer. The cost of this is F looku TIDX for each distinct object modified. The number of distinct udated objects is: Ø s Q distinct ` cóv ž objverž œ P However, as noted in Section 3.2, not all objects in a TDB are temoral. We denote the fraction of the data accesses going to temoral objects as temoral. nly udates of these objects alter the IDX, udates of nontemoral objects will be done in-lace. The number of distinct udated temoral objects is: I Ø s QS temoral The number of leaf ages to be accessed as a art of the installation read: Ôè Ø s I distinct Ø s ž GhŒŒ ž œª dp If there is sace for the new D in the leaf node of the TIDX, it can be inserted there, and the node can be written back. If there is no sace in the node, the node is slit, a rocess done recursively, ossibly to the root. If a node is slit, the arent node has to be udated as well. Because of the ossibility of age slits, determining the udate cost is difficult. With sufficiently many entries on each index node, the robability of age slit is small enough to be neglected [13]. However, for some ages, there are more than one insertion to that age (ossibly generated by several udates to one object during one checkoint interval, remember that each udate creates a new entry to be inserted into the IDX). Thus, we include the age slit in our cost functions. According to Loomis [5], the robabil- ê è g, ity of a slit in a B -tree of order é is less than Áë so we aroximate slit ] ê_5` e gh` oda»ë. For each slit, the new age needs to be written back, as well as the udated arent node. However, note that there may be several slits affecting one arent node, in this case, it needs only be written back once. The resulting total write back cost is: F «writeback è è è slitpfwv T slitpfvv The equation above assumes that there will on average be less than one entry to be inserted in each leaf node. That is the case as long as we use the following otimization: If the checkoint interval is sufficiently large, it is more efficient to read the comlete index, udate the index nodes, and write it back (if memory is not large enough, this is done in segments). This will be very efficient, as the reading and writing will be sequential. The cost of this is: giving: F writeback QUìîí5ï F ««writeback Q F H Ôè Average Index Udate Cost tree V P F looku Ö& hørù T F «writeback ž F ««writebackp The average index udate cost er object is the total cost of udating the PCache and the PCache-to-TIDX writeback, divided on the number of udated objects: F udate Q F writenew còv F write to PC còv F writeback ` còv Note that the total PCache-to-TIDX writeback is for one suer checkoint eriod, while the PCache udate and object creating cost is er ordinary checkoint eriod. 5.4 IDX Access Cost Without PCache In a system with no PCache, the IDX looku cost is: F looku Q ocachepf looku TIDX Without a PCache, we need to write back all Ds to the TIDX each checkoint interval. This is similar to PCacheto-TIDX writeback, excet that we now write back only còv Í cóâ entries instead of ` cóv, which makes it more difficult to do it efficiently. The average udate cost is: F udate Q F writenew cóv F writeback TIDX cóv 74

10 š š ò š V Set 3P P P P Table 1: Partition sizes and artition access robabilities for the artitioning sets used in this study. Parameter Value [ ocache.1 [ L&M G od oh ie 5 KB 8.33 ms 8 KB 32 B 8 B 16 B Parameter Value objver 1 mill. o.67 cóv Ÿ ó new.2 write.2 temoral.8 ocache where F writeback TIDX is calculated according to the equations used for F writeback, excet that cóv cóâ is used instead of ` cóv in calculating Ø o. When calculating the buffer hit robabilities, the index age buffer is only used for the TIDX, with the absence of a PCache, no memory is used for PCache ointers and tables. 6 Performance Study We have now derived the cost functions necessary to calculate the average cost of IDX access under different system arameters and access atterns, with and without the use of a PCache. We will in this section study how different values for these arameters affects the access cost, under which conditions using a PCache is beneficial, and otimal sizes for the PCache. The mix of udates and lookus to the IDX affects the otimal arameter values, and they should be studied together. If we denote the robability that an oeration is a write as write, the average index access cost is the average of the cost of all index looku and index udate oerations: F access Q U writepf looku T writef udate ur goal here is to minimize F access. We measure the gain from using a PCache, with otimal arameter values, as: Gain Qð 7 uñ F access nopcache F access F access where F access nopcache is the access cost when not using a PCache. In the rest of this aer, we give PCache size as a fraction of the TIDX size. It is difficult to know what kind of access attern that will be exerienced in TDBs. It is ossible to do redictions based on current access atterns, but we believe that it is quite ossible that when suort for temoral features become common, alication develoers can utilize these in new ways. The access atterns used in this aer do not necessarily reresent any of these, but we will use them to show that the gain from using the PCache is considerable, under most conditions and access atterns. We have used four access atterns. The artition sizes and access robabilities are summarized in Table 1 (note that this is the ID access attern œª, and not the index age access attern œ ). In the first artitioning set, we have three artitions, extensions of the 8/2 model, but Table 2: Default arameters. with the 2% hot sot artition further divided, into a 1% hot sot area, a 19% less hot area, and a 8% relatively cold area. The second artitioning set resembles the access attern close to what we exect it to be in future TDBs, with a large cold set, consisting of old versions. The two other sets in this analysis have each two artitions, with hot sot areas of 2% and 5%. Unless otherwise noted, results and numbers in the next sections are based on calculations using default arameters as summarized in Table 2.ô Note that in this aer, when we talk about available main memory, we only consider the memory available for index related buffering, [. Main memory for object age buffering is orthogonal to this issue. With the values in Table 2, the Ds would occuy ]öõ Ÿ GB if stored comactly. Tyically, the objects themselves occuies at least four times as much sace as the IDX, if this is reflected in available main memory buffer, [ Q µ MB should imly a total buffer memory of 2-3 MB. In this study, we mainly investigate the index memory interval from [ Qø MB to [ Qùµ MB, as this is the most dynamic area of IDX access cost, but we will also show how the availability of larger amounts of memory affects erformance. 6.1 The Effect of Using a PCache Figure 4 illustrates the tyical cost involved in IDX access, using the default arameters, but with different index memory sizes [. The gain is from 2% to several 1%. We see that the PCache is esecially beneficial with relative small index memory sizes comared to the total index size. As the index size increases, the gain decreases (but as we will show in Section 6.4, the gain actually increases again with larger main memory sizes). 6.2 timal PCache Size The otimal PCache sizes are illustrated to the left on Figure 5. With access attern 2P82 and 3P1, most of the ú Note that even though some of the arameter combinations in the following sections are unlikely to reresent the average over time, they can occur in eriods, e.g., more write than read oerations. It is in situations like this that adative self tuning systems would be interesting, when arameter sets differs from the average, which systems traditionally have been tuned against. 75

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions