Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1] Marc André Tanner May 30, 2014 Abstract This report contains two main sections: In section 1 the cache-oblivious computational model is motivated and introduced. Section 2 describes the main contribution of the original paper an optimal cache-oblivious priority queue. 1 Background and Models of Computation Memory systems of modern computers consist of complex multilevel memory hierarchies with several layers of cache, main memory and disk. Since access times between different layers of the hierarchy can vary by several orders of magnitude, it is becoming increasingly important to obtain high data locality in memory access patterns. These developments lead to new theoretical models of computation which allow to analyze algorithms with regard to their memory behaviour. 1.1 Random Access Machine (RAM) Model In the Random Access Machine (RAM) model memory is assumed to be infinite i.e. the relevant data always fits into main memory. Furthermore the memory is considered to have uniform access time, which is why it also referred to as flat memory. Clearly these assumptions are not suitable for use cases in which the memory system becomes the bottleneck, especially in a Big Data context with graphs consisting of millions of vertices and billions of edges. However developing a model which is both simple and realistic is a challenging task. The external memory model, which has gained widespread use due to its simplicity, is described next. 1

1.2 External Memory Model This model is also known as the I/O model or the cache-aware model (to contrast it with cache obliviousness) and was introduced in 1988 by Aggarwal and Vitter[2]. In order to avoid the complications of multilevel memory models the memory hierarchy consists of only two levels: an internal memory (often called cache) of size M which is fast to access but limited in space, and an arbitrarily large external memory (referred to as disk), partitioned into blocks of size B, with significantly slower access times. The efficiency of an algorithm is measured in terms of the number of block or memory transfers it performs between these two levels. What follows are a few important bounds which characterize the I/O model and will be used later on in the analysis of the priority queue. linear or scanning bound scan(n) = Θ( N B ) is the number of memory transfers needed to read N contiguous items from disk. sorting bound sort(n) = Θ( N B log M/B N B ) refers to the fact that sort(n) memory transfers are both necessary and sufficient to sort N elements. finding the median of N elements is possible in O( N B ) memory transfers. searching bound, the number of memory transfers needed to search for an element among a set of N elements is Ω(log B N). The searching bound is matched by the B-tree, which also supports updates in O(log B N). Notice however that unlike in the RAM model, one cannot sort optimally with a search tree inserting N elements in a B-tree takes N O(N log B N) memory transfers which is a factor of (B log B N)/(log M/B B ) from optimal. While the external memory model is reasonably simple, its algorithms still crucially depend on the parametrization of M and B. Furthermore the algorithms have (at least in principle) to explicitly issue read/write requests to the disk as well as explicitly manage the cache. 1.3 Cache-Oblivious Model The main idea of the cache-oblivious model introduced in 1999 by Frigo et al.[3] is to design and analyze algorithms in the external memory model, but without using the parameters M and B in the algorithm description. Thus combining the simplicity of a two-level model with the realism of more complicated hierarchical models. The cache oblivious model is based on a few assumptions which might seem unrealistic at first, but Frigo et al. showed with a couple of reductions, based on the least recently used (LRU) replacement strategy and 2-universal hash functions, that such a model can be simulated by essentially any memory system with only a small constant-factor overhead. 2

These assumptions are: there exist exactly two levels of memory. optimal paging strategy that is if main memory is full the ideal i.e. the block which will be accessed the farthest in the future will be evicted. automatic replacement when an algorithm accesses an element that is not currently stored in main memory, the relevant block is automatically fetched with a memory transfer. full associativity any block can be stored anywhere in memory. tall cache assumption: the number of blocks M/B is larger than the size of each block B or equivalently M B 2. It is important to realize that if a cache-oblivious algorithm performs well between two levels of the hierarchy, then it must automatically work well between any two adjacent levels of the memory hierarchy. This is implied by the fact that a cache-oblivious algorithm neither depends on the memory size nor on the block size. Therefore if the algorithm is optimal in the two-level model, it is optimal on all levels of the hierarchy. This is the reason why the cache-oblivious model is useful: it allows convenient algorithm analysis in a simple two-level model while still deriving reasonable conclusions about much more complex, multilayer memory systems as found in contemporary systems. 2 Optimal Cache-Oblivious Priority Queue 2.1 Background and Motivation A priority queue maintains a set of elements each with a priority (or key) under the operations insert, delete and deletemin, where a deletemin operation finds and deletes the minimum key element in the queue. The goal is now to design such a data structure which is both optimal (that is the number of memory transfers matches the sorting bound) as well as cacheoblivious (i.e. it should work without any knowledge of the memory and block size). These criteria are not satisfied by an implementation based on a heap or a balanced search tree as known from the RAM model. The authors point out that even though several efficient priority queues for the I/O model are known, none of them can readily be made cache-oblivious. Since there exist cache-oblivious B-tree implementations supporting all operations in O(log B N) memory transfers, this immediately implies the existence of a O(log B N) cache-oblivious priority queue. However as discussed above, in order to sort optimally a data structure performing all the operations in O( 1 B log M/B N B ) amortized memory transfers and O(log 2 N) amortized computation time is needed. This is exactly what the presented cache-oblivious priority queue achieves. 3

2.2 Structure In order to minimize the number of memory transfers a technique which could be described as lazy evaluation using buffers is employed. The idea is to keep keys with similar priorities grouped into buffers in such that random I/O is avoided as much as possible. Intuitively a buffer holds a certain interval of elements and is used to move elements between levels. As a consequence all elements of a buffer can be processed sequentially, in one operation, thus amortizing the cache misses among all involved elements. To efficiently support the deletemin operation an order among the elements - or at least among the buffers - needs to be maintained. Therefore the priority queue is structured in terms of levels which contain various buffers. The general idea is that smaller elements are stored in lower levels and as the level grow the contained elements do likewise. In particular all insert and deletemin operations are performed on the lowest level and over time the larger elements rise up, whereas the smaller ones trickle down. During this process a level might reach its maximum capacity and elements of a buffer need to be pushed one level up. Similarly if a level becomes too empty, elements are pulled from the next higher level. As will be shown, the structure is carefully designed in such a way that these operations can be performed efficiently. The whole data structure is statically pre-allocated and is completely rebuilt after a certain number of operations. The following section formally introduces this multilevel structure, the contained buffers as well as the maintained invariants. 2.2.1 Levels The priority queue is built of Θ(log log N) levels. The largest level has size N and all subsequent levels decrease by the power of 2/3 until a constant size c is reached. The levels are referred to by their respective sizes and the i th level from above has size N (2/3)i 1. Thus the levels from largest to smallest are level N, level N 2/3,..., level X 3/2, level X, level X 2/3..., level c 3/2, level c. 2.2.2 Buffers In order to efficiently move elements between different levels there exist two types of buffers. Up buffers store elements which are on their way up (i.e. they have not yet found the buffer they belong to) and will be stored in one of the down buffers higher up in the hierarchy. Similarly down buffers store elements which are on their way down. Their size is chosen in such a way that the up buffer one level down can quickly be filled with the smallest elements among the down buffers of this level. Level X consists of one up buffer u X that can store up to X elements, and at most X 1/3 down buffers d X 1,..., d X each containing between 1 X 1/3 2 X2/3 and 2X 2/3 elements. Notice that this means that each down buffer is at all times at least a quarter full. The element of each down buffer with the largest 4

. level X 3/2 d X3/2 1 u X3/2.. d X3/2 X 1/2 level X 71 61 67 59 97 83 79 73 17 11 31 23.. 41 53 47 43 up buffer of size X at most X 1/3 down buffers each of size X 2/3 level X 2/3 29 19 5.. 7 up buffer of size X 2/3 at most X 2/9 down buffers each of size X 4/9. Figure 1: Levels X 3/2, X and X 2/3 of the priority queue data structure with some example elements illustrating invariants 1-3. Pivot elements are highlighted. key is called the pivot element. In total the maximum capacity of level X is X + X 1/3 2X 2/3 = 3X. The size of the down buffers is twice as large as the size of the up buffer one level down. As an example consider the down buffers on level X 3/2 which have size 2X (3/2)2/3 = 2X, whereas the up buffer u X one level below has size X. Furthermore the following three invariants about the relationship between the elements in buffers of various levels are maintained. Invariant 1. At level X, elements are sorted among the down buffer, that is, elements in d X i have smaller keys than elements in d X i+1, but the elements within are unordered. d X i Invariant 2. At level X, the elements in the down buffers have smaller keys than the elements in the up buffer. Invariant 3. The elements in the down buffers at level X have smaller keys than the elements in the down buffer at the next higher level X 3/2. These invariants define intervals for the various buffers and ensure that the elements get larger as the levels grow. 2.2.3 Layout The priority queue is stored in a contiguous array, where levels are placed consecutively from smallest to largest. Each level reserves space for its maximal capacity of 3X. The up buffer is stored first, followed by all down buffers in an arbitrary order, but linked together to form an ordered linked list. 5

...... u X d X 1 d X 3 d X 2 d X 4... d X X 1/3...... X X 1/3 2X 2/3 = 2X Figure 2: Physical storage layout of level X which has size 3X. arbitrary, but linked together order of the down buffers. Notice the Summing up over all levels log 3/2 log c N i=0 3N (2/3)i = O(N) leads to the following space requirement. Lemma 1. The cache-oblivious priority queue uses O(N) space. 2.3 Operations The priority queue works with two main operations, push which inserts X elements into the next higher level X 3/2 and pull which moves the X elements with smallest key from level X 3/2 to the next lower one. Thus inserting an element into the priority queue corresponds to a push into the lowest level c. Similarly a deletemin is implemented by a pull from the lowest level. 2.3.1 Push A push is used when level X is full. In which case the largest X elements are moved from level X into the level above X 3/2. As a first step the X elements which are to be inserted into level X 3/2 are sorted cache-obliviously using O(1 + X B log M/B X B ) memory transfers. Now that these X elements are sorted they are distributed into the X 1/2 down buffers of the next higher level X 3/2. Remember that the elements are sorted among the down buffers (Invariant 1), and each down buffer stores its largest element as a pivot element. Therefore distributing the elements works by visiting the down buffers in linked order and appending elements as long as they are smaller than the current down buffer s pivot element. Elements with keys larger than the pivot element of the last down buffer d X3/2 are inserted into the up buffer of X 1/2 the same level u X3/2 While this process is fairly straight forward a few corner cases need to be handled carefully: down buffer overflows: Remember that a down buffer on level X 3/2 is twice as large as the up buffer on level X and thus has a maximal capacity of 2X. If during the distribution of elements this capacity is reached, the down buffer is split into two new down buffers and the elements are evenly distributed into the two new buffers such that both contain X elements. 6

Algorithm 1 push X elements into level X 3/2 Input: an array A of size X B := d1 X3/2 A sort(a) for all e A do {find the correct down buffer to insert the current element} while B nil and B u X3/2 and x > pivot(b) do B := B.next end while if B = nil then {the element is too large for the down buffers} B := u X3/2 {hence prepare insertion into up buffer} if B = u X3/2 then insert-into-up-buffer({e}) {see algorithm 2} else if B = 2X then {down buffer full} {check whether there is space left for a new down buffer} if number-of-down-buffers-on-level(x 3/2 ) = X 1/2 then {if not, move content of last down buffer to up buffer} insert-into-up-buffer(d X3/2 ) {see algorithm 2} X 1/2 d X3/2 X 1/2 B new := {allocate a new down buffer} m median(b) for all e B do {split the current buffer based on its median} if e > m then B B \ {e} B new B new {e} end for {chain the new buffer into the linked list} B new.next = B.next B.next = B new {make sure the current element will be placed into the correct buffer} if e > pivot(b) then B := B new B B {e} {add the element to the buffer} end for 7

Algorithm 2 insert a set of elements into the up buffer of level X 3/2, used by push Input: an array A for all e A do if u X3/2 = X 3/2 then {check wheter up buffer is full} push(u X3/2 ) {if so, push all elements into the next higher level X 9/4 } u X3/2 u X3/2 u X3/2 {e} end for This splitting step is a two phase process, first the median of the elements is calculated based on which the elements are partitioned into their respective buffer in a simple scan. When calculating the median it is assumed that the priority queue contains no duplicate keys, that is no elements with the same priority. Since the down buffers are linked together to form an ordered list, the newly allocated buffer can be placed at the end (after all already existing down buffers) where space is reserved. All in all this case can be implemented in median(x) + scan(x) + O(1) memory transfers. level X 3/2 already contains the maximum X 1/2 amount of down buffers: This case is problematic if the above splitting procedure happens when there is no space left to allocate a new down buffer. In this case the less than 2X elements of the last down buffer d X3/2, which by Invariant 1 are X 1/2 larger than all elements of the other down buffers, are moved into the up buffer u X3/2. Since the number of elements involved is bounded by 2X this case can be handled in scan(x) + O(1). up buffer u X3/2 overflows: If the up buffer reaches its maximum capacity of X 3/2 all of its elements are recursively pushed into the next higher level. Notice that after such a recursive push the up buffer is empty and X 3/2 elements have to be inserted before another recursive push is needed. The cost of this recursive push is for now ignored, it will be taken into account when doing an amortized analysis over all levels. Let us now do an analysis with regard to the number of memory transfers needed to perform such a push operation. First the X elements are sorted, during the distribution step X elements are scanned and in the worst case each of the X 1/2 down buffers is visited. The above listed special cases can all be dealt with in scan(x) memory transfers which means a pull can be performed in O(1) + sort(x) + scan(x) + X 1/2 = O(1 + X B log M/B 8 X B + X1/2 )

memory transfers. However upon closer inspection the X 1/2 term, which stems from the fact that during the distribution step non-full buffers might have to be written back, can be eliminated. To see this a case distinction on X, the number of elements involved is performed. B 2 < X : or equivalently X 1/2 < X B, which immediately leads to O(1 + X B log M/B X B ). B X B 2 : in this case the X 1/2 term could possibly dominate. The problem is that during the distribution step a down buffer could have to be written back even though its data does not amount to a full block. However since X 1/2 B M B, where the second inequality is justified by the tall-cache assumption (M B 2 ), a block for each of the X 1/2 down buffers can fit into memory. Notice that the operations take place on level X 3/2 and since B 3/2 X 3/2 B 3 there exist only one such level. Therefore a fraction of the main memory can be reserved to hold such partially filled blocks until they become full and are written back to disk. Since the assumed optimal paging strategy will perform at least as good as the strategy outlined above, the X 1/2 term can be eliminated. X < B : this case induces no costs since all levels less than B 3/2 can be kept in main memory at all times. Ignoring the cost of the recursion for now it can be concluded that: Lemma 2. A push of X elements from level X up to level X 3/2 can be performed in O(1+ X B log M/B X B ) memory transfers amortized while maintaining Invariants 1-3. 2.3.2 Pull A pull operation removes the X elements with smallest key from level X 3/2 and returns them in sorted order. It is used when there are not enough elements in the down buffers of level X. Recall that each down buffer at all times needs to be at least 1/4 full, which amounts to X/2 elements. During a pull X elements will be removed, but this invariant still has to be fulfilled. Therefore a case distinction on the number of elements contained within all down buffers is performed. In the first case it is assumed that the down buffers contain at least 3 2 X elements. Since the maximal capacity of a down buffer at level X 3/2 is 2X, the first three down buffers contain the smallest between 3 2X and 6X elements. These elements are sorted using O(1 + X B log M/B X B ) memory transfers. The smallest X elements are removed, while the other remaining between X/2 and 5X elements are left in one, two, or three down buffers of size between X/2 and 2X. These buffers can be constructed in O(1 + X/B) which means the sorting dominates. This procedure maintains Invariants 1-3. 9

Algorithm 3 Pull from level X 3/2, remove the X smallest elements Output: the X smallest elements in sorted order {check whether the first three down buffers contain enough elements} if d1 X3/2 d2 X3/2 d X3/2 < 3 2 X then 3 P pull from X 9/4 {if not, pull X 3/2 elements from the level above} U u X3/2 M merge(p, sort(u X3/2 )) u X3/2 U largest elements of M {distribute the remaining smaller elements into the down buffers} di X3/2 M \ u X3/2 {sort the first 3 down buffers, return the X smallest elements and distribute the remaining ones} T sort(d1 X3/2 d2 X3/2 d3 X3/2 ) S X smallest elements of T d1 X3/2, d2 X3/2, d3 X3/2 T \ S return S In the second case where the down buffers contain fewer than 3 2 X elements, a recursive pull of X 3/2 elements is performed on the next higher level. Recall that the keys of the elements in an up buffer are unordered relative to the keys of the elements in the down buffers one level up. Assume the up buffer u X3/2 contains U elements, these are sorted and then merged with the already sorted elements pulled from one level above. Now that all elements are sorted the U elements with largest keys are inserted into the up buffer, thus the number of elements in the up buffer is the same as before the pull operation. The remaining between X 3/2 and X 3/2 + 3 2X elements are distributed into the X1/2 down buffers of size between X and X + 3 2 X1/2. This procedure maintains the three invariants and the down buffers now contain at least X 3/2 elements, which means the previously discussed first case applies. As for the cost it requires one sort and one scan of X 3/2 elements, which is negligible compared to the cost of the recursive pull operation on the next level up. Ignoring these costs for now, it can be concluded that: Lemma 3. A pull of X elements from level X 3/2 down to level X can be performed in O(1+ X B log M/B X B ) memory transfers amortized while maintaining Invariants 1-3. 2.3.3 Total cost In order to analyze the amortized cost of an insert or deletemin, a sequence of N/2 operations is considered with regard to the performed memory transfers in their respective push and pull invocations. After N/2 operations the structure is 10

completely rebuilt such that all up buffers are empty and level X has X 1/3 down buffers each containing X 2/3. Notice that this ensures that the largest level N is always of size Θ(N). The rebuilding can be performed in a sorting step using sort(n) memory transfers, or O( 1 B log M/B N B ) transfers per operation. A push of X elements from level X up to level X 3/2 is charged to level X, because after such a push the up buffer u X is completely empty and X elements will have to be inserted before another recursive push is needed. Similarly a pull of X elements from level X 3/2 down to level X is charged to level X, because X elements will have to be deleted from level X before another recursive pull is needed. During the N/2 operations at most O(N/X) pushes and pulls are charged to level X. According to Lemma 2 and 3, a push or pull charged to level X uses O(1 + X B log M/B X B ) memory transfers. Altogether, the amortized memory transfers during the N/2 operations charged to level X are bounded by O(1 + 1 B log M/B X B ). Thus summing up over all levels it can be concluded that the total amortized transfer cost of an insert or deletemin operation in the sequence of N/2 such operations is: 1 O( B log M/B i=0 N (2/3)i B ) = O( 1 B log M/B N B ) The paper briefly mentions that a delete operation can be implemented in the same bounds and concludes with: Theorem 1. A set of N elements can be maintained in a linear-space cacheoblivious priority queue data structure supporting each insert, deletemin and delete operation in O( 1 B log M/B N B ) amortized memory transfers and O(log 2 N) amortized computing time. 3 Conclusion On a personal note I find the simplicity of the cache-oblivious model quite appealing. It is remarkable that such a universal, hardware independent model can be used to predict certain algorithm characteristics of real world systems. More concretely the main insight I got from the paper is the idea of lazy evaluation using buffers. That is the technique to carefully craft a data structure in a way that just about enough data can be kept in memory. Organizing the data such that the required operations can always be performed in sequential fashion, thus yielding excellent I/O behaviour regardless of the underlying memory system. As for further information, the interested reader can find a more detailed description and analysis of the cache-oblivious priority queue in a follow up paper by the same authors[4]. 11

References [1] L. Arge and M. A. Bender and E. D. Demaine and B. Holland-Minkley and J. I. Munro. Cache-Oblivious Priority Queue and Graph Algorithm Applications. Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 268 276, 2002. [2] A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems. Communications of the ACM, 31(9):1116-1127, 1988. [3] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 285 298, 1999. [4] L. Arge and M. A. Bender and E. D. Demaine and B. Holland-Minkley and J. I. Munro. An Optimal Cache-Oblivious Priority Queue and its Application to Graph Algorithms. SIAM Journal on Computing, Volume 36, Number 6, pages 1672 1695, 2007. 12