Inverted Index Compression via Online Document Routing

Size: px

Start display at page:

Download "Inverted Index Compression via Online Document Routing"

Moris Marshall
6 years ago
Views:

1 Inverted Index Copression via Online Docuent Routing Gal Lavee, Ronny Lepel, Edo Liberty, and Oren Soekh Yahoo! Labs., Haifa, Israel Eail: {gallavee, rlepel, edo, ABSTRACT Modern search engines are expected to ake docuents searchable shortly after they appear on the ever changing Web. To satisfy this requireent, the Web is frequently crawled. Due to the sheer size of their indexes, search engines distribute the crawled docuents aong thousands of servers in a schee called local index-partitioning, such that each server indexes only several illion pages. To ensure docuents fro the sae host (e.g., are distributed uniforly over the servers, for load balancing purposes, rando routing of docuents to servers is coon. To expedite the tie docuents becoe searchable after being crawled, docuents ay be siply appended to the existing index partitions. However, indexing by erely appending docuents, results in larger index sizes since docuent reordering for index copactness is no longer perfored. This, in turn, degrades search query processing perforance which depends heavily on index sizes. A possible way to balance quick docuent indexing with efficient query processing, is to deploy online docuent routing strategies that are designed to reduce index sizes. This work considers the effects of several online docuent routing strategies on the aggregated partitioned index size. We show that there exists a tradeoff between the copression of a partitioned index and the distribution of docuents fro the sae host across the index partitions (i.e., host distribution). We suggest and evaluate several online routing strategies with regard to their copression, host distribution, and coplexity. In particular, we present a ter based routing algorith which is shown analytically to provide better copression results than the industry standard rando routing schee. In addition, our algorith deonstrates coparable copression perforance and host distribution while having uch better running tie coplexity than other docuent routing heuristics. Our findings are validated by experiental evaluation perfored on a large benchark collection of Web pages. Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. WWW2011 Mar Apr. 1, Hyderabad, India Copyright 2011 ACM X-XXXXX-XX-X/XX/XX...$ Categories and Subject Descriptors H.3. [Inforation Systes]: Inforation Storage and Retrieval General Ters Algoriths Keywords Inverted Index, Index Copression, Index Partitioning, Docuent Routing, Online Algorith 1. INTRODUCTION Modern search engines serve enorous query loads over tens of billions of Web pages, and are expected to deliver fresh and relevant results to users in sub-second tie. Due to the sheer size of the Web, search engines partition their index over thousands of servers, where each server typically processes only several illions docuents. At query tie, the top results retrieved fro all servers are erged to produce the final results, which are returned to the user. The ain data structure used by search engines to efficiently ap queries to Web pages is the inverted index [2, 23]. The inverted index contains a postings list for each unique ter appearing in the corpus. The postings list of a ter consists of the list of docuent identifiers 1 (docids) containing it, which are typically sorted by increasing docid values. The list is represented by encoding the gaps (called dgaps) between successive docids. Another data structure in an inverted index is the lexicon, or dictionary, which is a lookup table that for each ter in the corpus, points to the postings list corresponding to it. It is well known that index size has a ajor effect on query processing throughput. In addition to the direct reduction in eory and disk space, ore copact indexes lead to savings in data transfers and increase the hit rate of eory caches [22, 21]. Therefore, a large body of work has focused on index copaction and copression ethods. The structure described above leaves two ain degrees of freedo for copression optiization. The first is the assignent of docids to docuents (also referred to as docuent reordering). The second is the actual encoding of the dgaps into bits (also referred to as dgap copression). The basic idea behind an effective docid assignent is to 1 Although ters frequencies and offsets within the docuent occupy a ajor portion of odern inverted indexes, we focus here on the docuents identifiers only.

2 assign siilar docuents with close docids, hence, potentially reducing the dgaps since siilar docuents contain any coon ters. Alas, the proble of finding the optial docid assignent is NP-hard and so various heuristics have been proposed in the literature. However, previous work has focused on copacting the inverted index of a single server, whereas large corpora are indexed over thousands of servers. A standard approach distributes the corpus in a rando fashion across the servers, which in particular routes (with high probability) siilar docuents to different servers. While this adversely affects docuent reordering schees, it creates a ore balanced query processing workload across the servers and eases counication bottlenecks as pages of popular Web sites are scattered about evenly across the servers. This work considers a real-tie partitioned indexing scenario where newly arriving content ust be indexed on soe server and ade available for search iediately upon arrival. This prohibits the application of tie consuing stateof-the-art docuent reordering schees. However, the partitioned setting allows a degree of freedo in the for of the policy that routes incoing docuents to the various partitions. We thus apply online docuent routing schees instead of the industry standard rando routing schee to reduce the aggregated size of the partitioned index. We further onitor and constrain the distribution of docuents belonging to the sae hosts across the partitions (referred to as host distribution hereafter). We tested our algoriths on the 25 illion Web page TREC.gov2 collection. Our ain contributions are the following: We introduce the idea of using online docuent routing for aggregated index size copression in locally partitioned indexes. As far as we are aware, this novel approach has not been studied previously. We propose and evaluate several online routing strategies with regard to their copression, host distribution, and coplexity. We show that there exists a tradeoff between the copression of an partitioned index and host distribution across the partitions. In particular, we present a ter-based routing algorith. Adhering to a siple docuent generating odel presented in [11], we show analytically that the algorith provides better copression results than the industry standard rando routing schee. In addition, our algorith deonstrates coparable copression perforance and host distribution while having uch better running tie coplexity than other docuent routing heuristics. The rest of this work is organized as follows. Section 2 surveys related work. Section 3 forally defines our odel, proble and etrics. Algoriths for docuent routing are presented in Section 4, with the ter-based algorith analyzed in Section 5. Our experiental results are reported in Section 6. We conclude in Section RELATED WORK 2.1 Index Partitioning The sheer size of the Web, the enorous nuber of search queries, and the required low latency, enforce a distributed inverted index architecture [2, 4, 5, 23]. To support these requireents, both distribution and replication principles are applied. Replication (also known as irroring) eans aking enough identical copies of the syste so that the required query load can be served, and is beyond the scope of this work. Distribution eans the way the inverted index is partitioned across a collection of nodes. The two ain strategies of partitioning an inverted index are local index-partitioning and global index-partitioning [4]. According to the local index-partitioning strategy (or docuent based partition), each node is responsible for a disjoint subset of docuents in the collection. In the global indexpartitioning strategy (or ter based partition), ters are divided into subsets, such that each node stores postings lists only for a subset of the ters in the collection. Due to various theoretical and practical considerations, odern large-scale search engines follow the local inverted index-partitioning strategy and distribute the docuents across the nodes [4, 5]. Docuents can be assigned to nodes using different policies. For exaple, the hash distribution policy allocates docuents to nodes in a rando fashion by hashing the docuents URLs to yield a node identifier [2]. Various other distribution policies such as round-robin distribution are also possible [14]. 2.2 Inverted Index Copression Herewefocusonasingle nodeofalocal index-partitioning architecture. We consider a siplified odel of an inverted index in which the postings list of ter t holds the list of docids containing t, sorted by increasing value. Denote the list by d t 1,d t 2,...,d t n t, where d t i denotes the docid of the i th docuent containing t out of n t such docuents. The list is actually represented by encoding the first docid and the sequence of gaps (dgaps) between successive identifiers thereafter, i.e. d t 1,d t 2 d t 1,...,d t n t d t n t 1. The two degrees of freedo available for copressing the size of the lists are (a) docid assignent; and (b) dgap encoding. As we focus on the forer, we start by briefly reviewing the latter. dgap encoding techniques ai to copress a sequence of integers. The literature contains schees that encode each gap individually, e.g. Gaa, Delta, Golob-Rice [20] and Zeta [9] encodings, as well as schees that encode certain blocks of gaps, e.g. PForDelta [22, 13] and Siple9 [1]. Additionally, the Interpolative Encoding schee [15] is applied directly on the docids rather than their dgaps, and works well for clustered ter occurrences. In general, the docid assignent proble seeks a perutation of the the docuents that iniizes the invertedindex size under a specific dgap encoding schee. As shown in [6], this proble is NP-hard and various heuristics are used to provide approxiations. The size of an inverted-index is a function of the dgaps. All effective dgap encoding techniques represent saller nubers with fewer bits (about logarithic in the nuber value). Hence, assigning docids in a way which results in saller dgaps is the key for better copression. This principle drives ost works dealing with docid assignent, and they strive to assign close docids to siilar docuents, i.e. docuents that share any ters. Technically, ost works define a graph G = (D,E), where D is the set of docuents, and E is a set of edges representing the siilarity between two docuents d i,d j D. One line of work started by [16] traverses the graph G to find

3 the axial weight path connecting all the nodes, assigning docids accordingly. This is equivalent to the NP-Hard traveling salesan proble (TSP). Several TSP approxiations were applied for docid assignent in [16, 7, 12]. In [16], a siple greedy nearest neighbors (GNN) approach is used to add one edge at a tie. To reduce the coputational load, [7] uses singular value decoposition (SVD) to reduce the diensionality of the ter-docuent atrix. To scale up TSP-based schees [12] proposes a new fraework based on coputing TSP on a reduced sparse graph obtained through locality sensitive hashing. In yet another line of work, the nodes of G are clustered according to their siilarity and close docids are assigned to the nodes (docuents) within each cluster. A top-down approach is used in [8], where the whole collection is recursively split into sub-collections, inserting siilar nodes into the sae sub-collections. Then, the sub-collections are erged into an ordered group of nodes. A botto-up approach called k-scan was proposed in [18]. A hybrid ethod which cobines k-scan clustering and TSP for intra-cluster docid assignent is proposed by [6]. A different approach, which is both highly scalable and highly effective, was proposed for Web collections in [17]. It assigns docids according to the lexicographically sorted order of the docuents URLs, utilizing the fact that URL siilarity is a strong indicator of docuent siilarity. The schee was found to perfor rearkably well on various Web collections indexed as a whole. In all the aforeentioned works, a heuristic of docid assignent or an encoding of dgaps were epirically tested against several collections and copared to the results of other works. In contrast, [11] analyzes the copressibility of a collection whose docuents are generated by a siple probabilistic odel in which ters are chosen independently fro a given distribution. 2.3 Online Probles Online algoriths propose solutions to probles where the input is not known in advance and ust be processed as it arrives in a sequential online fashion [10]. At first glance, the docuent routing proble tackled in this paper resebles two classical online probles - the Metrical Task Syste (MTS) and the Load Balancing (a.k.a. Job Scheduling) probles. MTS is a general fraework for online probles that generalizes several known online probles such as the paging proble and the k-servers proble [10]. An MTS is coposed of two coponents. The first is a etric space M =< S,d >, where S is a set of points in the space and d : S S R is a etric over S. The second coponent of an MTS is a set of tasks R. Each task r R ay be seen as a vector < r 1,r 2,...,r S > of costs where each entry, r i, is the cost of processing the request in state i S. Since the requests ust be processed in the order which they arrive, the cost of an MTS algorith is then the cost of oving fro one state to another plus the cost of processing each request in the current state, given by the forula: n n ALG(σ) = d(alg(i 1),ALG(i))+ r i(alg(i)), i=1 Here σ denotes a particular request sequence of size n and ALG(i) denotes the state of the algoriths iediately before processing request i. i=1 The Load Balancing proble seeks to assign arriving jobs requests to possible achines(one achine per job). Each of the jobs is associated with a cost, and a load balancing algorith s objective is to balance the cuulative cost of jobs assigned to each achine. We ay think of a job j as a vector r j =< r j,1,r j,2,...,r j, >, where r j,i is the cost of job j on achine i [3]. The ajor difference between the docuent routing proble and either MTS or Load Balancing stes fro the fact that the cost associated with appending docuent d to index partition j depends on the sequence of docuents previously routed to j. This doesn t fit at all within the Load Balancing fraework, and causes a factorial explosion of states in MTS that renders its coputation infeasible and its copetitiveness non-existent. Therefore, we cannot transfer the analytical results of MTS or Load Balancing to our setting. Nevertheless, load balancing approaches and intuitions ay still be applied and assessed epirically (see Section 4). 3. FORMAL PROBLEM DEFINITION The online proble setting we tackle is the following. The syste consists of a docuent dispatcher and (initially epty) index partitions. Docuents, odeled as bags of words, arrive in soe arbitrary sequence to the dispatcher (e.g. as the result of a large scale distributed crawl policy). Each docuent belongs to soe Web-host. The dispatcher ust iediately route each docuent to one of the partitions, whereby that partition siply appends the docuent to its existing inverted index. Each index consists of (1) a dictionary, and (2) inverted lists per ter. The inverted lists encode docuent IDs only, i.e. no intradocuent offsets or any other payload. For exaple, consider a docuent d containing ters t 1,...,t k that is appended as the j th docuent of partition l. d will be assigned docid j on l. Any of t 1,...,t k that has not appeared in any of the previous j 1 docuents routed to l will be added to l s dictionary, and docid j will then be appended - by dgap encoding - to the k postings lists corresponding to the ters. The dispatcher attepts to iniize the overall index size, i.e. the su of index sizes on all partitions, while aintaining soe balance between the nuber of pages fro each host residing on each partition. Specifically, our algoriths will typically attept to iniize the overall index size subject to soe constraint on the skewness allowed in the distribution of sae-host pages across the partitions. Since docuents are routed and indexed one at a tie (no buffering is allowed, either at the dispatcher or on the partitions), only single dgap encoding schees apply. Specifically, our experients use Delta encoding. Block-based dgap encoding schees such as PForDelta and Siple9 are not applicable. The following subsections forally define our etrics for index size and host distribution across partitions. 3.1 The Bits per Posting Metric Let a corpus with N overall postings be indexed across partitions, and let T i denote the set of distinct ters on the i th partition. Let t be a ter appearing in n t docuents in soe partition, and denote those docids by d t 1 < d t 2 <... < d t n t. Assue all postings lists are encoded using Delta encoding. Then, the overall size of all postings lists on the

4 i th partition, P i, is given by P i = ( ) n t δ(d t 1)+ δ(d t j d t j 1), t T i j=2 where δ(k) is the length (in bits) of the Delta encoding of the integer k: δ(k) = 1+ log 2 k +2 log 2 (1+ log 2 k ). (1) The overall size of the postings across the partitions, P, is defined as P = P i. i=1 We further define the overhead OH of a partitioned index as the space taken by the dictionaries of the individual partitions. Each entry of the i th dictionary is a pointer into the sequence of postings lists on the i th server, and hence requires log 2 P i bits. Overall, OH = T i log 2 P i. i=1 Finally, the bits per posting etric coes in two flavors, with and without overhead. Those are siply P+OH and N P, respectively. N 3.2 Host-Distribution Measure We evaluate the distribution of pages belonging to N h hosts across partitions using the Chi squared (χ 2 ) distribution [19]. Let N h,i denote the nuber of docuents fro host h on partition i, and denote by N i the total nuber of docuents on partition i. Also, let p h denote the proportion of the docuents belonging to host h of the total nuber of docuents, N. The following host-balancing value B is distributed χ 2 with ( 1)(N h 1) degrees of freedo: B = N h (N h,i N i p h ) 2 i=1 h=1 N i p h Essentially, this expression checks the null hypothesis that the host distribution resulted fro a rando routing of docuents to the partitions. Intuitively, the value corresponding to a docuent routing schee that approxiately preserves the overall host distribution on each of the servers will be low. When the nuber of degrees of freedo is large, the exact χ 2 distribution is difficult to copute. Hence we use a standard Noral approxiation, where the ean is the nuber of degrees of freedo and the variance is twice this nuber. We will thus report the value B ( 1)(N h 1) 2( 1)(Nh 1), which is approxiately distributed N(0, 1). 4. ALGORITHMS This section described several Algoriths for online docuent routing in the fraework defined in Section 3. Rando docuent routing. The rando algorith, coonly used in search engines, siply routes each incoing docuent to a partition chosen uniforly and independently at rando. Greedy docuent routing. The greedy algorith routes each docuent d to the partition whose inverted lists will grow by the least aount when appending d to its existing index. Let d contain the set of ters T d and denote by φ(j,t) the dgap created in the postings list of ter t T d on partition j,j {1,2,...} upon appending d to the current state of partition j. Note that φ(j,t) ranges fro 1 to the nuber of docuents already routed to partition j (in case ter t has not yet appeared there). The change in the size of the inverted lists of partition j upon appending docuent d, (j,d) is thus given by: (j,d) = t T d δ(φ(j,t)), (2) where δ( ) denotes the Delta dgap encoding function (1). Load-balancing docuent routing. The load balancing algorith is inspired by the relationship of our proble to the online load balancing proble of peranent jobs on unrelated achines. In this algorith, which is based on [3], we start with an initial estiate L of the axiu load (i.e., the axiu aggregated index size). We then calculate the noralized load l j,d = l j,d /L and noralized cost of adding the next docuent d, C(j,d) = (j,d)/l, for each partition j = 1... Here (j,d) denotes the cost of adding docuent d to partitionj (seeequation(2)), andl j,d denotesthesizeofpartition j assuing docuent d will be routed to it. Next, we copute the following cost function for all partitions: a l j,d +C(j,d) a l j,d, where a = , and route the docuent to the partition of lowest cost. As ore docuents are added to the index, it is possible that the initial guess of the axiu load, L, will be exceeded. In this case we siply update our axiu load estiate to 2L. More precisely, we check if l j,d + (j,d) > βl, where β is a paraeter (see [3]). For the unrelated achines load balancing proble, this algorith proises an O(log ) copetitive solution. Of course, as our docuent routing algorith is different, this analytical guarantee does not hold. Ter based docuent routing. The ter-based algorith associates a disjoint set of representing ters with every partition. Each new docuent is routed to the partition with which it has the ost overlapping ters. Section 5 proves that this algorith achieves better copression than rando routing, the current de facto standard. The algorith requires a strategy for associating the ters in such a way that the probability of a particular ter to represent each partition is about equal, as is the total probability ass of ters assigned to each partition. Due to the Power-law nature of ter popularities, rando association of ters to partitions creates partitions with ibalanced

5 probability ass of their assigned ters. We therefore used a heuristic approach which first sorts the ters in the corpus according to frequency 2, and then associates the in a zig zag pattern, i.e. the highest frequency ters are associated with partitions 1.. in ascending order; the next highest frequency ters are associated with partitions...1 in descending order; and so on. After this initial phase, we use iterative swapping of the highest frequency ter fro the partition with the axiu cuulative frequency (over all ters associated with it) with the lowest frequency ter fro the partition with the iniu total frequency. This process is allowed to iterate until convergence of the difference between the axial cuulative frequency and the inial cuulative frequency. Note that in practice there s no need to associate all ters in the corpus aong the servers. Our experients will leave ost ters unassigned, effectively ignoring the when aking routing decisions. 4.1 Adding Host-Balancing Constraints We extend both the Greedy and Ter-based routing algoriths by liiting the nuber of docuents fro each host that are assigned to any single partition 3. The resulting routing schees are called Constrained Greedy and Constrained Ter-based, respectively. For each host h of (estiated) size n h, we bound the nuber of its docuents which ay be routed to any single partition. Then, upon arrival of a docuent of host h, the constrained algorith (either Greedy or Ter-based) siply considers only partitions on which the nuber of alreadyrouted docuents belonging to host h is below the bound. We apply two flavors of bounds, denoted b 1(h) and b 2(h), to each host: { b 1(h) = ax α n h { nh b 2(h) = ax +α },3 nh },3 The first bound flavor requires a slack value of α > 1, to allow each partition to hold soe ultiplicative factor of its fair share of docuents fro host h. Our experients use α {1.05,1.1,1.2}. The second bound defines the slack in ters of the (approxiate) standard deviation of the rando routing of h s docuents across the partitions. Our experients use α {1,2,3}. With either flavor, we always allow partitions to hold 3 docuents of each host. This relaxes our constraints over hosts with very few pages relative to the nuber of partitions. 4.2 Ipleentation Notes The Rando routing algorith is obviously trivial to ipleent, with each routing decision being ade by the dispatcher in O(1). All other algoriths copute soe value for routing docuent d to each of the partitions, where the value depends on the set of ters T d contained in d. The Greedy and Load Balancing algoriths each copute a cost, which requires exaining all T d ters in the docuent to be appended, for each of the partitions. These coputations can be done in a centralized fashion in the 2 We assue that approxiate ter statistics are known fro previous crawls. 3 We assue here that the sizes of the hosts are approxiately known due to previous crawls. (3) (4) dispatcher, or in a distributed fashion with an extra round of counication between the dispatcher and each of the partitions. For the centralized coputation, the dispatcher ust keep track of the nuber of docuents routed to each partition, as well as the serial nuber of the last docuent routed to each partition for each ter in the corpus. Effectively, this eans keeping a global dictionary with entries per ter. Let T denote the nuber of distinct ters in the corpus. The centralized coputation requires O(T ) eory at the dispatcher and O( T d ) tie upon dispatching d. In the distributed coputation, the dispatcher sends d to each partition, which coputes its local cost and sends it back to the dispatcher, who then sends an indexing request to the cheapest partition. Each local cost coputation requires O( T d ) tie and requires O(T ) extra eory per partition, as the last docuent ID per ter needs to be kept in the local dictionary 4. Turning to the Ter-based algorith, the centralized coputation requires the dispatcher to keep track of the ters associated with each partition. This requires O(T ) space. The routing decision can then be achieved in O( T d + ) tie while aintaining additional counters, for overall space coplexity of O(T + ). Therefore, coparing centralized ipleentations, this algorith requires about ties less tie and space than the Greedy and Load Balancing algoriths. This becoes significant as grows. In the distributed coputation, each partition can track its own set of associated ters at the cost of O(T ) space across all partitions, or roughly O(T /)per partition. In addition to the above resources, the Constrained versions of the Greedy and Ter-based algoriths ust also store the N h host size estiates (or bounds), as well as the nuber of pages per host that have been routed to each partition. This entails soe extra O(N h ) eory. This eory can be kept at the dispatcher, or (in case of a distributed ipleentation) each partition ay aintain its own counters for the nuber of pages per host it holds. 5. TERM BASED DOCUMENT ROUTING ALGORITHM - ANALYSIS This section shows that the Ter-based routing algorith results in a saller aggregated index size than that of the industry standard rando routing schee under the siplified docuent generation odel of [11]. According to the odel of [11], the N docuents of the corpus ay include up to L unique ters. Docuents are independently generated as follows: each of the L ters is chosen independently with replaceent, according to soe frequency rank distribution {p i} (e.g., power-law or double- Pareto distributions) defined over the vocabulary T. The resulting corpus can be described by a T N boolean terdoc atrix that has a 1 in entry (i,j) if ter i is included in docuent j. Let S i denote the r.v. corresponding to the size of the postings list that encodes the ith row of the ter-doc atrix (i.e. ter t i), when applying soe single dgap encoding (e.g., Delta encoding). In addition, let P i Pr(t i d) = 1 (1 p i) L denote the probability that ter t i is included in docuent d. Moreover, for 1 g N, let X g,i be the 4 While this ID is already encoded in the inverted index, decoding it requires decoding a long sequence of dgaps.

6 r.v. denoting the nuber of dgaps of length g in row i. According to [11, Th. 2] where S(N,P i) = E[S i] = N δ(g)e[x g,i], (5) g=1 E[X g,i] = P i(1 P i) g 1 +P 2 i (1 P i) g 1 (n g), (6) and δ(g) is the nuber of bits needed to encode a dgap of length g. In particular, [11, Sec. 7] shows that S(N,P i) achieves the entropy bound for every ter t i that satisfies 5 w ( ) logn N = Pi = o(1). Hence, for these ters the expected size can be approxiated by 6 S(N,P i) N H(P i). Let ˆT denote the subset of ters that satisfy the conditions of [11, Sec. 7] and randoly divide it into disjoint sets of size ˆT / (a set for each partition). We represent the sets by vectors {V l } l=1 with V l(i) = 1 if t i is included in the lth set and V l (i) = 0 otherwise. A docuent d is assigned to the kth partition if it axiizes the dot product between the docuent ter vector and the corresponding partition-defining ter vector, i.e. k = argax{v l d}. (7) l We note that since the ters of ˆT are randoly and evenly dividedbetweenthepartitions, thedotproducts{v l d} l=1 are identically distributed rando variables. Proposition 1 For the above docuent generating odel, the Ter-based docuent routing algorith achieves saller expected aggregated index size than that of the Rando docuent routing schee. Proof. We start by exaining how the probability of a ter being included in a docuent changes given that the docuent was assigned to a certain partition according to the Ter-based routing algorith. Focusing on the lth partition, P i,l Pr(t i d d l) (a) = Pr(d l ti d) Pr(t i d) Pr(d l) = β i,l P i, (8) where (a) is achieved by applying Bayes theore; P i Pr(t i d), and β i,l Pr(d l t i d)/pr(d l). We note that since the partition representing ters were divided randoly and equally aong the partitions (hence, the decision dot products are identically distributed r.v. s), then the probability that a docuent d is routed to the lth partition P r(d l) 1/. Exaining the definition of β i,l we observe that three cases should be considered: (1) the ter t i is a eber of the set which represents the lth partition, i.e., V l (i) = 1 - in this case it is easily verified that β i,l > 1; (2) the ter t i is a eber of the set which represents soe other partition l l - in this case it is easily verified that β i,l < 1 ; and (3) the ter t i is not representing any partition t i / ˆT - in this case β i,l = 1. Hence, the probability that a ter t i appears in a docuent which was 5 Practically it eans ters with not too high and not too low probability. 6 Here H(q) denotes the binary entropy function H(q) = qlog 2 q (1 q)log 2 (1 q). routed to partition l increases in case t i represents the lth partition, decreases in case t i represents soe other partition, and reains unchanged otherwise. Moreover, since every docuent is always routed to one of the servers, we have that Pr(d l)p i,l = P i 1 P i,l. (9) l=1 l=1. Intuitively, this states that the overall probability of a ter in the corpus reains unchanged regardless of the partitioning. We coplete the proof by showing that the expected aggregated size difference between the Rando routing and the Ter-based routing setups is positive (a) (b) = (S(N/,P i) S(N/,P i,l )) l=1 t i T (S(N/,P i) S(N/,P i,l )) l=1 t i ˆT (c) t i ˆT l=1 N (H(Pi) H(P i,l)) = N ( ) H(P 1 i) H(P i,l ) t i ˆT l=1 (d) > N ( ( )) H(P i) H P i,l = 0, t i ˆT 1 l=1 (10) where (a) holds assuing each partition for both setups includes approxiately N/ docuents due to the strong law of large nubers and since partitions are hoogeneous; (b) is achieved since we show that P i,l = P i for all ters t i / ˆT; (c) is achieved since S(N,P i) N H(P i) for t i ˆT; (d) holds since H( ) is a strictly concave function; and the last equality is due to (9). To corroborate the results of the last Proposition, we generated a corpus using the docuent generation odel of [11], selected the partition representing ter sets, applied ter based docuent routing, and appended the routed docuents to their partitions using Delta encoding. The resulting aggregated bits per posting were copared to those calculated for an identical syste using the Rando docuent routing schee. The coparison shows a consistent although inor reduction in the aggregated index size when Ter-based routing is used. This sall iproveent, while obeying Proposition 1, is far fro the 20% iproveent achieved by the Ter-based routing when applied to the.gov2 corpus (see Section 6). We conjecture that the perforance gap stes fro the siplified nature of the docuent generation odel used to produce the synthetic corpus, as assuing independent docuents of independent ters does not capture the inherent siilarity between real docuents belonging to the sae host (the properties which ake URL docuent ordering so successful [17]). Hence, the probability for.gov2 docuents of the sae host ending up in the sae partition is uch higher than that of two arbitrary docuents generated by the odel of [11]. Accordingly, as indicated by our experients, the difference (10) is expected to be uch larger for real life docuents.

7 6. EXPERIMENTAL RESULTS 6.1 Experiental Setup We use the TREC.gov2 Web corpus, a collection of about 25.2 illion pages crawled fro the gov doain, for the experients 7. After parsing, tokenizing, and reoving all epty docuents, we are left with 24.9 illion docuents across 17,000 hosts, 74.5 illion distinct ters, and 5,705.2 illion postings (distinct ter appearances in docuents). Our experients apply a rando order of docuents arrival and easure the resulting aggregated bits per posting etric for every algorith (see section 3.1). As entioned earlier, we use the Delta dgap encoding schee throughout our experients. 6.2 Index Copression Results Figure 1 plots the Bits per Posting easure achieved by the different online algoriths as functions of the nuber of servers. The left plot focuses on the inverted lists alone (without the overhead incurred by the dictionaries), and shows a downward trend in the nuber of bits per posting needed to encode the partitioned index as the nuber of servers increases. This trend was observed in all algoriths and is ainly an artifact of the additional degrees of freedo present when there are ore partitions to choose fro when routing. However, rando routing also enjoys the increasing nuber of achines - this can be understood intuitively by considering the extree case where there are as any servers as docuents and each server uses only 1 bit to encode each posting. Turning to the right side of the figure which takes into account the dictionary-induced overhead, there is an upward trend in the aount of bits per posting needed to encode the partitioned index. This is caused by overlapping ter sets across the various partitions - overlapping ters require ultiple dictionary entries, up to entries in the worst (and realistic for any words) case. The relative perforance of the various algoriths is the sae in both plots, and so the following discussion is applicable to both. At the two extrees, Rando routing upper bounds the copression results, whereas Greedy routing trups all algoriths and saves about a thirdof the space as copared with Rando. It anages to decrease the overall index size (i.e. including overhead) for ost of the growth in the nuber of partitions. The reason for this is that the greedy algorith anages to assign docuents with siilar ter sets to the sae server, keeping the overlap in ter sets across the partitions relatively sall. As we further discuss in Section 6.3, Greedy and Rando routing also perfor at the two extrees in ters of host distribution, but the Rando perfors best (quite intuitively) while Greedy perfors worst, as sae-host docuents tend to share any ters and will thus tend to be routed by Greedy to the sae sall set of partitions, rather than uniforly across all servers. The Load Balancing algorith shows result very siilar to those of the rando algorith, for a low nuber of partitions ( < 200). Around 200 servers, its curve starts to iprove on that of Rando s, achieving fewer bits per posting. For = 1000, Load Balancing beats out the constrained 7 We note that since it was crawled alost to exhaustion,.gov2 is very dense and includes alost all relevant pages [12]. versions of the Greedy and Ter-based routing schees, but still loses to the unconstrained versions. The Ter-based algorith whose curve is plotted in Figure 1 is based on the association of about 8 illion ters to the various partitions. The ters are those whose frequencies in the corpus range fro 5 to It achieves about 20% iproveent in index size over the Rando algorith at = The constrained versions of both the Greedy and Terbased schees attept to decrease index size while not deviating too uch fro a balanced allocation of each host s pages across the partitions. Qualitatively, they achieve about the average copression of Rando and their non-constrained version. Note that the curves shown in Figure 1 use found flavor b 1(h) with α = 1.2, as defined in Section 4.1. Section 6.3 will further analyze the constrained versions of the algoriths. Other results (not presented here) reveals that the Terbased routing is not so sensitive with regards to the nuber of ters associated with the various partitions. Experients with 6 and 4 illions ters of the unconstrained and constrained Ter-based routing yields siilar index sizes as those presented here for the 8 illion ters version. 6.3 Host Distribution Results This section further studies the trade-off between the quality of copression and the balanced routing of sae-host docuents across the different partitions. The conclusion fro the data points given below is one of no free lunch. Most of the redundancy in the docuent strea that is exploited by the better-copressing routing schees stes fro the ability to route sae-host docuents to a sall set of partitions, rather than scattering the evenly across the partitions. This is yet another indication of why reordering Web collections by URL sorting[17] is so successful. Table 1 shows the noralized Chi-squared host distribution values, as defined in Section 3.2, for the routing schees presented in Figure 1, for several values of. The schees are listed in the order of their copression effectiveness, fro worst (Rando, left) to best (Greedy, right). Observe the following: Rando routing, as expected, achieves values that are typical of a Noral(0, 1) rando variable. The tradeoff between copression (Figure 1) and hostbalancing (Table 1) is near-perfect. The relative order of the schees in ters of the two easures is alost entirely reversed. The sole exception is that the Constrained Greedy schee outperfors the Constrained Ter-based schee in both easures 8. Whenever the Load Balancing schee achieves randolevel host-balancing ( = 10, 40, 100), it is basically siilar to Rando routing in ters of copression. Other than those three points, all other schees distribute sae-host pages with an ibalance that is statistically ipossible to achieve with rando routing. The constrained versions of the Greedy and Terbased schees achieve a host-balancing value that is 8 The difference between Rando and Load Balancing for = 10,40 is just a rando fluctuation and does not contradict the above stateent.

8 13 (a) Agregatted index size without overhead 13 (b) Agregatted index size with overhead Bits per posting Rando Load balancing Constrained ter based (8M ters) Constrained greedy Ter based (8M ters) Greedy Nuber of partitions Bits per posting Rando Load balancing Constrained ter based (8M ters) Constrained greedy Ter based (8M ters) Greedy Nuber of partitions Figure 1: Bits per posting as function of the nuber of partitions for different docuent routing strategies and Delta encoding applied to.gov2 corpus, (a) without and (b) with overhead. Schee Rando Load Constrained Constrained Ter-based Greedy Balancing Ter-based (8M) Greedy (8M) Table 1: Noralized host-distribution values per schee, as functions of the nuber of partitions orders of agnitude lower (i.e. ore balanced) than the corresponding non-constrained schees, while sacrificing about half of the copression iproveent over Rando routing. While they still distribute saehost pages in a anner that is statistically very far fro rando, they are also very far fro the skewed distribution of the non-constrained versions. Next, Table 2 checks the sensitivity of both easures, Bits per Posting and host-balancing, to the bound flavor and value of α as defined in Section 4.1, with the Constrained Ter-based schee over = 400 partitions. The sensitivity is quite inor 9. Still, it preserves the trend of the noralized host-distribution value being larger when the Bits per Posting value is lower. 7. CONCLUSIONS This work introduced the proble of reducing a partitioned index size in a low-latency indexing setting via docuent routing. In such settings, incoing docuents ust be ade searchable quickly, and tie-consuing docuent 9 This was true also for the Constrained Greedy schee, whose results are not shown Constraint type Bits per Noralized Posting host-distribution ax { n h +3 } n h, ax { n h +2 n h }, ax { } 1.2 n h, ax { n h + n h }, ax { } 1.1 n h, ax { } 1.05 n h, Table 2: Noralized host-distribution and bits per postings (without overhead) values, with the Constrained Ter-based schee (8 illion ters) over = 400 partitions, as functions of the constraint paraeter α. reordering algoriths cannot be applied. The industry standard routes docuents randoly to the partitions, evenly distributing sae-host pages across the partitions, which is advantageous for query-tie perforance. However, this results in indexes which do not copress well. We frae docuent routing as an online proble, and present docuent routing schees that result in indexes that are up to

9 30% ore copact. These algoriths can be run in both centralized and distributed fashions. We prove that one such schee Ter-based routing yields better copression than Rando routing over a corpus odel appearing in the literature. Its advantages are accentuated on real-life corpora, where sae-host docuents exhibit high siilarity between the. Ter-based routing is also lightweight in ters of tie and space coplexity as copared with the other non-rando routing schees. For each routing schee, we explored how sae-host pages are spread across the partitions, and found a clear tradeoff between copression and a balanced host distribution. This holds also when the routing algoriths are constrained to not exceed certain ibalance criteria. Essentially, this shows that uch of the redundancy that routing algoriths exploit in Web collections stes fro routing each host s pages unevenly, to a sall set of partitions. We believe that low-latency partitioned index systes offer additional trade-offs between issues that are well understood in batch indexing settings, and plan to pursue such trade-offs in future work. In addition, we plan to investigate weighted versions of the Ter-based routing schee. These weights, which will be functions of the ters frequencies, will refine the current intersection-based score of a docuent with respect to each partition. 8. REFERENCES [1] V. Anh and A. Moffat. Inverted index copression using word-aligned binary codes. In in Proc. Inforation Retrieval, pages 8(1): , Jan [2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Trans. Interet Technol., 1(1):2 43, [3] J. Aspnes, Y. Azar, A. Fiat, S. Plotkin, and O. Waarts. On-line routing of virtual circuits with applications to load balancing and achine scheduling. J. ACM, 44(3): , [4] C. Badue, R. Baeza-yates, B. Ribeiro-neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In In Proc. of the 9th String Processing and Inforation Retrieval Syposiu (SPIRE 01), pages IEEE CS Press, [5] L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22 28, April [6] R. Blanco. Index Copression for Inforation Retrieval Systes. PhD thesis, University of A Coruña, [7] R. Blanco and R. Barreiro. Tsp and cluster-based solutions to the reassignent of docuent identifiers. Journal of Inforation Retrieval (IR), 9(4), Sep [8] D. Blandford and G. Blelloch. Index copression through docuent reordering. Data Copression Conference, 0:0342, [9] P. Boldi and S. Vigna. Codes for the world wide web. Internet Matheatics, 2(4): , [10] A. Borodin and R. El-Yaniv. Online coputation and copetitive analysis. Cabridge University Press, New York, NY, USA, [11] F. Chierichetti, R. Kuar, and P. Raghavan. Copressed web indexes. In in Proc. WWW 2009, pages , North-Carolina, USA, Apr [12] S. Ding, J. Attenberg, and T. Suel. Scalable techniques for docuent identifier assignent. In in Proc. WWW 2010, pages , North-Carolina, USA, Apr [13] S. Háan. Super-scalar database copression between ra and cpu-cache. Master s thesis, Centru voor Wiskunde en Inforatica (CWI), [14] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: A repository of web pages. In in Proc. of the Ninth International World-Wide Web Conference (WWW 00), pages , May [15] A. Mofat and L. Stuiver. Binary interpolative coding for effective index copression. Inf. Retr., 3(1):25 47, [16] W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file copression through docuent identifier reassignent. Inf. Processing and Manageent, 39(1):117ï 131, [17] F. Silvestri. Sorting out the docuent identifier assignent proble. In In Proc. of 29th European Conference on IR Research (ECIR 07), pages , [18] F. Silvestri, R. Perego, and S. Orlando. Assigning docuent identifiers to enhance copressibility of web search engines indexes. In in Proc. of the 2004 ACM Syposiu on Applied Coputing (SAC 04), pages ACM Press, [19] G. Snedecor and W. Cochran. Statistical Methods. Iowa State University Press, eighth edition edition, [20] I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufann, [21] H. Yan, S. Ding, and T. Suel. Inverted index copression and query processing with optiized docuent ordering. In in Proc. WWW 2009, Madrid, Spain, Apr [22] J. Zhang, X. Long, and T. Suel. Perforance of copressed inverted list caching in search engines. In in Proc. WWW 2008, pages , Beijing, China, [23] M. Zobel and A. Moffat. Inverted files for text search engines. ACM Coputing Surveys, 38(2), Jul.

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering Clustering Cluster Analysis of Microarray Data 4/3/009 Copyright 009 Dan Nettleton Group obects that are siilar to one another together in a cluster. Separate obects that are dissiilar fro each other into