A Tree-based Inverted File for Fast Ranked-Document Retrieval
|
|
- Marjory Wiggins
- 5 years ago
- Views:
Transcription
1 A Tree-based Inverted File for Fast Ranked-Document Retrieval Wann-Yun Shieh Tien-Fu Chen Chung-Ping Chung Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C. Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan 621, R.O.C. Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C. Abstract Inverted files are widely used to index documents in large-scale information retrieval systems. An inverted file consists of posting lists, which can be stored in either a document-identifier ascending order or a document-weight descending order. For an identifierascending-order posting list, retrieving ranked documents necessitates traversal of all postings, whereas for the weight-descending-order posting list, performing Boolean queries involves very complex processing. In this paper, we transform a posting list to a tree-based structure, called the n-key-heap posting tree, to speedup ranked-document retrieval for Boolean queries. In this structure, the orders of document identifiers and document weights are preserved simultaneously. To preserve the identifier order, the edge pointers are designed to maintain numerical order in the posting tree. To preserve the weight order, greater-weight postings are stored in higher tree nodes by the heap property. We model these criteria to a tree-construction problem and propose an efficient algorithm to construct an optimal posting tree having the minimal access time. Keywords: information retrieval, inverted file, Boolean query, ranked document, posting tree 1. Introduction An indexing structure used by many information retrieval (IR) systems is the inverted file [1]. In an inverted file, for each distinct word (also known as term ) t in the text collection, there is a corresponding list (called the posting list) of the form < t ; ft ;( P1, Wt,1 ),...,( P, Wt, f ) >, where ft t frequency f t indicates the total number of documents in which t appears, identifier P i (also known as posting ) indicates the document that contains t, and weight W, indicates the weight of t i P i associated with t. When a user sends a request containing some query terms to an IR system, the system searches for these query terms in the inverted file to see which documents satisfy the request, and returns ranked documents identifiers to the user. Zobel et al. [2] showed that in terms of the querying time, used space, and functionality, inverted files perform better than other indexing structures. 1.1 Current methods and problems Postings can be permuted in a posting list by either an identifier-sorted order or by a weightsorted order. Both of these sorted types, however, require complex processes in retrieving ranked documents for Boolean queries. For an identifiersorted posting list, retrieving ranked documents requires accesses of all related posting lists from storage, no matter how many terms or how many ranked documents a user queries. As for the weight-sorted posting list, the drawback is to require extra processing cost to compare two posting lists within no identifier numerical order [3]. These problems become more serious as the amount of information increases explosively in the Internet world. If an IR system expands the collection, the lengths of most posting lists in the inverted file will increase. A user may then take longer waiting time in retrieving ranked documents by either the identifier-sorted or weight-sorted posting list. To the best of our knowledge, few studies have proposed suitable posting structures to reduce such complex processes in retrieving ranked documents for Boolean queries.
2 1.2 Research goal We propose a tree-based structure, called the n- key-heap posting-tree, to preserve the orders of document identifiers and document weights simultaneously for fast ranked-document retrieval. In an n-key-heap posting tree, the root node contains the n most important (that is, highest within-document weight) postings, and the n+1 children of the root node recursively contain the n+1 segments of the posting lists created by splitting at these n postings. To preserve the identifier order, the postings in each node are permuted in an identifier-ascending order, and the identifier order among tree nodes are maintained by edge pointers. To preserver the weight order, greater-weight postings are stored in higher tree nodes by the heap property. These criteria can be modeled to a tree-construction problem, in which the objective is to minimize the average access time in retrieving ranked documents. According to this model, we propose an efficient algorithm to construct such an optimal n-key-heap posting tree. Simulation results show that the disk access time and posting-list processing time for retrieving ranked documents can be effectively reduced by the proposed structure. This paper is organized as follows. In Section 2, we define the structure of the n-key-heap posting tree, and develop the posting-tree construction algorithm. Also, we present the scheme for retrieving ranked documents from a posting tree. In Section 3, we show simulation results in terms of disk transfer time and posting processing time. Finally, we give conclusions in Section N-key-heap posting tree The issue of designing a tree structure for a posting list is to preserve the orders of document identifiers and document weights simultaneously. We deal with this problem by following definitions. 2.1 Definition of a posting tree Definition 1: posting tree Given a posting list L, its posting tree T is a rooted tree having following properties: Property 1: Every node x contains following elements: a. n (identifier, weight) pairs: (x.identifier[i], x.weight[i]) L, 1 i n, which are stored in an identifier-ascending order; i.e., x.identifier[1]<... <x.identifier[n]. b. n+1 pointers: x.c[0],, x.c[n] point to x s children. Property 2: Identifiers in a subtree rooted at x.c[i] must be greater than identifier x.identifier[i] but less than identifier x.identifier[i+1]: if k i is any identifier stored in the subtree rooted at x.c[i], then k 1 < x.identifier[1] < k 2 < x.identifier[2] <... < x.identifier[n] < k n+1. By Property 1, when n identifiers are selected from L and are inserted into a root node x, remaining identifiers in L will be split into at most n+1 segments. By property 2, each segment recursively forms a posting tree and is pointed at by corresponding x.c[i]. Take a post list L 1 as example: L 1 : <t; 10 ; (6, ), (15, 0.19), (55, 0.18), (169, 0.07), (191, 0.14), (238, 0.08), (240, 0.04), (242, ), (251, 0.05), (310, 0.13)>. Figure 1 (a) shows an example posting tree for L 1. Here we let n=4. With the pointers x.c[i], all identifiers can be accessed in ascending order by performing DFS (Depth-First Search) along x.c[i] in the posting tree. 2.2 Definition of the n-key heap property Definition 2: n-key heap property A tree T, in which every node x contains n keys x.key[1],, x.key[n], satisfies the n-key heap property if the n keys of every node are all less than any key of its parent. Here the term key can be used to represent any specific characteristic. If document weights are used to be the keys in Definition 2, then nodes in a posting tree satisfying the n-key heap property forms a weight descending hierarchy. That is, higher-weight identifiers are stored in higher tree nodes.
3 Legend: identifier weight x.c[i] x.c[i+1] (a) (b) Figure 1. For posting list L 1 : (a) an example posting tree, (b) the n-key-heap posting tree. This feature helps the system visit more important identifiers early if the nodes are retrieved in a top-down manner. Figure 1(b) shows the example of L 1 posting tree satisfying the n-key heap property. The relation of treenodes between different levels is formulated in Lemma 1. Lemma 1: Assume T is a posting tree satisfying the n-key heap property. If x and y are two document identifiers in T, and the tree node containing x is an ancestor of the tree node containing y, then weight(x) > weight(y). Proof: The claim follows from the n-key heap property. 2.3 Constructing a minimal-access-time posting tree The time to access a document identifier in a posting tree is proportional to the depth of the node containing it. Reducing average node depth in a posting tree results in a shorter identifier access time. A posting tree satisfying n-key heap property can hence be judiciously constructed in accordance with document weights in such a way that the average node depth for retrieving ranked identifiers is minimized. We formulate this construction problem as an optimization problem [4]. Without loss of generality, we assume that any two identifiers in a posting tree have different weights, and all weights are normalized to 1. For convenience, we call a posting tree with n-key heap property an n-key-heap posting tree in the following. Definition 3: N-key-heap posting tree construction problem Let a posting list L contain m postings (p 1, w 1 ), (p 2, w 2 ) (p m, w m ), where p i is the document identifier, and w i is the weight of p i. The weighted node-depth of a posting tree T is defined as m wi DT ( pi ), where D T ( p i ) denotes the nodei = 1 depth of p i in T, and ( pi, wi ) L. The problem is to find an optimal n-key-heap posting tree T whose weighted node-depth is minimal. We derive an algorithm to construct such a posting tree in Figure 2. In Figure 2, the algorithm includes two phases. In the first phase (lines 1-2), we begin with a greedy selection to put the n highest-weight postings in the root node. In the second phase (lines 3-6), the children of the root node are recursively to be constructed in the same manner. Therefore, the time complexity of the algorithm is O( m log m ), where m is the total number of document identifiers in the given posting list. Lemma 2 shows that the problem of constructing an optimal n-key heap posting-tree has the optimal-substructure property: an optimal solution to the problem contains within its optimal solutions to subproblems [7].
4 Building_posting_tree(L, n) Input: a posting list L {(p 1, w 1 ), (p 2, w 2 ) (p m, w m )}, and an integer n. Output: an n-key-heap posting tree T whose weighted node-depth is minimal. Begin 1 Retrieve the n highest-weight postings, ( p, w ),...,( p, ) 1 i1 w from L; i in in 2 Let x be the root node of T. Put { ( p, w ),...,( p, ) 1 i1 w } into x; i in in 3 x.c[0] := Building_posting_tree({ ( p1, w1),...,( p 1, w 1) i1 i1 }, n); 4 for k := 1 to n-1 do 5 x.c[k] := Building_posting_tree({ ( p 1, w 1),...,( p 1, w 1) ik + ik + ik + 1 ik + 1 }, n); 6 x.c[n] := Building_posting_tree({ ( p 1, w 1)...,( pm, wm) in+ in+ }, n); 7 return T; End Figure2. Construct an n-key-heap posting tree with the minimal weighted node-depth. Lemma 2: Let T be an n-key heap posting-tree whose average weighted node-depth is minimal. Then, for any subtree Z in T, the average weighted node-depth of the posting tree T =T-Z is also minimal. Proof: Let WD(T) be the average weighted node-depth of the posting tree T. Since T =T-Z, we have WD( T ') + wi DT ( i) = WD( T ). If i Z the average weighted node-depth of T is not minimal, then there exists another posting tree T such that WD(T )<WD(T ). This implies WD( T '') + wi DT ( i) < WD( T ), i Z contradicting the optimality of T. Thus, the average weighted node-depth of T is minimal. By Lemmas 1 and 2, Theorem 1 thus follows. Theorem 1: The algorithm in Figure 2 produces an n-key heap posting-tree whose average weighted node-depth is minimal. Proof: Immediate from Lemmas 1 and Retrieving ranked documents identifiers from a posting tree For a query which contains one term, and requests R ranked documents, we search them from the root node of depth 1, and then the nodes of depth 2 etc in the related posting tree. The searching process stops when the R highestweight identifiers are returned. This top-down searching process can avoid the traversal of all postings, and is suited to retrieving ranked documents in long posting lists. For another query which contains two terms with a Boolean operator, and requests R ranked documents, we propose a range-checking approach to perform the Boolean operation on as few nodes in the related posting trees as possible. Take the posting tree in Figure 1(b) as an example. When we fetch the root node first, we obtain a set of identifier-ranges split from the original posting list L 1, as shown in Figure 3(a). If we perform an AND operation on these ranges against those of another posting tree, shown in Figure 3(b), a set of intersection ranges can be generated in Figure 3 (c). By recursively performing the same operation on these intersection ranges against the ranges of other nodes of depth k (k>1) in two posting trees, the ranges can be narrowed down to the identifiers satisfying the Boolean operation, or be discarded if they are obviously not satisfying the Boolean operation. By this range-checking process, the R highest-weight identifiers can be returned in topdown sequence, and do not need any sorting process further. For other query containing k terms with m Boolean operators, we can easily extend the range-checking approach to perform k-way retrieval.
5 (a) The ranges of (a): The ranges of (b): Intersection ranges: (b) (c) Figure 3. Performing an AND operation on two sets of ranges: (a) the node of depth 1, and its identifier-ranges in the posting tree of L 1, (b) a node, and its identifier-ranges in another posting tree, (c) intersection ranges. 3. Simulation and performance evaluation Simulation is used to generate performance data. In performance evaluation, factors to be examined include disk access time and posting processing time in retrieving ranked documents. 3.1 Simulation environment We use parts of WT10g, about 460,000 documents, to be our test collection. (WT10g is a widely distributed collection and has been included in TREC Web Test Collections [5].) To simulate query behavior, we implement a queryterm generator to select terms for synthesizing a set of queries. The occurrence of query terms follows the Zipf-like distribution [6]. An IR system is implemented on a Linux platform to simulate the retrieval services for the proposed retrieving algorithms. 3.2 Simulation results Table 1 compares the average disk access time (DT) and posting-list processing time (PT) between the structures of the 10-key-heap posting tree and the identifier-sorted posting list, for 100,000 one-term queries. (We do not compare with the weight-sorted posting list because it is not suited to Boolean query processing [3].) In the second column of Table 1, the average disk access time of the identifier-sorted posting list is fixed, regardless of the number of identifiers requested. This is because the entire linear posting list has to be retrieved from the disk for sorting. Contrarily, the disk access time of the posting tree is only proportional to the amount of requested identifiers because these identifiers can be retrieved selectively in top-down sequence. In addition, the average posting processing time of the posting tree is smaller than that of the identifier-sorted posting list due to reduced sorting process. Table 2 compares the same metrics between two structures but for 100,000 two-term queries. For the posting tree, we apply the rangechecking approach to retrieving ranked identifiers for each two-term query. Simulation results show that the posting tree with the rangechecking approach outperforms the identifiersorted posting list in terms of the average disk access time and posting-list processing time. This is because by range-checking, non-intersection ranges of two posting trees identifiers can be discarded as soon as possible, and unnecessary nodes do not need to be retrieved from storage. We also perform similar experiments on processing the three-term, and four-term queries to evaluate the advantages of the posting tree. The results all show that the posting tree outperforms the identifier-sorted posting list for fast ranked document retrieval.
6 Table 1. Comparison of retrieval performance for 100,000 one-term queries Amount of ranked Identifier-sorted posting list Posting tree identifiers requested DT (ms) PT (ms) DT (ms) PT (ms) Table 2. Comparison of retrieval performance for 100,000 two-term queries Amount of ranked Identifier-sorted posting list Posting tree identifiers requested DT (ms) PT (ms) DT (ms) PT (ms) Conclusion We propose an n-key-heap posting tree to speedup ranked-document retrieval for Boolean queries. This structure simultaneously preserves the orders of document identifiers and document weights by edge pointers and by the heap property, respectively. A greedy algorithm is proposed to construct an optimal n-key-heap posting tree, whose weighted node depth is minimal. The optimal posting tree guarantees that the tree has minimum access time for retrieving ranked postings. We also propose a range-checking approach to speedup retrieval process. The storage space is another issue for the posting tree, and can be reduced through encoding compression. Many studies have involved the identifier and weight compression [3]. However, the compression that is completely suitable for the posting tree needs to be investigated further. 5. References [1] E. Rillof, L. Hollaar, Text Database and Information Retrieval, ACM Computer Surveys, Vol. 28, No. 1, 1996, pp [2] J. Zobel, A. Moffat, K. Ramamohanarao, Inverted Files Versus Signature Files for Text Indexing, ACM Transactions on Database Systems, Vol. 23, No. 4, 1998, pp [3] I. H. Witten, A. Moffat, T. C. Bell, Managing Gigabytes - Compressing and Indexing Documents and Images, 2 nd Ed., Morgan Kaufmann Publishers, Inc, [4] C. H. Papadimitriou, K. Steiglitz, Combinatorial Optimization Algorithms and Complexity. Kenneth Steiglitz, Princeton University, [5] TREC Web Test Collections, [6] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web Caching and Zipf-like Distributions: Evidence and Implications, IEEE INFOCOM, Vol. 1, 1999, pp [7] Cormen, T. H., Leiserson, C. E., & Rivest, R. L. Introduction to Algorithms. Cambridge, MA: MIT Press, 1990.
Inverted file compression through document identifier reassignment
Information Processing and Management 39 (2003) 117 131 www.elsevier.com/locate/infoproman Inverted file compression through document identifier reassignment Wann-Yun Shieh a, Tien-Fu Chen b, Jean Jyh-Jiun
More informationA statistics-based approach to incrementally update inverted files
Information Processing and Management 41 (25) 275 288 www.elsevier.com/locate/infoproman A statistics-based approach to incrementally update inverted files Wann-Yun Shieh *, Chung-Ping Chung Department
More informationCluster based Mixed Coding Schemes for Inverted File Index Compression
Cluster based Mixed Coding Schemes for Inverted File Index Compression Jinlin Chen 1, Ping Zhong 2, Terry Cook 3 1 Computer Science Department Queen College, City University of New York USA jchen@cs.qc.edu
More informationLecture 3: B-Trees. October Lecture 3: B-Trees
October 2017 Remarks Search trees The dynamic set operations search, minimum, maximum, successor, predecessor, insert and del can be performed efficiently (in O(log n) time) if the search tree is balanced.
More informationAnalysis of Algorithms - Greedy algorithms -
Analysis of Algorithms - Greedy algorithms - Andreas Ermedahl MRTC (Mälardalens Real-Time Reseach Center) andreas.ermedahl@mdh.se Autumn 2003 Greedy Algorithms Another paradigm for designing algorithms
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationWe assume uniform hashing (UH):
We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationGreedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes
Greedy Algorithms CLRS Chapters 16.1 16.3 Introduction to greedy algorithms Activity-selection problem Design of data-compression (Huffman) codes (Minimum spanning tree problem) (Shortest-path problem)
More informationCS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department
CS473-Algorithms I Lecture 11 Greedy Algorithms 1 Activity Selection Problem Input: a set S {1, 2,, n} of n activities s i =Start time of activity i, f i = Finish time of activity i Activity i takes place
More informationTrees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.
Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,
More informationEnsures that no such path is more than twice as long as any other, so that the tree is approximately balanced
13 Red-Black Trees A red-black tree (RBT) is a BST with one extra bit of storage per node: color, either RED or BLACK Constraining the node colors on any path from the root to a leaf Ensures that no such
More informationComparative Analysis of Sparse Matrix Algorithms For Information Retrieval
Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}
More informationarxiv: v3 [cs.ds] 18 Apr 2011
A tight bound on the worst-case number of comparisons for Floyd s heap construction algorithm Ioannis K. Paparrizos School of Computer and Communication Sciences Ècole Polytechnique Fèdèrale de Lausanne
More informationDistributed minimum spanning tree problem
Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with
More informationPerformance Improvement of Hardware-Based Packet Classification Algorithm
Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,
More informationNotes on Binary Dumbbell Trees
Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes
More informationAlgorithms Dr. Haim Levkowitz
91.503 Algorithms Dr. Haim Levkowitz Fall 2007 Lecture 4 Tuesday, 25 Sep 2007 Design Patterns for Optimization Problems Greedy Algorithms 1 Greedy Algorithms 2 What is Greedy Algorithm? Similar to dynamic
More informationA Fast Algorithm for Optimal Alignment between Similar Ordered Trees
Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221
More informationLECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS
Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationPhysical Level of Databases: B+-Trees
Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,
More informationBinary Heaps in Dynamic Arrays
Yufei Tao ITEE University of Queensland We have already learned that the binary heap serves as an efficient implementation of a priority queue. Our previous discussion was based on pointers (for getting
More informationNon-context-Free Languages. CS215, Lecture 5 c
Non-context-Free Languages CS215 Lecture 5 c 2007 1 The Pumping Lemma Theorem (Pumping Lemma) Let be context-free There exists a positive integer divided into five pieces Proof for for each and Let and
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationBinary Trees
Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what
More informationSearch Trees. Undirected graph Directed graph Tree Binary search tree
Search Trees Undirected graph Directed graph Tree Binary search tree 1 Binary Search Tree Binary search key property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then
More informationA Note on Scheduling Parallel Unit Jobs on Hypercubes
A Note on Scheduling Parallel Unit Jobs on Hypercubes Ondřej Zajíček Abstract We study the problem of scheduling independent unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between
More informationText Compression through Huffman Coding. Terminology
Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character
More informationGeneralized indexing and keyword search using User Log
Generalized indexing and keyword search using User Log 1 Yogini Dingorkar, 2 S.Mohan Kumar, 3 Ankush Maind 1 M. Tech Scholar, 2 Coordinator, 3 Assistant Professor Department of Computer Science and Engineering,
More informationProblem Set 5 Solutions
Design and Analysis of Algorithms?? Massachusetts Institute of Technology 6.046J/18.410J Profs. Erik Demaine, Srini Devadas, and Nancy Lynch Problem Set 5 Solutions Problem Set 5 Solutions This problem
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationFull-Text Search on Data with Access Control
Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.
More informationA Simplified Correctness Proof for a Well-Known Algorithm Computing Strongly Connected Components
A Simplified Correctness Proof for a Well-Known Algorithm Computing Strongly Connected Components Ingo Wegener FB Informatik, LS2, Univ. Dortmund, 44221 Dortmund, Germany wegener@ls2.cs.uni-dortmund.de
More information16 Greedy Algorithms
16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices
More informationScribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017
CS6 Lecture 4 Greedy Algorithms Scribe: Virginia Williams, Sam Kim (26), Mary Wootters (27) Date: May 22, 27 Greedy Algorithms Suppose we want to solve a problem, and we re able to come up with some recursive
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationEvaluating XPath Queries
Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More information1 o Semestre 2007/2008
Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationB-Trees and External Memory
Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 and External Memory 1 1 (2, 4) Trees: Generalization of BSTs Each internal node
More informationPriority Queues and Binary Heaps
Yufei Tao ITEE University of Queensland In this lecture, we will learn our first tree data structure called the binary heap which serves as an implementation of the priority queue. Priority Queue A priority
More informationCSE331 Introduction to Algorithms Lecture 15 Minimum Spanning Trees
CSE1 Introduction to Algorithms Lecture 1 Minimum Spanning Trees Antoine Vigneron antoine@unist.ac.kr Ulsan National Institute of Science and Technology July 11, 201 Antoine Vigneron (UNIST) CSE1 Lecture
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationB-Trees and External Memory
Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 B-Trees and External Memory 1 (2, 4) Trees: Generalization of BSTs Each internal
More informationAdvanced algorithms. topological ordering, minimum spanning tree, Union-Find problem. Jiří Vyskočil, Radek Mařík 2012
topological ordering, minimum spanning tree, Union-Find problem Jiří Vyskočil, Radek Mařík 2012 Subgraph subgraph A graph H is a subgraph of a graph G, if the following two inclusions are satisfied: 2
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationDetecting negative cycles with Tarjan s breadth-first scanning algorithm
Detecting negative cycles with Tarjan s breadth-first scanning algorithm Tibor Ásványi asvanyi@inf.elte.hu ELTE Eötvös Loránd University Faculty of Informatics Budapest, Hungary Abstract The Bellman-Ford
More informationB-Trees. Based on materials by D. Frey and T. Anastasio
B-Trees Based on materials by D. Frey and T. Anastasio 1 Large Trees n Tailored toward applications where tree doesn t fit in memory q operations much faster than disk accesses q want to limit levels of
More informationEfficient Access to Non-Sequential Elements of a Search Tree
Efficient Access to Non-Sequential Elements of a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu Abstract
More informationCOMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University
COMP251: Algorithms and Data Structures Jérôme Waldispühl School of Computer Science McGill University About Me Jérôme Waldispühl Associate Professor of Computer Science I am conducting research in Bioinformatics
More informationChapter 17 Indexing Structures for Files and Physical Database Design
Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationChapter 12: Indexing and Hashing. Basic Concepts
Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition
More informationTrapezoidal decomposition:
Trapezoidal decomposition: Motivation: manipulate/analayze a collection of segments e.g. detect segment intersections e.g., point location data structure Definition. Draw verticals at all points binary
More information( ) D. Θ ( ) ( ) Ο f ( n) ( ) Ω. C. T n C. Θ. B. n logn Ο
CSE 0 Name Test Fall 0 Multiple Choice. Write your answer to the LEFT of each problem. points each. The expected time for insertion sort for n keys is in which set? (All n! input permutations are equally
More informationTrees Rooted Trees Spanning trees and Shortest Paths. 12. Graphs and Trees 2. Aaron Tan November 2017
12. Graphs and Trees 2 Aaron Tan 6 10 November 2017 1 10.5 Trees 2 Definition Definition Definition: Tree A graph is said to be circuit-free if, and only if, it has no circuits. A graph is called a tree
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationarxiv:cs/ v1 [cs.ir] 21 Jul 2004
DESIGN OF A PARALLEL AND DISTRIBUTED WEB SEARCH ENGINE arxiv:cs/0407053v1 [cs.ir] 21 Jul 2004 S. ORLANDO, R. PEREGO, F. SILVESTRI Dipartimento di Informatica, Universita Ca Foscari, Venezia, Italy Istituto
More informationTU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD
Algorithms (2IL15) Lecture 2 THE GREEDY METHOD x y v w 1 Optimization problems for each instance there are (possibly) multiple valid solutions goal is to find an optimal solution minimization problem:
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationGeometric Data Structures
Geometric Data Structures 1 Data Structure 2 Definition: A data structure is a particular way of organizing and storing data in a computer for efficient search and retrieval, including associated algorithms
More informationRandomized Ternary Search Tries
Randomized Ternary Search Tries icolai Diethelm bstract simple method for maintaining balance in ternary search tries is presented. The new kind of selfbalancing ternary search trie, called an r-trie,
More informationChapter 12: Indexing and Hashing
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationCSE202 Greedy algorithms. Fan Chung Graham
CSE202 Greedy algorithms Fan Chung Graham Announcement Reminder: Homework #1 has been posted, due April 15. This lecture includes material in Chapter 5 of Algorithms, Dasgupta, Papadimitriou and Vazirani,
More informationBinary search trees. Binary search trees are data structures based on binary trees that support operations on dynamic sets.
COMP3600/6466 Algorithms 2018 Lecture 12 1 Binary search trees Reading: Cormen et al, Sections 12.1 to 12.3 Binary search trees are data structures based on binary trees that support operations on dynamic
More informationHeaps Outline and Required Reading: Heaps ( 7.3) COSC 2011, Fall 2003, Section A Instructor: N. Vlajic
1 Heaps Outline and Required Reading: Heaps (.3) COSC 2011, Fall 2003, Section A Instructor: N. Vlajic Heap ADT 2 Heap binary tree (T) that stores a collection of keys at its internal nodes and satisfies
More informationTrees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.
Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial
More information4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests
4 Basics of Trees Trees, actually acyclic connected simple graphs, are among the simplest graph classes. Despite their simplicity, they still have rich structure and many useful application, such as in
More informationEfficient Priority Assignment Policies for Distributed Real-Time Database Systems
Proceedings of the 7 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 7 7 Efficient Priority Assignment Policies for Distributed Real-Time
More informationOutline. Definition. 2 Height-Balance. 3 Searches. 4 Rotations. 5 Insertion. 6 Deletions. 7 Reference. 1 Every node is either red or black.
Outline 1 Definition Computer Science 331 Red-Black rees Mike Jacobson Department of Computer Science University of Calgary Lectures #20-22 2 Height-Balance 3 Searches 4 Rotations 5 s: Main Case 6 Partial
More informationGraph Algorithms Using Depth First Search
Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth
More information16.Greedy algorithms
16.Greedy algorithms 16.1 An activity-selection problem Suppose we have a set S = {a 1, a 2,..., a n } of n proposed activities that with to use a resource. Each activity a i has a start time s i and a
More informationBinary search trees 3. Binary search trees. Binary search trees 2. Reading: Cormen et al, Sections 12.1 to 12.3
Binary search trees Reading: Cormen et al, Sections 12.1 to 12.3 Binary search trees 3 Binary search trees are data structures based on binary trees that support operations on dynamic sets. Each element
More informationTrees. Introduction & Terminology. February 05, 2018 Cinda Heeren / Geoffrey Tien 1
Trees Introduction & Terminology Cinda Heeren / Geoffrey Tien 1 Review: linked lists Linked lists are constructed out of nodes, consisting of a data element a pointer to another node Lists are constructed
More informationLecture 10: Strongly Connected Components, Biconnected Graphs
15-750: Graduate Algorithms February 8, 2016 Lecture 10: Strongly Connected Components, Biconnected Graphs Lecturer: David Witmer Scribe: Zhong Zhou 1 DFS Continued We have introduced Depth-First Search
More informationSolutions. (a) Claim: A d-ary tree of height h has at most 1 + d +...
Design and Analysis of Algorithms nd August, 016 Problem Sheet 1 Solutions Sushant Agarwal Solutions 1. A d-ary tree is a rooted tree in which each node has at most d children. Show that any d-ary tree
More informationRandomized incremental construction. Trapezoidal decomposition: Special sampling idea: Sample all except one item
Randomized incremental construction Special sampling idea: Sample all except one item hope final addition makes small or no change Method: process items in order average case analysis randomize order to
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationStatic Index Pruning for Information Retrieval Systems: A Posting-Based Approach
Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu
More informationAccess-Ordered Indexes
Access-Ordered Indexes Steven Garcia Hugh E. Williams Adam Cannane School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. {garcias,hugh,cannane}@cs.rmit.edu.au
More informationV.2 Index Compression
V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,
More informationDiscrete mathematics
Discrete mathematics Petr Kovář petr.kovar@vsb.cz VŠB Technical University of Ostrava DiM 470-2301/02, Winter term 2018/2019 About this file This file is meant to be a guideline for the lecturer. Many
More informationGreedy Algorithms. Textbook reading. Chapter 4 Chapter 5. CSci 3110 Greedy Algorithms 1/63
CSci 3110 Greedy Algorithms 1/63 Greedy Algorithms Textbook reading Chapter 4 Chapter 5 CSci 3110 Greedy Algorithms 2/63 Overview Design principle: Make progress towards a solution based on local criteria
More informationSearch trees, tree, B+tree Marko Berezovský Radek Mařík PAL 2012
Search trees, 2-3-4 tree, B+tree Marko Berezovský Radek Mařík PL 2012 p 2
More informationA SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)
Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania
More information4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationSpatial Index Keyword Search in Multi- Dimensional Database
Spatial Index Keyword Search in Multi- Dimensional Database Sushma Ahirrao M. E Student, Department of Computer Engineering, GHRIEM, Jalgaon, India ABSTRACT: Nearest neighbor search in multimedia databases
More informationChapter 22. Elementary Graph Algorithms
Graph Algorithms - Spring 2011 Set 7. Lecturer: Huilan Chang Reference: (1) Cormen, Leiserson, Rivest, and Stein, Introduction to Algorithms, 2nd Edition, The MIT Press. (2) Lecture notes from C. Y. Chen
More informationEfficient Non-Sequential Access and More Ordering Choices in a Search Tree
Efficient Non-Sequential Access and More Ordering Choices in a Search Tree Lubomir Stanchev Computer Science Department Indiana University - Purdue University Fort Wayne Fort Wayne, IN, USA stanchel@ipfw.edu
More informationB-Trees. Introduction. Definitions
1 of 10 B-Trees Introduction A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node may contain a large number of keys. The number of subtrees of each node,
More informationAssignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1
COMPUTER ENGINEERING DEPARTMENT BILKENT UNIVERSITY Assignment No. 1 Abdurrahman Yasar June 10, 2014 1 QUESTION 1 Consider the following search results for two queries Q1 and Q2 (the documents are ranked
More informationProperties of red-black trees
Red-Black Trees Introduction We have seen that a binary search tree is a useful tool. I.e., if its height is h, then we can implement any basic operation on it in O(h) units of time. The problem: given
More informationHEAPS ON HEAPS* Downloaded 02/04/13 to Redistribution subject to SIAM license or copyright; see
SIAM J. COMPUT. Vol. 15, No. 4, November 1986 (C) 1986 Society for Industrial and Applied Mathematics OO6 HEAPS ON HEAPS* GASTON H. GONNET" AND J. IAN MUNRO," Abstract. As part of a study of the general
More information(2,4) Trees. 2/22/2006 (2,4) Trees 1
(2,4) Trees 9 2 5 7 10 14 2/22/2006 (2,4) Trees 1 Outline and Reading Multi-way search tree ( 10.4.1) Definition Search (2,4) tree ( 10.4.2) Definition Search Insertion Deletion Comparison of dictionary
More informationMCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes
MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, 2016 Huffman Codes CLRS: Ch 16.3 Ziv-Lempel is the most popular compression algorithm today.
More informationThese notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.
Undirected Graphical Models: Chordal Graphs, Decomposable Graphs, Junction Trees, and Factorizations Peter Bartlett. October 2003. These notes present some properties of chordal graphs, a set of undirected
More information