An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

Similar documents
ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Improving Data Access Performance by Reverse Indexing

Symmetrically Exploiting XML

SpiderX: Fast XML Exploration System

Reasoning with Patterns to Effectively Answer XML Keyword Queries

Searching SNT in XML Documents Using Reduction Factor

Top-k Keyword Search Over Graphs Based On Backward Search

Outline. Depth-first Binary Tree Traversal. GerĂȘnciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Efficient LCA based Keyword Search in XML Data

Path-based Keyword Search over XML Streams

Processing XML Keyword Search by Constructing Effective Structured Queries

Evaluating XPath Queries

MAXLCA: A NEW QUERY SEMANTIC MODEL FOR XML KEYWORD SEARCH

An Efficient XML Index Structure with Bottom-Up Query Processing

Huayu Wu, and Zhifeng Bao National University of Singapore

XML Storage and Indexing

CHAPTER 3 LITERATURE REVIEW

Schema-Free XQuery. Yunyao Li Cong Yu H. V. Jagadish. Abstract. 1 Introduction

Efficient Keyword Search for Smallest LCAs in XML Databases

Efficient Indexing and Querying in XML Databases

Structural Consistency: Enabling XML Keyword Search to Eliminate Spurious Results Consistently

A New Indexing Strategy for XML Keyword Search

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

Full-Text and Structural XML Indexing on B + -Tree

Structural consistency: enabling XML keyword search to eliminate spurious results consistently

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Effective, Design-Independent XML Keyword Search

Module 3. XML Processing. (XPath, XQuery, XUpdate) Part 5: XQuery + XPath Fulltext

Cohesiveness Relationships to Empower Keyword Search on Tree Data on the Web

Mining XML Functional Dependencies through Formal Concept Analysis

CoXML: A Cooperative XML Query Answering System

LAF: a new XML encoding and indexing strategy for keyword-based XML search

EE 368. Weeks 5 (Notes)

The Research on Coding Scheme of Binary-Tree for XML

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Enabling Real Time Analytics over Raw XML Data

A Survey on Keyword Diversification Over XML Data

CSE 530A. B+ Trees. Washington University Fall 2013

Evaluation of Keyword Search System with Ranking

Improving the Performance of Search Engine With Respect To Content Mining Kr.Jansi, L.Radha

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

March 20/2003 Jayakanth Srinivasan,

Efficient, Effective and Flexible XML Retrieval Using Summaries

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Chapter 13 XML: Extensible Markup Language

Binary Trees

Retrieving Meaningful Relaxed Tightest Fragments for XML Keyword Search

Compression of the Stream Array Data Structure

Supporting Temporal Queries on XML Keyword Search Engines

Web Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Effective XML Keyword Search with Relevance Oriented Ranking

Indexing Keys in Hierarchical Data

Web Data Management. Tree Pattern Evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Information Sciences

Binary Trees, Binary Search Trees

Keywords Fuzzy search, Keyword Search, LCA and MCT, Type-ahead search, XML.

TwigList: Make Twig Pattern Matching Fast

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees

SQLfX Live Online Demo

TDDD43. Theme 1.2: XML query languages. Fang Wei- Kleiner h?p:// TDDD43

Efficient Access to Non-Sequential Elements of a Search Tree

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

ICRA: Effective Semantics for Ranked XML Keyword Search

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop.

Schema-Based XML-to-SQL Query Translation Using Interval Encoding

On Label Stream Partition for Efficient Holistic Twig Join

Efficient Non-Sequential Access and More Ordering Choices in a Search Tree

Search Driven Analysis of Heterogeneous XML Data

Cohesive Keyword Search on Tree Data

Selection, Generation and Extraction of MCCTree using XMCCTree

Data Structures and Algorithms

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Structural XML Querying

A General Framework to Resolve the MisMatch Problem in XML Keyword Search

Containment of Partially Specified Tree-Pattern Queries

Symmetrically Exploiting XML

LCA-based Selection for XML Document Collections

PAPER Full-Text and Structural Indexing of XML Documents on B + -Tree

Trees. CSE 373 Data Structures

XML Data Management. 5. Extracting Data from XML: XPath

Augmenting Data Structures

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

Distributed Database System. Project. Query Evaluation and Web Recognition in Document Databases

Keyword query interpretation over structured data

Labeling and Querying Dynamic XML Trees

Chapter 4 Trees. Theorem A graph G has a spanning tree if and only if G is connected.

DDE: From Dewey to a Fully Dynamic XML Labeling Scheme

Improving Suffix Tree Clustering Algorithm for Web Documents

Monotone Constraints in Frequent Tree Mining

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

Trees. Eric McCreath

Trees. Tree Structure Binary Tree Tree Traversals

CSI33 Data Structures

Transcription:

An Effective and Efficient Approach for Keyword-Based XML Retrieval Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

Search on XML documents 2 Why not use google?

Why are traditional search engines not suitable for searching XML docs? 3 There is no way to specify semantics in a keyword query executed with a traditional search engine. Consider these queries (they share the same set of keywords): 1. Find title and year of publications, of which Mary is an author 2. Find year and author of publications with similar titles to a publication of which Mary is an author (Queries Adapted from Li [2])

Why are traditional search engines not suitable for searching XML docs? 4 Traditional search engines index and return whole documents, not the specific parts of the documents. For the query: bike + luxury this document -> would be considered a relevant result < item> <ID> ID0 </ID> <name> bike </name> <desc>lame</desc> </item> < item> <ID> ID1</ID> <name> car </name> <desc> luxury </desc> </item>

XML document model as a tree 5 tree (rooted, ordered, labeled) edges = parent-child relationships between elements in XML document inner nodes (elements) and leaf nodes (values) labels path label path - a sequence of label names of a path ancestor relational subtree

Lowest Common Ancestor (LCA) 6

Lowest Common Ancestor (LCA) 7

Lowest Common Ancestor (LCA) 8

Lowest Common Ancestor (LCA) 9

Lowest Common Ancestor (LCA) 10

Lowest Common Ancestor (LCA) 11

Yet even more definitions meaningfully related nodes 12 Two nodes a and b are related meaningfully if the following conditions are met: 1. There are no two nodes with the same label on the subtree 2. The only two nodes that can have the identical labels on this subtree are the nodes a and b themselves

Example of meaningfully related nodes 13

Example of two nodes, which are not meaningfully related and do not satisfy the rule 14

Example of two nodes that satisfy the rule, but are not meaningfully related 15

Types of search on XML documents 16 keyword-based search structured query XSEarch (2003) Schema-free XQuery (2004) XRank XQuery

structured query search 17 ( ) requires knowledge of the XML document structure (partial or full); requires user to possess basic knowledge of query language; ( + ) high precision of the results; queries can convey sophisticated semantic meaning; problems associated with retrieving data from multiple documents (dif. schemas dif queries or use of translation)

keyword-based search 18 ( ) ( + ) hard to express semantic meaning in a query (e.g. no way to refer in a query to the tags deliberately); no way to specify which part of the document should be returned, often returns large number of undesired results. no knowledge of structure is necessary; no special knowledge is needed, anyone can execute search.

Schema-Free XQuery 19 the base of the approach is the concept of Lowest Common Ancestor authors define on the top of it Meaningful Lowest Common Ancestor Structure (MLCAS); MLCAS serves for the purpose of identifying the nodes, which are meaningfully related to each other; extends XQuery with the mlcas function, which provides an implementation of this approach. Li [2]

XSEarch 20 comprehensible syntax for queries, suitable for all users but also with extra possibilities for more advanced users to specify how keywords are related; has two types of algorithms for off- and online computation; has a full-fledged system for ranking the answers; based on the tree representation of the XML document develop their own heuristic for defining relationships between nodes, key concept define meaningfully related sets of nodes. Cohen [1]

The Approach, which is described in my article 21 based on the concept of Lowest Common Ancestor (LCA) developed the LCA of Label Path structure (PLCA); adapted the LCA rule for checking whether the nodes are meaningfully related (PLCA rule); use Dewey indexing for nodes; introduced a new type of index structure - PN-inverted index. Li [3]

Some more definitions 22 A prefix path lp' of lp is a sub-sequence of lp, where lp' has the same beginning with lp and lp' lp For the label path regions.africa.item.name, which of the following will be the prefix paths (according to the definition): regions.africa, regions.africa.item, regions.africa.item.id?

Node encoding 23 Dewey encoding: captures the relationship of ancestor and descendant information; reflects the depth from root to the node.

LCA of two label paths 24 Label path k is Common path of label paths a and b, if it is the prefix path for both a and b and it is longer than any other label path k, which is also the prefix path of both a and b. Then the last label in k is the LCA of a and b, denoted as k = PLCA (a,b) Information Retrieval, WS08 Daryna Bronnykova

PLCA rule 25 nodes u1 and u2 lp1 and lp2 label paths from root to the nodes u1 and u2 common path lp u=lca(u1,u2) u1 and u2 are meaningfully related if: lp = Dewey id of u there are no two distinct labels with the same name (other than ends of lp1 and lp2) in the set (lp1-lp)u(lp2-lp)

PN-Inverted Index 26 store for a keyword ID s of all nodes, which contain this keyword + label paths from the root to these nodes; nodes that have the same label path are grouped together to form the NodeSet; store ID of label path (instead of the path itself), using Path index(b) structure for mapping between the paths and their ID s

Query Algorithm 27 if (!stack.isempty) stack[top].childcount ++; childcount = 0; label getcurrentnodelabel(); stack.push(label, childcount); Example: <A <B <C> key />/>/>

Query Algorithm 28 if (node belongs to NV) curid getidfromstack(); curlabelpath getlabelpathfromstack(); curpathid pathindex.getpathid(curlabelpath); for each word in the text invertedindex.addentry(word, curid, curpathid) getidfromstack() and getlabelpathfromstack() functions traverse the stack from bottom to top to compute the Dewey id and the label path of current node, respectively addentry function clusters nodes to the corresponding keyword

Query Algorithm 29 Example: <A <B <C> key />/>/>

Query evaluation 30 all-pairs approach for defining a set of meaningfully related nodes each pair in a set should be meaningfully related

Naïve implementation of query evaluation 31 the search engine looks up each of the k keywords; returns k nodes sets ( each corresponding to one keyword); calculates the Cortesian product of the these k sets; checks whether they are meaningfully related (for each pair) based on the PLCA rule.

Efficient Query Evaluation 32 if two nodes are not meaningfully related, they will result in irrelevant answer in any combination with other nodes, therefore such nodes should be pruned as early as possible

Similarities with Schema-free XQuery 33 both approaches are based on the LCA concept; build up very similar structures based on the LCA rule to check whether the nodes are meaningfully related to each other; use index encodings for nodes, which captures depth and path from the root.

Similarities with XSEarch 34 both have comprehensive syntax for queries, suitable and convenient for naïve users; relational subtree concept; check against the result elements satisfy each word in the query, as well as whether or not they are meaningfully related.

Performance 35

Performance 36

Performance 37

References 38 1. S. Cohen, J. Mamou, Y. Kanza, Y. Sagiv. XSearch: a semantic search engine for xml. Proc. of VLDB, 2003 2. Y. Li, C. Yu, H. V. Jagadish. Schema-free XQuery. Proc. of VLDB, 2004 3. X.Li, J.Gong,D.Wang, G. Yu An Effective and Efficient Approach for Keyword- Based XML Retrieval