itrails: Pay-as-you-go Information Integration in Dataspaces

Size: px

Start display at page:

Download "itrails: Pay-as-you-go Information Integration in Dataspaces"

Randolph Bishop
5 years ago
Views:

1 itrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007

2 Outline Motivation itrails Experiments Conclusions and Future Work 2

3 Problem: Querying Several Sources Query What is the impact of global warming in Zurich????? Systems Data Sources Laptop Server Web Server DB Server 3

4 Solution 1: Use a Search Engine Job! Query global warming zurich System Graph IR Search Engine Drawback: Query semantics are not precise! TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03] Data Sources Laptop text, links text, links Server text, links Web Server text, links DB Server 4

5 Solution 2: Use an Information Integration System //Temperatures/*[city = zurich ] Query... Laptop Server Information Integration System Temps Cities Drawback: Too much... CO 2 effort to provide Sunspots schema mappings!. GAV (e.g. [ICDE95]), LAV (e.g. [VLDB96]), GLAV [AAAI99], P2P (e.g. [SIGMOD04]) missing schema mapping. missing schema mapping Web Server schema mapping DB Server schema mapping System Data Sources 5

6 Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich global warming zurich //Temperatures/*[city = zurich ] Graph IR Search Engine Dataspace System Temps CO 2. Cities Sunspots Information Integration System Pay-as-you-go Information text, links Integration text, links? text, links text, links text, links full-blown schema mappings Data Sources Laptop Server Web Server DB Server Data Sources Dataspace Vision by Franklin, Halevy, and Maier [SIGMOD Record 05] 6

7 Outline Motivation itrails Experiments Conclusions and Future Work 7

8 itrails Core Idea: Add Integration Hints Incrementally Step 1: Provide a search service over all the data Use a general graph data model (see VLDB 2006) Works for unstructured documents, XML, and relations Step 2: Add integration semantics via hints (trails) on top of the graph Works across data sources, not only between sources Step 3: If more semantics needed, go back to step 2 Impact: Smooth transition between search and data integration Semantics added incrementally improve precision / recall 8

9 itrails: Defining Trails Basic Form of a Trail Q L [.C L ] Q R [.C R ] Queries: NEXI-like keyword and path expressions Attribute projections Intuition: When I query for Q L [.C L ], you should also query for Q R [.C R ] 9

10 Trail Examples: Global Warming Zurich global warming zurich Temperatures date 24-Sep 24-Sep 25-Sep 26-Sep city Bern Uster Zurich Zurich region BE ZH ZH ZH celsius Trail for Implicit Meaning: When I query for global warming, you should also query for Temperature data above 10 degrees global warming //Temperatures/*[celsius > 10] Trail for an Entity: When I query for zurich, you should also query for references of zurich as a region DB Server zurich //*[region = ZH ] 10

TrainCompany s website with origin at ETH Uni and destination at Seilbahn

11 Trail Example: Deep Web Bookmarks train home Web Server Trail for a Bookmark: When I query for train home, you should also query for the TrainCompany s website with origin at ETH Uni and destination at Seilbahn Rigiblick train home //traincompany.com//*[origin= ETH Uni and dest = Seilbahn Rigiblick ] 11

also query for auto car auto car carro Trails for Dictionary: When I query

12 Trail Examples: Thesauri, Dictionaries, Language-agnostic Search Laptop Server car auto Trail for Thesauri: When I query for car, you should also query for auto car auto car carro Trails for Dictionary: When I query for car, you should also query for carro and vice-versa car carro carro car 12

Trail Examples: Schema Equivalences Employee empid empname salary Trail for schema match on names: When I query for Employee.empName, you should also query for Person.

13 Trail Examples: Schema Equivalences Employee empid empname salary Trail for schema match on names: When I query for Employee.empName, you should also query for Person.name DB Server Person //Employee//*.tuple.empName //Person//*.tuple.name SSN name age income Trail for schema match on salaries: When I query for Employee.salary, you should also query for Person.income //Employee//*.tuple.salary //Person//*.tuple.income 13

14 Outline Motivation itrails Experiments Conclusion and Future Work Core Idea Trail Examples How are Trails Created? Uncertainty and Trails Rewriting Queries with Trails Recursive Matches 14

15 How are Trails Created? Given by the user Explicitly Via Relevance Feedback (Semi-)Automatically Information extraction techniques Automatic schema matching Ontologies and thesauri (e.g., wordnet) User communities (e.g., trails on gene data, bookmarks) 15

16 Uncertainty and Trails Probabilistic Trails: model uncertain trails probabilities used to rank trails Q L [.C L ] Q R [.C R ], 0 p 1 p Example: car auto p =

17 Certainty and Trails Scored Trails: give higher value to certain trails scoring factors used to boost scores of query results obtained by the trail Examples: Q L [.C L ] Q R [.C R ], sf > 1 - T 1 :weather //Temperatures/* sf p = 0.9, sf = 2 - T 2 :yesterday //*[date = today() 1] p = 1, sf = 3 17

18 Rewriting Queries with Trails Query U weather yesterday T 2 matches U (3) Merging weather U //*[date = today() 1] yesterday Trail T 2 : yesterday //*[date = today() 1] (1) Matching (2) Transformation 18

19 Replacing Trails Trails that use replace instead of union semantics Query U U (3) Merging weather yesterday weather //*[date = today() 1] T 2 matches Trail T 2 : yesterday //*[date = today() 1] (1) Matching (2) Transformation 19

20 Problem: Recursive Matches (1/2) U weather yesterday T 2 matches U //*[date = today() 1] New query still matches T 2, so T 2 could be applied again U T 2 : yesterday //*[date = today() 1] T 2 matches weather U //*[date = today() 1] U //*[date = today() 1]... U yesterday //*[date = today() 1] U... //*[date = today() 1] Infinite recursion 20

21 Problem: Recursive Matches (2/2) U weather yesterday T 3 matches U Trails may be //*[date = today() 1] mutually recursive U T 3 : //*.tuple.date //*.tuple.modified weather U yesterday U //*[modified = today() 1] //*[date = today() 1] T 10 matches T 10 : //*.tuple.modified //*.tuple.date U weather U yesterday U and enter an infinite loop //*[date = today() 1] U //*[date = today() 1] //*[modified = today() 1] We again match T 3 21

22 Solution: Multiple Match Coloring Algorithm U weather yesterday T 1 matches First Level T 2 matches weather weather T 1 : weather //Temperatures/* T 2 : yesterday //*[date = today() 1] T 3 : //*.tuple.date //*.tuple.modified T 4 : //*.tuple.date //*.tuple.received U U U //*[date = today() 1] yesterday //Temperatures/* U U Second Level U yesterday //Temperatures/* U U //*[modified = today() 1] T 3, T 4 match //*[date = today() 1] //*[received = today() 1] 22

23 Multiple Match Coloring Algorithm Analysis Problem: MMCA is exponential in number of levels Solution: Trail Pruning Prune by number of levels Prune by top-k trails matched in each level Prune by both top-k trails and number of levels 23

24 Outline Motivation itrails Experiments Conclusion and Future Work 24

25 itrails Evaluation in imemex imemex Dataspace System: Open-source prototype available at Main Questions in Evaluation Quality: Top-K Precision and Recall Performance: Use of Materialization Scalability: Query-rewrite Time vs. Number of Trails 25

26 itrails Evaluation in imemex Scenario 1: Few High-quality Trails Closer to information integration use cases Obtained real datasets and indexed them 18 hand-crafted trails 14 hand-crafted queries Scenario 2: Many Low-quality Trails Closer to search use cases Generated up to 10,000 trails 26

27 itrails Evaluation in imemex: Scenario 1 Configured imemex to act in three modes Baseline: Graph / IR search engine itrails: Rewrite search queries with trails Perfect Query: Semantics-aware query Data: shipped to central index sizes in MB Laptop Web Server Server DB Server 27

28 Quality: Top-K Precision and Recall K = 20 Scenario 1: few high-quality trails (18 trails) Search Engine misses relevant results Queries perfect query Perfect Query always has precision and recall equal to 1 Search Query is partially semantics-aware Q3: pdf yesterday Q13: to = raimund.grube@ enron.com 28

29 Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds overhead to query execution Trail Materialization provides interactive times for all queries response times in sec. 29

30 Scalability: Query-rewrite Time vs. Number of Trails Scenario 2: many low-quality trails Query-rewrite time can be controlled with pruning 30

31 Conclusion: Pay-as-you-go Information Integration Step 1: Provide a search service over all the data Step 2: Add integration semantics via trails global warming zurich text, links Data Sources Step 3: If more semantics needed, go back to step 2 Our Contributions itrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri, attribute matches,...) We propose a framework and algorithms for Pay-as-yougo Information Integration Smooth transition between search and data integration Dataspace System 31

32 Future Work Trail Creation Use collections (ontologies, thesauri, wikipedia) Work on automatic mining of trails from the dataspace Other types of trails Associations Lineage 32

33 Questions? Thanks in advance for your feedback! 33

34 Backup Slides 34

35 Problem: Global Warming in Zurich Query: What is the impact of global warming in Zurich? Search for: global warming zurich Meaning of keyword query global warming should lead to query on Temperatures zurich should lead to a query for a city 35

36 Problem: PDF Yesterday Query: Retrieve all PDF documents added/modified yesterday Search for: pdf yesterday Meaning of keywords pdf and yesterday Different sources, different schemas: Laptop: modified received DBMS: changed 36

37 Related Work: Search vs. Data Integration vs. Dataspaces Integration Solution Search Dataspaces Data Integration Features Integration Effort Low Pay-as-yougo High Query Semantics Precision / Recall Precision / Recall Precise Need for Schema Schemalater Schemanever Schemafirst 37

38 Personal Dataspaces Literature Dittrich, Salles, Kossmann, Blunschi. imemex: Escapes from the Personal Information Jungle (Demo Paper). VLDB, September Dittrich, Salles. idm: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB, September 2006 Dittrich. imemex: A Platform for Personal Dataspace Management. SIGIR PIM, August Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace Odyssey: The imemex Personal Dataspace Management System (Demo Paper). CIDR, January Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From Personal Desktops to Personal Dataspaces: A Report on Building the imemex Personal Dataspace Management System. BTW 2007, March 2007 Salles, Dittrich, Karakashian, Girard, Blunschi. itrails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September

39 idm: imemex Data Model Our approach: get the data model closer to personal information not the other way around Supports: Unstructured, semi-structured and structured data, e.g., files&folders, XML, relations Clearly separation of logical and physical representation of data Arbitrary directed graph structures, e.g., section references in LaTeX documents, links in filesystems, etc Lazily computed data, e.g., ActiveXML (Abiteboul et. al.) Infinite data, e.g., media and data streams See VLDB

40 Data Model Options Bag of Words Data Models Relational XML idm Nonschematic data Support for Personal Data Serialization independent Support for Graph data Support for Lazy Computation Specific schema View mechanism Extension: XLink/ XPointer Extension: ActiveXML Support for Infinite data Extension: Document streams Extension: Relational streams Extension: XML streams 40

41 Data Models for Personal Information lower Abstraction Level higher Physical Level Relational idm Personal Information XML Document / Bag of Words 41

42 Architectural Perspective of imemex Complex operators (query algebra)... Search & Browse iql Query Processor Application Layer imemex PDSMS Operators Catalog Office Tools... Data Store Physical Algebra Result Cache Indexes&Replicas access (warehousing) idm Query Processor Operators Catalog Data Store Data Cleaning Indexes & Replicas Data source access (mediation) Data Source Query Processor Data Source Plugins Operators Catalog Content Converters... File System Data Source Layer IMAP DBMS... 42

imemex: From Search to Information Integration and Back

imemex: From Search to Information Integration and Back Jens Dittrich Marcos Antonio Vaz Salles Lukas Blunschi Saarland University Cornell University ETH Zurich Abstract In this paper, we report on lessons