6LPSOLI\LQJ'DWD$FFHVV 7KH(QHUJ\'DWD&ROOHFWLRQ3URMHFW. José Luis Ambite University of Southern California Information Science Institute

Size: px

Start display at page:

Download "6LPSOLI\LQJ'DWD$FFHVV 7KH(QHUJ\'DWD&ROOHFWLRQ3URMHFW. José Luis Ambite University of Southern California Information Science Institute"

Ethelbert Cooper
5 years ago
Views:

$6LPSOLI\LQJ'DWD$FFHVV 7KH(QHUJ\'DWD&ROOHFWLRQ3URMHFW José Luis$

1 6LPSOLI\LQJ'DWD$FFHVV 7KH(QHUJ\'DWD&ROOHFWLRQ3URMHFW José Luis Ambite University of Southern California Information Science Institute

Query results How many people had breast cancer in the area over the past 30 years?

2 7KH9LVLRQ $VNWKH*RYHUQPHQW We re thinking of moving to Denver...What are the schools like there? Query results How many people had breast cancer in the area over the past 30 years? Is there an orchestra? An art gallery? How far are the nightclubs? How have property values in the area changed over the past decade? Census Labor Stats

3 7KHSUREOHPDQGWKHVROXWLRQ Q Problem: FedStats has thousands of databases in over 70 Government agencies: X data is overlapping, sometimes near-duplicated X even Government officials cannot find it! Q Solution: Create a system to provide easy standardized access: X need multi-database access engine, X need powerful user interface, X need terminology standardization mechanism.

4 5HVHDUFKFKDOOHQJHV Q How do you scale to incorporate many databases? try to build data models automatically Q How do you integrate their data models, to allow querying across sources and agencies? take a large ontology, link the models into it automatically, and provide query reformulator Q How do you incorporate additional information that is available from text sources? use language processing tools to extract it Q How do you handle footnotes in the databases? extract them from the tables automatically

5 6\VWHP$UFKLWHFWXUH Construction phase: Deploy DBs Extend ontol. Integrated ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor Domain modeling - DB analysis - text analysis Query processor - reformulation - cost optimization R S T σ User phase: Compose query Sources Access phase: Create DB query Retrieve data

6 Construction phase: Deploy DBs Extend ontol. Integrated ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor Domain modeling - DB analysis - text analysis Query processor - reformulation - cost optimization R S T σ User phase: Compose query Sources Access phase: Create DB query Retrieve data

7 ,QIRUPDWLRQ,QWHJUDWLRQLQ6,06 To enable query access, SIMS needs to: [Arens+96] [Ambite+00] Q address semantic heterogeneity: => describe sources in common domain model Q address syntactic (format) heterogeneity: => standardize access to sources: X Structured (DBMS): Oracle, MS Access X Semistructured: wrappers for html, text, pdf

8 'RPDLQ0RGHOIRU7LPH6HULHV 8QLW 3RLQWRI6DOH 3URGXFW <HDU 3HULRG 0RQWK $UHD 86$ &$ 1< :HHN 0HDVXUHPHQW 'DWH 6XEFODVV 3DUWRI *HQHUDO5HODWLRQ 6RXUFH0DSSLQJ 9DOXH 7LPH 6HULHV )RRWQRWH 7DJ *5HJXODU 7H[W *3UHPLXP *DVROLQH &3, *3UHPLXP 8QOHDGHG 4XDOLW\ 3ULFH */HDGHG *8QOHDGHG 33, 9ROXPH

9 <HDU 'RPDLQ0RGHO 3HULRG 0RQWK $UHD 86$ 1< &$ :HHN 8QLW &HQWVJDOORQ (,$7 0HDVXUHPHQW 'DWH 6XEFODVV 3DUWRI *HQHUDO5HODWLRQ 6RXUFH0DSSLQJ (,$73ULFHVRI5HJXODU*DVROLQHLQ&$ 6DOHVWRHQGXVHUVWKURXJKUHWDLORXWOHWV 0HDVXUHGLQFHQWVSHUJDOORQ 7LPH 6HULHV 9DOXH 3RLQWRI6DOH 5HWDLORXWOHWV )RRWQRWH 7DJ (,$0XOWLVWDWH :UDSSHU *5HJXODU *3UHPLXP 7H[W 3URGXFW *DVROLQH *3UHPLXP 8QOHDGHG 4XDOLW\ 3ULFH */HDGHG *8QOHDGHG 9ROXPH

10 :UDSSHUV provide uniform mechanism for extracting data from semi-structured sources (HTML, text, ) 5HVWDXUDQWVLQ 6DQWD0RQLFD" 1DPH $GGUHVV &KLQRLVRQ0DLQ 0DLQ6W &KDR'DUD 8QLRQ6T «Wrapper

11 :UDSSHU%XLOGLQJ7RROV Q Creating Wrappers (semi-)automatically: X Demonstration-oriented interface enables users to show system what to extract by example X System automatically induces extraction rules X Common extraction engine Q Benefits: X Rapid wrapper creation X Simplified wrapper maintenance Q Fetch.com X Start-up that comercializes the technology [Muslea+99]

12 :UDSSHU%XLOGLQJ7RROV

13 Construction phase: Deploy DBs Extend ontol. Integrated ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor Domain modeling - DB analysis - text analysis Query processor - reformulation - cost optimization R S T σ User phase: Compose query Sources Access phase: Create DB query Retrieve data

14 The Core Large ontology (SENSUS) Concepts from glossaries (LKB) Linguistic mapping Logical mapping Domain-specific ontologies (SIMS models) Data sources

15 2QWRORJ\PRGHOVGDWD Q SENSUS ontology (90000 concepts) Q LKB glossary extracts (6000 concepts) X Purpose: provide rich domain knowledge X Sources: Census, EIA, EPA texts Q SIMS domain model (500 dimension conce, series) X Purpose: describe data, query planning and optimization X Sources: database metadata, extracted by hand Q Database sources (25000 time series) X EIA OGIRS: CD with MS Access database; no footnotes Q Web data sources (25000 time series) X EIA Multi-State: time series; formatted text X BLS: 60 time series; web form; 40 have footnotes X EIA Petroleum Supply Monthly: 20 time series in PDF; converted to HTML and then wrapped; all with footnotes X CEC: 12 series; static HTML tables; no footnotes

$,6, V 6(1686RQWRORJ\ [Knight+95] [Hovy+00] Q Q Q Q Q Taxonomy, multiple$ 90,000 concepts Top level: Penman Upper Model (ISI) Body: WordNet

16 ,6, V 6(1686RQWRORJ\ [Knight+95] [Hovy+00] Q Q Q Q Q Taxonomy, multiple superclass links Approx. 90,000 concepts Top level: Penman Upper Model (ISI) Body: WordNet (Princeton), rearranged KWWSDULDGQHLVLHGXVHQVXVHGF Used at ISI for machine translation, text summarization, database access

17 ([WUDFWLQJ PHWDGDWD IURPWH[W [Klavans+01] Problems: X Proliferation of terms in domain X Agencies define terms differently X Many refer to the same or related entity X Lengthy term definitions often contain important information which is buried Example input: Motor Gasoline Blending Components: Naphthas (e.g., straight-run gasoline, alkylate, reformate, benzene, toluene, xylene) used for blending or compounding into finished motor gasoline. These components include reformulated gasoline blendstock for oxygenate blending (RBOB) but exclude oxygenates (alcohols, ethers), butane, and pentanes plus. Note: Oxygenates are reported as individual components and are included in the total for other hydrocarbons, hydrogens, and oxygenates.

18 /H[LFDO.QRZOHGJH%DVH/.%7RRO Combines statistical and linguistic methods: Q Q Q Q Q trainable Finite State Recognizer identifies topics with high accuracy provides complete coverage useful for any subject area produced over 6,000 gasoline concepts from EIA and BLS

19 /LQNLQJ'RPDLQ0RGHOVWR2QWRORJ\ Match DM concepts against SENSUS concepts 1. Name Match Variations: substrings; reward for contiguous endings; stemming; variant word forms; case sensitivity; etc. Implemented string match from molecular biology for DNA matching 2. Definition Match Variations: stemming; stop words; words or letter trigrams; word weights (itf, tf.idf ), etc. Used IR vector space; cosine measure 3. Dispersal Match Def: Given a cluster of concepts, match them all and then select one match for each concept to form tightest cluster in SENSUS Variations: greedy search; number of candidates; weightings Name match (6 hrs) k-means clustering (5 hrs) Def match (4 hrs) Dispersal match (incompl.) [Hovy+01]

20 Construction phase: Deploy DBs Extend ontol. Integrated ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor Domain modeling - DB analysis - text analysis Query processor - reformulation - cost optimization R S T σ User phase: Compose query Sources Access phase: Create DB query Retrieve data

29 &RQFOXVLRQ Large-scale information integration Q Major research challenges: X Semi-automated domain modeling, information extraction, linkage to ontology X Efficient query processing X Footnote extraction and display Q Major practical challenges: X Understanding users needs: interface, information... X Getting a lot of data into the system

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them