Limitations of XPath & XQuery in an Environment with Diverse Schemes

Size: px

Start display at page:

Download "Limitations of XPath & XQuery in an Environment with Diverse Schemes"

Theodore Butler
6 years ago
Views:

1 Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML-Data Martin Theobald, Ralf Schenkel, and Gerhard Weikum Saarland University Saarbrücken, Germany Martin Theobald: Automatic Classification of XML Data 1 Limitations of XPath & XQuery in an Environment with Diverse Schemes. <inproceedings key="conf/icde/bargalw02"> <author>roger S. Barga</author> <author>david B. Lomet</author> <author>gerhard Weikum</author> <title>recovery Guarantees for General Multi-Tier Applications.</title> <year>2002</year> <booktitle>icde</booktitle> </inproceedings> //proceedings[contains(., "icde")]/title[contains(., Recovery")]/parent::* 0 Results. //title[contains(., Recovery")]/parent::* Results. DBLP 6/23/2003 Martin Theobald: Automatic Classification of XML Data 2 1

2 Automatic Classification helps. Proceedings 12 Results: SIGMOD VLDB ICDE P2P XML Databases Recovery <inproceedings key="conf/icde/bargalw02"> <author>roger S. Barga</author> <author>david B. Lomet</author> <author>gerhard Weikum</author> <title>recovery Guarantees for General Multi- Tier Applications.</title> <year>2002</year> <booktitle>icde</booktitle> </inproceedings>.. 6/23/2003 Martin Theobald: Automatic Classification of XML Data 3 Challenges in XML Classification Exploit annotation and structure Exploit ontological knowledge on sparse and/or heterogeneous training data Mapping of tags (and text terms) to semantic concepts In-document word sense disambiguation Quantification of concept similarities 6/23/2003 Martin Theobald: Automatic Classification of XML Data 4 2

3 Using Structure and Ontological Knowledge for Classification Tokens with Context Nodes XML Training Documents Structure-aware Document Analyzer XML Test Document Structural Features Ontology Service Disambiguation and Mapping onto Concepts Feature Selection using MI Incremental Mapping Feature Vectors SVM Classifier Tag- Term Pairs Element Paths & Twigs Ontology Database with Dice Similarities (based on WordNet) Topic- Specific Feature Spaces Large Document Collection (Focused Crawling) as Basis for Concept Similarity Estimation wrt. Natural Term Correlations 6/23/2003 Martin Theobald: Automatic Classification of XML Data 5 Feature-Selection & Term Weighting no Database Core Web IR TOPICS yes Semistr. Data no yes no Data Mining XML Linear Support Vector Machines for binary classifications in the topic tree Topic-specific feature spaces to support binary classification steps Mutual Information (MI) yields ranking for the most discriminating features per topic (aka. Kullback-Leibler-Divergence) P[ Xi cj] MI( Xi, cj): = P[ Xi cj]log2 P[ Xi] P[ cj] Term weights in classic TF*IDF IDF computed on element frequencies 6/23/2003 Martin Theobald: Automatic Classification of XML Data 6 3

4 Exploiting Annotation: Tag-Term Pairs Structure-aware features for more precise document representation Interpret (tag, term) pairs as concept-value pairs in the spirit of a database schema <car> <make>audi</make> <type>a4</type> <year>98</year> <price>10.000</price> </car> make$audi, type$a4, year$98, price$ car$make$audi, car$type$a4, car$year$98, car$price$ /23/2003 Martin Theobald: Automatic Classification of XML Data 7 Exploiting Structure: Element Paths and Twigs car Extension of the feature space by structural patterns Paths & Twigs Preserve or disregard element ordering make year price Different feature types (tag-term pairs & twigs) are mapped to distinct dimensions in the vector space car$make$year car$year$price car$make$price Scalability and noise reduction through feature selection (MI) under an integrated SVM model 6/23/2003 Martin Theobald: Automatic Classification of XML Data 8 4

5 Exploiting Ontological Knowledge WordNet: Directed and weighted ontology graph capturing Hypernyms Hyponyms Holonyms 0.8 s 1 [wheeled vehicle] s 2 [motor vehicle] s 3 [car, automobile, wagon, motorcar] sim(s 3,s 4 ) = ½( )? s 5 [wheel] s 4 [truck, motortruck] Quantified relationships based on estimated concept similarities: 2 Dice coefficient: dice( s1, s2) = df 6/23/2003 Martin Theobald: Automatic Classification of XML Data df ( senses( s1) senses( s2) ) ( senses( s1) ) + df ( senses( s2) ) Word Sense Disambiguation Compare term context con( t k ) with synset context con( s j ) using cosine measure Synset context includes hypernyms, hyponyms, and holonyms plus WordNet descriptions Infer semantics from current context rather than stipulate it 6/23/2003 Martin Theobald: Automatic Classification of XML Data 10 5

6 Incremental Mapping for Classification For any unknown concept s in a test document d do: Replace s with its closest match s from the training data Adjust term weight of s in d by concept similarity sim(s, s ) s [ sport utility vehicle, Training Feature Selection S.U.V. ] Data 0.21 using MI s [ jeep, landrover ] Test doc Problem: Possible loss of feature correlations that the SVM has learned No feature independency for SVM Reconsider dice(s, s ) with restrictive threshold Replace concept s only if s is strongly correlated to s, otherwise skip s 6/23/2003 Martin Theobald: Automatic Classification of XML Data 11 Experimental Evaluation: Internet Movie Database (IMDB) Training with very view features for Action vs. Western Homogenous, but rich structure with varying amounts of content Tag-term pairs (95%) plus twigs (5%) using MI Ontology lookups on tags only F measure Tag-Term Pairs & Twigs using tf*idf for Elements Text Features using tf*idf for Documents 1 F= precision recall # Features per topic 6/23/2003 Martin Theobald: Automatic Classification of XML Data 12 6

7 Summary Concept-based classification boosts classification results Detection of synonyms Incremental mapping of unknown concepts Structure-aware features offer a more precise document representation for XML Application area: Training on small, user-specific specific topic directories, e.g., bookmarks Classification of heterogeneous data sources 6/23/2003 Martin Theobald: Automatic Classification of XML Data 13 Future Work More robust term-to to-sense mapping Improved disambiguation of word senses Better awareness of feature correlations (in incremental term-to to-concept mapping) Topic-specific ontologies Is-instance instance-of relationships Integration into large web applications, e.g., focused crawling 6/23/2003 Martin Theobald: Automatic Classification of XML Data 14 7

8 Questions? 6/23/2003 Martin Theobald: Automatic Classification of XML Data 15 8

A Comprehensive Analysis of using Semantic Information in Text Categorization

A Comprehensive Analysis of using Semantic Information in Text Categorization Kerem Çelik Department of Computer Engineering Boğaziçi University Istanbul, Turkey celikerem@gmail.com Tunga Güngör Department