Using idocument for Document Categorization in Nepomuk Social Semantic Desktop

Similar documents
Incorporating ontological background knowledge into Information Extraction

The Personal Knowledge Workbench of the NEPOMUK Semantic Desktop

Mymory: Enhancing a Semantic Wiki with Context Annotations

ConTag: A Semantic Tag Recommendation System

Using Attention and Context Information for Annotations in a Semantic Wiki

From Handwriting Recognition to Ontologie-Based Information Extraction of Handwritten Notes

Domain-specific Knowledge Management in a Semantic Desktop

Personalization in the EPOS project

Semantic Desktop 2.0: The Gnowsis Experience

GoNTogle: A Tool for Semantic Annotation and Search

Semantic Searching. John Winder CMSC 676 Spring 2015

The NEPOMUK project. Dr. Ansgar Bernardi DFKI GmbH Kaiserslautern, Germany

The NEPOMUK project. Dr. Ansgar Bernardi DFKI GmbH Kaiserslautern, Germany

Towards the Semantic Desktop. Dr. Øyvind Hanssen University Library of Tromsø

Semantic Annotation using Horizontal and Vertical Contexts

Annotation for the Semantic Web During Website Development

The NEPOMUK Project - On the Way to the Social Semantic Desktop

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Annotation Component in KiWi

SemSearch 2008, CEUR Workshop Proceedings, ISSN , online at CEUR-WS.org/Vol-334/ QuiKey a Demo. Heiko Haller

Metadata and the Semantic Web and CREAM 0

Combining Fact and Document Retrieval with Spreading Activation for Semantic Desktop Search

Survey of Semantic Annotation Platforms

Motivating Ontology-Driven Information Extraction

An Adaptive Framework for Named Entity Combination

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH

The NEPOMUK Project - On the way to the Social Semantic Desktop

GoNTogle: A Tool for Semantic Annotation and Search

Semantic Document Architecture for Desktop Data Integration and Management

XETA: extensible metadata System

97 Information Technology with Audiovisual and Multimedia and National Libraries (part 2) No

An Annotation Tool for Semantic Documents

Kripke style Dynamic model for Web Annotation with Similarity and Reliability

Ontology-Based Information Extraction

On a Java based implementation of ontology evolution processes based on Natural Language Processing

Development of an Ontology-Based Portal for Digital Archive Services

Ontology-based Communication Architecture Within a Distributed Case-Based Retrieval System for Architectural Designs

Ontology Extraction from Heterogeneous Documents

A Context Model for Personal Knowledge Management

An Approach for Accessing Linked Open Data for Data Mining Purposes

Generation of Semantic Clouds Based on Linked Data for Efficient Multimedia Semantic Annotation

GIR experiements with Forostar at GeoCLEF 2007

Linked Data Querying through FCA-based Schema Indexing

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Domain-specific Concept-based Information Retrieval System

On the Design and Exploitation of Presentation Ontologies for Information Extraction

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

3 Publishing Technique

Linking Semantic Desktop Data to the Web of Data

Towards an Integrated Information Framework for Service Technicians

Semantic Web Technologies Trends and Research in Ontology-based Systems

Probabilistic Information Integration and Retrieval in the Semantic Web

Converging Web and Desktop Data with Konduit

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

SWSE: Objects before documents!

Semanta Semantic Made Easy

Requirements for Information Extraction for Knowledge Management

ResPubliQA 2010

Semantic Multimedia Information Retrieval Based on Contextual Descriptions

Ontology Extraction from Tables on the Web

GroupMe! - Where Semantic Web meets Web 2.0

Modelling Higher-Level Thought Structures Method & Tool 1

A service based on Linked Data to classify Web resources using a Knowledge Organisation System

FOAM Framework for Ontology Alignment and Mapping Results of the Ontology Alignment Evaluation Initiative

Ontology-based Web Information Extraction in Practice

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Text Mining for Software Engineering

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why?

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Heading-Based Sectional Hierarchy Identification for HTML Documents

Actionable User Intentions for Real-Time Mobile Assistant Applications

Combining Ontology Mapping Methods Using Bayesian Networks

Semantic Web-Based Document: Editing and Browsing in AktiveDoc

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Generating and Managing Metadata for Web-Based Information Systems

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Generating and Using Gaze-Based Document Annotations

Semantic Cloud Generation based on Linked Data for Efficient Semantic Annotation

DBpedia Spotlight at the MSM2013 Challenge

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Precise Medication Extraction using Agile Text Mining

Semantic Web and Natural Language Processing

Modelling Structures in Data Mining Techniques

FCA-based Search for Duplicate objects in Ontologies

Advanced Management of Research Publications based on the Lightroom Paradigm

LODatio: A Schema-Based Retrieval System forlinkedopendataatweb-scale

Mobile Query Interfaces

Text Mining. Representation of Text Documents

STS Infrastructural considerations. Christian Chiarcos

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

Enhanced retrieval using semantic technologies:

USE OF ONTOLOGIES IN INFORMATION EXTRACTION

Things to consider when using Semantics in your Information Management strategy. Toby Conrad Smartlogic

Automatic Extraction of Knowledge from Web Documents

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

NUS-I2R: Learning a Combined System for Entity Linking

Transcription:

Using idocument for Document Categorization in Nepomuk Social Semantic Desktop Benjamin Adrian, Martin Klinkigt,2 Heiko Maus, Andreas Dengel,2 ( Knowledge-Based Systems Group, Department of Computer Science University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern, Germany 2 German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 22, 67663 Kaiserslautern, Germany firstname.lastname@dfki.de) Abstract On the Semantic Desktop users maintain their model of the world in a formal personal information model ontology. Concepts from this ontology are used to annotate documents from desktop, allowing efficient navigation and browsing of these. However, the mental overhead required for correctly classifying new incoming document is substantial. We present the integration of the ontology-based information extraction system idocument into the Nepomuk Semantic Desktop for classifying documents within the personal information model. A comparison is done between idocument and the original classification system Structure Recommender. It is based on real models and documents from five Nepomuk users. Results reveal evidences that idocument s categorization proposals are rated with higher recall and precision values and show that idocument s result ranking corresponds to user ratings. Key Words: semantic desktop, ontology-based information extraction, text classification, personal information model Category: H.3., I.2.7, M.0, M.5, M.8 Introduction On the Semantic Desktop users maintain their model of the world in a formal personal information model ontology (PIMO, cf. [SvED07]). Concepts from PIMO are used to tag documents from desktop, allowing efficient navigation and browsing of these. The Nepomuk project [GHM + 07] provided a Semantic Desktop implementation based on Semantic Web techniques (RDF/S). As the mental overhead required for correctly classifying new incoming document is substantial, Nepomuk offers a component called Drop-box. Users simply drop document from their desktop into the Drop-box and get recommended instances from PIMO as possible classification candidates. The current service generating these recommendations is Structure Recommender (StrucRec, cf. [TSJB08]). For each document, StrucRec generates a flat set of instance candidates. Unfortunately, these results are neither weighted nor ordered. Users cannot change the classification behavior to for example restrict recommendations on specific parts of their PIMO (e.g., just use instance about projects for classification purpose).

Therefore, we integrate the ontology-based information extraction (OBIE) system idocument [AMD09] into Nepomuk. idocument generates weighted document classification proposals. It also uses extraction templates that may be written in the RDF query language SPARQL to define relevant patterns of PIMO instances for categorization purpose. In order to get evidence about the quality of idocument s functionalities, we compare both systems with data from five mid-term to long-term Nepomuk users. Each user provided ten manually classified documents and rated the quality of recommended classification proposals from both systems. The rest of this paper is structured as follows. At first, an overview of related and previous work is given. Nepomuk and its existing recommendation facilities are illustrated in Section 3. idocument is explained in Section 4. Next section describes the comparison between idocument and StrucRec. Finally, we conclude comparison and evaluation results and provide an outlook of future work. 2 Related Work Ontology-based information extraction (OBIE) systems use ontologies and incorporated instance knowledge to extract information from unstructured text. Many OBIE approaches just extend standard IE systems as done in S-Cream [HSC02] or SOBA [BCR06]. Other approaches use ontologies for extraction purpose directly. Labsky et al. use specialized forms of extraction ontologies [LSN08]. GATE has been extended with ontology gazetteers for instance recognition tasks [BTMC04]. Compared to these, idocument bases on GATE but provides additional extraction templates. These templates define patterns that describe relevant types of instances that should be extracted from text. As idocument expects ontologies written in RDFS, templates are specified in SPARQL. Inside the Gnowsis Semantic Desktop [SGK + 06], the generation of tag recommendations was done with a system called ConTag [ASRB07]. ConTag used external Web Services for extracting named entities from documents. Privacy is an important issue, thus idocument does not query external web services. 3 Document classification in Nepomuk Nepomuk [GHM + 07] provides a document classification component called Drop Box. Tag recommendations are generated each time the user drops a document into the Drop Box. Users may accept recommended instances as tags by clicking on them. In Nepomuk, StrucRec suggests PIMO things as tags (or document categories). StrucRec s 2 approach uses labels from PIMO instances matching see http://www.w3.org/tr/rdf-sparql-query 2 please inspect http://www.alphaworks.ibm.com/tech/galaxy for more information

them as parts of text passages. Next, it performs a spreading activation inside the PIMO for instance disambiguation purpose. Disadvantages of StrucRec are that it does not weight its results with confidence ratios and thus does not provide any ranking facilities. It is not user adaptable as users may not select specific parts of their PIMO for document classification purpose. 4 The OBIE system idocument idocument takes an RDFS domain ontology, an extraction template in SPARQL (e.g., SELECT * WHERE {?p rdf:type foaf:person; foaf:member?o.?o rdf:type foaf:organization} ), and a document as input and finally returns an RDF model with multiple named graphs. Each graph is a possible result (called scenario) for the given template. idocument follows a pipeline of six extraction tasks (cf. [AMD09]). (i) Normalization extracts plain text and existing meta data from an underlying text document. (ii) Segmentation partitions incoming text to segmental units i.e., paragraphs, sentences, and tokens. (iii) Symbolization recognizes matches between phrases in text and literal values of data type properties of the domain ontology. Successful matches are called symbols. (iv) Instantiation resolves recognized symbols with candidates for possible instances. (v) Contextualization resolves recognized instances, recognized object properties, and existing fact knowledge for creating fact candidates that are valid for populating a template. (vi) Population populates extraction templates with multiple variants (called scenarios) of extracted facts. Each scenario, instance, and fact is weighted with a confidence value. 5 Evaluation The comparison between idocument and StrucRec was done on real PIMO data. Five Nepomuk users agreed to provide their PIMO model and ten already categorized documents. For each model, we logged statistics about the amount of overall PIMO instances. In addition, we marked those instances that were used as tags for each of the ten selected documents. These tag relations were deleted from the evaluation models. Based on these evaluation models, StrucRec and idocument generated classification proposals. The Nepomuk users received two spreadsheets with results from idocument and StrucRec about their documents. They did not know which of the recommendation system created which result. They rated each proposals by choosing one of the following three values: - The instance as such is invalid and should be deleted from PIMO. Some users had applied Nepomuk components that auto inserted noisy instances into their PIMO. Instances labeled with this value were ignored in this evaluation.

5 Nepomuk User 4 3 2 idocument StrucRec 0,00% 20,00% 40,00% 60,00% 80,00% 00,00% relative covering of all recommended instances 00,00% uncertain bucket Precision 80,00% 60,00% 40,00% 20,00% 0,00% 2 3 4 5 Nepomuk User (weight>=0.0) unrealistic bucket (weight>=0.25) realistic bucket (weight>=0.5) certain bucket (weight>=0.75) Figure : Upper chart: Relative amount of generated recommendations. Lower chart: Precision ratios about four ranges of idocument s ranked result list. 0 The instance is not a valid category for this document. + The instance is a valid category for this document. Users could also use this values for instances they previously did not use for categorizing documents. If users labeled identical categories different for the same document, these categories were ignored in the evaluation. The following table shows statistics about the amount of instances each of the five users maintained in his or her PIMO. Nepomuk User # Instances User Type 43 (long-term user) 2 89 (mid-term user) 3 926 (long-term user) 4 820 (long-term user) 5 649 (mid-term user) The upper chart in Figure shows the relative covering of recommended instances of both systems for each user. We observed that both systems generated 50% identical recommendations in general. In most cases, StrucRec generated more recommendations than idocument.

Precision 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, 3 3 2 4 4 0 0, 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 Recall 2 5 5 StrucRec idocument Figure 2: Recall and precision ratios The calculation of recall was complicated as users did not know if they used all relevant PIMO instances for categorizing their documents. Thus, we could not assume users to rate over 000 PIMO instances for each document. Therefore, we took those instances that were manually taken for categorization purpose as base line for relevant categories. We used this base line for estimating recall values about results from idocument and StrucRec. Precision values were weighted by the amount of recommended instances each system made for each document. The distribution of precision ratios in Figure 2 shows that idocument beats StrucRec for at least 4% except for User 3. Here StrucRec s precision of results was 4% higher than idocument s. Analyzing recall values in Figure 2 shows that idocument beats StrucRec for at least 0% except for User 3. Here StrucRec s recall was 5% better than idocument s. In contrast to StrucRec, idocument generates confidence values for each recommendation (cf. [AD08]). We separated idocument s recommendations into four buckets with thresholds in steps of 0.25 from 0.0 to.0. Each bucket contained recommended instances if the confidence was greater than the buckets threshold. Then we analyses precision ratios for each bucket. The lower chart in Figure reveals that four users accepted more recommendations in buckets with higher thresholds. This confirms the quality of idocument s result ranking. 6 Conclusion and Outlook The comparison between StrucRec and idocument yields that idocument generates better recall and precision results than StrucRec for four of five users.

The user s PIMO model (User 3), where idocument produced worse results than StrucRec, contained a huge amount of auto generated instances from other Nepomuk components. It was the largest PIMO model (926 instances) and both systems result were rated with poor recall values below 30%. The ranked result list of idocument provided relevant results with high confidence weights for four of five users also. The PIMO model (User 4), where idocument s result ranking did not fit, was the smallest of all (89 instances). As both systems generated about 50% identical recommendations, it is recommended to use both systems in Nepomuk. idocument was executed with a standard IE template. In future work, user adaptable extraction templates are going to be evaluated. This work was financed by the BMBF project Perspecting (Grant 0IW08002). References [AD08] [AMD09] Adrian, B. and Dengel, A.: Believing finite-state cascades in knowledgebased information extraction; In Proc. of KI 2008: Advances in Artificial Intelligence, LNCS, pages 52 59. Springer, 2008. Adrian, B., Maus, H., and Dengel, A.: The Ontology-based Information Extractions System idocument; Poster at KM, 2009. [ASRB07] Adrian, B., Sauermann, L., and Roth-Berghofer, T.: ConTag: A semantic tag recommendation system; In Proc. of I-MEDIA 07 and I- SEMANTICS 07, pages 297 304. JUCS, 2007. [GHM + 07] [HSC02] [LSN08] [SGK + 06] [SvED07] [BCR06] Buitelaar, P., Cimiano, P., and Racioppa, S.: Ontology-based Information Extraction with SOBA; In Proc. of LREC, pages 232 2324. ELRA, 2006. [BTMC04] Bontcheva, K., Tablan, V., Maynard, D., and Cunningham, H.: Evolving GATE to Meet New Challenges in Language Engineering; Natural Language Engineering, 0(3/4):349 373, 2004. Groza, T., Handschuh, S., Moeller, K., Grimnes, G., Sauermann, L., Minack, E., Mesnage, C., Jazayeri, M., Reif, G., and Gudjonsdottir, R.: The NEPOMUK Project - On the way to the Social Semantic Desktop; In Proc. of I-MEDIA 07 and I-SEMANTICS 07, pages 20 2. JUCS, 2007. Handschuh, S., Staab, S., and Ciravegna, F.: S-CREAM Semi-automatic CREAtion of Metadata; In Proc. of SAAKM 2002 -Semantic Authoring, Annotation and Knowledge Markup, ECAI Workshop, pages 27 34, 2002. Labsky, M., Svatek, V., and Nekvasil, M.: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation; In Proc. of OBIES 2008, volume CEUR-WS 400, 2008. Sauermann, L., Grimnes, G., Kiesel, M., Fluit, C., Maus, H., Heim, D., Nadeem, D., Horak, B., and Dengel, A.: Semantic Desktop 2.0: The Gnowsis Experience; In Proc. of ISWC, LNCS, pages 887 900. Springer, 2006. Sauermann, L., van Elst, L., and Dengel, A.: PIMO - a Framework for Representing Personal Information Models; In Proc. of I-MEDIA 07 and I-SEMANTICS 07, pages 270 277. JUCS, 2007. [TSJB08] Troussov, A., Sogrin, M., Judge, J., and Botvich, D.: Mining sociosemantic networks using spreading activation technique; In Proc. of I- MEDIA 08 and I-KNOW 08. JUCS, 2008.