Instance-based matching in COMA++

Similar documents
Innovation Lab at Univ. of Leipzig on semantic Web Data Integration Funded by BMBF (German ministry for research and education)

SEBD 2011, Maratea, Italy June 28, 2011 Database Group 2011

Instance-based matching of hierarchical ontologies

COMA A system for flexible combination of schema matching approaches. Hong-Hai Do, Erhard Rahm University of Leipzig, Germany dbs.uni-leipzig.

Mapping Composition for Matching Large Life Science Ontologies

On Matching Large Life Science Ontologies in Parallel

Effective Mapping Composition for Biomedical Ontologies

Outline A Survey of Approaches to Automatic Schema Matching. Outline. What is Schema Matching? An Example. Another Example

A method for recommending ontology alignment strategies

Matching Anatomy Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

XML Schema Matching Using Structural Information

Data Integration in Bioinformatics and Life Sciences

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Matching Large XML Schemas

A Comparative Analysis of Ontology and Schema Matching Systems

Semantic Interoperability. Being serious about the Semantic Web

RiMOM Results for OAEI 2008

PRIOR System: Results for OAEI 2006

A Tool for Semi-Automated Semantic Schema Mapping: Design and Implementation

Alignment Results of SOBOM for OAEI 2009

Ontology alignment. Evaluation of ontology alignment strategies. Recommending ontology alignment. Using PRA in ontology alignment

A Generic Algorithm for Heterogeneous Schema Matching

Lily: Ontology Alignment Results for OAEI 2009

Cluster-based Similarity Aggregation for Ontology Matching

Chapter 27 Introduction to Information Retrieval and Web Search

XML Grammar Similarity: Breakthroughs and Limitations

RiMOM Results for OAEI 2010

Natasha Noy Stanford University USA

From Dynamic to Unbalanced Ontology Matching

University of Rome Tor Vergata GENOMA. GENeric Ontology Matching Architecture

Keywords Data alignment, Data annotation, Web database, Search Result Record

YAM++ Results for OAEI 2013

Target-driven merging of Taxonomies

Restricting the Overlap of Top-N Sets in Schema Matching

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Matching Schemas for Geographical Information Systems Using Semantic Information

Ontology matching using vector space

IBE101: Introduction to Information Architecture. Hans Fredrik Nordhaug 2008

ONTOLOGY MATCHING: A STATE-OF-THE-ART SURVEY

E-Fuice. Integration of E-Commerce Data

A Framework for Generic Integration of XML Sources. Wolfgang May Institut für Informatik Universität Freiburg Germany

Automatically Constructing a Directory of Molecular Biology Databases

SAMBO - A System for Aligning and Merging Biomedical Ontologies

The Results of Falcon-AO in the OAEI 2006 Campaign

A system for aligning taxonomies and debugging taxonomies and their alignments

Leveraging Data and Structure in Ontology Integration

3 Classifications of ontology matching techniques

InsMT / InsMTL Results for OAEI 2014 Instance Matching

A LEXICAL APPROACH FOR TAXONOMY MAPPING

Ontology Matching Techniques: a 3-Tier Classification Framework

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 94-95

A GML SCHEMA MAPPING APPROACH TO OVERCOME SEMANTIC HETEROGENEITY IN GIS

Evaluation of ontology matching

Evolution of the COMA Match System

A Unified Approach for Aligning Taxonomies and Debugging Taxonomies and Their Alignments

Building an effective and efficient background knowledge resource to enhance ontology matching

Searching the Deep Web

FOAM Framework for Ontology Alignment and Mapping Results of the Ontology Alignment Evaluation Initiative

DATA MINING - 1DL105, 1DL111

POMap results for OAEI 2017

The HMatch 2.0 Suite for Ontology Matchmaking

DATA MINING II - 1DL460. Spring 2014"

Searching the Deep Web

SciMiner User s Manual

CS570: Introduction to Data Mining

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

An Improving for Ranking Ontologies Based on the Structure and Semantics

Community-based ontology development, alignment, and evaluation. Natasha Noy Stanford Center for Biomedical Informatics Research Stanford University

Generic Model Management

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Computational Cost of Querying for Related Entities in Different Ontologies

BLOOMS on AgreementMaker: results for OAEI 2010

Optimizing Query results using Middle Layers Based on Concept Hierarchies

Supplementary Materials for. A gene ontology inferred from molecular networks

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

A Grid Middleware for Ontology Access

XBenchMatch: a Benchmark for XML Schema Matching Tools

Falcon-AO: Aligning Ontologies with Falcon

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Information Retrieval

Introduction to Web Clustering

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

DATA MINING II - 1DL460. Spring 2017

A Tagging Approach to Ontology Mapping

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data

Putting ontology alignment in context: Usage scenarios, deployment and evaluation in a library case

Database module of Vijjana, a Pragmatic Model for Collaborative, Self-organizing, Domain Centric Knowledge Networks

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Part I: Data Mining Foundations

Open Data Integration. Renée J. Miller

SmartMatcher How Examples and a Dedicated Mapping Language can Improve the Quality of Automatic Matching Approaches

Information Discovery, Extraction and Integration for the Hidden Web

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Ontology matching techniques: a 3-tier classification framework

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Transcription:

Erhard Rahm http://dbs.uni-leipzig.de June 3, 2009 Ontologies Ontology Matching Problem Match techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ Constraint- / Content-based Matching Matching web directories Matching by Instance overlap Similarity measures Evaluation: Product catalogs, biomedical ontologies Stability of ontology mappings Conclusions 2

Support a shared understanding of terms/concepts in a domain Annotation of data instances by terms/concepts of an ontology Semantically organize information of a domain Find data instances based on concepts (queries, navigation) Support data integration e.g. by mapping data sources to shared ontology Sample ontologies Product catalogs of companies, e.g. online shops Web directories Biomedical ontologies 3 Hierarchical categorization of products Instances: product descriptions Often very large: ten thousands categories, millions of products 4

Categorization of websites Instances: website descriptions (URL, name, content description) Manual vs. automated category assignment of instances General lists or specialized (per region, topic, etc.), e.g. Yahoo! Directory Dmoz Open Directory Project (ODP) Google Directory based on Dmoz Business.com Vfunk: Global Dance Music Directory 5 Many ontologies for different disciplines, e.g. Molecular Biology, Anatomy, Health etc. Largest ontologies (> 10,000 concepts), e.g., Gene Ontology (GO), NCI Thesaurus Ontologies used to annotate genes and proteins Support for functional data analysis Instances: annotated objects; separate from ontology Protein SwissProt instance associations Molecular Function GO 6 Gene Entrez Genetic Disorders OMIM Biological Process GO

AgBase 7 Focus on practically used ontologies Ontology O consists of a set of concepts/categories interconnected by relationships (e.g. of type is-a or part-of part-of ) ). O is represented by a DAG and has a designated root concept. Concepts have attributes,, e.g. Id, Name, Description Concepts may have associated instances Ontologies may be versioned Instances May be managed together with ontology or independently d May be associated to several concepts May have heterogeneous schemas, even per concept 8

O1 O2 Matching Mapping: O1.e11, O2.e23, 0.87 O1.e13,O2.e27, 0.93 O1, O2 instances further input, e.g. dictionaries Process of identifying ing semantic correspondences between 2 ontologies Result: ontology mapping Mostly equivalence mappings: correspondences specify equivalent ontology concepts Variation of schema matching problem 9 Shopping.Yahoo.com Electronics DVD Recorder Beamer Digital Cameras Digital Photography Digital Cameras Amazon.com Electronics & Photo TV & Video DVD Recorder Projectors Camera & Photo Digital Cameras Ontology mappings useful for Improving query results, e.g. to find specific products Advanced (cross-site) product recommendations Automatic categorization of products in different catalogs Merging catalogs 10

Protein SwissProt Gene Entrez Molecular Function instance associations GO Genetic Disorders OMIM?? Biological Process GO Ontology mappings useful for Improved analysis Answering questions such as Which Molecular Functions are involved in which Biological Processes? Validation (curation) and recommendation of instance associations Ontology merge or curation, e.g. to reduce overlap between ontologies 11 High degree of semantic heterogeneity in independently developed ontologies Syntactic differences Different models and languages g Structural differences Different is-a and part-of hierarchies Overlapping categories Semantic differences Naming ambiguities and conflicts Modeling errors / inconsistencies Instance / content differences Different scope Heterogeneous instance representations Fully automatic, generic solutions? 12

Metadata-based Instance-based Reuse-oriented Linguistic Names Descriptions Element Structure Element Element Structure Types Keys Constraint- based Constraint- based Parents Children Leaves Linguistic IR (word frequencies, key terms) Constraint- based Value pattern and ranges Dictionaries Thesauri Previous match results Matcher combinations Hybrid matchers Composite matchers 13 * Rahm, E., P.A. Bernstein: A Survey of Approaches to Automatic Schema Matching. VLDB Journal 10(4), 2001 semantics of a category may be better expressed by the instances associated to category than by metadata (e.g. concept name, description) Categories with most similar instances should match Main problem: Availability of (shared/similar) instances for most/all concepts Common cases: O1? O2 O1? O2 ontology associations Common instances O1 instances? O2 instances 14 a) Common instances (separate from ontologies) Example: Documents/Objects annotated by O1, O2 terms / concepts b) Ontology-specific instances b1) with shared instances b2) without shared (but similar) instances

Many prototypes for schema or ontology matching * Instance-based schema matching (XML, relational) SEMINT LSD Clio imap Dumas Instance-based ontology matching (OWL) GLUE, U of Washington COMA++, U Leipzig (supports schema + ont. matching) FOAM / QOM, U Karlsruhe Sambo, Linköping U, Sweden Falcon-AO, South East U, China RiMOM, Tsinghua U, China 15 * Euzenat/Shvaiko: Ontology matching. Springer 2007 Mappings for O 1, Mappings for O 2 Relaxation Labeling Common Knowledge & Similarity Matrix Domain Constraints Similarity Estimator Similarity Function Joint Probability Distribution P(A,B), P(A, B) Meta Learner Base Learner Base Learner Taxonomy O1 (tree structure + data instances) Distribution Estimator Taxonomy O2 (tree structure + data instances) 16 Use of machine learning to find ontology mappings Base learners use concept names + data instances (description) Similarity measures computed from joint probability distribution of concepts Evaluation on comparatively small ontologies: 3 match tasks, per ontology: 34-331 concepts, 6-30 non-leaf concepts, 1500-14000 instances, 34-236 correspondences * Doan, AH; et al: Learning to Match Ontologies on the Semantic Web. VLDB Journal, 12(4):303-319, 2003

A,S Concept A Concept S A, A S A, S Hypothetical universe of all examples A, S Sim(Concept A, Concept S) = [Jaccard] P(A S) P(A S) = P(A,S) P(A, S) + P(A,S) + P( A,S) Joint Probability Distribution: P(A,S),P( A,S),P(A, S),P( A, S) different similarity measures usable based on JPD 17 Mutual use of trained classifiers to determine instance-concept concept associations (requires no shared but only similar instances) A, S A Taxonomy 1 A,S A, S Taxonomy 2 A, S S A S A,S A,S A, S A,S CL A A CL S A JPD estimated by counting the sizes of the partitions S S 18

System for aligning and merging biomedical ontologies Framework to find similar concepts in overlapping OWL ontologies for alignment and merge tasks Combined use of different matchers and auxiliary information Linguistic, structure-based, constraint-based Instance-based matching Based on texts t (e.g., papers) Two concepts are similar if a document describes both concepts description i logic reasoner checks results for ontology consistency and cycles 19 * Lambrix, P; Tan, H.: SAMBO A system for aligning and merging biomedical ontologies. Journal of Web Semantics, 4(3):196-206, 2006 Dataset extracted dfrom Google, Yahoo and dlooksmart web directory More than 4,500 simple node matching tasks, no instances Comparison of matching quality results (top-3 systems of each year ) In 2008 the systems together did not manage to discover 48% of the total number of positive correspondences 20 OAEI (Ontology Alignment Evaluation Initiative) Alignment Contest, http://oaei.ontologymatching.org

Ontologies Ontology Matching Problem Match techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ Constraint- / Content-based Matching Matching web directories Matching by Instance overlap Similarity measures Evaluation: Product catalogs, biomedical ontologies Stability of ontology mappings Conclusions 21 Extends previous COMA prototype (VLDB2002) Matching of XML & rel. Schemas and OWL ontologies Several match strategies: Parallel (composite) and sequential matching; Fragment-based matching for large schemas; Reuse of previous match results Model Pool Directed graphs S1 S2 Component identification {s 11, s 12,...} {s 21, s 22,...} Matcher execution Matcher 1 Matcher 2 Matcher 3 Match Strategy Similarity cube Similarity Combination s 11 s 21 s 12 s 22 s 13 s 23 Mapping Mapping Pool Mapping s 22 Import, Load, Save Model Manipulation Nodes,... Paths,... Component Types Name, Children, Leaves, NamePath, Matcher Library Aggregation, Direction, Selection, CombinedSim Combination Library Diff, Intersect, Union, MatchCompose, Eval,... Mapping Manipulation *Schema and Ontology Matching with COMA++. Proc. SIGMOD 2005

Repository (persistent) & Workspace (in memory) Current Mapping Domains Schemas/ Ontologies Mappings Schema/ mapping info Source Schema Target Schema 23 Configuration of matcher Configuration of match strategies Metadata-based Reuse-based Instance-based User-programmed 24

Source Schema Sh Intermediate t Schema Sh Target tsh Schema Mapping Excel < > Noris Mapping Noris < > Noris_Ver2 25 Instance matchers introduced in 2006 Constraint-based matching Content-based matching: 2 variations Coma++ maintains instance value set per element XML schema instances ( Ontology instances 26

Instance constraints are assigned to schema elements General constraints: always applicable Example: average length and used characters (letters, numeral, special char.) Numerical constraints: for numerical instance values Example: positive or negative, integer or float Pattern constraints: My@email.com vs. Example: Email and URL Your@email.org Use of constraint similarity matrix to determine element similarity il it (like data type matching) Simple and efficient approach Effectiveness depends on availability of constrained value ranges / pattern Approach does not require shared instances Determine Constraints Compare Constraints General constraints c 11 c 21 element 1 Numerical constraints c element 1h c 2k 2 Pattern constraints Ui Using Constraint it Similarity Matrix Similarity value 27 2 variations Value Matching: pairwise similarity comparison of instance values Document (value set) matching: combine all instances into a virtual document and compare documents Both approaches do not require shared instances Value matching Use any similarity measure for pairwise value comparison Aggregate individual similarity values (similarity matrix) into a combined concept similarity (e.g., based on Dice) Compare pair-wise Instance Values Aggregation element 1 element 2 instance 11 instance 21 instance 12 instance 21 instance 1n instancei 2m Similarity Matrix Similarity value 28

Document matching 1i instance document per category or selected string category attribute (e.g. description) Document comparison based on TF-IDF to focus on most significant terms Two options to deal with multiple string attributes All values for these attributes are handled d as one virtual document Independent matching per attribute and aggregation g of the similarity values element 1 element 2 Compare virtual documents document 1 document 2 instance 11 instance 21 instance 12 Similarity function instance 22 based on instance 1n TF-IDF instance 2m Similarity value 29 39 of 51 test cases based on instances 2966 correspondences in reference alignment 2400 2200 All Corresp. 2000 Correct Corresp.- 1800 1600 1400 1200 1000 800 600 400 200 0 Constraint- Content- NameType Content + Constraint+ Matching Matching NameType Content+NameType t+n F-Measure: 0.15 0.61 0.64 0.82 0.82 30

Instance-based matching between 4 web directories, limited to online shops Dmoz Google Web Yahoo #Categories 746 728 418 3,234 #Direct instances 15,304 15,082 13,673 34,949 Clothing Swimwear Dmoz Sports Water Sports Yahoo Sports Swimming and Diving Swimming and Diving URL =http://www.beachwear.net Name =The Beachwear Network URL Description =http://www.skinzwear.com/ =Selection of beachwear. Name =Skinz Deep URL Description =http://www.ritchieswimwear.com/ =Swimwear, bikinis and Name streetwear. = Ritchie Swimwear Description =Designer brand for women, men and little girls. Gear and Equipment Apparel URL =www.skinzwear.com Name =Skinz Deep, Inc. Description URL =www.ritchieswimwear.com =Bikinis, swimwear, beachwear, Name =Ritchie and streetwear Swimwear for men and women. Description =Offers bathing suits, beachware, and cover-ups for men, women, and children. Stores located throughout South Florida. 31 * Massmann, S., Rahm, E.: Evaluating. Evaluating Instance-based Matching of Web Directories. Proc. WebDB 2008 Instances are shop websites Instance-based matching on 3 attributes: shop URL, name, description Use of directly and indirectly associated instances URL matcher based an value matching After URL preprocessing, equal URLs are needed (same shops in different directories) to find matching categories goodbooks.net/index.asp Preprocessing http://www.goodbooks.net Preprocessing goodbooks.net GoodBooks.net Preprocessing Name matcher based on value matching Description matcher based on document matching Name / description matching do not need shared instances 32

Six match tasks six reference mappings (manually created) Dmoz Dmoz Dmoz Google Google Web Google Web Yahoo Web Yahoo Yahoo # Corresp 729 218 436 211 416 235 2245 33 Combination of three instance-based matchers (URL, name, description) and six metadata-based matchers minimum and maximum values for the six match tasks s best single metadata-based matche best single instance-based matcher Combination: all 3 instance-based and 3 metadata-based matchers (Path, Name, Parent), average Fmeasure: 0.79 34

Ontologies Ontology Matching Problem Match techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ Constraint- / Content-based Matching Matching web directories Matching by Instance overlap Similarity measures Evaluation: Product catalogs, biomedical ontologies Stability of ontology mappings Conclusions 35 Use of instance overlap for ontology matching: two concepts are related / similar il if they share a significant number of associated objects Different measures to determine the instance-based similarity Base-K; Dice, Min, Jaccard Extensions: Consideration of indirect instance associations Combination with other match approaches Consideration of similar il (but non-identical) objects * Thor, A; Kirsten, T; Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. BTW, 2007 Kirsten, T, Thor, A; Rahm, E.: Instance-based matching of large life science ontologies. Proc. DILS, 2007 36

A m a z o n S o f t u n i t y by Brands by Category Software Microsoft Novell Books DVD Software Languages Utilities & Tools Traveling Kids & Home Business & Productivity it Operating System Burning Software Operating System Handheld Software Windows Linux Id =158298302X EAN = "662644467122" Title = "SuSE Linux 10.1 (DVD)" Price = 49.99 Id = B0002423YK Ranking = 180 EAN = 0805529832282 Title = "Windows XP Home Edition incl. SP2" Price = 191.91 Ranking = 47 Id = ECD435127K EAN = 0662644467122 ProductName = "SuSE Linux 10.1" DateOfIssue = 02.06.2006 Price = Id 59.95 = 95 ECD851350K EAN = 0805529832282 ProductName = "WindowsXP Home" DateOfIssue = 15.10.2004 Price = 238.90 37 Baseline similarity Sim BaseK Example: c 1 O 1 c 2 O 2 1, if Nc c >= K 1 2 SimBaseK ( c1, c2) = 0, if Nc c < K 1 2 Dice similarity Sim Dice 2 Nc 1c SimDice ( c1, c2) = N + N Minimum similarity Sim Min Nc 1c2 Sim Min( c1, c2) = min( N, N c 1 c 1 2 c 2 c 2 ) 4 =2 3 I Sim Base1 = Sim Base2 = 1, Sim Base3 = 0 Sim Dice = 2*2/(4+3) = 0.57 Sim Min = 2/3 = 0.67 0 Sim Dice Sim Min Sim Base1 1 38

Computation of precision & recall needs a perfect mapping (reference alignment) Laborious for large ontologies Might not be well-defined Syntactic measures to approximate recall / precision Match coverage: fraction of matched categories MatchCoverage O CO 1 = C Match 1 O [0...1] CO InstMatchCoverage = C Combined O + C 1 Match O2 Inst + C 1 1 2 Match ratio: #correspondences per matched concept Goal: high Match Coverage with low Match Ratio Match O Inst CorrO 1 O 2 2 CorrO 1 O 2 MatchRatioO 1 = 1 CombinedMatchRatio = 1 C C + C O1 Match O1 Match O2 Match 39 Amazon (AM) vs. Softunity (SU) Baseline1: max. Match Coverage, high Match Ratios Sim Min : good Match Coverage, moderate Match Ratios Sim Dice : low Match Coverage, low Match Ratios SU AM # concepts (product categories) 470 1,856 # concepts having instances 170 1,723 # instances (products) 2,576 18,024 # direct associations 2,576 25,448 # associations / # instances 1 1.4 # Instances / #concepts 15 15 Corr O O1O2 1 O 2 Ratio O1 Ratio O2 SU 711 AM 5.4 2.1 SU 335 AM 2.7 1.1 SU 71 AM 1.2 1.0 Coverage O1O2 100% 80% 27% I 1 I 2 InstO1O2 SU 1872 Baseline1 AM SU AM 849 Min (100%) SU AM 535 Dice (50%) 40

Ontologies 3 subontologies of GeneOntology Genetic disorders of OMIM Instances: Ensembl proteins of 3 species, i.e. homo sapiens, mouse, rat Homo Sapiens Mus Musculus Only subset of concepts has associated instances 288 110 3,018 2,810 288 201 110 Molecular Function Biological Process Cellular Component Genetic Disorder 2,709 77 2,452 133 47 41 Ensembl Proteins of different species Rattus Norvegicus Number of associated Biological Processes (total # processes: 12,555) Sim Base : high Match Coverage (99%) w.r.t. concepts having instances, very high Match Ratios Sim Dice : low Coverage (< 20%) and low Match Ratios Sim Min : good Coverage (60%-80%) with moderate Match Ratios Match Coverage Human Mouse Rat 1,0 Match Ratios per ontology 0,8 MF - BP 0,6 MF BP 0,4 0,2 Base 20.4 17.0 0,0 Min 4.4 4.0 Dice 1.3 1.2 SimB Base Sim mmin Sim Dice SimKa appa SimB Base Sim mmin Dice Sim SimKa appa SimB Base Sim mmin Dice Sim MF - BP MF - CC BP - CC SimKa appa (Match Ratios for Homo Sapiens, MF-BP task) 42

Simple matcher on concept names Relatively low Match Coverage (however wrt w.r.t. all concepts including instance-free concepts) No correspondences for similarity 0.9 Low similarity thresholds (e.g.< 0.6) too imprecise 43 er ontology Coverage p Match 0,50 0,45 0,40 0,35 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0,5 0,6 0,7 0,8 MF BP MF CC BP CC MF - BP MF - CC BP - CC Match Coverage per ontology MF - BP MF BP 0.5 4.4 6.9 06 0.6 24 2.4 29 2.9 0.7 1.4 1.4 0.8 1.1 1.1 Match Ratios per ontology Combinations between instance- (Sim Min ) and metadata-based match approach Union: Increased Match Coverage an Match Ratios Intersection: Low Match Coverage (<1%) Low overlap between instance- and metadata- based mappings 1,00 0,5 0,6 0,7 0,8 Matc ch Coverage pe er Ontology 0,80 0,60 0,40 0,20 0,00 Match Coverage per ontology for combined mappings MF BP MF CC BP CC MF - BP MF - CC BP - CC Match Ratios per ontology (Name threshold 0.7) MF MF - BP BP 4.1 3.7 1.0 1.0 (Sim Min = 1.0, Homo Sapiens) 44

Automatically vs. manually assigned annotations Example: Annotations in Ensembl (July 2008) 46,704 proteins MF BP Automatically assigned 82466 82% 57824 72% Manually assigned 17729 18% 22951 28% Sum 100195 80775 Ontology mappings for Base3,Min Restriction to manual annotations returns small mappings of likely improved quality all man Corr BP_MF C BP C MF Base3 21386 1939 1393 Min Base3 3275 1107 1107 Base3 3835 899 533 Min Base3 758 435 285 MC BP MC MF MR BP MR MF 45 all man Base3 0,13 0,17 11,0 15,4 Min Base3 008 0,08 013 0,13 30 3,0 30 3,0 Base3 0,06 0,06 4,3 7,2 Min Base3 0,03 0,03 1,7 2,7 Ontologies Ontology Matching Problem Match techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ Constraint- / Content-based Matching Matching web directories Matching by Instance overlap Similarity measures Evaluation: Product catalogs, biomedical ontologies Stability of ontology mappings Conclusions 46

Continous evolution of ontologies (many versions) Evolution analysis of 16 life science ontologies: Average of 60% growth in last four years Deletes and changes also common Ontology size C start C last grow C, start, last NCI Thesaurus 35,814 63,924 1.78 GeneOntology 17,368 25,995 1.50 -- Biological Process 8,625 15,001 1.74 large -- Molecular Function 7,336 8,818 1.20 -- Cellular Components 1,407 2,176 1.55 ChemicalEntities 10,236 18,007 1.76 www.izbi.de/onex 47 Full period (May. 04 - Feb. 08) Last year (Feb. 07 - Feb. 08) Ontology Add Del Obs adr add-frac del-frac obs-frac Add Del Obs #monthly NCI Thesaurus 627 2 12 42.4 1.3% 0.0% 0.0% 416 0 5 changes: GeneOntology 200 12 4 12.2 0.9% 0.1% 0.0% 222 20 5 -- Biological Process 146 7 2 16.2 1.2% 0.1% 0.0% 133 10 2 -- Molecular Function 36 3 2 6.8 0.4% 0.0% 0.0% 69 7 3 -- Cellular Components 18 2 0 8.9 1.0% 0.1% 0.0% 19 3 0 ChemicalEntities 256 62 0 4.1 1.8% 0.5% 0.0% 384 67 0 Hartung, M; Kirsten, T; Rahm, E.: Analyzing the Evolution of Life Science Ontologies and Mappings. Proc. 5 th Intl. Workshop on Data Integration in the Life Sciences (DILS), 2008 High change rates in Ontologies Instances Annotations (instance-concept associations) Ontology mappings (between versions of two ontologies) also change frequently, especially for instance-based match approaches correspondences may disappear in newer mapping versions Consideration of instance overlap or metadatabases similarity may not be sufficient for determining good ontology mappings 48

Standard match approaches only consider information about current ontology versions and ignore evolution history Is the black correspondence as good as the red one? Possible instabilities of match correspondences due to evolution of ontologies and/or related data source Idea: Consider the evolution of a match correspondence to assess its stability/quality in the current version 49 * Thor, A; Hartung, M; Gross, A; Kirsten, T; Rahm, E.:An Evolution-based Approach for Assessing Ontology Mappings - A Case Study in the Life Sciences. Proc. BTW, 2009 Average Stability stabavg n 1 1 = k i= n ( a, b, m) 1 sim ( a, b, m) sim ( a, b, m) [0,1] n, k i+ 1 i k Weighted Maximum Stability Proximity of similarities in the last versions compared to the current version stabwm n ( a, b, m) sim ( a, b, m) simn n i, k ( a, b, m) = 1 max [0,1] i = 1... k i 1 ence ty 0,8 stabavg 6,5 stabwm 6,5 50 Correspond similarit 0,6 0,4 0,2 0 1 2 3 4 5 6 Version (a 1,b 1 ) 0.9 0.95 (a 2,b 2 ) 0.7 0.9 (a 3,b 3 ) 09 0.9 06 0.6

Setting Mapping GO Biological Processes to Molecular Functions Instance based matching (using Ensembl source) Result: 2,497 correspondences (Base3 Min 0.8) which existed in the last 5 versions Selection of correspondences based on similarity and stability accepted 55% 15% candidates 15% questionable 30% 51 52 Instance-based match approaches Important since instances reflect well semantics of categories Availability of usable instances may be restricted to subset of concepts (consideration of indirectly associated instances helpful) Need to be combined with metadata-based techniques Correct ontology mappings NOT limited to 1:1 correspondences High change rates for ontologies/instances may result in unstable ontology mappings Matching based on shared instances Different similarity measures to consider instance overlap Especially applicable in bioinformatics (frequent annotations) Instance-based matching in COMA++ 3 basic instance matchers (constraint-based, content-based) not requiring shared instances Flexible combination with many metadata-based d approaches and different match strategies

Evaluation and validation of large ontology mappings Combined study of ontology matching and instance (entity) matching Correspondences based on instance similarity not equality Entity matching utilizing category similarity Automatic instance categorization Scalable instance match approaches based on machine learning Ontology Evolution Ontology Merging 53 http://se-pubs.dbs.uni-leipzig.de 54

Aumüller, D., Do, H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. Proc. ACM SIGMOD, 2005 Hartung M, Kirsten T., Rahm E.: Analyzing the evolution of life science ontologies and Mappings. Proc. DILS 2008. Springer LNCS 5109 Kirsten T., Thor A., Rahm E.: Instance-based matching of large life science ontologies. Proc. DILS 2007. Springer LNCS 4544 Massmann, S.; Rahm, E.: Evaluating Instance-based matching of web directories. Proc. 11th Int. Workshop on the Web and Databases (WebDB), 2008 Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. The VLDB Journal, 10(4): 334-350, 350 2001. Thor A., Hartung M., Groß A., Kirsten T., Rahm E.: An evolutionbased approach for assessing ontology mappings - A case study in the life sciences. Proc. 13 th German Database Conf. (BTW), 2009 Thor, A., Kirsten, T., Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. 12 th German Database Conf. (BTW), 2007 55