Data Schema Integration

Similar documents
Introduction to Data Integration

Data Integration and Fusion using RDF

ORM Modeling Tips and Common Mistakes

Subset, Equality, and Exclusion Rules In ORM

Mandatory Roles. Dr. Mustafa Jarrar. Knowledge Engineering (SCOM7348) (Chapter 5) University of Birzeit

Quick Mathematical Background for Conceptual Modeling

Uniqueness and Identity Rules in ORM

Subset, Equality, and Exclusion Rules In ORM

Schema Equivalence and Optimization

Uniqueness and Identity Rules in ORM

Dr. Mustafa Jarrar. Knowledge Engineering (SCOM7348) (Chapter 4) University of Birzeit

Schema Equivalence and Optimization

Mustafa Jarrar: Lecture Notes on RDF Schema Birzeit University, Version 3. RDFS RDF Schema. Mustafa Jarrar. Birzeit University

Introduction to Web 2.0 Data Mashups

The Data Web and Linked Data.

Database Heterogeneity

Practical Database Design Methodology and Use of UML Diagrams Design & Analysis of Database Systems

Conceptual Design. The Entity-Relationship (ER) Model

Database Design with Entity Relationship Model

The Entity-Relationship Model. Steps in Database Design

Introduction to Conceptual Data Modeling

Conceptual Modeling in ER and UML

Course on Database Design Carlo Batini University of Milano Bicocca, Italy

2004 John Mylopoulos. The Entity-Relationship Model John Mylopoulos. The Entity-Relationship Model John Mylopoulos

RDF Graph Data Model

Data Warehouse and Data Mining

CMSC 424 Database design Lecture 3: Entity-Relationship Model. Book: Chap. 1 and 6. Mihai Pop

MERGING BUSINESS VOCABULARIES AND RULES

COMP Instructor: Dimitris Papadias WWW page:

MIT Database Management Systems Lesson 03: Entity Relationship Diagrams

Introduction to modeling


Data Management Lecture Outline 2 Part 2. Instructor: Trevor Nadeau

MIT Database Management Systems Lesson 01: Introduction

Database Management Systems MIT Lesson 01 - Introduction By S. Sabraz Nawaz

Entity-Relationship Model &

Database Design. 6-1 Artificial, Composite, and Secondary UIDs. Copyright 2015, Oracle and/or its affiliates. All rights reserved.

Part 9: More Design Techniques

Conceptual Data Modeling Concepts & Principles

Introduction to Data Management. Lecture #2 (Big Picture, Cont.) Instructor: Chen Li

Release Date: September, 2015 Updates:

Chapter 6: Entity-Relationship Model. The Next Step: Designing DB Schema. Identifying Entities and their Attributes. The E-R Model.

CS2 Current Technologies Lecture 5: Data Modelling and Design

Introduction to Data Management. Lecture #4 (E-R Relational Translation)

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model

FROM A RELATIONAL TO A MULTI-DIMENSIONAL DATA BASE

DATABASE MANAGEMENT SYSTEM COURSE CONTENT

DATABASE SCHEMA DESIGN ENTITY-RELATIONSHIP MODEL. CS121: Relational Databases Fall 2017 Lecture 14

Modeling Databases Using UML

Dr. Mustafa Jarrar. Chapter 4 Informed Searching. Sina Institute, University of Birzeit

Schema Integration Methodologies for Multidatabases and the Relational Integration Model - Candidacy document

The Next Step: Designing DB Schema. Chapter 6: Entity-Relationship Model. The E-R Model. Identifying Entities and their Attributes.

Specific Objectives Contents Teaching Hours 4 the basic concepts 1.1 Concepts of Relational Databases

The Relational Model

Text Mining for Software Engineering

Introduction to Databases

II. Data Models. Importance of Data Models. Entity Set (and its attributes) Data Modeling and Data Models. Data Model Basic Building Blocks

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #5: Entity/Relational Models---Part 1

Dr. Mustafa Jarrar. Chapter 4 Informed Searching. Artificial Intelligence. Sina Institute, University of Birzeit

CS 338 The Enhanced Entity-Relationship (EER) Model

1/24/2012. Chapter 7 Outline. Chapter 7 Outline (cont d.) CS 440: Database Management Systems

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Lecture 10 - Chapter 7 Entity Relationship Model

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University.

Database Applications (15-415)

EECS 647: Introduction to Database Systems

CS 405G: Introduction to Database Systems

Handout 6 Logical design: Translating ER diagrams into SQL CREATE statements

XV. The Entity-Relationship Model

Introduction to Data Management. Lecture #5 Relational Model (Cont.) & E-Rà Relational Mapping

LAB 2 Notes. Conceptual Design ER. Logical DB Design (relational) Schema Refinement. Physical DD

Course on Database Design Carlo Batini University of Milano Bicocca

Introduction to Data Management. Lecture #4 (E-R à Relational Design)

GUJARAT TECHNOLOGICAL UNIVERSITY

Weak Entity Sets. A weak entity is an entity that cannot exist in a database unless another type of entity also exists in that database.

Database Management Systems MIT Introduction By S. Sabraz Nawaz

SOME TYPES AND USES OF DATA MODELS

Copyright 2004 Pearson Education, Inc.

Ontology-Based Schema Integration

CT13 DATABASE MANAGEMENT SYSTEMS DEC 2015

Credit where Credit is Due. Last Lecture. Goals for this Lecture

Bsc(Hons) Web Technologies. Examinations for / Semester 2

ORM and Description Logic. Dr. Mustafa Jarrar. STARLab, Vrije Universiteit Brussel, Introduction (Why this tutorial)

Database Management Systems

E-R Model. Hi! Here in this lecture we are going to discuss about the E-R Model.

CMPT 354 Database Systems. Simon Fraser University Fall Instructor: Oliver Schulte

Can you name one application that does not need any data? Can you name one application that does not need organized data?

Database Programming - Section 18. Instructor Guide

B.C.A DATA BASE MANAGEMENT SYSTEM MODULE SPECIFICATION SHEET. Course Outline

Introduction to Data Management. Lecture #2 (Big Picture, Cont.)

1. Inroduction to Data Mininig

Enhanced Entity- Relationship Models (EER)

Database Administration. Database Administration CSCU9Q5. The Data Dictionary. 31Q5/IT31 Database P&A November 7, Overview:

DISCUSSION 5min 2/24/2009. DTD to relational schema. Inlining. Basic inlining

EECS 647: Introduction to Database Systems

CISC 3140 (CIS 20.2) Design & Implementation of Software Application II

FINAL EXAM REVIEW. CS121: Introduction to Relational Database Systems Fall 2018 Lecture 27

The Relational Model. Outline. Why Study the Relational Model? Faloutsos SCS object-relational model

Data Modeling and the Entity-Relationship Model

Transcription:

Mustafa Jarrar Lecture Notes, Web Data Management (MCOM7348) University of Birzeit, Palestine 1 st Semester, 2013 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info 1

Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2013/11/web-data-management.html 2

Data Schema Integration: A simple example In ORM: Employee bornin/ City locatedin/ Region Integrated schema /WorksIn Organization locatedin/ Employee Municipality /WorksIn Organization Worker bornin/ City Schema 2 locatedin/ Region Organization locatedin / Schema 1 Schema 3 3

Data Schema Integration: A simple example In ER: Source: Carlo Batini Employee born City in Region works Integrated schema Organization in Employee works Empoloyee born City in Region Municipality Organization Schema 2 Organization in Schema 1 Schema 3 4

Challenges of Data Schema Integration Source: Carlo Batini Schema Integration has two major challenges: 1. Identification of all portions of schemas that pertain to the same concept, in such a way to unify such different representations in the global schema. 2. Identification, analysis and resolution of the different types of conflicts (heterogeneities) in different schemas. 5

Framework for Schema Integration Local Schemas Schemas Transformation Transformation Rules Schemas Matching Matching Rules Schemas Integration Integration Rules Integrated Schema and mappings Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000 6

Framework for Schema Integration 0. Define the integration strategy If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted. Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics S1 S2 S3 S1 S2 S3 S4 S1 S2 S3 S4 IS One shot strategy - Efficient integration process - Many correspondences between concepts have to be considered together. IS 1 IS 2 IS Pair at a time strategy - Priority to most relevant and stable schemas. - The integration process is more efficient IS 1 IS 2 IS Balanced Strategy -e.g.: Production, Marketing, Sales. -To be preferred when the cohesion among schemas is high. 7

Framework for Schema Integration 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model and Design Homogeneization Reduce model heterogeneities as much as possible to make the sources more suitable for integration. Goal: use a single, common data model and format. Source: Stefano Spaccapietra Transformation Integration Source DBs Homogeneized DBs DW 8

Schema Transformation Schema Transformation involves: Data model homogeneization Where all data sources are described using the same data model. Design homogeneization Enforce standard design rules to reduce the number of structural conflicts (e.g., Normalization: one fact in one place) Reverse Engineering Reverse engineer the schema from existing data (such as COBOL files, spreadsheets, legacy relational databases, legacy objectoriented databases). 9

Example of Design homogeneization (Normalization) ONE TABLE: R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date) Dependencies: #Student à Name, LastName #Course à CourseName #Student #Course à Grade, Date) Normalized Into 3 Tables: One Fact In One Place: R11 (#Student, Name, LastName) R12 (#Course, CourseName) R13 (#Student, #Course, Grade, Date) 10

Example of Reverse Engineering Source: Stefano Spaccapietra 11

Schema Matching 2. Schema matching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences Correspondences relate (schema) elements which describe the same phenomena of the real world. This step aims at finding and describing all semantic links between elements of the input schemas and the corresponding data. By doing so, one matches between the schemas to be integrated. This step fixes the conflicts found in the schema. 12

Semantics of Correspondences Source: Stefano Spaccapietra Correspondences relate (schema) elements which describe the same phenomena of the real world. 13

Asserting Correspondences Source: Stefano Spaccapietra Finding matching correspondences is done through the use of a rich language for expressing correspondences (matchings). Example: S1.Person S2.Person, With Corresponding Identifiers: Pin, With Corresponding Property: name 14

Automated Matching Fully automated matching is impossible, as a computer process can hardly make ultimate decisions about the semantics of data. But even partial assistance in discovering of correspondences (to be confirmed or guided by humans) is beneficial, due to the complexity of the task. All proposed methods rely on some similarity measures that try to evaluate the semantic distance between two descriptions. Some state of the art matching systems Cupid (Microsoft Research, USA) FOAM/QOM (University of Karlsruhe, Germany) OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada) S-Match (University of Trento, Italy) many others 15

Examples of Correspondences Source: Stefano Spaccapietra 16

Examples of Correspondences Previous example Employee Municipality /WorksIn Organization Organization locatedin/ Schema 1 Schema 3 Worker bornin/ City locatedin/ Region Schema 2 17

Examples of Correspondences Source: Stefano Spaccapietra 18

Schema Integration & Mapping Generation 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Source: Carlo Batini Method used: New classification of conflicts + Conflict resolution transformations GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases. 19

GAV and LAV Integration Research has identified two methods to set up mappings between the integrated schema and the input schemas: (1) GAV (Global As View): proposes to define the integrated schema as a view over input schemas. GAV is usually considered simpler and more efficient for processing queries on the integrated database, but is weaker in supporting evolution of the global system through addition of new sources. (2) LAV (Local As View): proposes to define the local schemas as views over the integrated schema. LAV generates issues of incomplete information, which adds complexity in handling global queries, but it better supports dynamic addition and removal of source. 20

Integration Process After we identified the correspondences (in the previous step), we now solve the conflicts: One can distinguish between four types of conflicts: Structural conflicts Classification conflicts Descriptive conflicts Fragmentation conflicts Examples of conflicts among related object types different classifications (sets of instances) different sets of properties different structures different coding schemes 21

Integration Rules Rules defining the strategy to solve conflicts Example rules: If an class corresponds to an attribute, keep the class If the population of a class is included in the population of another class, build an is-a hierarchy Integration rules depend on how you want the integrated schema to look like 22

Structural Conflicts Source: Stefano Spaccapietra Different schema element types, e.g.: class, attribute, relationship Library example: S1 : Book is a class S2 : books is an attribute of Author S1 Conflict resolution : Choose the less constraining structure Integrated Schema: Book is a class S2 23

Classification Conflicts Corresponding elements describe different sets of real world objects S1.Faculty CONTAINS S2.PhD-advisor Conflict Resolution: Generalization / Specialization hierarchy S1 Faculty Faculty S2 Phd-advisor Phd-advisor Merging Faculty 24

Descriptive Conflicts Corresponding types have different properties, or corresponding properties are described in different ways Object / Entity / Relationship type: Naming conflicts : synonyms Node, Extremity homonyms Highway (EU), Highway (USA) Composition conflicts : different attributes and methods Employee ( E#, name, address ) Employee ( E#, position, salary, department ) 25

Integration Methods: Manual Source: Stefano Spaccapietra First method: manual integration do it yourself a language mapping rules schemas integrated schema DBA Easy to implement, Flexible BUT time consuming for the DBA 26

Integration Methods: Semi-Automatic Second method : semi-automatic integration tell me about the problem, I will try to fix it correspondences schemas TOOL mapping rules integrated schema DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful 27

References and Acknowledgement Carlo Batini: Course on Data Integration. BZU IT Summer School 2011. Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005. Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. Thanks to Anton Deik for helping me preparing this lecture 28