Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration

Similar documents
Model Management and Schema Mappings: Theory and Practice (Part II)

Bio/Ecosystem Informatics

From business need to implementation Design the right information solution

Topic 1, Volume A QUESTION NO: 1 In your ETL application design you have found several areas of common processing requirements in the mapping specific

Orchid: Integrating Schema Mapping and ETL

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

A Examcollection.Premium.Exam.47q

Designing your BI Architecture

Introduction to Federation Server

C Exam Code: C Exam Name: IBM InfoSphere DataStage v9.1

Passit4sure.P questions

SERVICE-ORIENTED COMPUTING

Vendor: IBM. Exam Code: P Exam Name: IBM InfoSphere Information Server Technical Mastery Test v2. Version: Demo

Unifying Big Data Workloads in Apache Spark

2779 : Implementing a Microsoft SQL Server 2005 Database

A c t i v e w o r k s p a c e f o r e x t e r n a l d a t a a g g r e g a t i o n a n d S e a r c h. 1

Raising the Level of Development: Models, Architectures, Programs

Oracle Database: SQL and PL/SQL Fundamentals NEW

IBM WebSphere Studio Asset Analyzer, Version 5.1

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Optimizer Challenges in a Multi-Tenant World

Tools to Develop New Linux Applications

P IBM. IBM InfoSphere Information Server Technical Mastery Test v2

Using MDM Big Data Relationship Management to Perform the Match Process for MDM Multidomain Edition

MetaMatrix Enterprise Data Services Platform

A Dream of Software Engineers -- Service Orientation and Cloud Computing

DB2 NoSQL Graph Store

Clio Grows Up: From Research Prototype to Industrial Tool

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Guide to Managing Common Metadata

Design concepts for data-intensive applications

Luncheon Webinar Series June 3rd, Deep Dive MetaData Workbench Sponsored By:

Application Discovery and Enterprise Metadata Repository solution Questions PRIEVIEW COPY ONLY 1-1

RDP203 - Enhanced Support for SAP NetWeaver BW Powered by SAP HANA and Mixed Scenarios. October 2013

Pre-Discussion. XQuery: An XML Query Language. Outline. 1. The story, in brief is. Other query languages. XML vs. Relational Data

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Evaluation Guide for ASP.NET Web CMS and Experience Platforms

Introduction to Hive Cloudera, Inc.

SQL++ Query Language 9/9/2015. Semistructured Databases have Arrived. in the form of JSON. and its use in the FORWARD framework

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Part II Black-Box Composition Systems 20. Finding UML Business Components in a Component-Based Development Process

Structural characterizations of schema mapping languages

COMP9321 Web Application Engineering

Oracle Enterprise Data Quality - Roadmap

Identity Synchronizer User administration in Domino controlled by Active Directory

CIS 550 Fall Final Examination. December 13, Name: Penn ID:

COMP9321 Web Application Engineering

Implementing the Army Net Centric Data Strategy in a Service Oriented Environment

STARCOUNTER. Technical Overview

Table of Contents Chapter 1 - Introduction Chapter 2 - Designing XML Data and Applications Chapter 3 - Designing and Managing XML Storage Objects

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc.

Chapter 1: Introduction

Computation Independent Model (CIM): Platform Independent Model (PIM): Platform Specific Model (PSM): Implementation Specific Model (ISM):

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL. Overview

Teiid Designer User Guide 7.5.0

Part V. Relational XQuery-Processing. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2007/08 297

Features and Requirements for an XML View Definition Language: Lessons from XML Information Mediation

Building JavaServer Faces Applications

DISCUSSION 5min 2/24/2009. DTD to relational schema. Inlining. Basic inlining

Using the Altova Tools with IBM DB2 purexml

BioFederator: A Data Federation System for Bioinformatics on the Web

2015 Ed-Fi Alliance Summit Austin Texas, October 12-14, It all adds up Ed-Fi Alliance

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Informatica Power Center 10.1 Developer Training

Hacking PostgreSQL Internals to Solve Data Access Problems

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper

Enterprise Data Catalog for Microsoft Azure Tutorial

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Labelling & Classification using emerging protocols

challenges in domain-specific modeling raphaël mannadiar august 27, 2009

Build and Deploy Stored Procedures with IBM Data Studio

Call: Datastage 8.5 Course Content:35-40hours Course Outline

B. Assets are shared-by-copy by default; convert the library into *.jar and configure it as a shared library on the server runtime.

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Configuring the module for advanced queue integration

The interaction of theory and practice in database research

Part II Black-Box Composition Systems 10. Business Components in a Component-Based Development Process

Leveraging the Social Web for Situational Application Development and Business Mashups

20762B: DEVELOPING SQL DATABASES

Database Processing. Fundamentals, Design, and Implementation. Global Edition

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

SOA Architect. Certification

Information empowerment for your evolving data ecosystem

Cisco Information Server 6.2

Bonus Content. Glossary

ISA Action 1.17: A Reusable INSPIRE Reference Platform (ARE3NA)

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

The Evolution of Big Data Platforms and Data Science

Microsoft. [MS20762]: Developing SQL Databases

Azure Data Lake Analytics Introduction for SQL Family. Julie

SOFTWARE ARCHITECTURE INTRODUCTION TO SOFTWARE ENGINEERING PHILIPPE LALANDA

Practical Model-Driven Development with the IBM Software Development Platform

<Insert Picture Here> Oracle SQL Developer Data Modeler 3.0: Technical Overview

IBM Rational Software Architect

Implementing a Microsoft SQL Server 2005 Database Course 2779: Three days; Instructor-Led

PROCE55 Mobile: Web API App. Web API.

Swimming in the Data Lake. Presented by Warner Chaves Moderated by Sander Stad

Announcements. Relational Model & Algebra. Example. Relational data model. Example. Schema versus instance. Lecture notes

Databricks, an Introduction

An UML-XML-RDB Model Mapping Solution for Facilitating Information Standardization and Sharing in Construction Industry

Transcription:

Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration Howard Ho IBM Almaden Research Center

Take Home Messages Clio (Schema Mapping for EII) has evolved to Clio 2010 (Flow of Mappings for ETL or Mashups) Raise the level of abstraction for data transformation (EII, ETL or mashups) to flow of mappings Compile flow of mappings for different runtime engines Unify Famous Objects in advance and incorporate into Clio 2010 to simplify integration tasks Need schema decomposition algorithms to do integration in Clio 2010 using UFOs 2

Outline Clio Evolution Clio 2010 Unified Famous Object (UFO) Schema Decomposition Joint work with Mauricio Hernandez, Lucian Popa, Ioana (Roxana) Stanoi Interns: Bogdan Alexe (UCSC), Michael Gubanov (UW), Jen- Wei Huang (NTU, Taiwan), Yansis Katsis (UCSD), Barna Saha (UMD) 3

Schema Mappings and Data Exchange User mapping Mapping Generation Source schema S conforms to Schema Mapping Target schema T conforms to data Data exchange process (or SQL/XQuery/XSLT/ ) Schema mappings are bridges for information integration Our research (Clio project) addressed two main problems: How to generate schema mappings and how to use them for data exchange? 4

Clio: From Mappings to Runtime Artifacts (Queries or Operators) sid name Doc 1 Student1 Student2, Doc 2 courseeval /Student1/* Key Course_2 = F_course(sid, name) Key value_key // Student_0 De-dup sid name course grade Course_2 = Pid_0 eid = Key Sk1( ) Key Pid_0 = F_course ( ) Key value_key value_key /Student2/* Pid_0 cname eid Nest-join to Eval_2 // // sid name Course_1 De-dup Group-by Key Course_2 Key value_key value_key Pid_0 concatenate /courseeval/* eval_key=eval_key Key value_key Student, Course, Eval eid = Key Sk2( ) Pid_0 cname eid Eval_2 De-dup XML relational sid name cname file to Eval_2 Key Pid_0 = F_course ( ) value_key relational XML... 5

Evolution of Clio Nine years of active research Created and established the research area of schema mappings and data exchange 14 patents and 22 database PIC papers, 7 papers invited to special issues of journals 5 of these papers have received 200+ citations each (254, 213, 198, 166, 143) 1 paper (composition) won 2004 Pat Goldberg Memorial Best Paper Award. Many invited talks, 4 keynotes, 2 tutorial The Clio system has evolved to be a design-time tool (based on mappings) that can represent multiple runtimes: Query execution platforms: SQL, SQL/XML, XSLT, XQuery Java Data Flow engines: ETL (DataStage and E2), mashups (DAMIA), now JAQL flows on Hadoop Transfer to IBM products DB2 8.1 SQL Assist DB2/II 8.2 XML Wrapper Generator (Chocolate) IBM Content Manager 8.3 (Cinnamon) Rational Data Architect v7 (Criollo) Information Server v8 (FastTrack and BackTrack) 6

Limitations of the Clio Mapping Paradigm (pre-2007) Based on XML/relational schemas Based on one, monolithic, source-to-target mapping Building and managing complex mappings becomes difficult. The idea of flow of transformations was missing. The mapping process lacked modularity and reusability Needed to be adaptive to more data models (Java objects, JSON) and runtimes (DAMIA, Hadoop -- Cloud Computing) Needed to raise the level of abstraction (business objects UFOs) 7

Clio 2010: Declarative Flows of Mappings Vision: Schema mappings are building blocks (steps) in a flow Clio 2010: extension of Clio where multiple mappings can be defined, loaded, reused and managed The graph of mappings is a declarative specification of the flow of data Additional Semantic Data Model: UFOs (semantic objects) Novel mapping technology in Clio 2010: automatic assembly of a graph of initial, uncorrelated mappings into larger, richer mappings Includes mapping composition, mapping merge (correlation) Correlation: Join and fuse data coming from the individual mappings Deployment: Richer flow of mappings can be compiled/optimized into a global execution plan Queries, transformation scripts, ETL flows, Map/Reduce (e.g., Hadoop), Mashups (e.g. DAMIA) 8

Clio 2010: Flows and UFOs Source Schemas Source to UFO mappings Metadata Repository UFO to Target mappings Target Schema Key points: modularity and simplicity higher-level of abstraction reusable pieces Legend: Schemas Mappings 9

UFOs IBM Research Clio 2010 Architecture GUI extensions XML Schema JSON Java Objects RDBs. Clio 2010 Core Data Model Adapters Flow Model: Mappings Operators Data models Metadata Repository Operators Operator Adapters Generation Core extensions DS ETL. DataStage/E2 JAQL (Hadoop) Mashups. 10

UFO Model Unified representation of famous objects Decompose complicated concepts into simple flat objects Predefined relationships between UFOs Make the schema mapping job easier for users UFO repository interface Allow users to browse, query and modify UFOs and mappings Repository Interface UFO Repository Domain mappings UFO Repository mappings Schemas Mapping Source Schema generated mappings Target Schema Schema Object * each data contains two extra fields: created user and URI 11

Schema Matching with UFOs Source Schema Advantages: Modularity and simplicity Reusable pieces Better match quality Easy validation and correction by user Target Schema UFO Repository 12

Schema Matching with UFOs A 2-step technique Step 1: decompose Source and Target Step 2: find paths between Source and Target through UFOs Currently we focus on Step 1 Schema Decomposition 13

Summary of Schema Decomposition Use a three-step approach for solving the schema decomposition problem using UFOs. Step 1: Filtering Filter out irrelevant famous objects based on high-level signatures of schema and UFOs Step 2: Similarity Score Computation Calculate actual similarities only for reduced search space Step 3: Obtain Maximum Coverage Compute coverage (which UFOs to use and where) 14

Goals: Clio 2010, UFOs and Schema Decomposition Establish Clio 2010 as useful design-time abstraction behind many runtime platforms that do data transformation ETL, Mashups, Cloud Computing Precisely define the UFO data model. Use Freebase, OAGIS, and industrial models (SAP) as sources of UFOs Use Schema decomposition to match schema and UFOs Goal in 2008 Demonstrate an integration scenario using flow of mappings based on UFOs Demonstrate Clio 2010 generating transformation code for different platforms E.g., JAQL queries for JSON data & Map/Reduce, DAMIA scripts, E2, DS ETL. Unified data integration environment Object based, common and open data model, mapping-based optimization, compiles to parallel engine via Map/Reduce 15