From Spreadsheets to Relational Databases and Back

Similar documents
Discovery-based Edit Assistance for Spreadsheets

Chapter 3. The Relational database design

Database Technologies. Madalina CROITORU IUT Montpellier

Normalization. { Ronak Panchal }

Database Normalization

Type-Safe Two-Level Data Transformation

MODEL-BASED PROGRAMMING ENVIRONMENTS FOR SPREADSHEETS

Model-based Programming Environments for Spreadsheets

Coupled Transformation of Schemas, Documents, Queries, and Constraints 1

CS317 File and Database Systems

Relational Design: Characteristics of Well-designed DB

Chapter 5. The Relational Data Model and Relational Database Constraints. Slide 5-١. Copyright 2007 Ramez Elmasri and Shamkant B.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 5-1

Embedding, Evolution, and Validation of Model-Driven Spreadsheets

Chapter 4. The Relational Model

A Shortcut Fusion Rule for Circular Program Calculation

Model Inference for Spreadsheets

Introduction to Databases

The Relational Data Model and Relational Database Constraints

Transforming ER to Relational Schema

File Processing Approaches

CS403- Database Management Systems Solved Objective Midterm Papers For Preparation of Midterm Exam

Type-safe Evolution of Spreadsheets

Chapter 5. Relational Model Concepts 5/2/2008. Chapter Outline. Relational Database Constraints

Chapter 5. Relational Model Concepts 9/4/2012. Chapter Outline. The Relational Data Model and Relational Database Constraints

Lecture 03. Spring 2018 Borough of Manhattan Community College

CS403- Database Management Systems Solved MCQS From Midterm Papers. CS403- Database Management Systems MIDTERM EXAMINATION - Spring 2010

Unit I. Introduction to Database. Prepared By: Prof.Sushila Aghav. Ref.Database Concepts by Korth

DBMS Chapter Three IS304. Database Normalization-Comp.

Lecture 03. Fall 2017 Borough of Manhattan Community College

COSC344 Database Theory and Applications. σ a= c (P) Lecture 3 The Relational Data. Model. π A, COSC344 Lecture 3 1

Lecture5 Functional Dependencies and Normalization for Relational Databases

We shall represent a relation as a table with columns and rows. Each column of the table has a name, or attribute. Each row is called a tuple.

8) A top-to-bottom relationship among the items in a database is established by a

CSE 132A Database Systems Principles

Normalization. Murali Mani. What and Why Normalization? To remove potential redundancy in design

CS 377 Database Systems

Translation of ER-diagram into Relational Schema. Dr. Sunnie S. Chung CIS430/530

Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5-1

01/01/2017. Chapter 5: The Relational Data Model and Relational Database Constraints: Outline. Chapter 5: Relational Database Constraints

Relational Model and Relational Algebra. Rose-Hulman Institute of Technology Curt Clifton

Applied Databases. Sebastian Maneth. Lecture 5 ER Model, normal forms. University of Edinburgh - January 25 th, 2016

The Basic (Flat) Relational Model. Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Unit 3 Fill Series, Functions, Sorting

Unit 3 Functions Review, Fill Series, Sorting, Merge & Center

Let s briefly review important EER inheritance concepts

Database Management

Automatically Inferring ClassSheet Models from Spreadsheets

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Lecture 5 Data Definition Language (DDL)

Functional Dependencies CS 1270

Chapter 6: Relational Database Design

CS317 File and Database Systems

FUNCTIONAL DEPENDENCIES

5 Normalization:Quality of relational designs

DATABASE MANAGEMENT SYSTEM SHORT QUESTIONS. QUESTION 1: What is database?

doc. RNDr. Tomáš Skopal, Ph.D. RNDr. Michal Kopecký, Ph.D.

3ISY402 DATABASE SYSTEMS

Functional Dependencies & Normalization for Relational DBs. Truong Tuan Anh CSE-HCMUT

Introduction to Relational Databases

Lecture 07. Spring 2018 Borough of Manhattan Community College

DB Creation with SQL DDL

Transformation of Structure-Shy Programs

Techno India Batanagar Computer Science and Engineering. Model Questions. Subject Name: Database Management System Subject Code: CS 601

Applied Databases. Sebastian Maneth. Lecture 5 ER Model, Normal Forms. University of Edinburgh - January 30 th, 2017

Brief History of SQL. Relational Database Management System. Popular Databases

Informal Design Guidelines for Relational Databases

Chapter 1: Introduction

Lecture 6 Structured Query Language (SQL)

Science of Computer Programming. Transformation of structure-shy programs with application to XPath queries and strategic functions

Functional Dependencies and Finding a Minimal Cover

Databases -Normalization I. (GF Royle, N Spadaccini ) Databases - Normalization I 1 / 24

A database consists of several tables (relations) AccountNum

Using Relational Databases for Digital Research

Standard Query Language. SQL: Data Definition Transparencies

CS317 - File and Database Systems

Normalization. VI. Normalization of Database Tables. Need for Normalization. Normalization Process. Review of Functional Dependence Concepts

Relational Database Systems Part 01. Karine Reis Ferreira

customer = (customer_id, _ customer_name, customer_street,

NORMAL FORMS. CS121: Relational Databases Fall 2017 Lecture 18

Database Management System 15

In This Lecture. Normalisation to BCNF. Lossless decomposition. Normalisation so Far. Relational algebra reminder: product

VU Mobile Powered by S NO Group All Rights Reserved S NO Group 2013

SQL DATA DEFINITION LANGUAGE

This lecture. Databases -Normalization I. Repeating Data. Redundancy. This lecture introduces normal forms, decomposition and normalization.

Fundamentals of Database Systems

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Data about data is database Select correct option: True False Partially True None of the Above

CS275 Intro to Databases

The strategy for achieving a good design is to decompose a badly designed relation appropriately.

SQL DATA DEFINITION LANGUAGE

E(xtract) T(ransform) L(oad)

CS317 File and Database Systems

SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Relational Databases Fall 2017 Lecture 7

Supplier-Parts-DB SNUM SNAME STATUS CITY S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens

Introduction to Data Management. Lecture #7 (Relational Design Theory)

From Murach Chap. 9, second half. Schema Refinement and Normal Forms

Normalisation Chapter2 Contents

3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key

Chapter 1 SQL and Data

Transcription:

From Spreadsheets to Relational Databases and Back Jácome Cunha João Saraiva Joost Visser Departamento de Informática Universidade do Minho Portugal SIG & CWI The Netherlands PEPM 09, January, 19-20

Motivation Property Renting System Each row represents a renting transaction It includes information about owners, clients, properties It also contains dates and prices (formulas) Row 3 says that john has rented a property owned by tony, with address 5 Novar Dr., price per day 70, between dates 1/7/00 and 8/31/01 and paid 25550 for it 1.1

Motivation Redundancy This unstructured model is valid and serves its purpose But, it contains data redundancy For example, the rent per day is repeated several times 1.2

Motivation Updating Problems As a result, updates can cause data inconsistence For example, updating the renting value of property pg4 must be performed in several places 1.3

Motivation Deleting Problems Deleting rows can be problematic, too For example, deleting the row 5 will remove all the information about property pg36 1.4

Motivation A Database Solution Client clientno VARCHAR(256) cname VARCHAR(256) P Renting clientno VARCHAR(256) propertyno VARCHAR(256) rentstart VARCHAR(256) rentfinish VARCHAR(256) P P F F P P Property propertyno VARCHAR(256) paddress VARCHAR(256) rentperday VARCHAR(256) oname VARCHAR(256) P F ownerno VARCHAR(256) P F total_rent VARCHAR(256) totaldays VARCHAR(256) Owner ownerno VARCHAR(256) P oname VARCHAR(256) U A normalized (3NF) data representation No redundancy No update/delete problems 1.5

Motivation A Database Solution A normalized (3NF) data representation No redundancy No update/delete problems: the price per day occurs only once Tables can be stored in different sheets 1.6

Motivation A Database Solution We can use information computed during normalization to produce a refactored spreadsheet that helps users to introduce correct data Spreadsheet warns the user when errors are introduced 1.6

Our Approach to fun s : S FDs from synthesize T i CFDs choose PKs S s CFDs T i to from spreadsheet schema (header) spreadsheet data (remaining rows) compound functional dependencies relational database schema referential integrity constraints data conversion from RDB to SS data conversion from SS to RDB Infer functional dependencies (Fun) Normalize (3NF) those functional dependencies (synthesize) Create a Relational Database (RDB) schema (3NF) Use data refinements to perform the data migration (to and from) HaExcel combines all these steps creating a complete system 2.0

Our Approach Calculate FDs A Functional Dependency (FD) denoted A B means that an element of A is uniquely associated with an element of B The Fun algorithm (Novelli et. al.) calculates the FDs from data For our example, the Haskell implementation returns: ownerno oname totaldays clientno, cname propertyno paddress, rentperday, ownerno, oname... Formulas also induce FDs: C 0 = f(x 0,..., X n ) induces X 0,..., X n C 0 2.1

Our Approach A Few Concepts Each tuple in a DB table is uniquely identified by a set of attributes called Primary Key (PK) There may be more then one set suitable for becoming PK. They are designated Candidate Keys (CK) First Normal Form (1NF) is respected if each element of all tuples are atomic values Second Normal Form (2NF) is respected if the 1NF is respected and its non-key attributes are not functionally dependent on part of the key Third Normal Form (3NF) is respected if the 2NF is respected and if the non-key attributes are only dependent on the key attributes 2.2

Our Approach Normalize FDs Synthesize algorithm (Maier) calculates 3NF set of FDs It returns a set of attributes with candidate keys It is necessary to choose one CK to be the primary key To produce a schema with the lossless decomposition property one must add a FD (Maier): {all atts } \ {atts rep formulas } {new att, atts rep formulas } For our example: clientno, propertyno, cname, paddress, rentstart, rentfinish, rentperday, ownerno, oname totaldays, total rent, new att 2.3

Our Approach Calculate a RDB Schema Fun produces too many, not necessary and redundant FDs Also, attributes representing columns with formulas can not be used in a PK We use Synthesize to produce a 3NF set of attributes/ck Its input is the result from Fun without the FDs with formulas in PKs and the FD that guarantees the losslees decomposition property We choose the smallest CK to be the PK 2.4

Our Approach For our example, owners ownerno oname clients clientno cname properties propertyno paddress, rentperday, oname renting clientno, propertyno, rentstart, rentfinish, ownerno total rent, totaldays 2.5

From a Spreadsheet to a Database Data Refinements We need to migrate data from spreadsheets to the relational model We use data refinements: A[rr] to B[ll] from to : A B is an injective function; from : B A a surjective function; from to = id A (identity function on A); B is a refinement of A witnessed by the to and from functions In the case where the refinement works in both directions we have an isomorphism A = B 3.1

From a Spreadsheet to a Database Data Refinements Composition if A[rr] to B[ll] from and B[rr] to C[ll] from then A[ Propagation if A[rr] to B[ll] from then FA[rr] Fto FB[ll] Ffrom 3.1

From a Spreadsheet to a Database Examples: Hierarchical-Relational Data Mapping A N A List elimination 2 A = A 1 Set elimination A? = 1 A Optional elimination A + B A? B? Sum elimination A (B + C) = (A B) + (A C) Distribute product over sum A (B + C) (A B) (A C) Distribute map over sum (range) (B + C) A = (B A) (C A) Distribute map over sum (domain) 3.2

From a Spreadsheet to a Database Data Refinements With Constraints To represent database constraints we use type constraints Associate a constraint φ to a type A: A φ where φ : A Bool Add a second constraint to a constrained datatype: (A φ ) ψ = Aφ ψ Functorial pull: F(A φ ) = (FA) (Fφ) For example, a constraint on the elements of a list can be pulled up to a constraint on the list: (A φ ) = (A ) listφ 3.3

From a Spreadsheet to a Database Data Refinements With Constraints Refining a constrained datatype (source) if A[rr] to B[ll] from then A φ [rr] to B φ from [ll] from A law that introduces a constraint to a constrained datatype if A[rr] to B ψ [ll] from then A φ [rr] to (B ψ ) φ from [ll] from 3.3

From a Spreadsheet to a Database Constraint-Aware Rewriting There is an Haskell implementation of data refinement called 2LT Two Level Transformation (http://code.google.com/p/2lt) We need a type-safe representation for types, so we use a GADT data Type t where Int :: Type Int [ ] :: Type a Type [a ] :: Type a Type b Type (a, b) Func :: Type a Type b Type (a b) ( ) :: Type a PF (Pred a) Type a (A φ )... type Pred a = a Bool 3.4

From a Spreadsheet to a Database Constraint-Aware Rewriting For the function representation we also use a GADT data PF a where π 1 :: PF ((a, b) a) δ :: PF ((a b) Set a) :: PF (a b) PF ([a ] [b ]) :: PF (a b) PF (c d) PF ((a, c) (b, d))... Function representations can be evaluated to the function that is represented: eval :: Type (a b) PF (a b) a b 3.5

From a Spreadsheet to a Database Constraint-Aware Rewriting Data refinement rules in Haskell type Rule = a. Type a Maybe (View (Type a)) data View a where View :: Rep a b Type b View (Type a) data Rep a b = Rep{to = PF (a b), from = PF (b a)} 3.5

From a Spreadsheet to a Database Constraint-Aware Rewriting Rewriting composition operators nop :: Rule ::Rule Rule Rule ::Rule Rule Rule many :: Rule Rule once :: Rule Rule -- identity -- sequential composition -- left-biased choice -- repetition -- arbitrary depth rule application 3.5

From a Spreadsheet to a Database Refining a Relational Table to a Spreadsheet Table T able2sstable A B (A B) fd Sstable2table data PF a where... Sstable2table :: PF ([(a, b)] (a b)) Table2sstable :: PF ((a b) [(a, b)]) table2sstable :: Rule table2sstable (a b) = return (View rep [a b] fd ) where rep = Rep{to = Table2sstable, from = Sstable2table } 3.6

From a Spreadsheet to a Database Refining Relational Tables where a PK Contains a FK ((A B) (C D)) πa δ π 1 π C δ π 2 [dd] T able2sstable T able2ssta ((A B) fd (C D) fd ) π A list2set π 1 π 1 π C list2set π 1 π 2 [uu] Sstable2table A Foreign Key (FK) is a set of attributes within one relation that matches the PK of some relation 3.7

From a Spreadsheet to a Database Refining Tables where Non-Key Attributes Contain FKs ((A B) (C D)) πb ρ π 1 π C δ π 2 [dd] T able2sstable T able2sst ((A B) fd (C D) fd ) π B list2set π2 π 1 π C list2set π1 π 2 [uu] Sstable2table 3.8

From a Spreadsheet to a Database The Strategy rdb2ss :: Rule rdb2ss = simplifyinv (many (aux tables2table)) (many ((aux tables2sstables) (aux tables2sstables ))) (many (aux table2sstable)) where aux r = ((once r) simplifyinv) ((many (once r)) simplifyinv) 3.9

The HaExcel Framework HaExcel is a framework to manipulate and transform spreadsheets Haskell library: generic and reusable library to transform spreadsheets into RDBs and back Front-ends: Excel/Gnumeric (XML) formats can be imported and exported Tools: an online tool, available at http://haskell.di.uminho.pt/jacome/index.html a batch tool, compilable from sources available at http://haskell.di.uminho.pt/websvn 4.0

Conclusions We extended the 2LT framework with new constraint-aware twolevel data refinements We shown how these rules can be combined into a strategic rewrite system We constructed HaExcel that transforms a relational schema into a tabular schema and back Our techniques also migrate the formulas between the models We connected importers and exporters for SQL and spreadsheet formats This allows refactoring of spreadsheets to reduce data redundancy and improve error detection and migration of spreadsheet applications An Excel plugin that implements HaExcel is under development 5.0

Questions? Questions? 6.0