CS Hot topics in database systems: Data Provenance

Similar documents
Who we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science

Perm Integrating Data Provenance Support in Database Systems

CS 525 Advanced Database Organization - Spring 2017 Mon + Wed 1:50-3:05 PM, Room: Stuart Building 111

CSE 544 Principles of Database Management Systems

CS Hot topics in database systems: Data Provenance

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please)

San José State University Science/Computer Science Database Management System I

The Picture of Dorian Gray

CS 241 Data Organization. August 21, 2018

Additionally, if you are ing me please place the name of the course in the subject of the .

Austin Community College Google Apps Calendars Step-by-Step Guide

Department of Accounting & Law, School of Business. State University of New York at Albany. Acc 682 Analysis & Design of Accounting Databases

San José State University Computer Science Department CS157A: Introduction to Database Management Systems Sections 5 and 6, Fall 2015

Database Design and Management - BADM 352 Fall 2009 Syllabus and Schedule

What s a database anyway?

Advanced Relational Database Management MISM Course F A Fall 2017 Carnegie Mellon University

Informatics 1: Data & Analysis

CSC 407 Database System I COURSE PARTICULARS COURSE INSTRUCTORS COURSE DESCRIPTION

Salesforce Certified Administrator Study Guide

Extensibility and Modularity in Programming Languages

MIT Database Management Systems Lesson 01: Introduction

Introduction and Overview

Tim Cohn TimWCohn

Introduction to Data Management. Lecture #2 (Big Picture, Cont.) Instructor: Chen Li

PROFESSIONAL CERTIFICATES AND SHORT COURSES: MICROSOFT OFFICE. PCS.uah.edu/PDSolutions

The ebuilders Guide to selecting a Web Designer

Cleveland State University

Spam. Time: five years from now Place: England

Hours: See Canvas staff information for TA hours.

Isabella: Braveheart Of France By Colin Falconer

EECS 647: Introduction to Database Systems

CS425 Fall 2016 Boris Glavic Chapter 1: Introduction

Advanced Relational Database Management MISM Course S A3 Spring 2019 Carnegie Mellon University

Course Title: Computer Networking 2. Course Section: CNS (Winter 2018) FORMAT: Face to Face

Web Programming Fall 2011

DOWNLOAD PDF LEARN TO USE MICROSOFT ACCESS

CS157a Fall 2018 Sec3 Home Page/Syllabus

Google Analytics. powerful simplicity, practical insight

Database Management Systems MIT Lesson 01 - Introduction By S. Sabraz Nawaz

Thanks to Chris Bregler. COS 429: Computer Vision

IS Spring 2018 Database Design, Management and Applications

Dr. Jeff Ritchie Chair of Digital Communications Department at Lebanon Valley College 101 North College Ave. Annville, PA 17003

Fundamentals of Database Systems

HCI-4/631 Software Architectures for User Interfaces, Fall 2006

CSCU9Q5. Administrivia & Topics to be covered. Traditional File-Based Systems. Problems with Manual Filing Systems. CSCU9Q5- Database P&A

Databases TDA357/DIT620. Niklas Broberg

Announcements. Using Electronics in Class. Review. Staff Instructor: Alvin Cheung Office hour on Wednesdays, 1-2pm. Class Overview

INTRODUCTION TO GRAPHIC DESIGN FOR WEB AND PRINT (INTENSIVE) COURSE ID: GD0086

INF 315E Introduction to Databases School of Information Fall 2015

USC Viterbi School of Engineering

CPS352 Database Systems Syllabus Fall 2012

TDDD38 - Advanced programming in C++

CS535: Interactive Computer Graphics

CSCI 320 Group Project

Database Design Phases. History. Entity-relationship model. ER model basics 9/25/11. Entity-relationship (ER) model. ER model basics II

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #5: Entity/Relational Models---Part 1

Welcome to... CS113: Introduction to C

Advanced Topics in Database Systems Spring 2016

Bases de Dades: introduction and organization

DSP Research Project

CMPT 354 Database Systems. Simon Fraser University Fall Instructor: Oliver Schulte

CSC 111 Introduction to Computer Science (Section C)

CMPT 354 Database Systems I. Spring 2012 Instructor: Hassan Khosravi

Database Systems (INFR10070) Dr Paolo Guagliardo. University of Edinburgh. Fall 2016

Introduction to CS 4604

Database Management Systems MIT Introduction By S. Sabraz Nawaz

Set your office free.

PostgreSQL Training. Scheduled Courses On-site Courses. Learn from the PostgreSQL experts.

Introduction to Programming Nanodegree Syllabus

X-AFFILIATE module for X-Cart 4.0.x

Data Structures and Algorithm Analysis (CSC317) Introduction

SYLLABUS BIOLOGY 1107: Principles of Biology I Fall Semester Section 001: (9:05-9:55AM, Mon, Wed and Fri) Bldg.: Laurel Hall, Room 102

How To Make 3-50 Times The Profits From Your Traffic

Getting it done. New responses. Audio tour outcomes. Tailor content for audio tour. In situ interviews. Offer more than random access tour

IS 331-Fall 2017 Database Design, Management and Applications

Big Sandy Community and Technical College. Course Syllabus

San José State University Computer Science Department CS49J, Section 3, Programming in Java, Fall 2015

Apprenticeships CUSTOMER. Functional Skills Level 1 MATHEMATICS

Service Anywhere Practitioner Forum. Program Overview

Computer Science Technology Department

CS 564: DATABASE MANAGEMENT SYSTEMS. Spring 2018

MCSE Productivity. A Success Guide to Prepare- Core Solutions of Microsoft SharePoint Server edusum.com

San Jose State University - Department of Computer Science

SharePoint 2016 Power User

The University of Jordan

A good example of entities and relationships can be seen below.

Databases vs. Spreadsheets

BOSTON UNIVERSITY Metropolitan College MET CS342 Data Structures with Java Dr. V.Shtern (Fall 2011) Course Syllabus

Conceptual Design with ER Model

CISC 3140 (CIS 20.2) Design & Implementation of Software Application II

CS W Introduction to Databases Spring Computer Science Department Columbia University

arth 4573 history of graphic design spg18 The project is broken into three progressive parts:

UNIVERSITY OF TEXAS AT DALLAS MIS 6302: Information Technology Strategy & Management SPRING 2014 Tuesday 7 to 9.45 pm

CS240: Programming in C

CSCI 201L Syllabus Principles of Software Development Spring 2018

New York University Tisch School of the Arts Reel Delivery: Design for Media Distribution

ECONOMICS 5317: CONTEMPORARY GOVERNMENT AND BUSINESS RELATIONS

Textbook: Chapter 4. Chapter 5: Intermediate SQL. CS425 Fall 2016 Boris Glavic. Chapter 5: Intermediate SQL. View Definition.

Writing 125, Fall 2004, Section Founders x2575, or Office Hours: T/F: 9:45-10:45 W: 12:20-1:20 & by appt.

Introduction to Access 97/2000

Transcription:

CS 595 - Hot topics in database systems: Data Provenance Boris Glavic August 22, 2012

Outline 1 Instructor 2 Course Overview 3 Course Details and Administrative Information

Boris Glavic Assistant Professor for Database Systems (aka the new guy) Office: Stuart Building, room 226C Office hours: Thursday, 1:00 pm - 2:00 pm Webpage: www.cs.iit.edu/~glavic Phone: 312 567 5205 Slide 1 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Outline 1 Instructor 2 Course Overview Why should I care? 3 Course Details and Administrative Information

CS 595-06 - Data Provenance Administrative Info Hours: Mon + Wed 3:15-4:30 PM Room: Stuart Building in room 106 Course Webpage: Will be linked on www.cs.iit.edu/~glavic soon! Slide 2 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Data Provenance Data Provenance Information about the creation process and origin of data Slide 3 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why do we call it Provenance? Origin of the Term From art dealing Alternative Terms Lineage Data Pedigree Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why do we call it Provenance? Origin of the Term From art dealing Alternative Terms Lineage for kings Data Pedigree Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why do we call it Provenance? Origin of the Term From art dealing Alternative Terms Lineage for kings Data Pedigree for dogs Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why do we call it Provenance? Origin of the Term From art dealing for pieces of art Alternative Terms Lineage for kings Data Pedigree for dogs Slide 4 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info Provenance in Art Given a piece of art How do we know... if it is authentic? who created it? if it has been altered? Example Jan Van Eyck - Arnolfini Portrait Slide 5 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

me Course Overview Course Info Provenance in Art Provenance French provenir, to come from Chronology of the ownership or location of an historical object Example 1434 - Painting dated by van Eyck; presumably owned by the sitters. before 1516 - In possession of Don Diego de Guevara (d. Brussels 1520), a Spanish career courtier of the Habsburgs (himself the subject of a fine portrait by Michael Sittow in the National Gallery of Art). He lived most of his life in the Netherlands, and may have known the Arnolfinis in their later years. By 1516 he had given the portrait to Margaret of Austria, Habsburg Regent of the Netherlands. 1516 - Painting is the first item in an inventory of Margaret s paintings, made in her presence at Mechelen. The item says (in French): a large picture which is called Hernoul le Fin with his wife in a chamber, which was given to Madame by Don Diego, whose arms are on the cover of the said picture; done by the painter Johannes. A note in the margin says It is necessary to put on a lock to close it: which Madame has ordered to be done. 1523-4 - In another Mechelen inventory, a similar description, this time the name of the subject is given as Arnoult Fin. 1558 - In 1530 the painting was inherited by Margaret s niece Mary of Hungary, who in 1556 went to live in Spain. It is clearly described in an inventory taken after her death in 1558, when it was inherited by Philip II of Spain. A painting of two of his young daughters commissioned by Philip clearly copies the pose of the figures (Prado).[1] 1599 - a German visitor saw it in the Alcazar Palace in Madrid. Now it had verses from Ovid painted on the frame: See that you promise: what harm is there in promises? In promises anyone can be rich. It is very likely that Vela zquez knew the painting, which may have influenced his Las Meninas, which shows a room in the same palace. 1700 - In an inventory after the death of Carlos II it was still in the palace, with shutters and the verses from Ovid. 1794 - Now in the Palacio Nuevo in Madrid. Jan Van Eyck - Arnolfini Portrait Slide 5 of 26 Boris Glavic 1816 - The painting is now in London, in the possession of Colonel James Hay, a Scottish soldier. He claimed that after being seriously wounded at the Battle of Waterloo the previous year, the painting hung in the room where he convalesced in Brussels. He fell in love with it, and persuaded the owner to sell. More relevant to the real facts is no doubt Hay s presence at the Battle of Vitoria (1813) in Spain, where a large coach loaded by King Joseph Bonaparte with easily portable artworks from the Spanish royal collections was first plundered by British troops, before what was left was recovered by their commanders and returned to the Spanish. Hay offered the painting to the Prince Regent, later George IV of England, via Sir Thomas Lawrence. The Prince had it on approval for two years at Carlton House before eventually returning it in 1818. 1523-4 1558 In another Mechelen inventory, a similar description, this time the name of the subject is given as Arnoult Fin. In 1530 the painting was inherited by Margaret s niece Mary of Hungary, who in 1556 went to live in Spain. It is clearly described in an inventory taken after her death in 1558, when it was inherited by Philip II of Spain. A painting of two of his young daughters commissioned by Philip clearly copies the pose of the figures (Prado).[1] CS 595 - Hot topics in database systems: Data Provenance

Provenance in Data Processing Given a piece of data How do we know... which data it is derived from? which transformations (SQL) where used to create it? who created it?... Example result shop rev t 1 Migros 125 t 2 Coop 25 Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Provenance in Data Processing Given a piece of data How do we know... which data it is derived from? which transformations (SQL) where used to create it? who created it?... Example Compute the revenue for each shop as sum of prices of items sold Example result shop rev t 1 Migros 125 t 2 Coop 25 SELECT shop, sum ( price ) AS rev FROM sales, items WHERE itemid = id GROUP BY shop sales shop itemid s 1 Migros 1 s 2 Migros 3 s 3 Coop 3 items id price i 1 1 100 i 2 2 10 i 3 3 25 Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Provenance in Data Processing Given a piece of data How do we know... which data it is derived from? which transformations (SQL) where used to create it? who created it?... Definition (Data Provenance) Information about the origin and creation process of data. Example result shop rev t 1 Migros 125 t 2 Coop 25 SELECT shop, sum ( price ) AS rev FROM sales, items WHERE itemid = id GROUP BY shop sales shop itemid s 1 Migros 1 s 2 Migros 3 s 3 Coop 3 items id price i 1 1 100 i 2 2 10 i 3 3 25 Slide 6 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Scenario You are an analyst for a garden supply shop You have to compute the first quater revenue for each shop location Datawarehouse with sales data Use SQL to compute the required information from the warehouse Slide 7 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Example (Input Data) Employee SSN Name WorksFor 123 Peter Peterson New York 342 Jane Janeson New York 555 Heinz Heinzmann Wuppertal Shop Location Budget New York 1.000.000 Wuppertal 4.000 Item Id Description Price 1 Lawnmower 199 2 Fertilizer 32 3 Rake 9 Sales Employee Item Amount Month 123 1 1 1 342 2 64 1 342 3 2 3 555 3 1 5 Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Example (SalesTotal Query) CREATE VIEW SalesTotal AS SELECT Location AS Shop, Month, SSN AS Employee, Price * Amount AS Totalprice FROM Employee E, Shop H, Item I, Sales S WHERE E. WorksFor = H. Location AND E. SSN = S. Employee AND I. Id = S. Item Example (Results) SalesTotal Shop Month Employee Totalprice New York 1 123 199 New York 1 342 2048 New York 3 342 18 Wuppertal 5 555 9 Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Example (MonthlyRevenue Query) CREATE VIEW MonthlyRevenue SELECT Shop, Month, sum ( Totalprice ) AS Revenue FROM SalesTotal GROUP BY Shop, Month Example (Results) MonthlyRevenue Shop Month Revenue New York 1 2247 New York 3 18 Wuppertal 5 9 Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Example (RevenueFirstQ Query) CREATE VIEW RevenueFirstQ SELECT Shop, sum ( Revenue ) AS Revenue FROM MonthlyRevenue WHERE Month < 5 GROUP BY Shop Example (Results) RevenueFirstQ Shop Revenue New York 2265 Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

A more Complex Example Compute First Quarter Revenue RevenueFirstQ MonthlyRevenue SalesTotal Employee Shop Item Sales Slide 8 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Example Data Example Compute First Quarter Revenue RevenueFirstQ MonthlyRevenue SalesTotal Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Example Data Example Compute First Quarter Revenue RevenueFirstQ MonthlyRevenue SalesTotal Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Example Data Example SalesTotal EmployeeShop Name WorksFor SSN Location Budget 123 Peter Peterson New York New York 1.000.000 342 Jane Janeson New York New York 1.000.000 555 Heinz Heinzmann Wuppertal Wuppertal 4.000 Employee SSN Name WorksFor 123 Peter Peterson New York 342 Jane Janeson New York 555 Heinz Heinzmann Wuppertal Slide 9 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Tracing an Error Problem One result tuple of your query looks suspicious You expect the input data to be the culprit How to know which input data affected which output data Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Tracing an Error Problem One result tuple of your query looks suspicious You expect the input data to be the culprit How to know which input data affected which output data This is Data Provenance Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Tracing an Error Problem One result tuple of your query looks suspicious You expect the input data to be the culprit How to know which input data affected which output data This is Data Provenance But how to get at the data provenance? Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Tracing an Error Problem One result tuple of your query looks suspicious You expect the input data to be the culprit How to know which input data affected which output data This is Data Provenance But how to get at the data provenance? Manually? Not reasonable for big data or complex query! Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Tracing an Error Problem One result tuple of your query looks suspicious You expect the input data to be the culprit How to know which input data affected which output data This is Data Provenance But how to get at the data provenance? Manually? Not reasonable for big data or complex query! Need system that tracks it automatically! Slide 10 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why should I care? Use Cases Debugging (tracking the sources of errors) Propagating annotations Gain deeper understanding of data and transformations Estimate quality, trust Improvement of other data processing technologies Probabilistic databases Deletion propagation Testing Slide 11 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why should I care? Application Domains Complex database queries, e.g., datawarehousing E-science and curated databases Data integration/exchange Workflow systems Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why should I care? Application Domains Complex database queries, e.g., datawarehousing E-science and curated databases Data integration/exchange Workflow systems Application domain with complex, multi-stage data processing Map-Reduce style processing and its frontends like Pig Simulations... Slide 12 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why should I care? Debugging Example SalesTotal Slide 13 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Why should I care? Annotation Propagation Example EnzymeProduce Enzyme Gene EC 1.1.1.1 ALB?? EC 1.97.1.6 ALB?? CREATE VIEW EnzymeProduce AS SELECT Enzyme, Name AS Gene FROM Gene G, Enzyme E WHERE G.Id = E.ProducedBy Gene Id Name 4q11-q13 ALB {a 4} 18q21.3 BCL2 {} Enzyme Enzyme Weight ProducedBy EC 1.1.1.1 45 4q11-q13 {a 1,a 2} EC 1.97.1.6 12 4q11-q13 {a 2,a 3} a 1 Necessary for red blood cells a 2 Produced in liver a 3 Unhealthy a 4 Discovered by Edmond Hillary Slide 14 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Outline 1 Instructor 2 Course Overview 3 Course Details and Administrative Information

Course Topics We will study... Several models of database provenance Approaches for automatically tracking provenance Query languages and storage mechanism for provenance Real systems that generate provenance data Outlook into other research areas that use provenance Slide 15 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Prerequisites Some database background (mainly query languages) SQL Relational algebra Datalog Ideally you have taken one of the following courses: CS 425 CS 520 CS 525 Slide 16 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Course Material No text book is required Research papers that are available online Slide 17 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Course Organization Course will consist of... Lectures Research paper reviews (oral presentations done by students) A major project Implementation or extension of a real system Written report Oral presentation Slide 18 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Paper Reviews Papers The list of papers will be on the course website soon Topics cover the content of the course and extend it Process Students will have to pick papers to review By end of august, first week of september Some of the classes will be used for oral presentations on the topics Each presentation will be followed by a discussion Short written reviews (4-6 pages) will be due by Nov 1st Slide 19 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Course Project Implementation or extension of a provenance system Topics: Adding new provenance types to existing systems Add causality based provenance to Perm Add Why-provenance to Perm Extend the How-provenance implementation of Perm to derive new information Build a visualization tool for provenance in Perm... Your topic Slide 20 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Course Project cont. Organisation Choose a project by Sep 15th Meetings to discuss the projects with me Project implementation Progress reports during semester Written report: Due by Nov 15th Oral presentations: Classes at the end of the semester Slide 21 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Stuff to get familiar with Get practical experience with a database system PostgreSQL: open-source, quite standard compliant, Perm based on that Use Perm which is an provenance-enabled version of Postgres Query languages: SQL Relational Algebra Datalog Programming skills: C, C++, Java *nix operating system or Cygwin for Windows users Slide 22 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Detailed Course Outline Introduction to Data Provenance What is Data Provenance? Why do we need it? Understanding different types of data provenance Database Provenance Provenance Models and Systems Why-provenance Where-provenance and the DBNotes system Lineage and the WHIPS prototype Witness-list semantics and Perm Provenance semirings and Orchestra Causality and Responsibility models Storage mechanisms Query languages Slide 23 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Detailed Course Outline cont. Extensions of the Provenance concept Provenance for missing answers Provenance for past queries Provenance for updates Beyond Database Provenance Scientific workflows Provenance in the operating system context Connection with Dataflow analysis in programming languages Slide 24 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Grading Policies Course Project (Implementation, Report, Presentation): 60% Paper reviews: written review (15%) and oral presentation (15%) Participation in the paper discussions: (10%) Slide 25 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance

Questions? Further Information Office hours: Thursday, 1:00 pm - 2:00 pm, room 226C Webpage: www.cs.iit.edu/~glavic Course webpage will be linked there soon. Slide 26 of 26 Boris Glavic CS 595 - Hot topics in database systems: Data Provenance