Logical Optimization of ETL Workflows

Size: px
Start display at page:

Download "Logical Optimization of ETL Workflows"

Transcription

1 Logical Optimization of ETL Workflows Alkis Simitsis Nat. Tech. University of Athens Panos Vassiliadis University of Ioannina Timos Sellis Nat. Tech. University of Athens Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/

2 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Extraction Transformation Loading (ETL) Workflows HDMS'05 - A. Simitsis - 25/08/

3 ETL workflows are not big queries S1 γ V1 U S2 γ V2 Sources TIME DW HDMS'05 - A. Simitsis - 25/08/ ETL workflows are not big queries DS.PS_NEW 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY SUPPKEY=1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE DS.PS_OLD 1 DIFF 1 DS.PS 1 Add_SPK 1 SK 1 rejected $2 rejected A2EDate rejected U Log Log Log DS.PS_NEW 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SUPPKEY=2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE QTY>0 DIFF 2 DS.PS 2 Add_SPK 2 SK 2 NotNULL AddDate CheckQTY DS.PS_OLD 2 Log rejected Log rejected DSA PKEY, DAY MIN(COST) S 1 _PARTSUPP FTP 1 DW.PARTSUPP Aggregate 1 V1 DW.PARTSUPP.DATE, DAY PKEY, MONTH AVG(COST) S 2 _PARTSUPP FTP 2 TIME Aggregate 2 V2 Sources DW HDMS'05 - A. Simitsis - 25/08/

4 Problems The well known query optimization techniques are not sufficient, mostly due to the existence of data manipulation functions when is it valid to push an ETL activity in front of a function? black-box activities unknown semantics difficult/impossible/meaningless to express in relational algebra we cannot interfere with their interior (e.g., their source code) naming conflicts e.g., an attribute named COST in source 1 contains values in Dollars, while in source 2 contains values in Euros HDMS'05 - A. Simitsis - 25/08/ Related Work ETL and optimization Data Cleaning (Rahm & Do 2000) Duplicate detection (Monge 2000) Cleaning of data from the web (AJAX 2000) CPU optimization for ETL (Potter s Wheel 2001) Extraction, Recovery and Lineage Tracing Detection of differentials (Labio et al. 1997) Recovery (Labio et al. 2000) Lineage tracing (Cui 2001) Streams and optimization AURORA 2003 HDMS'05 - A. Simitsis - 25/08/

5 Equivalent workflows Can we push selection early enough? Can we aggregate before $2 takes place? What about naming conflicts? HDMS'05 - A. Simitsis - 25/08/ Aims How can we minimize the execution time of an ETL workflow? How can we deal with naming problems? When is it valid to change the position of an activity in an ETL workflow? How can we produce equivalent ETL workflows? In summary, the two problems that have risen are to determine which modifications of the workflow are legal which configuration is the best in terms of performance gains HDMS'05 - A. Simitsis - 25/08/

6 Problem Formulation and Contribution We set up the theoretical framework for the problem of the optimization of an ETL workflow by modeling the problem as a state-space search problem each state represents a particular design of the workflow as a graph we define transitions from one state to another we construct the state-space we search for a state that minimizes the execution time of the workflow HDMS'05 - A. Simitsis - 25/08/ Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/

7 States Each activity is characterized by A unique ID Input schemata Output schemata Semantics, expressed in an Input 2 extended relational algebra with black-box functions Each ETL workflow is a state, i.e., a DAG: Nodes: activities and relations Edges: provider relations Input 1 Activity Output HDMS'05 - A. Simitsis - 25/08/ State Generation through Transitions State Generation: a transition is a modification to a state (i.e., a variant of the workflow) that generates a semantically equivalent workflow Three categories for five transitions Swap Distribute and Factorize Merge and Split HDMS'05 - A. Simitsis - 25/08/

8 Transitions Swap two activities Locally swap two unary activities HDMS'05 - A. Simitsis - 25/08/ Transitions Distribute and Factorize Factorize: replacement of two homologous activities placed in two converging data flows by a new one Distribute: distribution of an activity into two converging data flows a,a 1,a 2 : homologous activities HDMS'05 - A. Simitsis - 25/08/

9 Transitions Distribute and Factorize An example a simple cost model: #rows cost: c SK =nlog 2 n and c σ =n application: input: 8 records in each flow selectivity: 50% for σ, 100% for the rest C 1 =2nlog 2 n+n=56 C 2 =2(n+(n/2)log 2 (n/2))=32 C 3 =2n+(n/2)log 2 (n/2)=24 HDMS'05 - A. Simitsis - 25/08/ Transitions Merge and Split Merge / Split unary activities HDMS'05 - A. Simitsis - 25/08/

10 Transition Applicability We resolve naming problems by reducing all attribute names (for all activity and relation schemata) to a common terminology We annotate each activity with three extra schemata: Functionality schema: the set of input attributes that participate in the computation Generated / Projected-out schemata Easy to obtain these through a set of templates (Vassiliadis et CAiSE 03, Inf. Systems) E.g., the Surrogate Key Transformation requires an input production key, a generated surrogate key and a parameter specifying the source of the data HDMS'05 - A. Simitsis - 25/08/ Transition Applicability When is it valid to swap two activities? the functionality schema should be a subset of the input schema HDMS'05 - A. Simitsis - 25/08/

11 Transition Applicability When is it valid to swap two activities? the functionality schema should be a subset of the input schema input schemata should be subset of provider schemata HDMS'05 - A. Simitsis - 25/08/ Transition Correctness To prove correctness we employ cond: post-condition for activities true, when the activity is successfully executed involves activity s functionality schema e.g., $2 (#vrbl1) Cond: post-condition for ETL workflows $2 (COST) $2 (COST) conjunction of the post-conditions of all the workflow activities, arranged in the order of their execution ETL workflow equivalence: the schema of the data propagated to each target recordset, at the end of each flow, is identical the flows have the same post-conditions Theorem: All transitions produce equivalent workflows HDMS'05 - A. Simitsis - 25/08/

12 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Algorithms Simple to produce an Exhaustive search (ES algorithm) We generate all the possible states that can be generated by applying all the applicable transitions to every state Heuristic search (HS algorithm) Pre-Processing: Merge before any other transition Phase 1: Swap only in linear paths Phase 2: Factorize only homologous activities placed in two converging paths Phase 3: Distribute only if transformation is applicable Phase 4: Swap again only in the linear paths of the new states produced in phases 2 and 3 Greedy variant of the Heuristic search (HS-G algorithm) Swap only if we gain in cost HDMS'05 - A. Simitsis - 25/08/

13 Experiments Setup Three categories for 40 different ETL workflows Small: 20 activities (avg) Medium: 40 activities (avg) Large: 70 activities (avg) Quality of solution HS: near optimal HS-G: good for small/medium and over 60% for large workflows Improvement of solution HS: over 70% HS-G: over 60% for small and medium, and over 47% for large workflows Quality Ratio Avg Improvement (%) Quality of Solution (% avg) 40 small medium large Type of Scenario 80,00% 70,00% 60,00% 50,00% Improvement of Solution 40,00% small medium large Type of Scenario HS HS-G HS HS-G HDMS'05 - A. Simitsis - 25/08/ Experiments Volume of state space ES: exponentially increases HS: grows for large scenarios HS-G: keeps state space small Evolution of time ES: in medium/large scenarios did not terminate within 40h HS: avg worst case is ~35min for large scenarios, while the gain in the execution time outreaches 70% 1,E+06 HS-G: for large 1,E+05 scenarios is quicker 1,E+04 1,E+03 than HS, but the gain 1,E+02 is only 47% (avg) 1,E+01 Time (logarithmic scale) 1,E+00 Evolution of Time small medium large Type of Scenario Number of Visited States Volume of State Space small medium large Type of Scenario Evolution of Time ES HS HS-G Time (sec) small medium large Type of Scenario ES HS HS-G HS HS-G HDMS'05 - A. Simitsis - 25/08/

14 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Conclusions & Future Work We set up the problem of optimization of ETL workflows as a state-space search problem, by modeling an ETL workflow as a state We define transitions in the search space, along with their applicability and prove the correctness of applicability rules We provide three search algorithms and explore their performance Future work can be pursued in different directions: Optimization at the physical level ETL Optimization for non-traditional data (XML, biomedical, ) Parallel ETL processing HDMS'05 - A. Simitsis - 25/08/

15 Thank you HDMS'05 - A. Simitsis - 25/08/

On the Logical Modeling of ETL Processes

On the Logical Modeling of ETL Processes On the Logical Modeling of ETL Processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens {pvassil,asimi,spiros}@dbnet.ece.ntua.gr What are Extract-Transform-Load

More information

Modeling ETL Activities as Graphs

Modeling ETL Activities as Graphs Modeling ETL Activities as Graphs Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division, Iroon

More information

Conceptual modeling for ETL

Conceptual modeling for ETL Conceptual modeling for ETL processes Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 08/26/04 Email: erg@evaltech.com Abstract: Due to the

More information

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

From conceptual design to performance optimization of ETL workflows: current state of research and open problems The VLDB Journal (2017) 26:777 801 DOI 10.1007/s00778-017-0477-2 REGULAR PAPER From conceptual design to performance optimization of ETL workflows: current state of research and open problems Syed Muhammad

More information

Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates

Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, and Spiros Skiadopoulos 1 1 National Technical University

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Blueprints and Measures for ETL Workflows

Blueprints and Measures for ETL Workflows Blueprints and Measures for ETL Workflows Panos Vassiliadis 1, Alkis Simitsis 2, Manolis Terrovitis 2, Spiros Skiadopoulos 2 1 University of Ioannina, Dept. of Computer Science, Ioannina, Hellas pvassil@cs.uoi.gr

More information

Blueprints for ETL workflows

Blueprints for ETL workflows Blueprints for ETL workflows Panos Vassiliadis 1, Alkis Simitsis 2, Manolis Terrovitis 2, Spiros Skiadopoulos 2 1 University of Ioannina, Dept. of Computer Science, 45110, Ioannina, Greece pvassil@cs.uoi.gr

More information

A Taxonomy of ETL Activities

A Taxonomy of ETL Activities Panos Vassiliadis University of Ioannina Ioannina, Greece pvassil@cs.uoi.gr A Taxonomy of ETL Activities Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com Eftychia Baikousi University of Ioannina

More information

Rule-based Management of Schema Changes at ETL sources

Rule-based Management of Schema Changes at ETL sources Rule-based Management of Schema Changes at ETL sources George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, Timos Sellis 1, Yannis Vassiliou 1 1 National Technical University of Athens, Athens,

More information

Integrating ETL Processes from Information Requirements

Integrating ETL Processes from Information Requirements Integrating ETL Processes from Information Requirements Petar Jovanovic 1, Oscar Romero 1, Alkis Simitsis 2, and Alberto Abelló 1 1 Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain

More information

Benchmarking ETL Workflows

Benchmarking ETL Workflows Benchmarking ETL Workflows Alkis Simitsis 1, Panos Vassiliadis 2, Umeshwar Dayal 1, Anastasios Karagiannis 2, Vasiliki Tziovara 2 1 HP Labs, Palo Alto, CA, USA, {alkis, Umeshwar.Dayal}@hp.com 2 University

More information

What-if Analysis for Data Warehouse Evolution

What-if Analysis for Data Warehouse Evolution What-if Analysis for Data Warehouse Evolution George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, and Yannis Vassiliou 1 1 National Technical University of Athens, Dept. of Electrical and Computer

More information

. The problem: ynamic ata Warehouse esign Ws are dynamic entities that evolve continuously over time. As time passes, new queries need to be answered

. The problem: ynamic ata Warehouse esign Ws are dynamic entities that evolve continuously over time. As time passes, new queries need to be answered ynamic ata Warehouse esign? imitri Theodoratos Timos Sellis epartment of Electrical and Computer Engineering Computer Science ivision National Technical University of Athens Zographou 57 73, Athens, Greece

More information

Target and source schemas may contain integrity constraints. source schema(s) assertions relating elements of the global schema to elements of the

Target and source schemas may contain integrity constraints. source schema(s) assertions relating elements of the global schema to elements of the Data integration Data Integration System: target (integrated) schema source schema (maybe more than one) assertions relating elements of the global schema to elements of the source schema(s) Target and

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 System Issues Architecture

More information

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Advantages of UML for Multidimensional Modeling

Advantages of UML for Multidimensional Modeling Advantages of UML for Multidimensional Modeling Sergio Luján-Mora (slujan@dlsi.ua.es) Juan Trujillo (jtrujillo@dlsi.ua.es) Department of Software and Computing Systems University of Alicante (Spain) Panos

More information

Designing the Global Data Warehouse with SPJ Views

Designing the Global Data Warehouse with SPJ Views Designing the Global Data Warehouse with SPJ Views Dimitri Theodoratos, Spyros Ligoudistianos, and Timos Sellis Department of Electrical and Computer Engineering Computer Science Division National Technical

More information

Lesson 12 - Operator Overloading Customising Operators

Lesson 12 - Operator Overloading Customising Operators Lesson 12 - Operator Overloading Customising Operators Summary In this lesson we explore the subject of Operator Overloading. New Concepts Operators, overloading, assignment, friend functions. Operator

More information

Relational Model, Relational Algebra, and SQL

Relational Model, Relational Algebra, and SQL Relational Model, Relational Algebra, and SQL August 29, 2007 1 Relational Model Data model. constraints. Set of conceptual tools for describing of data, data semantics, data relationships, and data integrity

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

SYSTEMATIC ETL MANAGEMENT EXPERIENCES WITH HIGH-LEVEL OPERATORS (Practice-Oriented)

SYSTEMATIC ETL MANAGEMENT EXPERIENCES WITH HIGH-LEVEL OPERATORS (Practice-Oriented) SYSTEMATIC ETL MANAGEMENT EXPERIENCES WITH HIGH-LEVEL OPERATORS (Practice-Oriented) Alexander Albrecht Hasso Plattner Institute for Software Systems Engineering University of Potsdam, Germany Alexander.Albrecht@hpi.uni-potsdam.de

More information

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE COMP 62421 Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Querying Data on the Web Date: Wednesday 24th January 2018 Time: 14:00-16:00 Please answer all FIVE Questions provided. They amount

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Chapter 4. Capturing the Requirements. 4th Edition. Shari L. Pfleeger Joanne M. Atlee

Chapter 4. Capturing the Requirements. 4th Edition. Shari L. Pfleeger Joanne M. Atlee Chapter 4 Capturing the Requirements Shari L. Pfleeger Joanne M. Atlee 4th Edition It is important to have standard notations for modeling, documenting, and communicating decisions Modeling helps us to

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

arxiv: v1 [cs.db] 26 Jan 2017

arxiv: v1 [cs.db] 26 Jan 2017 The Many Faces of Data-centric Workflow Optimization: A Survey Georgia Kougka Aristotle University of Thessaloniki, Greece Anastasios Gounaris Aristotle University of Thessaloniki, Greece Alkis Simitsis

More information

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Data Warehouses Chapter 12. Class 10: Data Warehouses 1 Data Warehouses Chapter 12 Class 10: Data Warehouses 1 OLTP vs OLAP Operational Database: a database designed to support the day today transactions of an organization Data Warehouse: historical data is

More information

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere ETL Best Practices and Techniques Marc Beacom, Managing Partner, Datalere Thank you Sponsors Experience 10 years DW/BI Consultant 20 Years overall experience Marc Beacom Managing Partner, Datalere Current

More information

The Inverse of a Schema Mapping

The Inverse of a Schema Mapping The Inverse of a Schema Mapping Jorge Pérez Department of Computer Science, Universidad de Chile Blanco Encalada 2120, Santiago, Chile jperez@dcc.uchile.cl Abstract The inversion of schema mappings has

More information

Introduction to Relational Databases. Introduction to Relational Databases cont: Introduction to Relational Databases cont: Relational Data structure

Introduction to Relational Databases. Introduction to Relational Databases cont: Introduction to Relational Databases cont: Relational Data structure Databases databases Terminology of relational model Properties of database relations. Relational Keys. Meaning of entity integrity and referential integrity. Purpose and advantages of views. The relational

More information

MISO: Souping Up Big Data Query Processing with a Multistore System

MISO: Souping Up Big Data Query Processing with a Multistore System MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis,

More information

CPS510 Database System Design Primitive SYSTEM STRUCTURE

CPS510 Database System Design Primitive SYSTEM STRUCTURE CPS510 Database System Design Primitive SYSTEM STRUCTURE Naïve Users Application Programmers Sophisticated Users Database Administrator DBA Users Application Interfaces Application Programs Query Data

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model Chapter 3 The Multidimensional Model: Basic Concepts Introduction Multidimensional Model Multidimensional concepts Star Schema Representation Conceptual modeling using ER, UML Conceptual modeling using

More information

SQL and Incomp?ete Data

SQL and Incomp?ete Data SQL and Incomp?ete Data A not so happy marriage Dr Paolo Guagliardo Applied Databases, Guest Lecture 31 March 2016 SQL is efficient, correct and reliable 1 / 25 SQL is efficient, correct and reliable...

More information

Note: Select one full question from each unit

Note: Select one full question from each unit P.E.S COLLEGE OF ENGINEERING, MANDYA-571401 (An Autonomous Institution Under VTU Belgaum) Department of Master of Computer Applications Model Question Paper Data Structures Using C (P18MCA21) Credits :

More information

CSE 421: Introduction to Algorithms Complexity

CSE 421: Introduction to Algorithms Complexity CSE 421: Introduction to Algorithms Complexity Yin-Tat Lee 1 Defining Efficiency Runs fast on typical real problem instances Pros: Sensible, Bottom-line oriented Cons: Moving target (diff computers, programming

More information

RELATIONAL DATA MODEL

RELATIONAL DATA MODEL RELATIONAL DATA MODEL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY RELATIONAL DATA STRUCTURE (1) Relation: A relation is a table with columns and rows.

More information

Chapter 4. The Relational Model

Chapter 4. The Relational Model Chapter 4 The Relational Model Chapter 4 - Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations and relations in the relational model.

More information

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Warehouse Logical Design Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Mart logical models MOLAP (Multidimensional On-Line Analytical Processing) stores data

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Announcements Grading questions:

More information

Announcements. CS 188: Artificial Intelligence Fall Large Scale: Problems with A* What is Search For? Example: N-Queens

Announcements. CS 188: Artificial Intelligence Fall Large Scale: Problems with A* What is Search For? Example: N-Queens CS 188: Artificial Intelligence Fall 2008 Announcements Grading questions: don t panic, talk to us Newsgroup: check it out Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted

More information

TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS

TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS SAMUEL MADDEN, MICHAEL J. FRANKLIN, JOSEPH HELLERSTEIN, AND WEI HONG Proceedings of the Fifth Symposium on Operating Systems Design and implementation

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

OASIS: Self-tuning Storage for Applications

OASIS: Self-tuning Storage for Applications OASIS: Self-tuning Storage for Applications Kostas Magoutis, Prasenjit Sarkar, Gauri Shah 14 th NASA Goddard- 23 rd IEEE Mass Storage Systems Technologies, College Park, MD, May 17, 2006 Outline Motivation

More information

Load Shedding in a Data Stream Manager

Load Shedding in a Data Stream Manager Load Shedding in a Data Stream Manager Nesime Tatbul, Uur U Çetintemel, Stan Zdonik Brown University Mitch Cherniack Brandeis University Michael Stonebraker M.I.T. The Overload Problem Push-based data

More information

Efficient Data Structures for Tamper-Evident Logging

Efficient Data Structures for Tamper-Evident Logging Efficient Data Structures for Tamper-Evident Logging Scott A. Crosby Dan S. Wallach Rice University Everyone has logs Tamper evident solutions Current commercial solutions Write only hardware appliances

More information

ETL Work Flow for Extract Transform Loading

ETL Work Flow for Extract Transform Loading Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 6, June 2014, pg.610

More information

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! q Overview! q Optimization! q Measures of Query Cost! Query Evaluation! q Sorting! q Join Operation! q Other

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:

More information

Tania Tudorache Stanford University. - Ontolog forum invited talk04. October 2007

Tania Tudorache Stanford University. - Ontolog forum invited talk04. October 2007 Collaborative Ontology Development in Protégé Tania Tudorache Stanford University - Ontolog forum invited talk04. October 2007 Outline Introduction and Background Tools for collaborative knowledge development

More information

Querying Data with Transact SQL

Querying Data with Transact SQL Course 20761A: Querying Data with Transact SQL Course details Course Outline Module 1: Introduction to Microsoft SQL Server 2016 This module introduces SQL Server, the versions of SQL Server, including

More information

Call: Datastage 8.5 Course Content:35-40hours Course Outline

Call: Datastage 8.5 Course Content:35-40hours Course Outline Datastage 8.5 Course Content:35-40hours Course Outline Unit -1 : Data Warehouse Fundamentals An introduction to Data Warehousing purpose of Data Warehouse Data Warehouse Architecture Operational Data Store

More information

Horizontal Aggregations for Mining Relational Databases

Horizontal Aggregations for Mining Relational Databases Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,

More information

Ian Kenny. December 1, 2017

Ian Kenny. December 1, 2017 Ian Kenny December 1, 2017 Introductory Databases Query Optimisation Introduction Any given natural language query can be formulated as many possible SQL queries. There are also many possible relational

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 7: CSPs II 2/7/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Today More CSPs Applications Tree Algorithms Cutset

More information

ELASTIC PERFORMANCE FOR ETL+Q PROCESSING

ELASTIC PERFORMANCE FOR ETL+Q PROCESSING ELASTIC PERFORMANCE FOR ETL+Q PROCESSING ABSTRACT Pedro Martins, Maryam Abbasi, Pedro Furtado Department of Informatics, University of Coimbra, Portugal Most data warehouse deployments are not prepared

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4

More information

Relational Algebra 1

Relational Algebra 1 Relational Algebra 1 Relational Algebra Last time: started on Relational Algebra What your SQL queries are translated to for evaluation A formal query language based on operators Rel Rel Op Rel Op Rel

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University HW2 Q1.1 parts (b) and (c) cancelled. HW3 released. It is long. Start early. CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/26/17 Jure Leskovec, Stanford

More information

Magda Balazinska - CSE 444, Spring

Magda Balazinska - CSE 444, Spring Query Optimization Algorithm CE 444: Database Internals Lectures 11-12 Query Optimization (part 2) Enumerate alternative plans (logical & physical) Compute estimated cost of each plan Compute number of

More information

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY CHARACTERISTICS Data warehouse is a central repository for summarized and integrated data

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Relational Model: History

Relational Model: History Relational Model: History Objectives of Relational Model: 1. Promote high degree of data independence 2. Eliminate redundancy, consistency, etc. problems 3. Enable proliferation of non-procedural DML s

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

Queries with Order-by Clauses and Aggregates on Factorised Relational Data

Queries with Order-by Clauses and Aggregates on Factorised Relational Data Queries with Order-by Clauses and Aggregates on Factorised Relational Data Tomáš Kočiský Magdalen College University of Oxford A fourth year project report submitted for the degree of Masters of Mathematics

More information

Lyublena Antova, Christoph Koch, and Dan Olteanu Saarland University Database Group Saarbr ucken, Germany Presented By: Rana Daud

Lyublena Antova, Christoph Koch, and Dan Olteanu Saarland University Database Group Saarbr ucken, Germany Presented By: Rana Daud Lyublena Antova, Christoph Koch, and Dan Olteanu Saarland University Database Group Saarbr ucken, Germany 2007 1 Presented By: Rana Daud Introduction Application Scenarios I-SQL World-Set Algebra Algebraic

More information

Advanced Data Management Technologies Written Exam

Advanced Data Management Technologies Written Exam Advanced Data Management Technologies Written Exam 02.02.2016 First name Student number Last name Signature Instructions for Students Write your name, student number, and signature on the exam sheet. This

More information

Querying the Sensor Network. TinyDB/TAG

Querying the Sensor Network. TinyDB/TAG Querying the Sensor Network TinyDB/TAG 1 TAG: Tiny Aggregation Query Distribution: aggregate queries are pushed down the network to construct a spanning tree. Root broadcasts the query and specifies its

More information

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. El-Bastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,

More information

Automatically Synthesizing SQL Queries from Input-Output Examples

Automatically Synthesizing SQL Queries from Input-Output Examples Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun Goal: making it easier for non-expert users to write correct SQL queries

More information

Bob s Notes for COS 226 Fall : Introduction, Union-Find, and Percolation. Robert E. Tarjan 9/15/13. Introduction

Bob s Notes for COS 226 Fall : Introduction, Union-Find, and Percolation. Robert E. Tarjan 9/15/13. Introduction Bob s Notes for COS 226 Fall 2013 1: Introduction, Union-Find, and Percolation Robert E. Tarjan 9/15/13 Introduction Welcome to COS 226! This is the first of a series of occasional notes that I plan to

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 Data Streams Continuous

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Orchid: Integrating Schema Mapping and ETL

Orchid: Integrating Schema Mapping and ETL Orchid: Integrating Schema Mapping and ETL (Exted Version) Stefan Dessloch, Mauricio A. Hernández, Ryan Wisnesky, Ahmed Radwan, Jindan Zhou Department of Computer Science, University of Kaiserslautern

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

Data Structure: Relational Model

Data Structure: Relational Model Applications View of a Relational Database Management (RDBMS) System Why applications use DBs? Persistent data structure Independent from processes using the data Also, large volume of data High-level

More information

Theorem proving. PVS theorem prover. Hoare style verification PVS. More on embeddings. What if. Abhik Roychoudhury CS 6214

Theorem proving. PVS theorem prover. Hoare style verification PVS. More on embeddings. What if. Abhik Roychoudhury CS 6214 Theorem proving PVS theorem prover Abhik Roychoudhury National University of Singapore Both specification and implementation can be formalized in a suitable logic. Proof rules for proving statements in

More information

Practice Problems for the Final

Practice Problems for the Final ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each

More information

Using Multi-Level Graphs for Timetable Information. Frank Schulz, Dorothea Wagner, and Christos Zaroliagis

Using Multi-Level Graphs for Timetable Information. Frank Schulz, Dorothea Wagner, and Christos Zaroliagis Using Multi-Level Graphs for Timetable Information Frank Schulz, Dorothea Wagner, and Christos Zaroliagis Overview 1 1. Introduction Timetable Information - A Shortest Path Problem 2. Multi-Level Graphs

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Module 07: Efficiency

Module 07: Efficiency Module 07: Efficiency Topics: Basic introduction to run-time efficiency Analyzing recursive code Analyzing iterative code 1 Consider the following two ways to calculate the maximum of a nonempty list.

More information

Massively Parallel Seesaw Search for MAX-SAT

Massively Parallel Seesaw Search for MAX-SAT Massively Parallel Seesaw Search for MAX-SAT Harshad Paradkar Rochester Institute of Technology hp7212@rit.edu Prof. Alan Kaminsky (Advisor) Rochester Institute of Technology ark@cs.rit.edu Abstract The

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Design Metrics for Data Warehouse Evolution

Design Metrics for Data Warehouse Evolution Design Metrics for Data Warehouse Evolution George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, and Yannis Vassiliou 1 1 National Technical University of Athens, Athens, Hellas {gpapas,yv}@dbnet.ece.ntua.gr

More information

Best Practice - Pentaho OLAP Design Guidelines

Best Practice - Pentaho OLAP Design Guidelines Best Practice - Pentaho OLAP Design Guidelines This page intentionally left blank. Contents Overview... 1 Schema Maintenance Tools... 2 Use Schema Workbench to create and maintain schemas... 2 Limit number

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

8) A top-to-bottom relationship among the items in a database is established by a

8) A top-to-bottom relationship among the items in a database is established by a MULTIPLE CHOICE QUESTIONS IN DBMS (unit-1 to unit-4) 1) ER model is used in phase a) conceptual database b) schema refinement c) physical refinement d) applications and security 2) The ER model is relevant

More information

Two-level Data Staging ETL for Transaction Data

Two-level Data Staging ETL for Transaction Data Two-level Data Staging ETL for Transaction Data Xiufeng Liu University of Waterloo, Canada xiufeng.liu@uwaterloo.ca ABSTRACT In data warehousing, Extract-Transform-Load (ETL) extracts the data from data

More information

CMP-3440 Database Systems

CMP-3440 Database Systems CMP-3440 Database Systems Relational DB Languages Relational Algebra, Calculus, SQL Lecture 05 zain 1 Introduction Relational algebra & relational calculus are formal languages associated with the relational

More information

Database System Concepts

Database System Concepts Chapter 14: Optimization Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2007/2008 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan.

More information

Representation Learning for Clustering: A Statistical Framework

Representation Learning for Clustering: A Statistical Framework Representation Learning for Clustering: A Statistical Framework Hassan Ashtiani School of Computer Science University of Waterloo mhzokaei@uwaterloo.ca Shai Ben-David School of Computer Science University

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Databases = Categories

Databases = Categories Databases = Categories David I. Spivak dspivak@math.mit.edu Mathematics Department Massachusetts Institute of Technology Presented on 2010/09/16 David I. Spivak (MIT) Databases = Categories Presented on

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2

More information

FDB: A Query Engine for Factorised Relational Databases

FDB: A Query Engine for Factorised Relational Databases FDB: A Query Engine for Factorised Relational Databases Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny Oxford CS Christan Grant cgrant@cise.ufl.edu University of Florida November 1, 2013 cgrant (UF)

More information