Logical Optimization of ETL Workflows
|
|
- Bernard Brooks
- 6 years ago
- Views:
Transcription
1 Logical Optimization of ETL Workflows Alkis Simitsis Nat. Tech. University of Athens Panos Vassiliadis University of Ioannina Timos Sellis Nat. Tech. University of Athens Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/
2 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Extraction Transformation Loading (ETL) Workflows HDMS'05 - A. Simitsis - 25/08/
3 ETL workflows are not big queries S1 γ V1 U S2 γ V2 Sources TIME DW HDMS'05 - A. Simitsis - 25/08/ ETL workflows are not big queries DS.PS_NEW 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY SUPPKEY=1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE DS.PS_OLD 1 DIFF 1 DS.PS 1 Add_SPK 1 SK 1 rejected $2 rejected A2EDate rejected U Log Log Log DS.PS_NEW 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SUPPKEY=2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE QTY>0 DIFF 2 DS.PS 2 Add_SPK 2 SK 2 NotNULL AddDate CheckQTY DS.PS_OLD 2 Log rejected Log rejected DSA PKEY, DAY MIN(COST) S 1 _PARTSUPP FTP 1 DW.PARTSUPP Aggregate 1 V1 DW.PARTSUPP.DATE, DAY PKEY, MONTH AVG(COST) S 2 _PARTSUPP FTP 2 TIME Aggregate 2 V2 Sources DW HDMS'05 - A. Simitsis - 25/08/
4 Problems The well known query optimization techniques are not sufficient, mostly due to the existence of data manipulation functions when is it valid to push an ETL activity in front of a function? black-box activities unknown semantics difficult/impossible/meaningless to express in relational algebra we cannot interfere with their interior (e.g., their source code) naming conflicts e.g., an attribute named COST in source 1 contains values in Dollars, while in source 2 contains values in Euros HDMS'05 - A. Simitsis - 25/08/ Related Work ETL and optimization Data Cleaning (Rahm & Do 2000) Duplicate detection (Monge 2000) Cleaning of data from the web (AJAX 2000) CPU optimization for ETL (Potter s Wheel 2001) Extraction, Recovery and Lineage Tracing Detection of differentials (Labio et al. 1997) Recovery (Labio et al. 2000) Lineage tracing (Cui 2001) Streams and optimization AURORA 2003 HDMS'05 - A. Simitsis - 25/08/
5 Equivalent workflows Can we push selection early enough? Can we aggregate before $2 takes place? What about naming conflicts? HDMS'05 - A. Simitsis - 25/08/ Aims How can we minimize the execution time of an ETL workflow? How can we deal with naming problems? When is it valid to change the position of an activity in an ETL workflow? How can we produce equivalent ETL workflows? In summary, the two problems that have risen are to determine which modifications of the workflow are legal which configuration is the best in terms of performance gains HDMS'05 - A. Simitsis - 25/08/
6 Problem Formulation and Contribution We set up the theoretical framework for the problem of the optimization of an ETL workflow by modeling the problem as a state-space search problem each state represents a particular design of the workflow as a graph we define transitions from one state to another we construct the state-space we search for a state that minimizes the execution time of the workflow HDMS'05 - A. Simitsis - 25/08/ Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/
7 States Each activity is characterized by A unique ID Input schemata Output schemata Semantics, expressed in an Input 2 extended relational algebra with black-box functions Each ETL workflow is a state, i.e., a DAG: Nodes: activities and relations Edges: provider relations Input 1 Activity Output HDMS'05 - A. Simitsis - 25/08/ State Generation through Transitions State Generation: a transition is a modification to a state (i.e., a variant of the workflow) that generates a semantically equivalent workflow Three categories for five transitions Swap Distribute and Factorize Merge and Split HDMS'05 - A. Simitsis - 25/08/
8 Transitions Swap two activities Locally swap two unary activities HDMS'05 - A. Simitsis - 25/08/ Transitions Distribute and Factorize Factorize: replacement of two homologous activities placed in two converging data flows by a new one Distribute: distribution of an activity into two converging data flows a,a 1,a 2 : homologous activities HDMS'05 - A. Simitsis - 25/08/
9 Transitions Distribute and Factorize An example a simple cost model: #rows cost: c SK =nlog 2 n and c σ =n application: input: 8 records in each flow selectivity: 50% for σ, 100% for the rest C 1 =2nlog 2 n+n=56 C 2 =2(n+(n/2)log 2 (n/2))=32 C 3 =2n+(n/2)log 2 (n/2)=24 HDMS'05 - A. Simitsis - 25/08/ Transitions Merge and Split Merge / Split unary activities HDMS'05 - A. Simitsis - 25/08/
10 Transition Applicability We resolve naming problems by reducing all attribute names (for all activity and relation schemata) to a common terminology We annotate each activity with three extra schemata: Functionality schema: the set of input attributes that participate in the computation Generated / Projected-out schemata Easy to obtain these through a set of templates (Vassiliadis et CAiSE 03, Inf. Systems) E.g., the Surrogate Key Transformation requires an input production key, a generated surrogate key and a parameter specifying the source of the data HDMS'05 - A. Simitsis - 25/08/ Transition Applicability When is it valid to swap two activities? the functionality schema should be a subset of the input schema HDMS'05 - A. Simitsis - 25/08/
11 Transition Applicability When is it valid to swap two activities? the functionality schema should be a subset of the input schema input schemata should be subset of provider schemata HDMS'05 - A. Simitsis - 25/08/ Transition Correctness To prove correctness we employ cond: post-condition for activities true, when the activity is successfully executed involves activity s functionality schema e.g., $2 (#vrbl1) Cond: post-condition for ETL workflows $2 (COST) $2 (COST) conjunction of the post-conditions of all the workflow activities, arranged in the order of their execution ETL workflow equivalence: the schema of the data propagated to each target recordset, at the end of each flow, is identical the flows have the same post-conditions Theorem: All transitions produce equivalent workflows HDMS'05 - A. Simitsis - 25/08/
12 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Algorithms Simple to produce an Exhaustive search (ES algorithm) We generate all the possible states that can be generated by applying all the applicable transitions to every state Heuristic search (HS algorithm) Pre-Processing: Merge before any other transition Phase 1: Swap only in linear paths Phase 2: Factorize only homologous activities placed in two converging paths Phase 3: Distribute only if transformation is applicable Phase 4: Swap again only in the linear paths of the new states produced in phases 2 and 3 Greedy variant of the Heuristic search (HS-G algorithm) Swap only if we gain in cost HDMS'05 - A. Simitsis - 25/08/
13 Experiments Setup Three categories for 40 different ETL workflows Small: 20 activities (avg) Medium: 40 activities (avg) Large: 70 activities (avg) Quality of solution HS: near optimal HS-G: good for small/medium and over 60% for large workflows Improvement of solution HS: over 70% HS-G: over 60% for small and medium, and over 47% for large workflows Quality Ratio Avg Improvement (%) Quality of Solution (% avg) 40 small medium large Type of Scenario 80,00% 70,00% 60,00% 50,00% Improvement of Solution 40,00% small medium large Type of Scenario HS HS-G HS HS-G HDMS'05 - A. Simitsis - 25/08/ Experiments Volume of state space ES: exponentially increases HS: grows for large scenarios HS-G: keeps state space small Evolution of time ES: in medium/large scenarios did not terminate within 40h HS: avg worst case is ~35min for large scenarios, while the gain in the execution time outreaches 70% 1,E+06 HS-G: for large 1,E+05 scenarios is quicker 1,E+04 1,E+03 than HS, but the gain 1,E+02 is only 47% (avg) 1,E+01 Time (logarithmic scale) 1,E+00 Evolution of Time small medium large Type of Scenario Number of Visited States Volume of State Space small medium large Type of Scenario Evolution of Time ES HS HS-G Time (sec) small medium large Type of Scenario ES HS HS-G HS HS-G HDMS'05 - A. Simitsis - 25/08/
14 Outline Introduction Modeling ETL Optimization as a State-Space Problem Problem formulation State Generation Transition Applicability State Space Search & Experimental Evaluation Conclusions & Future Work HDMS'05 - A. Simitsis - 25/08/ Conclusions & Future Work We set up the problem of optimization of ETL workflows as a state-space search problem, by modeling an ETL workflow as a state We define transitions in the search space, along with their applicability and prove the correctness of applicability rules We provide three search algorithms and explore their performance Future work can be pursued in different directions: Optimization at the physical level ETL Optimization for non-traditional data (XML, biomedical, ) Parallel ETL processing HDMS'05 - A. Simitsis - 25/08/
15 Thank you HDMS'05 - A. Simitsis - 25/08/
On the Logical Modeling of ETL Processes
On the Logical Modeling of ETL Processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens {pvassil,asimi,spiros}@dbnet.ece.ntua.gr What are Extract-Transform-Load
More informationModeling ETL Activities as Graphs
Modeling ETL Activities as Graphs Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division, Iroon
More informationConceptual modeling for ETL
Conceptual modeling for ETL processes Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 08/26/04 Email: erg@evaltech.com Abstract: Due to the
More informationFrom conceptual design to performance optimization of ETL workflows: current state of research and open problems
The VLDB Journal (2017) 26:777 801 DOI 10.1007/s00778-017-0477-2 REGULAR PAPER From conceptual design to performance optimization of ETL workflows: current state of research and open problems Syed Muhammad
More informationGraph-Based Modeling of ETL Activities with Multi-level Transformations and Updates
Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, and Spiros Skiadopoulos 1 1 National Technical University
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More informationBlueprints and Measures for ETL Workflows
Blueprints and Measures for ETL Workflows Panos Vassiliadis 1, Alkis Simitsis 2, Manolis Terrovitis 2, Spiros Skiadopoulos 2 1 University of Ioannina, Dept. of Computer Science, Ioannina, Hellas pvassil@cs.uoi.gr
More informationBlueprints for ETL workflows
Blueprints for ETL workflows Panos Vassiliadis 1, Alkis Simitsis 2, Manolis Terrovitis 2, Spiros Skiadopoulos 2 1 University of Ioannina, Dept. of Computer Science, 45110, Ioannina, Greece pvassil@cs.uoi.gr
More informationA Taxonomy of ETL Activities
Panos Vassiliadis University of Ioannina Ioannina, Greece pvassil@cs.uoi.gr A Taxonomy of ETL Activities Alkis Simitsis HP Labs Palo Alto, CA, USA alkis@hp.com Eftychia Baikousi University of Ioannina
More informationRule-based Management of Schema Changes at ETL sources
Rule-based Management of Schema Changes at ETL sources George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, Timos Sellis 1, Yannis Vassiliou 1 1 National Technical University of Athens, Athens,
More informationIntegrating ETL Processes from Information Requirements
Integrating ETL Processes from Information Requirements Petar Jovanovic 1, Oscar Romero 1, Alkis Simitsis 2, and Alberto Abelló 1 1 Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain
More informationBenchmarking ETL Workflows
Benchmarking ETL Workflows Alkis Simitsis 1, Panos Vassiliadis 2, Umeshwar Dayal 1, Anastasios Karagiannis 2, Vasiliki Tziovara 2 1 HP Labs, Palo Alto, CA, USA, {alkis, Umeshwar.Dayal}@hp.com 2 University
More informationWhat-if Analysis for Data Warehouse Evolution
What-if Analysis for Data Warehouse Evolution George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, and Yannis Vassiliou 1 1 National Technical University of Athens, Dept. of Electrical and Computer
More information. The problem: ynamic ata Warehouse esign Ws are dynamic entities that evolve continuously over time. As time passes, new queries need to be answered
ynamic ata Warehouse esign? imitri Theodoratos Timos Sellis epartment of Electrical and Computer Engineering Computer Science ivision National Technical University of Athens Zographou 57 73, Athens, Greece
More informationTarget and source schemas may contain integrity constraints. source schema(s) assertions relating elements of the global schema to elements of the
Data integration Data Integration System: target (integrated) schema source schema (maybe more than one) assertions relating elements of the global schema to elements of the source schema(s) Target and
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 System Issues Architecture
More informationStaying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing
Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationAdvantages of UML for Multidimensional Modeling
Advantages of UML for Multidimensional Modeling Sergio Luján-Mora (slujan@dlsi.ua.es) Juan Trujillo (jtrujillo@dlsi.ua.es) Department of Software and Computing Systems University of Alicante (Spain) Panos
More informationDesigning the Global Data Warehouse with SPJ Views
Designing the Global Data Warehouse with SPJ Views Dimitri Theodoratos, Spyros Ligoudistianos, and Timos Sellis Department of Electrical and Computer Engineering Computer Science Division National Technical
More informationLesson 12 - Operator Overloading Customising Operators
Lesson 12 - Operator Overloading Customising Operators Summary In this lesson we explore the subject of Operator Overloading. New Concepts Operators, overloading, assignment, friend functions. Operator
More informationRelational Model, Relational Algebra, and SQL
Relational Model, Relational Algebra, and SQL August 29, 2007 1 Relational Model Data model. constraints. Set of conceptual tools for describing of data, data semantics, data relationships, and data integrity
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationSYSTEMATIC ETL MANAGEMENT EXPERIENCES WITH HIGH-LEVEL OPERATORS (Practice-Oriented)
SYSTEMATIC ETL MANAGEMENT EXPERIENCES WITH HIGH-LEVEL OPERATORS (Practice-Oriented) Alexander Albrecht Hasso Plattner Institute for Software Systems Engineering University of Potsdam, Germany Alexander.Albrecht@hpi.uni-potsdam.de
More informationTwo hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE
COMP 62421 Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Querying Data on the Web Date: Wednesday 24th January 2018 Time: 14:00-16:00 Please answer all FIVE Questions provided. They amount
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationChapter 4. Capturing the Requirements. 4th Edition. Shari L. Pfleeger Joanne M. Atlee
Chapter 4 Capturing the Requirements Shari L. Pfleeger Joanne M. Atlee 4th Edition It is important to have standard notations for modeling, documenting, and communicating decisions Modeling helps us to
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationarxiv: v1 [cs.db] 26 Jan 2017
The Many Faces of Data-centric Workflow Optimization: A Survey Georgia Kougka Aristotle University of Thessaloniki, Greece Anastasios Gounaris Aristotle University of Thessaloniki, Greece Alkis Simitsis
More informationData Warehouses Chapter 12. Class 10: Data Warehouses 1
Data Warehouses Chapter 12 Class 10: Data Warehouses 1 OLTP vs OLAP Operational Database: a database designed to support the day today transactions of an organization Data Warehouse: historical data is
More informationETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere
ETL Best Practices and Techniques Marc Beacom, Managing Partner, Datalere Thank you Sponsors Experience 10 years DW/BI Consultant 20 Years overall experience Marc Beacom Managing Partner, Datalere Current
More informationThe Inverse of a Schema Mapping
The Inverse of a Schema Mapping Jorge Pérez Department of Computer Science, Universidad de Chile Blanco Encalada 2120, Santiago, Chile jperez@dcc.uchile.cl Abstract The inversion of schema mappings has
More informationIntroduction to Relational Databases. Introduction to Relational Databases cont: Introduction to Relational Databases cont: Relational Data structure
Databases databases Terminology of relational model Properties of database relations. Relational Keys. Meaning of entity integrity and referential integrity. Purpose and advantages of views. The relational
More informationMISO: Souping Up Big Data Query Processing with a Multistore System
MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis,
More informationCPS510 Database System Design Primitive SYSTEM STRUCTURE
CPS510 Database System Design Primitive SYSTEM STRUCTURE Naïve Users Application Programmers Sophisticated Users Database Administrator DBA Users Application Interfaces Application Programs Query Data
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationChapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model
Chapter 3 The Multidimensional Model: Basic Concepts Introduction Multidimensional Model Multidimensional concepts Star Schema Representation Conceptual modeling using ER, UML Conceptual modeling using
More informationSQL and Incomp?ete Data
SQL and Incomp?ete Data A not so happy marriage Dr Paolo Guagliardo Applied Databases, Guest Lecture 31 March 2016 SQL is efficient, correct and reliable 1 / 25 SQL is efficient, correct and reliable...
More informationNote: Select one full question from each unit
P.E.S COLLEGE OF ENGINEERING, MANDYA-571401 (An Autonomous Institution Under VTU Belgaum) Department of Master of Computer Applications Model Question Paper Data Structures Using C (P18MCA21) Credits :
More informationCSE 421: Introduction to Algorithms Complexity
CSE 421: Introduction to Algorithms Complexity Yin-Tat Lee 1 Defining Efficiency Runs fast on typical real problem instances Pros: Sensible, Bottom-line oriented Cons: Moving target (diff computers, programming
More informationRELATIONAL DATA MODEL
RELATIONAL DATA MODEL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY RELATIONAL DATA STRUCTURE (1) Relation: A relation is a table with columns and rows.
More informationChapter 4. The Relational Model
Chapter 4 The Relational Model Chapter 4 - Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations and relations in the relational model.
More informationData Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)
Data Warehouse Logical Design Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Mart logical models MOLAP (Multidimensional On-Line Analytical Processing) stores data
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Announcements Grading questions:
More informationAnnouncements. CS 188: Artificial Intelligence Fall Large Scale: Problems with A* What is Search For? Example: N-Queens
CS 188: Artificial Intelligence Fall 2008 Announcements Grading questions: don t panic, talk to us Newsgroup: check it out Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted
More informationTAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS
TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS SAMUEL MADDEN, MICHAEL J. FRANKLIN, JOSEPH HELLERSTEIN, AND WEI HONG Proceedings of the Fifth Symposium on Operating Systems Design and implementation
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationOASIS: Self-tuning Storage for Applications
OASIS: Self-tuning Storage for Applications Kostas Magoutis, Prasenjit Sarkar, Gauri Shah 14 th NASA Goddard- 23 rd IEEE Mass Storage Systems Technologies, College Park, MD, May 17, 2006 Outline Motivation
More informationLoad Shedding in a Data Stream Manager
Load Shedding in a Data Stream Manager Nesime Tatbul, Uur U Çetintemel, Stan Zdonik Brown University Mitch Cherniack Brandeis University Michael Stonebraker M.I.T. The Overload Problem Push-based data
More informationEfficient Data Structures for Tamper-Evident Logging
Efficient Data Structures for Tamper-Evident Logging Scott A. Crosby Dan S. Wallach Rice University Everyone has logs Tamper evident solutions Current commercial solutions Write only hardware appliances
More informationETL Work Flow for Extract Transform Loading
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 6, June 2014, pg.610
More informationQuery Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!
Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13! q Overview! q Optimization! q Measures of Query Cost! Query Evaluation! q Sorting! q Join Operation! q Other
More informationAdvanced Data Management Technologies
ADMT 2017/18 Unit 13 J. Gamper 1/42 Advanced Data Management Technologies Unit 13 DW Pre-aggregation and View Maintenance J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements:
More informationTania Tudorache Stanford University. - Ontolog forum invited talk04. October 2007
Collaborative Ontology Development in Protégé Tania Tudorache Stanford University - Ontolog forum invited talk04. October 2007 Outline Introduction and Background Tools for collaborative knowledge development
More informationQuerying Data with Transact SQL
Course 20761A: Querying Data with Transact SQL Course details Course Outline Module 1: Introduction to Microsoft SQL Server 2016 This module introduces SQL Server, the versions of SQL Server, including
More informationCall: Datastage 8.5 Course Content:35-40hours Course Outline
Datastage 8.5 Course Content:35-40hours Course Outline Unit -1 : Data Warehouse Fundamentals An introduction to Data Warehousing purpose of Data Warehouse Data Warehouse Architecture Operational Data Store
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationIan Kenny. December 1, 2017
Ian Kenny December 1, 2017 Introductory Databases Query Optimisation Introduction Any given natural language query can be formulated as many possible SQL queries. There are also many possible relational
More informationCS 188: Artificial Intelligence Spring Today
CS 188: Artificial Intelligence Spring 2006 Lecture 7: CSPs II 2/7/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Today More CSPs Applications Tree Algorithms Cutset
More informationELASTIC PERFORMANCE FOR ETL+Q PROCESSING
ELASTIC PERFORMANCE FOR ETL+Q PROCESSING ABSTRACT Pedro Martins, Maryam Abbasi, Pedro Furtado Department of Informatics, University of Coimbra, Portugal Most data warehouse deployments are not prepared
More informationRelational Databases
Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4
More informationRelational Algebra 1
Relational Algebra 1 Relational Algebra Last time: started on Relational Algebra What your SQL queries are translated to for evaluation A formal query language based on operators Rel Rel Op Rel Op Rel
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
HW2 Q1.1 parts (b) and (c) cancelled. HW3 released. It is long. Start early. CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/26/17 Jure Leskovec, Stanford
More informationMagda Balazinska - CSE 444, Spring
Query Optimization Algorithm CE 444: Database Internals Lectures 11-12 Query Optimization (part 2) Enumerate alternative plans (logical & physical) Compute estimated cost of each plan Compute number of
More informationDATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY
DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY CHARACTERISTICS Data warehouse is a central repository for summarized and integrated data
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationRelational Model: History
Relational Model: History Objectives of Relational Model: 1. Promote high degree of data independence 2. Eliminate redundancy, consistency, etc. problems 3. Enable proliferation of non-procedural DML s
More informationFrom Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019
From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways
More informationQueries with Order-by Clauses and Aggregates on Factorised Relational Data
Queries with Order-by Clauses and Aggregates on Factorised Relational Data Tomáš Kočiský Magdalen College University of Oxford A fourth year project report submitted for the degree of Masters of Mathematics
More informationLyublena Antova, Christoph Koch, and Dan Olteanu Saarland University Database Group Saarbr ucken, Germany Presented By: Rana Daud
Lyublena Antova, Christoph Koch, and Dan Olteanu Saarland University Database Group Saarbr ucken, Germany 2007 1 Presented By: Rana Daud Introduction Application Scenarios I-SQL World-Set Algebra Algebraic
More informationAdvanced Data Management Technologies Written Exam
Advanced Data Management Technologies Written Exam 02.02.2016 First name Student number Last name Signature Instructions for Students Write your name, student number, and signature on the exam sheet. This
More informationQuerying the Sensor Network. TinyDB/TAG
Querying the Sensor Network TinyDB/TAG 1 TAG: Tiny Aggregation Query Distribution: aggregate queries are pushed down the network to construct a spanning tree. Root broadcasts the query and specifies its
More informationDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. El-Bastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,
More informationAutomatically Synthesizing SQL Queries from Input-Output Examples
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun Goal: making it easier for non-expert users to write correct SQL queries
More informationBob s Notes for COS 226 Fall : Introduction, Union-Find, and Percolation. Robert E. Tarjan 9/15/13. Introduction
Bob s Notes for COS 226 Fall 2013 1: Introduction, Union-Find, and Percolation Robert E. Tarjan 9/15/13 Introduction Welcome to COS 226! This is the first of a series of occasional notes that I plan to
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 Data Streams Continuous
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationOrchid: Integrating Schema Mapping and ETL
Orchid: Integrating Schema Mapping and ETL (Exted Version) Stefan Dessloch, Mauricio A. Hernández, Ryan Wisnesky, Ahmed Radwan, Jindan Zhou Department of Computer Science, University of Kaiserslautern
More informationFedX: A Federation Layer for Distributed Query Processing on Linked Open Data
FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany
More informationData Structure: Relational Model
Applications View of a Relational Database Management (RDBMS) System Why applications use DBs? Persistent data structure Independent from processes using the data Also, large volume of data High-level
More informationTheorem proving. PVS theorem prover. Hoare style verification PVS. More on embeddings. What if. Abhik Roychoudhury CS 6214
Theorem proving PVS theorem prover Abhik Roychoudhury National University of Singapore Both specification and implementation can be formalized in a suitable logic. Proof rules for proving statements in
More informationPractice Problems for the Final
ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each
More informationUsing Multi-Level Graphs for Timetable Information. Frank Schulz, Dorothea Wagner, and Christos Zaroliagis
Using Multi-Level Graphs for Timetable Information Frank Schulz, Dorothea Wagner, and Christos Zaroliagis Overview 1 1. Introduction Timetable Information - A Shortest Path Problem 2. Multi-Level Graphs
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationModule 07: Efficiency
Module 07: Efficiency Topics: Basic introduction to run-time efficiency Analyzing recursive code Analyzing iterative code 1 Consider the following two ways to calculate the maximum of a nonempty list.
More informationMassively Parallel Seesaw Search for MAX-SAT
Massively Parallel Seesaw Search for MAX-SAT Harshad Paradkar Rochester Institute of Technology hp7212@rit.edu Prof. Alan Kaminsky (Advisor) Rochester Institute of Technology ark@cs.rit.edu Abstract The
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationDesign Metrics for Data Warehouse Evolution
Design Metrics for Data Warehouse Evolution George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3, and Yannis Vassiliou 1 1 National Technical University of Athens, Athens, Hellas {gpapas,yv}@dbnet.ece.ntua.gr
More informationBest Practice - Pentaho OLAP Design Guidelines
Best Practice - Pentaho OLAP Design Guidelines This page intentionally left blank. Contents Overview... 1 Schema Maintenance Tools... 2 Use Schema Workbench to create and maintain schemas... 2 Limit number
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More information8) A top-to-bottom relationship among the items in a database is established by a
MULTIPLE CHOICE QUESTIONS IN DBMS (unit-1 to unit-4) 1) ER model is used in phase a) conceptual database b) schema refinement c) physical refinement d) applications and security 2) The ER model is relevant
More informationTwo-level Data Staging ETL for Transaction Data
Two-level Data Staging ETL for Transaction Data Xiufeng Liu University of Waterloo, Canada xiufeng.liu@uwaterloo.ca ABSTRACT In data warehousing, Extract-Transform-Load (ETL) extracts the data from data
More informationCMP-3440 Database Systems
CMP-3440 Database Systems Relational DB Languages Relational Algebra, Calculus, SQL Lecture 05 zain 1 Introduction Relational algebra & relational calculus are formal languages associated with the relational
More informationDatabase System Concepts
Chapter 14: Optimization Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2007/2008 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan.
More informationRepresentation Learning for Clustering: A Statistical Framework
Representation Learning for Clustering: A Statistical Framework Hassan Ashtiani School of Computer Science University of Waterloo mhzokaei@uwaterloo.ca Shai Ben-David School of Computer Science University
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationDatabases = Categories
Databases = Categories David I. Spivak dspivak@math.mit.edu Mathematics Department Massachusetts Institute of Technology Presented on 2010/09/16 David I. Spivak (MIT) Databases = Categories Presented on
More informationNotes for Lecture 24
U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined
More informationCS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing
CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2
More informationFDB: A Query Engine for Factorised Relational Databases
FDB: A Query Engine for Factorised Relational Databases Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny Oxford CS Christan Grant cgrant@cise.ufl.edu University of Florida November 1, 2013 cgrant (UF)
More information