DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE

Size: px
Start display at page:

Download "DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE"

Transcription

1 TALLINN UNIVERSITY OF TECHNOLOGY Faculty of Information Technology Department of Informatics IDU70LT Alla Tšornenkaja IAMP DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE Master s thesis Supervisor: Eduard Ševtšenko Ph.D. Associate Professor Co-supervisor: Igor Artemtšuk M.A. Developer Tallinn 2016

2 TALLINNA TEHNIKAÜLIKOOL Infotehnoloogia teaduskond Informaatikainstituut IDU70LT Alla Tšornenkaja IAPM ANDMETE ELUTSÜKLI JÄLGIMISPROTSESSI PARENDAMINE FINANTSETTEVÕTTE ANDMEAIDAS Magistritöö Juhendaja: Eduard Ševtšenko Ph.D. Dotsent Kaasjuhendaja: Igor Artemtšuk M.A. Arendaja Tallinn 2016

3 Author s declaration of originality I hereby certify that I am the sole author of this thesis. All the used materials, references to the literature and the work of others have been referred to. This thesis has not been presented for examination anywhere else. Author: Alla Tšornenkaja

4 Abstract This thesis presents one possible solution for the financial data lineage tracing in the existing Enterprise Data Warehouse environment in the Financial Institution N. Current research gives an overview of the different lineage types and existing prototype solutions presented in other scientific works, and formalizes the possibilities and constraints of applying the current approaches in the specific data warehousing environment. As the result of this thesis, the prototype solution has been developed, where some ideas were adopted from the studied material and others were proposed by the author. The prototype solution has been created according to the provided requirements. Several conclusions have been made during the prototype development stage as well as different development constraints have been outlined separately. The prototype solution has been tested according to the requirements provided, and the outcome has been described in details. Additionally, a graphical visualization has been made in order to give a better overview of the contribution of the current work. The different directions for the future prototype solution improvements have been suggested. This thesis is written in English and is 64 pages long, including 4 chapters, 16 figures and 26 tables. 4

5 Annotatsioon Andmete elutsükli jälgimisprotsessi parendamine finantsettevõtte andmeaidas Käesolevas töös on esitatud üks võimalik prototüüplahendus finantsnäitajate andmete elutsükli jälgimiseks olemasolevas andmeaida keskkonnas finantsettevõttes N. Väljatöötatud prototüüplahendus põhineb etteantud nõuetel. Antud uurimistöös käsitletakse erinevat tüüpi andmete elutsükleid ja olemasolevaid prototüüplahendusi, mis on esitatud teistes teaduslikes töödes. Lisaks tehakse ülevaade antud lahenduste kohaldamise võimalustest ja piirangutest konkreetse andmeaida keskkonnas. Uurimistöö metoodika on konstruktiivne uuring, mis eeldab, et enne praktilise probleemi lahendamist uuritakse põhjalikult olemasolevaid teaduslikke materjale. Käesoleva töö tulemusena määrati kindlaks metaandmete kasutamise piirangud finantsnäitajate elutsükli jälgimiseks, mis pikemas perspektiivis võiksid olla kõrvaldatud metaandmete kvaliteedi parandamise käigus. Tuuakse välja põhjalik prototüübi kirjeldus ning programmi kood on esitatud töö lisades. Prototüüplahendus on testitud vastavalt etteantud nõuetele ning testid on detailselt kirjeldatud. Lisaks on tehtud graafiline visualiseerimine, et anda parem ülevaade tehtud tööpanusest. Olemasolevas töös pakutakse ka prototüübi võimalikud arengusuunad tulevikuks. Käesolev töö on üldiselt jagatud kaheks suureks osaks, millest üks osa on teoreetiline ja teine praktiline. Teoreetilises osas kirjeldatakse erinevaid olemasolevaid praktikaid ja meetodeid andmete elutsükli jälgimiseks andmeaidas. Praktilises osas kirjeldatakse prototüüplahenduse arendust ning sellega kaasnevaid kitsendusi. Lõputöö on kirjutatud inglise keeles ning sisaldab teksti 64 leheküljel, 4 peatükki, 16 joonist, 26 tabelit. 5

6 List of abbreviations and terms ASPJ query BI Conjunction Database classifier Data Mart DBMS Disjunction EDW ETL SQL Teradata FastLoad Teradata MultiLoad Teradata TPump A denotation for the Aggregate-Select-Project-Join query types. Business Intelligence is the use of computing technologies for the identification, discovery and analysis of business data (for example, sales revenue, products, costs and incomes) [1]. A compound proposition that is true if and only if all of its component propositions are true [2]. Any data type code defined in EDW (financial metrics, products, etc.). Data Mart is a subject-oriented archive that stores data and uses the retrieved set of information to assist and support the requirements involved within a particular business function or department [1]. Database Management System is a collection of programs that enables to store, modify and extract information from a database [3]. A compound proposition that is true if and only if at least one of a number of alternatives is true [2]. Enterprise Data Warehouse is a unified database that holds all the business information about the organization and makes it accessible all across the company [1]. Extract-Transform-Load is the process of extraction, transformation and loading data during database use [1]. Structured Query Language is a standardized query language for requesting information from a database [3]. A Teradata functionality to load huge amount of data from a flat file into empty tables [4]. A Teradata functionality to load multiple tables at one time from either a LAN or Channel environment [4]. A Teradata functionality to load data one row at a time, using row hash locks [4]. 6

7 Table of contents 1 Introduction The Background and the Problem The Tasks of this Thesis Methodology Overview of the Work Lineage or Provenance Introduction Provenance classification Provenance calculation approaches Provenance in Data Warehouses Provenance for view structures Provenance for ETL processes Teradata possibilities for Data Lineage tracing Teradata Meta Data Services (MDS) Teradata Mapping Manager (TMM) Overview of some existing Lineage tracing Solutions for Relational Databases How provenance: TRIO How provenance: SPIDER Where Provenance: DBNOTES Data Lineage tracing module development in Enterprise Data Warehouse Introduction Prototype Requirements and Analysis Prototype Description Data Scanner Lineage solution for ETL processes Lineage solution for View structures Prototype Testing Lineage Visualization Future development directions

8 4 Summary References

9 List of figures Figure 1. Data provenance calculation approaches: (a) annotation approach, (b) nonannotation approach [6] Figure 2. Overview of Data Warehouse Architecture [11, p 2] Figure 3. Example of view definition for Metric_Description Figure 4. Procedure ViewLineage algorithm for view lineage tracing Figure 5. Query α sum(metric_amt) (Account_Metric) code representation Figure 6. Query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) code representation Figure 7. Transformation properties: (a) dispatcher, (b) aggregator, (c) black-box [10] Figure 8. The sequence of ETL transformation steps to derive result table Metric_Sum from Account_Metric Figure 9. Procedure TradeDS algorithm for tracing dispatcher s lineage Figure 10. Procedure TraceAG algorithm for tracing aggregator s lineage Figure 11. Query π account_nbr, metric_name, metric_amt (Metrics Account_Metric) code representation Figure 12. UML diagram for T3058_ACCOUNT_METRIC table Figure 13. The object level and the classifier level example Figure 14. UMLdiagram for Data Lineage module Figure 15. Example visualization of Account_Metric_Type_Code 29 lineage Figure 16. Example visualization of Account_Metric_Type_Code 171 lineage

10 List of tables Table 1. Provenance classification Table 2. Table holding the descriptions for some financial metrics Table 3. Table holding the actually calculated metrics for some accounts Table 4. Intermediate view Account_Metric_Sum Table 5. Final view Metric_Description Table 6. Description of transformation steps Table 7. Result table produced by transformation step T Table 8. Result table produced by transformation step T Table 9. Result table produced by transformation step T Table 10. Categorization of systems according to provenance types Table 11. Result table Metric_Description Table 12. Metadata table used to store the lineage for a derived relation Table 13. Target instance T Table 14. Table holding the description of some financial metrics with annotations Table 15. Table holding the actually calculated metrics for some accounts with annotations Table 16. Result table Metric_Description showing annotation s propagation Table 17. Functional requirements for the prototype solution Table 18. Table holding relations between objects Table 19. Table holding relations between classifiers Table 20. Categorization of systems according to provenance types Table 21. Prototype object usage summary Table 22. Prototype transformation steps description Table 23. DATA_GROUP attribute description Table 24. Metadata scanners methods description Table 25. Description of object categories Table 26. Prototype Testing Report

11 1 Introduction 1.1 The Background and the Problem This thesis presents one possible solution for the financial data lineage tracing in the existing Enterprise Data Warehouse environment in the Financial Institution N. Current research gives an overview of different lineage types and existing prototype solutions presented in other scientific works, and formalizes the possibilities and constraints of applying current approaches in the specific data warehouse environment. The aim of this work is to propose a possible lineage tracing solution according to the existing data warehouse structure and situation, and develop a working prototype in form of a metadata module. Data warehousing systems gather information from different sources and integrate the retrieved data within the local structures. The data integration is usually performed with the help of numerous ETL processes which retrieve data from external or internal data sources, transform the extracted data to a predefined state and, finally, load the data to the destination location. While data transformations are usually done in the staging area layer, the final output is stored within the physical layer in tables. Next, from the physical layer the data is distributed to the logical layer via views. Some part of the data can be reloaded separately from the data warehouse to data mart structures for a subjectoriented analysis. Finally, the data is reported to business analysts by the means of reporting tools, OLAP, Ad-Hoc, or other possible solutions, which retrieve the data from the logical layer. Therefore, the data journey undergoes table-to-table or view-totable transformations, carried out by the means of ETL, as well as table-to-view transformations, carried out by the means of view creation functionality. Usually, the whole data movement model is much more complicated and the amount of stored attributes within the data warehouse is vast. The subject of this work is to concentrate only on retrieving the data flow for the financial figures, calculated within the Enterprise Data Warehouse, according to the available metadata and financial 11

12 metrics data. Nonetheless, future research can be concentrated on extending developed processes to be used for other financial data lineage tracing. The party, interested in current research, is the Enterprise Data Warehouse management department of the Financial Institution N. The benefit of this work is to simplify the impact analysis for required financial metrics data modifications as well as to help to answer the most frequently arising questions: What financial metrics are actually calculated from month to month? Which objects are using the financial metrics data? What will break if we change the calculation logics or close the calculations for some financial metric? Current research has been done during winter spring The Tasks of this Thesis Following tasks are completed as the result of this thesis: 1. We research and explain different lineage tracing approaches presented in other scientific works. We show the algorithms of these solutions based on our own examples, and provide the analysis of the current ideas applicability for the formulated problem. 2. We try to use the knowledge obtained to find the solution for the financial metrics data lineage tracing problem in current warehousing environment. We introduce the architecture and logics of possible solution. 3. We create a prototype solution and illustrate the lineage tracing functionality as well as possible. 1.3 Methodology The research methodology of this thesis is the Constructive Research. The reason for selecting current methodology comes from the fact what the existing scientific materials for corresponding topic should be studied first. The construct is a data lineage tracing prototype, which should be developed as the result of this thesis. 12

13 The theoretical contribution of the current work is to provide the algorithm of the data lineage tracing solution, which can be implemented in the prototype. 1.4 Overview of the Work The structure of the current thesis includes the following: 1. In the first chapter we provide an overview of the lineage classification and lineage calculation approaches. We also give a review of the existing lineage tracing solutions and algorithms based on relational algebra operations. 2. In the second chapter we describe the prototype solution development in details. Additionally, we describe the performed test cases and present a graphical visualization of the financial metrics data lineage. 13

14 2 Lineage or Provenance 2.1 Introduction In this chapter, we start our research from explaining the lineage concept in databases and categorizing the existing lineage types in order to be able to better formulate what is the type of the lineage we wish to capture, and if there are already any methods proposed which can be used straight away. The word lineage is synonymously used with the word provenance in the database community. Sometimes it is also referred to as pedigree, source attribution or source tagging. The idea which stands behind the lineage (or provenance) is that it keeps the association of the data item with its derivation. In other words, the lineage provides the ability to track the data journey from its origin to the destination instance. 2.2 Provenance classification The lineage related studies have been performed since early 1990-s [5, p 2]. The provenance study efforts were done separately on two different areas of data management: provenance tracing for scientific workflow systems and database management system [6]. In current work, we will concentrate on the lineage in the data management systems, as we are mainly interested in the data residing in the Enterprise Data Warehouse (hereinafter referred to as EDW) database. In order to get a better overview of the lineage types, we will systemize possible differentiations found in scientific works simply in a table. Based on the categorization of the information found, we will analyze the lineage types we want to capture and we are searching the solution for. The most general differentiation of the lineage is based on a granularity of the data captured. Currently there are two main trends for capturing lineage information: workflow (also known as coarse-grained) and data (also known as fine-grained) 14

15 provenance [7]. We provide an overview of the both types; however, in current work we will focus mainly on the fine-grained provenance. Workflow (or coarse-grained) lineage describes the data processing tasks in complex. It captures the whole history of the data derivation and represents it as a sequence of steps, where understanding of what is exactly happening during the transformation step can be missing. Nonetheless, the amount of steps and the depth of the information captured for workflow provenance can be stored on the level of platforms, software, versions, common description of the transformation steps, etc. Data (or fine-grained) lineage describes relatively detailed information about the derivation of a piece of data which we will refer to as data item. It captures the journey of particular data items like columns, tuples or attributes and represents it as the sequence of transformation steps. Mainly, the lineage research questions are [6]: 1. Why a data item was derived? What are the pre-conditions so that the data item will be derived? 2. How a data item was derived? How does the query transform the data? 3. What data was used to derive a data item? Where does the data come from? In respect to these three main questions, three types of data provenance have been outlined: why-provenance, how-provenance and where-provenance [6]. The definition for the why-provenance has been formalized by Buneman in [8]. The idea of the why-provenance is to provide the information about the witnesses of the output of a query. Therefore, why-provenance describes the set of all combinations of source data items, named a witness basis, that indicate the existence of an output data item in the result of the query. For example, let A and B be the attributes of a binary relation (A,B) R 1 with two tuples t 1 = (1,2), t 2 = (1,3) and let B and C be the attributes of another binary relation (B,C) R 2 with three tuples t 3 = (2,3), t 4 = (3,3), t 5 = (3,4). The duplicate-eliminating result of the query π a,c (R 1 R 2 ) consists of two output tuples t 6 = (1,3), t 7 = (1,4). The why-provenance of tuple t 6 = (1,3) will capture the witness basis as the set of tuples {t 1, t 3, t 2, t 4 }, and for t 7 = (1,4) it will capture {t 2, t 5 }. In contrast to why-provenance, how-provenance provides additional information about how the source data items witness the output data item in the result of a query. In 15

16 addition to witness basis, how-provenance supplements the information about logical operations like conjunctions (AND logical statement) and disjunctions (OR logical statement) used in the query. Consider the same example described above with two binary relations (A,B) R 1 and (B,C) R 2. Opposite to the why-provenance, the lineage for the output tuple t 6 = (1,3) in how-provenance will be stored as ((t 1 t 3 ) (t 2 t 4 )), and for t 7 = (1,4) will be stored as (t 2 t 5 ). Where-provenance type is a little bit different from why- and how-provenance, and generally describes where the data item in the output of the query is copied from [4]. While why- and how-provenance are capturing the relationships between source and output data items, where-provenance describes the relationships between source and output locations of the data items. Consider the same example described above with two binary relations (A,B) R 1 and (B,C) R 2. The where-provenance for the output tuple t 6 = (1,3) will keep the addresses of the locations where the values are copied from as {(R 1, t 1, A), (R 1, t 2, A)} for value 1 and {(R 2, t 3, C), (R 2, t 4, C)} for value 2. The same is for the output tuple t 7 = (1,4), the addresses kept will be {(R 1, t 2, A)} for value 1 and {(R 2, t 5, C)} for value 4. The three data provenance types described above can be differentiated by the scope of the data covered; therefore, two more groups are outlined: an external and internal data provenance [9]. The external lineage captures the data item journey through all possible nodes the data passes through, including external databases, programs, etc. In terms of the EDW, the external lineage of the data item would include a source database, where the value came from, and a destination database, there the value went to. In contrast to external lineage, the internal lineage captures the data item journey only within particular database. For better overview of the provenance types studied above, we will put together a simple categorization shown in Table 1. 16

17 Table 1. Provenance classification Provenance classification Workflow (or coarse-grained) Data (or fine-grained) External Internal Why How Where Why How Where 2.3 Provenance calculation approaches Formally, all of the existing research efforts on the data lineage distinguish one of the two approaches for calculating data provenance: the annotation approach (also known as eager approach, bookkeeping approach) and the non-annotation approach (also known as lazy approach) [7]. The annotation approaches are presented in Figure 1. Tracing provenance Input Q Output Tracing provenance Input Q Output Input Q Output Extra information (a) Non-Annotation approach (b) Annotation approach Figure 1. Data provenance calculation approaches: (a) annotation approach, (b) non-annotation approach [6]. In the annotation (or eager) approach, the query is re-engineered to carry additional information from the source instances to the output of the query. Generally, the original transformation query Q is modified to a query Q, which produces the same output as Q, but can additionally manage annotations technique, as shown on the right side (b) of the Figure 1. The advantage of this approach is that the lineage for the data item can be retrieved simply by examining its annotation, and there is no need to examine the source instances separately. The disadvantage comes from the fact that the additional information should be stored and processed along with the actual data which can cause 17

18 storage overheads as well as performance overheads during the executions of the queries in the database. In the non-annotation (lazy) approach, the query Q is executed as it is, as shown on the left side (a) of the Figure 1. In order to compute the provenance for an output data item, the source data, output data and the transformation should be analyzed separately. The advantage of this approach is that it can be deployed on an existing system as a separate module without or with minor changes to the system, and it doesn t cause performance or storage overheads. The disadvantage of the non-annotation approach is that analyzing provenance of the input, transformation and output data requires sophisticated techniques and cannot be used in case the source data becomes unavailable. 2.4 Provenance in Data Warehouses A data warehouse (DW) is a collection of the corporate information and data derived from the external data sources. The main aim of the data warehouse is to support business decisions by allowing data consolidation, business intelligence analysis and financial reporting [10]. The overview of the common data warehouse architecture is shown in Figure 2. 18

19 BI Analysis OLAP BI Reports Data Mining Ad-Hoc Data Management ETL Process Data Mart Enterprise Data Warehouse Metadata Repository ETL Process Extraction Cleansing Transformation Load Deposits Loans Leasing Source Systems Figure 2. Overview of Data Warehouse Architecture [11, p 2] Data warehousing systems gather information from different sources and integrate the retrieved data within the local structures. The data integration is usually performed with the help of numerous ETL processes which retrieve data from external or internal data sources, transform the extracted data to a predefined state and, finally, load the data to the destination location. While data transformations are usually done in the staging area layer, the final output is stored within the physical layer in tables. Next, from the physical layer the data is distributed to the logical layer via views. Some part of the data can be reloaded separately from the data warehouse to data mart structures for a subjectoriented analysis. Finally, the data is reported to business analysts by the means of business intelligence (BI) reporting tools, OLAP, Ad-Hoc, or other possible solutions, which are retrieving data from the logical layer. There are five main features which distinguish an enterprise data warehouse from any other data warehouse [12]: 1. EDW should have a single version of truth and corresponding rules. 19

20 2. EDW should contain data for multiple subject areas like marketing, sale, finance, human resources, etc. 3. EDW should have a normalized design. 4. EDW should be implemented as a mission-critical environment. 5. EDW should be scalable across several dimensions and be able to handle the growth of data and the complexity of business processes. An enterprise data warehouse environment should be able to handle huge amounts of data without becoming a data swamp, and therefore the metadata is an integral part of keeping the understanding about the business data consolidated from different sources. The techniques for collecting metadata can be sometimes very sophisticated and are usually implemented in different levels of data processing, such as ETL or BI processing, etc. In terms of data provenance tracing, metadata should be considered to be the main source describing the data relations found in EDW. Unfortunately, not many study efforts on data provenance describe the provenance tracing principles specifically in the Data Warehouse environment. There is even less works which would provide any contribution for tracing data provenance for ETL processes. The problem of tracing the lineage in Data Warehouses has been formally founded by Cui and Widom in [13] and [14], who have divided the lineage problem into two parts: lineage tracing for views and lineage tracing for ETL processes. We will review both types in next chapters Provenance for view structures Cui and Widom [13] proposed a fine-grained lineage tracing common algorithm for any aggregate-select-project-union (hereinafter referred to as ASPJ) view, which is automatically generated from the view definition and a small amount of auxiliary information maintained together with the warehouse views. It is assumed that view content is calculated by evaluating an algebraic view definition query tree bottom-up. Each operator in the tree calculates its result tuple-by-tuple based on the results of previous nodes, and passes the result upwards [13]. An example for a possible view definition for the view Metric_Description is shown in Figure 3. 20

21 Metric_Description π metric_name σ metric_nbr / sum(metric.amt) / metric_name Account_Metric_Sum α metric_nbr / sum(metric.amt) Account_Metric Metrics Figure 3. Example of view definition for Metric_Description. Any ASPJ view v can be transformed into an equivalent form v which contains (join), σ (selection), π (projection), α (aggregation) operator sequences. The form v' is named an ASPJ canonical form, where each operator, σ, π, α sequence is named an ASPJ segment. A view defined by one ASPJ segment is named a one-level ASPJ view, where lineage for the tuples can be calculated using a single relational query named the lineage tracing query [13]. The lineage for a relational operator and view tuple are defined separately [13]. Let Op be a relational operator ( (join), σ (selection), π (projection), α (aggregation)) and let T = Op(T 1,..,T m ) be a table that results from applying Op to tables T 1,..,T m. The lineage of a tuple t T is a subset of source table data (T 1,..,T m ) denoted as (T 1 *,..,T m *), so that Op(T 1 *,..,T m *)={t}. Let D be a database with base tables R 1,,R m, and let V = v(d) be a view over D. Consider a tuple t V : 1. v = R i : if tuple t R i, then t contributes to itself in V. 2. v = Op(v 1,, v k ), where v j is a view defined over D, j = 1...k. Tuple t* R i contributes to t according to v. Lineage tracing query can be defined next. Let V = v(t 1,,T m ) be a one-level ASPJ view, and let T V be a tuple set of V. The T s lineage in T 1,,T m according to v can be 21

22 computed with the following query, where Split is an operator which breaks a table into multiple tables TQ T,v = Split T1,..Tm (σ C (T1,Tm)). The procedure algorithm for tracing view lineage is presented in Figure 4 [9]. procedure ViewLineage (T, v, D) begin if v = R D then return (T); // else v = v (v 1,,v k ) where v is a one-level ASPJ query, // V j = v j (D) is an intermediate view or a base table, j = 1..k (V * 1,, V * k ) TQ(T, v, {V 1,,V k }); D* for j 1 to k do // concatenate the lineage of each subview onto the result D* D* ViewLineage (V * j, v j, D) Return (D*); end Figure 4. Procedure ViewLineage algorithm for view lineage tracing. We will review ViewLineage (T, v, D) procedure logics on a highly simplified database instance holding some financial figures (hereinafter referred to as financial metrics) and consisting of two base tables. We will omit the metrics calculation periods, as this fact have no impact on lineage tracing logics. Assume that table Metrics (id, metric_nbr, metric_name) holds the descriptions of financial metrics that can be calculated based on the customer account information, as shown in Table 2. Table Account_Metric (id, account, metric_nbr, metric_amt) holds the actually calculated metrics for some accounts, as shown in Table 3. Assume tables Metrics and Account_Metric are base tables R 1 and R 2 respectively, such that {R 1, R 2 } D. The tuples are denoted as t i, i [1,7]. Metrics Table 2. Table holding the descriptions for some financial metrics. metric_nbr metric_name t 1 : 1 t 2 : 2 t 3 : 3 Account balance Interest income Transaction fee income 22

23 Table 3. Table holding the actually calculated metrics for some accounts. Account_Metric account metric_nbr metric_amt t 4 : A t 5 : A t 6 : A t 7 : A Assume that Account_Metric_Sum (metric_nbr, metric_amt) is the intermediate view produced by the aggregation query α sum(metric_amt) (Account_Metric) and represented in Table 4. The tuples are denoted as u j, j [1,2]. The code representation is given in Figure 5. CREATE VIEW Account_Metric_Sum AS SELECT Account_Metric.metric_nbr, sum(account_metric.metric_amt) FROM Account_Metric Figure 5. Query α sum(metric_amt) (Account_Metric) code representation. Table 4. Intermediate view Account_Metric_Sum. Account_Metric_Sum metric_nbr metric_amt u 1 : u 2 : Assume that Metric_Description (metric_name) is the final view produced by query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) and presented in Table 5. The tuples are denoted as o n, n [1,2]. The code representation is given in Figure 6. CREATE VIEW Metric_Description AS SELECT Metrics.metric_name FROM Account_Metric_Sum, Metrics WHERE Account_Metric.metric_nbr = Metrics.metric_nbr Figure 6. Query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) code representation. 23

24 Table 5. Final view Metric_Description. Metric_Description metric_name o 1 : o 2 : Account balance Interest income Now we would like to retrieve the lineage for set of tuples T = {o 1,o 2 } and execute the procedure ViewLineage ({o 1,o 2 }, Metric_Description, {Metrics, Account_Metric}). Firstly, we check if T = {o 1,o 2 } is simply selected from any of the base tables R 1 =T and R 1 =T, and understand that it is not. Next, we execute the lineage query for a one-level ASPJ view or TQ {o1,o2},metric_description = Split Account_Metric_Sum, Metrics (σ C (Account_Metric_Sum Metrics)) where the selection condition C is Account_Metric.metric_nbr = Metrics.metric_nbr. As the result of this tracing query, we receive the subsets (Account_Metric_Sum*, Metrics*) such that Op(Account_Metric_Sum*, Metrics*)={o 1,o 2 }, where Op is a join operator. The subset Account_Metric_Sum* consists of tuples u 1 and u 2, and subset Metrics* consists of tuples t 1 and t 2. Since Account_Metric_Sum is an intermediate view, we trace the lineage further for the Account_Metric_Sum*, and obtain a subset Account_Metric*, which consists of tuples t 4, t 5, t 6 and t 7. The final lineage result is obtained by concatenating (Account_Metric*, Metrics*) = ({t 4, t 5, t 6, t 7 }, {t 1, t 2 }) Provenance for ETL processes Cui and Widom [14] proposed the common lineage tracing algorithms for ETL processes in data warehousing environment. Each ETL transformation step has been classified to a certain transformation class according to the way the step maps the input data to the output data. The transformation class is named property and there are three main properties distinguished: dispatchers, aggregators and black-boxes. Figure 7 illustrates a dispatcher, aggregator and black-box property, where I is a set of the input data and O is a set of the output data [14]. 24

25 (a) Dispatcher (b) Aggregator (c) Black-box Figure 7. Transformation properties: (a) dispatcher, (b) aggregator, (c) black-box [10]. A transformation T is a dispatcher, if each input data item produces zero or more output data items independently. The lineage for the dispatcher is defined as T(o,I) = {i I o T({i})} [10]. A transformation T is an aggregator, if for all I and T(I) = O = {o 1,...,o n }, there exists a unique disjoint partition I 1,,I n of I such that T(I k ) = {o k } for k = 1 n. The lineage for the aggregator is defined as T(o k,i) = I k [10]. A transformation T is a black-box, if it is neither a dispatcher nor an aggregator and any subset of the input items may have been used to produce a given output. Lineage for the black-box is defined as for o O, T(o,I) = I [14]. Usually, the ETL process contains one or more transformation steps, and therefore, the lineage for the whole ETL process consists of the lineages for each ETL transformation step. Figure 8 represents a sequence of three simple ETL transformation steps with one possible property each. The transformation steps summary is given in Table 6. Account_Metric T 1 T 2 T 3 Metric_Sum Figure 8. The sequence of ETL transformation steps to derive result table Metric_Sum from Account_Metric. 25

26 Transformation summary Table 6. Description of transformation steps. Name T 1 T 2 T 3 Description selects metric_nbr = 1 from table Account_Metric sums metric_amt in table Account_Metric_1 selects metric_amt from Metric_Total Furthermore, we will review the proposed data provenance tracing algorithms for each transformation property separately. Let I be a set of input data, I* be the subset of the input data, and i be an input data item, so that i I* I. Let T be a transformation of an input data item T(I). Let O be a set of output data, O* be the subset of the output data, and o be an output data item of O*, so that o O* O. The lineage is defined as a subset = $ %!"# for a separate output data item o. The procedure algorithm for tracing the lineage for dispatcher properties is presented in Figure 9 [14]. procedure TraceDS (T, O*, I) I* for each i I do if T ({i}) O* then I* I* {i} return I*; Figure 9. Procedure TradeDS algorithm for tracing dispatcher s lineage. To illustrate the TraceDS (T, O*, I) procedure logics, we will review the transformation step T 1. We will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapter for representing the data provenance tracing in view structures. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). The tuples are denoted as t i, i [1,7]. According to the transformation T 1 we receive the output data set O* presented in table Metric_1 (account, metric_nbr, metric_amt), as shown in Table 7. The tuples are 26

27 denoted as o n, n [1,2]. The T 1 is a dispatcher; therefore, we execute the procedure TraceDS (T 1, {o 1 }, Account_Metric) to obtain the lineage for the output tuple o 1. Table 7. Result table produced by transformation step T 1. Metric_1 account metric_nbr metric_amt o 1 : A o 2 : A The subset I* is not an empty. For each input tuple i = {t 4, t 5, t 6, t 7 } we check if the input tuple is intersecting with the output data set O*. If it is, we add one of the input tuple to the I* subset. The TraceDS (T 1, {o 1 }, Account_Metric) procedure will return I* = {t 4, t 5 }. The procedure algorithm for tracing the lineage for aggregator properties is presented in Figure 10 [10]. procedure TraceAG (T, O*, I) L all subsets of I sorted by size for each I* L in increased order do if T(I*) = O* then if (I - I*) = O - O* then break; else L = all supersets of I* sorted by size; return I*; Figure 10. Procedure TraceAG algorithm for tracing aggregator s lineage. To illustrate the TraceAG (T, O*, I) procedure logics, we will review the transformation step T 2. According to the transformation T 2, we receive the output data set O* presented in table Metric_1_Total (metric_nbr, metric_amt), as shown in Table 8. Table 8. Result table produced by transformation step T 2. Metric_1_Total metric_nbr metric_amt o 3 :

28 We execute TraceAG (T 2, {o 3 }, Metric_1) to obtain the lineage of the output tuple o 3. Now I is the table Metric_1 (metric_nbr, account, metric_amt), and all subsets of I sorted by size are presented in L = {L 1, L 2, L 3 } where L 1 = {o 1 }, L 2 = {o 2 } and L 3 = {o 1, o 2 }. We check if any subset of L = {L 1, L 2, L 3 } produces o 3 via T 2 and find that T 2 (L 3 ) = {o 3 }. Finally, the TraceAG (T 2, {o 3 }, Metric_1) procedure returns L 3 = I* = {o 1, o 2 }. To illustrate the black-box lineage tracing logics, we will review the transformation step T 3. There is no separate algorithm given for black-boxes, and the lineage for the blackbox property is defined as o O, T(o,I) = I. Therefore, the lineage for the black-box simply returns the entire input set I. Assume that the output data set O* is stored in the table Metric_1_Date (metric_nbr, metric_amt, calculation_time), as shown in the Table 9. The input data set I is Metric_1_Total (metric_nbr, metric_amt). The lineage for the tuple o 4 will return I* = {o 3 }. Table 9. Result table produced by transformation step T 3. Metric_1_Date metric_nbr metric_amt calculation_time o 4 : :00:00 Finally, the whole lineage T(o 4,I) for tuple o 4 can be presented as a set of lineages for each separate transformation step, sorted by its position or T(o 4,I)= {(1, t 1, t 2 ),(2, o 1, o 2 ),(3, o 3 )}. The fact that worth mentioning is that the both proposed algorithms for tracing view and ETL lineages assume that all required prerequisite data (such as source and target instance mappings, annotations or schema mappings or reverse queries generation logics) is already stored in the data warehouse metadata. In case such information is absent, there is a need to collect missing data to fill the gaps, which is always the trickiest part. Both study efforts does not focus on how the transformation properties are specified or discovered or how the required metadata is collected, but rather on how to use collected data in common for lineage tracing. 28

29 2.5 Teradata possibilities for Data Lineage tracing The EDW database in the Financial Institution N is built on Teradata platform. Teradata is a company which develops and sells the relational database management system (hereinafter referred to as RDBMS) for big data analytics, as well as publishes other analytical software products integrated with the Teradata RDBMS [15]. Therefore, it is always worth to review if the database platform provider proposes any lineage tracing solution integrated with the database platform. However, according to the information studied [15], there are no lineage oriented metadata based products. Still, there are some possible options which can help to retrieve the lineage information at least on some basic level. In the following chapters, we will review the Teradata Meta Data Services (MDS) and the Teradata Mapping Manager (TMM) solutions Teradata Meta Data Services (MDS) Teradata Meta Data Services (MDS) is an infrastructural module for managing data warehouse metadata and creating tools to exchange the metadata with external operational applications like ETL, BI, etc. [16]. Nevertheless, there is no straight solution mentioning the lineage precisely, MDS gives a possibility to partly collect the required metadata as an input for the lineage tracking solution. MDS consists of several built-in components, among which there are three default built-in meta-models [16]: The Teradata Database Information Meta-model (DIM) is defined to store the information from the databases and associated business data. The Client Load Meta-model (CLM) is defined to store the information obtained from Teradata client FastLoad, MultiLoad and TPump utility scripts and output files. The Common Warehouse Meta-model (CWM) is defined to store the metadata obtained from supported business intelligence tools. The DIM meta-model holds relations between tables and views, between columns and tables/views, between columns and mapped business attributes, between macro and macro parameters, between procedures and procedure parameters [16]. This means that some basic relations can be queried very straightforward. Additionally, it worth outlining that DIM gives a possibility to see the relations between entities on a column 29

30 level. The CLM and CWM meta-models are more about linking the external data lineage part to the warehouse internal data lineage. However, the MDS doesn t propose any solution for more complicated lineages like ETL lineages. Still, this part stays without any architectural solution Teradata Mapping Manager (TMM) Teradata Mapping Manager (TMM) is a Java-based desktop solution for creating and maintaining the data mapping rules between the data and/or the requirements [13]. According to the tool logics described in the TTM user guide, this tool is very similar to the SPIDER solution described in the following chapters. Current solution allows creating mappings between source and target data models, and, based on the schema mapping, identify (debug) requirement gaps [17]. TMM is not meant for data lineage reporting, nevertheless, in case all the mappings between the source and target instances are described, for example, for view and table relations, for ETL data source and destination entities, then the TMM can be used to perform a data lineage reporting. However, describing the mappings is very tedious work, especially for ETL processes. The main disadvantage of such approach is that the source-to-target schema mapping can be described once, however, can t be maintained automatically. Therefore, new mappings should be described manually for every new relation appearing in the data warehouse. 2.6 Overview of some existing Lineage tracing Solutions for Relational Databases In this chapter, we will review the three different systems implementing the data lineage tracing functionality. We selected the systems which are more frequently discussed in other study efforts and referred to as an example of how- and where- provenance. These are TRIO, SPIDER and DBNotes. Note that there is no need to discuss the whyprovenance separately because it is an integral part of how-provenance, and therefore, the same system implementing how-provenance can be reviewed. To start with, we will categorize these systems in Table 10, according to the lineage types. Unfortunately, we were not able to find any publicly available example of a 30

31 system which would introduce the calculation of where-provenance according to the non-annotation approach. Table 10. Categorization of systems according to provenance types. Data (or fine-grained) Provenance Internal Data Provenance Annotation (eager) approach Non-annotation (lazy) approach Why-provenance TRIO SPIDER How-provenance TRIO SPIDER Where-provenance DBNotes How provenance: TRIO TRIO is introduced in [18]. It is the RDBMS system for probabilistic databases where the correctness of the data is uncertain. TRIO is built on top of PostgreSQL DBMS, where the Oracle SQL language is supplemented with additional functionality to annotatively store the lineage for data operations [19]. We would like to outline that in TRIO, the lineage can be queried with or without confidence values assigned to the data. Therefore, in current review we will leave the concept of uncertainty without attention and we will concentrate only on the lineage tracing functionality. We will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters for representing the data provenance tracing. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). The tuples are denoted as t i, i [1,7]. The lineage is denoted as a function λ associated with each tuple. Since Metrics and Account_Metric are base tables, the lineage of each of their tuples is the tuple itself or λ(t i ) = t i, i [1,7]. To illustrate how the lineage is stored for the derived data, assume we execute the query π account_nbr, metric_name, metric_amt (Metrics Account_Metric), the code representation is 31

32 given in Figure 11. Next, we store result in a table called Metric_Description presented in Table 11. The tuples are denoted as u j, j [1,4]. SELECT Account_Metric.account, Metrics.metric_name, Account_Metric.metric_amt INTO Metric_Description FROM Metrics, Account_Metric WHERE Metrics.metric_nbr = Account_Metric.metric_nbr; Figure 11. Query π account_nbr, metric_name, metric_amt (Metrics Account_Metric) code representation. Metric_Description Table 11. Result table Metric_Description. account metric_ name metric_amt u 1 : A001 Account balance λ(u 1 ) = t 1 t 4 u 2 : A002 Account balance λ(u 2 ) = t 1 t 5 u 3 : A002 Interest income λ(u 3 ) = t 2 t 6 u 4 : A001 Interest income λ(u 4 ) = t 2 t 7 The lineage of the first tuple u 1 in the table Metric_Description is given by λ(u 1 ) = t 1 t 4, indicating that the existence of u 1 is possible due to the existence of both tuples t 1 in relation Metrics and t 4 in relation Account_Metric. The same logic applies to the other tuples. This is the representation of the why-provenance concept. We know that join operations produce a conjunctive lineage, while operations like projection or union produce disjunctive lineage, in case they are duplicate-eliminating [2]. According to current example, the TRIO will store the conjunction of the input tuples in the lineage of u 1 and u 2. This means that the understanding that the input tuples have been joined can be obtained simply from the lineage information stored, and there will be no need to examine the base tables separately. The lineage for derived relation R result is stored in a separate metadata table lin: R result (parent_id, table, child_id), where relations are annotated with conjunctive/disjunctive lineage as well as with parent_id and child_id attributes, as shown in Table

33 lin: Metric_Description Table 12. Metadata table used to store the lineage for a derived relation. parent_id table child_id relational_operator u 1 Metrics t 1 u 1 Account_Metric t 4 u 2 Metrics t 1 u 2 Account_Metric t 5 In order to retrieve the lineage information, the Trio system executes recursive queries over lineage tables by asking for the parents of the data set and then the parents of the parents continuously until all relations are found [19] How provenance: SPIDER One of the widely used non-annotation approaches for calculating provenance is called schema mapping. A schema mapping is a specification that describes how the data in the source schema is transformed into the data residing in the target schema. The relationship between the source and target data in respect to the mapping is called a route. Routes are a form of how-provenance over schema mappings [20], [21]. SPIDER is a solution which gives a possibility to trace the data lineage over schema mappings [21]. It is implemented on top of data exchange system Clio from IBM, which provides a possibility to debug schema mappings together with source and target data, in order to find data quality issues based on provenance. A schema mapping designer is allowing to specify source and target instances, denoted as S and T respectively, and the way they are related to each other. The SPIDER system calculates routes for a specific data item selected in target instance T through chains of possibly recursive mappings. The schema mapping can be denoted as quadruple M = (S, T, Σ st, Σ t ), where S is a source instance, T is a target instance, Σ st is a set of source-to-target mappings and Σ t is the union of a finite set of target dependencies [6]. To illustrate the schema mappings idea, we will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). 33

34 Assume that Account_Metric (account, metric_nbr, metric_amt) is a source instance S. Let Account_Metric_Sum (metric_nbr, metric_amt) be a target instance T, as shown in Table 13. The tuples are denoted as u j, j [1,2]. Table 13. Target instance T. Account_Metric_Sum metric_nbr metric_amt u 1 : u 2 : In SPIDER, the schema mapping should be defined separately by a user. Assume, we set up the next mapping: Source-to-target mapping (Σ st ): m 1 : for t i in Account_Metrics exists u i in Account_Metric where t i [metric_nbr] = u i [metric_nbr] Target dependencies (Σ t ): Assume we would like to know the lineage for the tuple u 1. The SPIDER will return next route R: Account_Metrics(t 1,t 2 ) (m 1 ) Account_Metric_Sum(u 1 ) The route R consists of a satisfaction step (m 1, (t 1,t 2 )), which witnesses the existence of the target tuple u 1 because of the existence of the source tuples t 1 and t 2. This part contributes to the why-provenance concept. The mapping itself m 1 can be considered as the how-provenance part. If some values are not satisfying the mapping conditions, then data analyst can discover a bug in data quality. The SPIDER system applies the non-annotation (or lazy) approach for calculating data provenance because no additional information is stored during everyday database operations and provenance is calculated only when it is needed Where Provenance: DBNOTES DBNotes is a solution, implemented on top of the Oracle relational database system, which provides a possibility to add one or multiple annotations to every attribute value in the database [21]. Once annotations are assigned, they can be propagated together with further data flows covering whichever transformations. The annotations technique 34

35 is implemented in Java and there is no need to change underlying database schema when adding additional information [7]. The annotation stores a physical address of a value or an array of addresses, indicating where the value is coming from. Let V be a result of a query Q executed on a relational database D, where V could undergo different transformations. Annotation {} v, associated with result data item v V, consists of the annotations which are associated with each data item in the source (s 1, s 2,, s n ) S where V is copied from, so that {} v = ( {} s1, {} s2,, {} sn ). The location of the value v in the database is denoted as as (R, t, A) where R is a table, t is a tuple and A is a column in the table R [6]. The annotation of the value v corresponds to its location, so that {} v = (R, t, A). To illustrate the annotations technique, we will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 14) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 15). The tuples are denoted as t i, i [1,7]. Each value is assigned the annotation {a i }=(R, t, A), i [1,10], corresponding to the value location. Metrics Table 14. Table holding the description of some financial metrics with annotations. metric_nbr metric_name t 1 : t 2 : t 3 : 1 {a 1 } Account balance {a 4 } 2 {a 2 } Interest income {a 5 } 3 {a 3 } Transaction fee income {a 6 } {a 1 }=(Metrics, t 1, metric_nbr), {a 4 }=(Metrics, t 1, metric_name) {a 2 }=(Metrics, t 2, metric_nbr), {a 5 }=(Metrics, t 2, metric_name) {a 3 }=(Metrics, t 3, metric_nbr), {a 6 }=(Metrics, t 3, metric_name) 35

36 Table 15. Table holding the actually calculated metrics for some accounts with annotations. Account_Metric account metric_nbr metric_amt t 4 : A001 {a7} 1 {a11} {a15} {a7}=(account_metric, t4, account), {a11}=(account_metric, t4, metric_nbr), {a15}=(account_metric, t4, metric_amt) t 5 : A002 {a8} 1 {a12} {a16} {a8}=(account_metric, t4, account), {a12}=(account_metric, t4, metric_nbr), {a16}=(account_metric, t4, metric_amt) t 6 : A002 {a9} 2 {a13} {a17} {a9}=(account_metric, t4, account), {a13}=(account_metric, t4, metric_nbr), {a17}=(account_metric, t4, metric_amt) t 7 : A001 {a10} 2 {a14} {a18} {a10}=(account_metric, t4, account), {a14}=(account_metric, t4, metric_nbr), {a18}=(account_metric, t4, metric_amt) The propagation of the annotations is shown in Table 16, which stores the result of the query π account_nbr, metric_nbr, metric_name, metric_amt (Metrics Account_Metric). The tuples are denoted as u j, j [1,4]. Table 16. Result table Metric_Description showing annotation s propagation. Metric_Description account metric_nbr metric_ name metric_amt u 1 : A001 {a7} 1 {a 1, a 11 } Account balance {a 4 } {a15} u 2 : A002 {a8} 1 {a 1, a 12 } Account balance {a 4 } {a16} u 3 : A002 {a9} 2 {a 2, a 13 } Interest income {a 5 } {a17} u 4 : A001 {a10} 2 {a 1, a 14 } Interest income {a 5 } {a18} The annotations for values in column metric_nbr consist of other annotations where the values are copied from. For example, the value of metric_nbr in u 1 have annotation {a 1, a 11 }, since value 1 is copied from both locations: {a 1 }=(Metrics, t 1, metric_nbr) and {a11}=(account_metric, t4, metric_nbr). The same logic applies to other values in column metric_nbr. 36

37 In contrast to why- and how-provenance, which indicate the tuples in the input that witness the existence of an output tuple and save the way the values were used according to the query Q, where-provenance tells us precisely the location from where an attribute value in the output was copied from [6]. 37

38 3 Data Lineage tracing module development in Enterprise Data Warehouse 3.1 Introduction In this chapter we will describe the development tasks accomplished in order to implement one possible solution for the financial metrics data lineage tracing in EDW environment in the Financial Institution N. The prototype code is presented in Appendixes 1-7, where it can be simply copied from and executed in the Teradata database application (for example, Teradata administrator or Teradata Assistant) in Production environment. Code is divided into steps which should be executed in the chronological order. All database objects were created in form of volatile tables, which means that they will exist only during particular user session. 3.2 Prototype Requirements and Analysis Next, we provide the user story requirements gathered for the prototype solution. The functional requirements are described in Table 17. We would like to mention that at the current stage of development there are, generally, no non-functional requirements, which would significantly affect the prototype development directions. The only nonfunctional requirement proposed at the moment is that none of the code developed should fail during query processing because of the lack of user s spool space. Table 17. Functional requirements for the prototype solution. Use Case ID UC_01 UC_02 UC_03 Use Case Description There should be a functionality to query data lineage for the financial metrics (attributes Account_Metric_Type_Code, Account_Nbr_Modifier) stored in table EDW.T3058_ACCOUNT_METRIC. There should be a functionality to trace the metric lineage back to the sources and further to the targets to the specified amount of levels. Starting point is EDW.T3058_ACCOUNT_METRIC. There should be a functionality to query the lineage for actually calculated 38

39 Use Case ID Use Case Description financial metrics for a defined period (attribute Period_Ending_Date, table EDW.T3058_ACCOUNT_METRIC). UC_04 UC_05 UC_06 UC_07 There should be a functionality to see the object-level relations like table-totable or table-to-view relations. There should be a possibility to see source and target metrics if such exist. The lineage tracing solution should implement a reusable pattern, which could be later used for other classifiers than the financial metrics. The lineage tracing pattern should be implemented upon existing metadata. The EDW warehousing environment is logically divided into different layers. Next, we provide the description of the layers that we have mainly concentrated our solution for: EDW_DW is a denotation for the main physical layer. The physical schema name is EDW. EDW_CL is a denotation for the main logical layer. The physical schema name is HBG. EDW_SA is a denotation for the staging area layer. The physical schema name is OSA. We will not describe other layers here as this will not give any additional value to the current work. We know that in the physical layer, the data is generally stored by the means of tables, while in logical layer it actually appears in the views. Therefore, data propagation from the physical to logical layer is done by the means of view structures. The data propagation from tables to tables or from views to tables is possible only by the means of ETL processes. In prototype solution, we have divided the relations into ETL relations and DB relations (for the table-to-view relations) and have separated the ways we search for them. The table T3058_ACCOUNT_METRIC is residing in the physical layer EDW_DW, in the EDW schema. The prototype development has been started from this table according to the functional requirements UC_01 and UC_02. The table holds the month summary financial metrics calculated for Financial Institution s customer accounts. The UML diagram of T3058_ACCOUNT_METRIC together with the table attributes is presented in Figure 12. Table EDW.T7060_ACCOUNT_METRIC_TYPE is holding 39

40 the financial metrics semantical description and we will leave it without attention as we are not interested in financial metrics business meanings but only in technical data processing. Figure 12. UML diagram for T3058_ACCOUNT_METRIC table. According to the functional requirements UC_04 and UC_05, we have outlined two different data lineage levels: Data lineage object level is the level which stores the relations between objects like table-to-table, table-to-view, view-to-view, view-to-table. Note, that the data propagation from table-to-table and from view-to-table is possible only with the means of ETL processes. Data lineage classifier level is the level which stores the relations between (classifier and object)-to-(classifier and object). In our case the classifier attribute is Account_Metric_Type_Code, which holds the values for financial metrics. The data lineage for object and classifier levels is depicted in Figure 13. The object level lineage is shown with the black arrows and the classifier lineage is shown with the red arrows. The tables are annotated as T (with number) and views as V (with number). We assume that tables are residing in the physical layer and views are residing in the logical layer. The red denotation M1 denotes the value 1 in the attribute Account_Metric_Type_Code, and therefore, the red flow represents the lineage for the 40

41 financial metric 1, which is the classifier level. The data propagation way is shown between objects as ETL or view. view V1 ETL ETL view ETL ETL T1 T2 T3 V2 T4 T5 M1 M1 M1 M1 view V3 view V4 Figure 13. The object level and the classifier level example. The relations for the data lineage object level can be stored in a separate table with level_no, parent_id and child_id attributes, as presented in the Table 18 (based on the relations given in Figure 13). The lineage for a specific object usage can be queried recursively upon this table. Table 18. Table holding relations between objects. Object_Relations level_no source target 1 T1 T2 2 T2 T3 3 T3 V1 4 T3 V2 5 T3 V4 6 V2 T4 7 T4 T5 Generally, we have adopted the idea for storing data lineage classifier level relations from the TRIO project, where the concept of the lineage table lin: R result (parent_id, table, child_id) has been presented. The classifier level relations storing example is presented in Table

42 Table 19. Table holding relations between classifiers. Classifier_Relations level_no metric_nbr source target 1 1 T1 T2 2 1 T2 T3 3 1 T3 V1 4 1 T3 V2 5 1 V2 T4 According to the functional requirements, the type of lineage we need to capture can be identified as the internal where-provenance, as we are not interested in what transformations exactly the tuples from the source objects undergo, but we are interested in the locations where the values are coming from. We use the lineage type classification table from the previous chapters in order to depict the required lineage type marked with red X, as shown in Table 20. The lineage annotation calculation approach is possible only then the facilities for such a solution are foreseen from the very beginning of the database implementation. There are no possibilities for such approach in current warehousing environment, and therefore we have no choice than to implement the lazy calculation approach. Table 20. Categorization of systems according to provenance types. Data (or fine-grained) Provenance Internal Data Provenance Annotation (eager) approach Non-annotation (lazy) approach Why-provenance TRIO SPIDER How-provenance TRIO SPIDER Where-provenance DBNotes X According to the requirement UC_03, the lineage should be calculated only for the financial metrics existing for the specified period (in our case for the selected month). While table T3058_ACCOUNT_METRIC holds the data aggregated on a monthly basis (each period is denoted with the last day of month like , , etc.), the source objects can hold different customers account facts data on a daily 42

43 basis. Current specialty was kept in mind while implementing Data Group Scanner described later in this work. We would like to mention that the requirement UC_06 was not fulfilled. In order to make the prototype solution pattern applicable for other possible warehouse classifiers (not only Account_Metric_Type_Code but also for other numerical attributes), additional research should be performed for each classifier separately. The prototype solution can be redeveloped to be more universal taking into account other classifier specialties. This is very time consuming and tedious work, which would be out of the scope of current thesis workload, and will be left for future prototype improvements. According to the requirement UC_07, the prototype solution has been built upon the metadata information available in the current warehousing environment as much as it was possible. Several metadata usage constraints have been identified within this work and the workaround have been implemented. 3.3 Prototype Description In this chapter, we will describe the prototype solution in more detail. The system consists of seven main transformation steps. Additional recursive queries are built upon the lineage result table to query the metric lineage back to the sources and forward to the next target objects. Additional query was built in order to implement data lineage visualization. The prototype object usage summary is presented in Error! Reference source not found.. The metadata objects used are denoted with the schema name MEPL. Volatile tables, created specifically for the prototype, have no schema denotation. Moreover, each object is supplemented with the overall description. Additionally, the description of what is happening during each transformation step is presented in Table 22. Table 21. Prototype object usage summary. Step No Source Object Source Object Description Target Object Target Object description 1.1 MEPL.MD_LO GICAL_OBJEC T_USAGE Metadata - Logical object usage in ETL processes and DB views, macros and MD_ETL_SOU RCE_TARGET Volatile table holding source and target objects for ETL packages by package 43

44 Step No Source Object 1.2 MD_ETL_SOU RCE_TARGET MEPL.MD_LO GICAL_OBJEC T MEPL.MD_LO GICAL_OBJEC T_COLUMN 2.1 MEPL.MDE_L OG_OBJECT_ USAGE_SUM MARY 2.2 AUX_MD_ETL _SOURCE_TA RGET 3.1 MEPL.MDE_L OG_OBJECT_ USAGE_SUM MARY Source Object Description procedures. Volatile table holding source and target objects for ETL packages by package and package step. Each object is assigned an object category. Metadata - Logical object. Metadata - Columns of logical object. Metadata - show logical object usage both in ETL and in DB with owner services. Volatile (auxiliary) table holding filtered out source and target objects with financial facts. Metadata - Logical object usage in ETL processes and DB views, macros and procedures. Target Object AUX_MD_ETL _SOURCE_TA RGET ETL_SOURCE _TARGET ETL_SOURCE _TARGET DB_SOURCE_ TARGET Target Object description and package step. Each object is assigned an object category. Volatile (auxiliary) table holding filtered out source and target objects with financial facts. Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Volatile table holding relations table-to-view and view-to-view. Information is gathered from two different sources. 3.2 MEPL.MD_LO Metadata - Logical DB_SOURCE_ Volatile table holding 44

45 Step No Source Object GICAL_OBJEC T MEPL.MD_LO GICAL_OBJEC T_COLUMN MEPL.MD_LO GICAL_OBJEC T_USAGE 4 EDW.T0009_P ROCESS_RUN MEPL.MD_PR OCESS EDW.T3058_A CCOUNT_MET RIC EDW.T03056_ ACCOUNT_SU MMARY_DLY _DD EDW.T03064_ ACCT_INSUR ANCE_METRI C EDW.T03057_ ACCOUNT_M ETRIC_HISTO RY EDW.T4150_A Source Object Description Target Object Target Object description object. TARGET relations table-to-view and view-to-view. Metadata - Columns of logical object. Metadata - Logical object usage in ETL processes and DB views, macros and procedures. The process run instance, where execution of ODI processes is logged. Metadata - Process - Package with execution parameters, scheduling and dependencies information. EDW - This entity contains summary account information at the end of a time period for a particular balance category. EDW - This entity contains summary account information at the end of day. EDW - Insurance metric values for insurance accounts. EDW - This entity contains account metric history for validity periods with specific start date and end date. EDW - This entity DATA_GROUP Information is gathered from two different sources. Volatile table holding scanned information in form of data groups. 45

46 Step No Source Object SSET_METRIC _HISTORY EDW.T3059_A CCOUNT_MET RIC_DLY Source Object Description contains asset metric value history for a particular asset metric type code. EDW - This entity contains account metric information at the end of day for a particular balance category. 5 DATA_GROUP Volatile table holding scanned information in form of data groups. OSA_DATA_G ROUP_NO_PR OCESS MEPL.MDE_C LASSIFIER_US AGE_SUMMA RY Volatile (temporary) table holding data groups with no process_run_id attribute assigned. Metadata - Classifier usages summary scanned from ETL, DB objects and foreign keys. 6 DATA_GROUP Volatile table holding scanned information in form of data groups. ETL_SOURCE _TARGET MEPL.MD_LO GICAL_OBJEC T 7 EDW.T0079_P ORCESS_OBJE Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Metadata - Logical object. Target Object OSA_DATA_G ROUP_NO_PR OCESS DATA_GROUP OSA_ETL_SO URCE OSA_T0079 Target Object description Volatile (temporary) table holding data groups with no process_run_id attribute assigned. Volatile table holding scanned information in form of data groups. Volatile (temporary) table storing the source objects for the data groups. Volatile (temporary) table holding copy of 46

47 Step No Source Object Source Object Description Target Object Target Object description CT_REL_DTL mapping from EDW.T0079_PROCE SS_OBJECT_REL_D TL. OSA_T0079 Volatile (temporary) table holding copy of mapping from EDW.T0079_PROCE SS_OBJECT_REL_D TL. OSA_T0079_2 Volatile (temporary) table holding the mapping replaced with real data values. DATA_GROUP Volatile table holding scanned information in form of data groups. OSA_T0079_2 Volatile (temporary) table holding the mapping replaced with real data values. DATA_GROUP _RELATION Volatile table holding data group relations. Result table holding lineage information. DATA_GROUP Volatile table holding scanned information in form of data groups. Table 22. Prototype transformation steps description. Step No Target Object description 1.1 Find ETL relations: Retrieve ETL relations from MEPL.MD_LOGICAL_OBJECT_USAGE. Left join together source and target objects by package and package step. Assign the object category to each object. 1.2 Find ETL relations: Based on the object obtained in 1.1, filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by object category. Additionally, filter out the objects by checking each object columns and assuming that a table holding financial facts should have some specific columns like (_metic_amt, _metric_code, _metric_rate, _metric_code, etc.). 2.1 Find ETL relations: Retrieve ETL relations from a parallel source of information - MEPL.MDE_LOG_OBJECT_USAGE_SUMMARY. Left join together source and target objects by package. Assign object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category. 47

48 Step No Target Object description 2.2 Find ETL relations: Add the relations from the auxiliary object obtained in 1.2, which are true, and which are missing in object obtained in step Find Table-View relations: Retrieve ETL relations from MEPL.MDE_LOG_OBJECT_USAGE_SUMMARY. Left join together source and target objects. Assign the object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category similarly to step Find Table-View relations: Retrieve ETL relations from MEPL.MD_LOGICAL_OBJECT_USAGE. Left join together source and target objects. Assign the object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category and by columns similarly to step Data Group Scanner: According to a pattern developed, scan metric facts keeping database objects. 5 Data Group Scanner: Find Processes and Packages for the data groups which are missing the process_run_id from metadata object MEPL.MDE_CLASSIFIER_USAGE_SUMMARY. 6 Data Group source objects: Find source objects for data groups by processes and packages related to the data groups. Process and package for a data group is found in steps 4 and 5, and ETL relations are found in Data Group Relations: Find source-target relations between data groups. Each transformation step code is presented in Appendixes 1-8. Despite the amount of objects processed, and the volume of the transformation steps, the final architectural solution is very simple and consists of two tables. The whole prototype work and transformations are done in order to retrieve the lineage information which is, finally, stored in form of data groups and the relations between them. All other objects created during prototype work can be considered as temporary and are required only for the data processing. The architecture of the final lineage module is presented in the Figure

49 Figure 14. UMLdiagram for Data Lineage module. The explanation of how the data groups are obtained is provided in the next chapter. 3.4 Data Scanner In terms of financial metrics lineage, we are not interested in retrieving the lineage for every particular account or tuple in the table EDW.T3058_ACCOUNT_METRIC. Therefore, we implemented the Data Scanner which mainly divides the set of data of EDW.T3058_ACCOUNT_METRIC into the subsets and stores the subsets in the aggregated form. Lately the subsets are checked whether they are intersecting with the relations found. The idea is more similar to the schema mapping solution described in the SPIDER application example. The data is mainly grouped by: Account_Metric_Type_Code the type code of the financial metric; Account_Nbr_Modifier the attribute identifies from which source database the account has been loaded to the data warehouse (credit, leasing, etc.); Country_Context_Code the attribute identifies which country the account belongs to; Process_Name the ETL process name by which financial metrics have been derived to the table T3058_ACCOUNT_METRIC; 49

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:

More information

CHAPTER 3 Implementation of Data warehouse in Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining CHAPTER 3 Implementation of Data warehouse in Data Mining 3.1 Introduction to Data Warehousing A data warehouse is storage of convenient, consistent, complete and consolidated data, which is collected

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

Data Management Glossary

Data Management Glossary Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Data warehouse architecture consists of the following interconnected layers:

Data warehouse architecture consists of the following interconnected layers: Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and

More information

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management Management Information Systems Review Questions Chapter 6 Foundations of Business Intelligence: Databases and Information Management 1) The traditional file environment does not typically have a problem

More information

Q1) Describe business intelligence system development phases? (6 marks)

Q1) Describe business intelligence system development phases? (6 marks) BUISINESS ANALYTICS AND INTELLIGENCE SOLVED QUESTIONS Q1) Describe business intelligence system development phases? (6 marks) The 4 phases of BI system development are as follow: Analysis phase Design

More information

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended. Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews cannot be printed. TDWI strives to provide

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Guide Users along Information Pathways and Surf through the Data

Guide Users along Information Pathways and Surf through the Data Guide Users along Information Pathways and Surf through the Data Stephen Overton, Overton Technologies, LLC, Raleigh, NC ABSTRACT Business information can be consumed many ways using the SAS Enterprise

More information

Handout 12 Data Warehousing and Analytics.

Handout 12 Data Warehousing and Analytics. Handout 12 CS-605 Spring 17 Page 1 of 6 Handout 12 Data Warehousing and Analytics. Operational (aka transactional) system a system that is used to run a business in real time, based on current data; also

More information

Application software office packets, databases and data warehouses.

Application software office packets, databases and data warehouses. Introduction to Computer Systems (9) Application software office packets, databases and data warehouses. Piotr Mielecki Ph. D. http://www.wssk.wroc.pl/~mielecki piotr.mielecki@pwr.edu.pl pmielecki@gmail.com

More information

Enterprise Data Catalog for Microsoft Azure Tutorial

Enterprise Data Catalog for Microsoft Azure Tutorial Enterprise Data Catalog for Microsoft Azure Tutorial VERSION 10.2 JANUARY 2018 Page 1 of 45 Contents Tutorial Objectives... 4 Enterprise Data Catalog Overview... 5 Overview... 5 Objectives... 5 Enterprise

More information

PANDA A System for Provenance and Data. Example: Sales Prediction Workflow. Example: Sales Prediction Workflow. Backward Tracing. Item Sales.

PANDA A System for Provenance and Data. Example: Sales Prediction Workflow. Example: Sales Prediction Workflow. Backward Tracing. Item Sales. PANDA A System for Provenance and Data Example: Prediction Workflow Union Predict Agg -1 s 2 Example: Prediction Workflow Backward Tracing Union Predict Agg -1 s Name Amelie Jacques Isabelle Name Address

More information

Fast Innovation requires Fast IT

Fast Innovation requires Fast IT Fast Innovation requires Fast IT Cisco Data Virtualization Puneet Kumar Bhugra Business Solutions Manager 1 Challenge In Data, Big Data & Analytics Siloed, Multiple Sources Business Outcomes Business Opportunity:

More information

PANDA A System for Provenance and Data. Jennifer Widom Stanford University

PANDA A System for Provenance and Data. Jennifer Widom Stanford University PANDA A System for Provenance and Data Stanford University Example: Sales Prediction Workflow CustList 1 Europe CustList 2... Dedup Split Union Predict ItemAgg CustList n-1 CustList n USA Catalog Items

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4

More information

SAS Data Integration Studio 3.3. User s Guide

SAS Data Integration Studio 3.3. User s Guide SAS Data Integration Studio 3.3 User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2006. SAS Data Integration Studio 3.3: User s Guide. Cary, NC: SAS Institute

More information

Sql Fact Constellation Schema In Data Warehouse With Example

Sql Fact Constellation Schema In Data Warehouse With Example Sql Fact Constellation Schema In Data Warehouse With Example Data Warehouse OLAP - Learn Data Warehouse in simple and easy steps using Multidimensional OLAP (MOLAP), Hybrid OLAP (HOLAP), Specialized SQL

More information

After completing this course, participants will be able to:

After completing this course, participants will be able to: Designing a Business Intelligence Solution by Using Microsoft SQL Server 2008 T h i s f i v e - d a y i n s t r u c t o r - l e d c o u r s e p r o v i d e s i n - d e p t h k n o w l e d g e o n d e s

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)? Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely

More information

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders Data Warehousing ETL Esteban Zimányi ezimanyi@ulb.ac.be Slides by Toon Calders 1 Overview Picture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load

More information

Informatica Enterprise Information Catalog

Informatica Enterprise Information Catalog Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with

More information

Implementing a SQL Data Warehouse

Implementing a SQL Data Warehouse Implementing a SQL Data Warehouse 20767B; 5 days, Instructor-led Course Description This 4-day instructor led course describes how to implement a data warehouse platform to support a BI solution. Students

More information

Perm Integrating Data Provenance Support in Database Systems

Perm Integrating Data Provenance Support in Database Systems Perm Integrating Data Provenance Support in Database Systems Boris Glavic Database Technology Group Department of Informatics University of Zurich glavic@ifi.uzh.ch Gustavo Alonso Systems Group Department

More information

Meaning & Concepts of Databases

Meaning & Concepts of Databases 27 th August 2015 Unit 1 Objective Meaning & Concepts of Databases Learning outcome Students will appreciate conceptual development of Databases Section 1: What is a Database & Applications Section 2:

More information

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Data Warehouses Chapter 12. Class 10: Data Warehouses 1 Data Warehouses Chapter 12 Class 10: Data Warehouses 1 OLTP vs OLAP Operational Database: a database designed to support the day today transactions of an organization Data Warehouse: historical data is

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Technology In Action, Complete, 14e (Evans et al.) Chapter 11 Behind the Scenes: Databases and Information Systems

Technology In Action, Complete, 14e (Evans et al.) Chapter 11 Behind the Scenes: Databases and Information Systems Technology In Action, Complete, 14e (Evans et al.) Chapter 11 Behind the Scenes: Databases and Information Systems 1) A is a collection of related data that can be stored, sorted, organized, and queried.

More information

Introduction to Data Science

Introduction to Data Science UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics

More information

Data Warehousing. Adopted from Dr. Sanjay Gunasekaran

Data Warehousing. Adopted from Dr. Sanjay Gunasekaran Data Warehousing Adopted from Dr. Sanjay Gunasekaran Main Topics Overview of Data Warehouse Concept of Data Conversion Importance of Data conversion and the steps involved Common Industry Methodology Outline

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Teradata Aggregate Designer

Teradata Aggregate Designer Data Warehousing Teradata Aggregate Designer By: Sam Tawfik Product Marketing Manager Teradata Corporation Table of Contents Executive Summary 2 Introduction 3 Problem Statement 3 Implications of MOLAP

More information

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures) CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,

More information

QM Chapter 1 Database Fundamentals Version 10 th Ed. Prepared by Dr Kamel Rouibah / Dept QM & IS

QM Chapter 1 Database Fundamentals Version 10 th Ed. Prepared by Dr Kamel Rouibah / Dept QM & IS QM 433 - Chapter 1 Database Fundamentals Version 10 th Ed Prepared by Dr Kamel Rouibah / Dept QM & IS www.cba.edu.kw/krouibah Dr K. Rouibah / dept QM & IS Chapter 1 (433) Database fundamentals 1 Objectives

More information

Advanced Data Management Technologies Written Exam

Advanced Data Management Technologies Written Exam Advanced Data Management Technologies Written Exam 02.02.2016 First name Student number Last name Signature Instructions for Students Write your name, student number, and signature on the exam sheet. This

More information

Implementing a Data Warehouse with Microsoft SQL Server 2014 (20463D)

Implementing a Data Warehouse with Microsoft SQL Server 2014 (20463D) Implementing a Data Warehouse with Microsoft SQL Server 2014 (20463D) Overview This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems Data Analysis and Design for BI and Data Warehousing Systems Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Benefits of Automating Data Warehousing

Benefits of Automating Data Warehousing Benefits of Automating Data Warehousing Introduction Data warehousing can be defined as: A copy of data specifically structured for querying and reporting. In most cases, the data is transactional data

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

MOC 20463C: Implementing a Data Warehouse with Microsoft SQL Server

MOC 20463C: Implementing a Data Warehouse with Microsoft SQL Server MOC 20463C: Implementing a Data Warehouse with Microsoft SQL Server Course Overview This course provides students with the knowledge and skills to implement a data warehouse with Microsoft SQL Server.

More information

MS-55045: Microsoft End to End Business Intelligence Boot Camp

MS-55045: Microsoft End to End Business Intelligence Boot Camp MS-55045: Microsoft End to End Business Intelligence Boot Camp Description This five-day instructor-led course is a complete high-level tour of the Microsoft Business Intelligence stack. It introduces

More information

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Fig 1.2: Relationship between DW, ODS and OLTP Systems 1.4 DATA WAREHOUSES Data warehousing is a process for assembling and managing data from various sources for the purpose of gaining a single detailed view of an enterprise. Although there are several definitions

More information

1Z0-526

1Z0-526 1Z0-526 Passing Score: 800 Time Limit: 4 min Exam A QUESTION 1 ABC's Database administrator has divided its region table into several tables so that the west region is in one table and all the other regions

More information

Data Vault Partitioning Strategies WHITE PAPER

Data Vault Partitioning Strategies WHITE PAPER Dani Schnider Data Vault ing Strategies WHITE PAPER Page 1 of 18 www.trivadis.com Date 09.02.2018 CONTENTS 1 Introduction... 3 2 Data Vault Modeling... 4 2.1 What is Data Vault Modeling? 4 2.2 Hubs, Links

More information

Hyperion Interactive Reporting Reports & Dashboards Essentials

Hyperion Interactive Reporting Reports & Dashboards Essentials Oracle University Contact Us: +27 (0)11 319-4111 Hyperion Interactive Reporting 11.1.1 Reports & Dashboards Essentials Duration: 5 Days What you will learn The first part of this course focuses on two

More information

Metadata Based Impact and Lineage Analysis Across Heterogeneous Metadata Sources

Metadata Based Impact and Lineage Analysis Across Heterogeneous Metadata Sources Metadata Based Impact and Lineage Analysis Across Heterogeneous Metadata Sources Presentation at the THE 9TH ANNUAL Wilshire Meta-Data Conference AND THE 17TH ANNUAL DAMA International Symposium by John

More information

The Evolution of Data Warehousing. Data Warehousing Concepts. The Evolution of Data Warehousing. The Evolution of Data Warehousing

The Evolution of Data Warehousing. Data Warehousing Concepts. The Evolution of Data Warehousing. The Evolution of Data Warehousing The Evolution of Data Warehousing Data Warehousing Concepts Since 1970s, organizations gained competitive advantage through systems that automate business processes to offer more efficient and cost-effective

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

INTRODUCTORY INFORMATION TECHNOLOGY ENTERPRISE DATABASES AND DATA WAREHOUSES. Faramarz Hendessi

INTRODUCTORY INFORMATION TECHNOLOGY ENTERPRISE DATABASES AND DATA WAREHOUSES. Faramarz Hendessi INTRODUCTORY INFORMATION TECHNOLOGY ENTERPRISE DATABASES AND DATA WAREHOUSES Faramarz Hendessi INTRODUCTORY INFORMATION TECHNOLOGY Lecture 7 Fall 2010 Isfahan University of technology Dr. Faramarz Hendessi

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores Announcements Shumo office hours change See website for details HW2 due next Thurs

More information

Implement a Data Warehouse with Microsoft SQL Server

Implement a Data Warehouse with Microsoft SQL Server Implement a Data Warehouse with Microsoft SQL Server 20463D; 5 days, Instructor-led Course Description This course describes how to implement a data warehouse platform to support a BI solution. Students

More information

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered Lecture 18 Business Intelligence and Data Warehousing BDIS 6.2 BSAD 141 Dave Novak Topics Covered Test # Review What is Business Intelligence? How can an organization be data rich and information poor?

More information

Liberate, a component-based service orientated reporting architecture

Liberate, a component-based service orientated reporting architecture Paper TS05 PHUSE 2006 Liberate, a component-based service orientated reporting architecture Paragon Global Services Ltd, Huntingdon, U.K. - 1 - Contents CONTENTS...2 1. ABSTRACT...3 2. INTRODUCTION...3

More information

Data Integration and ETL with Oracle Warehouse Builder

Data Integration and ETL with Oracle Warehouse Builder Oracle University Contact Us: 1.800.529.0165 Data Integration and ETL with Oracle Warehouse Builder Duration: 5 Days What you will learn Participants learn to load data by executing the mappings or the

More information

Mahathma Gandhi University

Mahathma Gandhi University Mahathma Gandhi University BSc Computer science III Semester BCS 303 OBJECTIVE TYPE QUESTIONS Choose the correct or best alternative in the following: Q.1 In the relational modes, cardinality is termed

More information

Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format.

Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format. About the Tutorial IBM Cognos Business intelligence is a web based reporting and analytic tool. It is used to perform data aggregation and create user friendly detailed reports. IBM Cognos provides a wide

More information

Microsoft End to End Business Intelligence Boot Camp

Microsoft End to End Business Intelligence Boot Camp Microsoft End to End Business Intelligence Boot Camp 55045; 5 Days, Instructor-led Course Description This course is a complete high-level tour of the Microsoft Business Intelligence stack. It introduces

More information

Training 24x7 DBA Support Staffing. MCSA:SQL 2016 Business Intelligence Development. Implementing an SQL Data Warehouse. (40 Hours) Exam

Training 24x7 DBA Support Staffing. MCSA:SQL 2016 Business Intelligence Development. Implementing an SQL Data Warehouse. (40 Hours) Exam MCSA:SQL 2016 Business Intelligence Development Implementing an SQL Data Warehouse (40 Hours) Exam 70-767 Prerequisites At least 2 years experience of working with relational databases, including: Designing

More information

20463C-Implementing a Data Warehouse with Microsoft SQL Server. Course Content. Course ID#: W 35 Hrs. Course Description: Audience Profile

20463C-Implementing a Data Warehouse with Microsoft SQL Server. Course Content. Course ID#: W 35 Hrs. Course Description: Audience Profile Course Content Course Description: This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse 2014, implement ETL with

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper THE NEED Knowing where data came from, how it moves through systems, and how it changes, is the most critical and most difficult task in any data management project. If that process known as tracing data

More information

DATA MINING TRANSACTION

DATA MINING TRANSACTION DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Data Warehousing & Mining Techniques

Data Warehousing & Mining Techniques 2. Summary Data Warehousing & Mining Techniques Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Last week: What is a Data

More information

Introduction to Federation Server

Introduction to Federation Server Introduction to Federation Server Alex Lee IBM Information Integration Solutions Manager of Technical Presales Asia Pacific 2006 IBM Corporation WebSphere Federation Server Federation overview Tooling

More information

DATABASE DEVELOPMENT (H4)

DATABASE DEVELOPMENT (H4) IMIS HIGHER DIPLOMA QUALIFICATIONS DATABASE DEVELOPMENT (H4) December 2017 10:00hrs 13:00hrs DURATION: 3 HOURS Candidates should answer ALL the questions in Part A and THREE of the five questions in Part

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g

Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g Vlamis Software Solutions, Inc. Founded in 1992 in Kansas City, Missouri Oracle Partner and reseller since 1995 Specializes

More information

Oracle Essbase XOLAP and Teradata

Oracle Essbase XOLAP and Teradata Oracle Essbase XOLAP and Teradata Steve Kamyszek, Partner Integration Lab, Teradata Corporation 09.14 EB5844 ALLIANCE PARTNER Table of Contents 2 Scope 2 Overview 3 XOLAP Functional Summary 4 XOLAP in

More information

Implementing a Data Warehouse with Microsoft SQL Server 2012/2014 (463)

Implementing a Data Warehouse with Microsoft SQL Server 2012/2014 (463) Implementing a Data Warehouse with Microsoft SQL Server 2012/2014 (463) Design and implement a data warehouse Design and implement dimensions Design shared/conformed dimensions; determine if you need support

More information

Migrating Express Applications To Oracle 9i A Practical Guide

Migrating Express Applications To Oracle 9i A Practical Guide Migrating Express Applications To Oracle 9i A Practical Guide Mark Rittman, Mick Bull Plus Consultancy http://www.plusconsultancy.co.uk Agenda Introduction A brief history of Oracle Express Oracle 9i OLAP

More information

Database Technology Introduction. Heiko Paulheim

Database Technology Introduction. Heiko Paulheim Database Technology Introduction Outline The Need for Databases Data Models Relational Databases Database Design Storage Manager Query Processing Transaction Manager Introduction to the Relational Model

More information

The Data Organization

The Data Organization C V I T F E P A O TM The Data Organization Best Practices Metadata Dictionary Application Architecture Prepared by Rainer Schoenrank January 2017 Table of Contents 1. INTRODUCTION... 3 1.1 PURPOSE OF THE

More information

Microsoft Implementing a Data Warehouse with Microsoft SQL Server 2014

Microsoft Implementing a Data Warehouse with Microsoft SQL Server 2014 1800 ULEARN (853 276) www.ddls.com.au Microsoft 20463 - Implementing a Data Warehouse with Microsoft SQL Server 2014 Length 5 days Price $4290.00 (inc GST) Version D Overview Please note: Microsoft have

More information

Data Warehouses and Deployment

Data Warehouses and Deployment Data Warehouses and Deployment This document contains the notes about data warehouses and lifecycle for data warehouse deployment project. This can be useful for students or working professionals to gain

More information

Data Warehousing. Overview

Data Warehousing. Overview Data Warehousing Overview Basic Definitions Normalization Entity Relationship Diagrams (ERDs) Normal Forms Many to Many relationships Warehouse Considerations Dimension Tables Fact Tables Star Schema Snowflake

More information

ETL and OLAP Systems

ETL and OLAP Systems ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

The recent agreement signed with IBM means that WhereScape will be looking to integrate its offering with a wider range of IBM products.

The recent agreement signed with IBM means that WhereScape will be looking to integrate its offering with a wider range of IBM products. Reference Code: TA001707DBS Publication Date: July 2009 Author: Michael Thompson WhereScape RED v6 WhereScape BUTLER GROUP VIEW ABSTRACT WhereScape RED is an Integrated Development Environment (IDE) that

More information

0. Database Systems 1.1 Introduction to DBMS Information is one of the most valuable resources in this information age! How do we effectively and efficiently manage this information? - How does Wal-Mart

More information

Managing Data Resources

Managing Data Resources Chapter 7 Managing Data Resources 7.1 2006 by Prentice Hall OBJECTIVES Describe basic file organization concepts and the problems of managing data resources in a traditional file environment Describe how

More information

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. Choice 2: Business analysts access the Data

More information

Managing Data Resources

Managing Data Resources Chapter 7 OBJECTIVES Describe basic file organization concepts and the problems of managing data resources in a traditional file environment Managing Data Resources Describe how a database management system

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Call: Datastage 8.5 Course Content:35-40hours Course Outline

Call: Datastage 8.5 Course Content:35-40hours Course Outline Datastage 8.5 Course Content:35-40hours Course Outline Unit -1 : Data Warehouse Fundamentals An introduction to Data Warehousing purpose of Data Warehouse Data Warehouse Architecture Operational Data Store

More information

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) INTRODUCTION A dimension is an attribute within a multidimensional model consisting of a list of values (called members). A fact is defined by a combination

More information

Data Warehouse and Mining

Data Warehouse and Mining Data Warehouse and Mining 1. is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. A. Data Mining. B. Data Warehousing. C. Web Mining. D. Text

More information

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.com Objectives Explain the basics of: 1. Data

More information

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER ABOUT THIS COURSE The focus of this five-day instructor-led course is on creating managed enterprise BI solutions. It describes how to implement multidimensional and tabular data models, deliver reports

More information