DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE

Size: px

Start display at page:

Download "DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE"

Neal Lamb
5 years ago
Views:

1 TALLINN UNIVERSITY OF TECHNOLOGY Faculty of Information Technology Department of Informatics IDU70LT Alla Tšornenkaja IAMP DATA LINEAGE TRACING PROCESS IMPROVEMENT IN FINANCIAL INSTITUTION S DATA WAREHOUSE Master s thesis Supervisor: Eduard Ševtšenko Ph.D. Associate Professor Co-supervisor: Igor Artemtšuk M.A. Developer Tallinn 2016

2 TALLINNA TEHNIKAÜLIKOOL Infotehnoloogia teaduskond Informaatikainstituut IDU70LT Alla Tšornenkaja IAPM ANDMETE ELUTSÜKLI JÄLGIMISPROTSESSI PARENDAMINE FINANTSETTEVÕTTE ANDMEAIDAS Magistritöö Juhendaja: Eduard Ševtšenko Ph.D. Dotsent Kaasjuhendaja: Igor Artemtšuk M.A. Arendaja Tallinn 2016

3 Author s declaration of originality I hereby certify that I am the sole author of this thesis. All the used materials, references to the literature and the work of others have been referred to. This thesis has not been presented for examination anywhere else. Author: Alla Tšornenkaja

4 Abstract This thesis presents one possible solution for the financial data lineage tracing in the existing Enterprise Data Warehouse environment in the Financial Institution N. Current research gives an overview of the different lineage types and existing prototype solutions presented in other scientific works, and formalizes the possibilities and constraints of applying the current approaches in the specific data warehousing environment. As the result of this thesis, the prototype solution has been developed, where some ideas were adopted from the studied material and others were proposed by the author. The prototype solution has been created according to the provided requirements. Several conclusions have been made during the prototype development stage as well as different development constraints have been outlined separately. The prototype solution has been tested according to the requirements provided, and the outcome has been described in details. Additionally, a graphical visualization has been made in order to give a better overview of the contribution of the current work. The different directions for the future prototype solution improvements have been suggested. This thesis is written in English and is 64 pages long, including 4 chapters, 16 figures and 26 tables. 4

5 Annotatsioon Andmete elutsükli jälgimisprotsessi parendamine finantsettevõtte andmeaidas Käesolevas töös on esitatud üks võimalik prototüüplahendus finantsnäitajate andmete elutsükli jälgimiseks olemasolevas andmeaida keskkonnas finantsettevõttes N. Väljatöötatud prototüüplahendus põhineb etteantud nõuetel. Antud uurimistöös käsitletakse erinevat tüüpi andmete elutsükleid ja olemasolevaid prototüüplahendusi, mis on esitatud teistes teaduslikes töödes. Lisaks tehakse ülevaade antud lahenduste kohaldamise võimalustest ja piirangutest konkreetse andmeaida keskkonnas. Uurimistöö metoodika on konstruktiivne uuring, mis eeldab, et enne praktilise probleemi lahendamist uuritakse põhjalikult olemasolevaid teaduslikke materjale. Käesoleva töö tulemusena määrati kindlaks metaandmete kasutamise piirangud finantsnäitajate elutsükli jälgimiseks, mis pikemas perspektiivis võiksid olla kõrvaldatud metaandmete kvaliteedi parandamise käigus. Tuuakse välja põhjalik prototüübi kirjeldus ning programmi kood on esitatud töö lisades. Prototüüplahendus on testitud vastavalt etteantud nõuetele ning testid on detailselt kirjeldatud. Lisaks on tehtud graafiline visualiseerimine, et anda parem ülevaade tehtud tööpanusest. Olemasolevas töös pakutakse ka prototüübi võimalikud arengusuunad tulevikuks. Käesolev töö on üldiselt jagatud kaheks suureks osaks, millest üks osa on teoreetiline ja teine praktiline. Teoreetilises osas kirjeldatakse erinevaid olemasolevaid praktikaid ja meetodeid andmete elutsükli jälgimiseks andmeaidas. Praktilises osas kirjeldatakse prototüüplahenduse arendust ning sellega kaasnevaid kitsendusi. Lõputöö on kirjutatud inglise keeles ning sisaldab teksti 64 leheküljel, 4 peatükki, 16 joonist, 26 tabelit. 5

6 List of abbreviations and terms ASPJ query BI Conjunction Database classifier Data Mart DBMS Disjunction EDW ETL SQL Teradata FastLoad Teradata MultiLoad Teradata TPump A denotation for the Aggregate-Select-Project-Join query types. Business Intelligence is the use of computing technologies for the identification, discovery and analysis of business data (for example, sales revenue, products, costs and incomes) [1]. A compound proposition that is true if and only if all of its component propositions are true [2]. Any data type code defined in EDW (financial metrics, products, etc.). Data Mart is a subject-oriented archive that stores data and uses the retrieved set of information to assist and support the requirements involved within a particular business function or department [1]. Database Management System is a collection of programs that enables to store, modify and extract information from a database [3]. A compound proposition that is true if and only if at least one of a number of alternatives is true [2]. Enterprise Data Warehouse is a unified database that holds all the business information about the organization and makes it accessible all across the company [1]. Extract-Transform-Load is the process of extraction, transformation and loading data during database use [1]. Structured Query Language is a standardized query language for requesting information from a database [3]. A Teradata functionality to load huge amount of data from a flat file into empty tables [4]. A Teradata functionality to load multiple tables at one time from either a LAN or Channel environment [4]. A Teradata functionality to load data one row at a time, using row hash locks [4]. 6

7 Table of contents 1 Introduction The Background and the Problem The Tasks of this Thesis Methodology Overview of the Work Lineage or Provenance Introduction Provenance classification Provenance calculation approaches Provenance in Data Warehouses Provenance for view structures Provenance for ETL processes Teradata possibilities for Data Lineage tracing Teradata Meta Data Services (MDS) Teradata Mapping Manager (TMM) Overview of some existing Lineage tracing Solutions for Relational Databases How provenance: TRIO How provenance: SPIDER Where Provenance: DBNOTES Data Lineage tracing module development in Enterprise Data Warehouse Introduction Prototype Requirements and Analysis Prototype Description Data Scanner Lineage solution for ETL processes Lineage solution for View structures Prototype Testing Lineage Visualization Future development directions

8 4 Summary References

9 List of figures Figure 1. Data provenance calculation approaches: (a) annotation approach, (b) nonannotation approach [6] Figure 2. Overview of Data Warehouse Architecture [11, p 2] Figure 3. Example of view definition for Metric_Description Figure 4. Procedure ViewLineage algorithm for view lineage tracing Figure 5. Query α sum(metric_amt) (Account_Metric) code representation Figure 6. Query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) code representation Figure 7. Transformation properties: (a) dispatcher, (b) aggregator, (c) black-box [10] Figure 8. The sequence of ETL transformation steps to derive result table Metric_Sum from Account_Metric Figure 9. Procedure TradeDS algorithm for tracing dispatcher s lineage Figure 10. Procedure TraceAG algorithm for tracing aggregator s lineage Figure 11. Query π account_nbr, metric_name, metric_amt (Metrics Account_Metric) code representation Figure 12. UML diagram for T3058_ACCOUNT_METRIC table Figure 13. The object level and the classifier level example Figure 14. UMLdiagram for Data Lineage module Figure 15. Example visualization of Account_Metric_Type_Code 29 lineage Figure 16. Example visualization of Account_Metric_Type_Code 171 lineage

10 List of tables Table 1. Provenance classification Table 2. Table holding the descriptions for some financial metrics Table 3. Table holding the actually calculated metrics for some accounts Table 4. Intermediate view Account_Metric_Sum Table 5. Final view Metric_Description Table 6. Description of transformation steps Table 7. Result table produced by transformation step T Table 8. Result table produced by transformation step T Table 9. Result table produced by transformation step T Table 10. Categorization of systems according to provenance types Table 11. Result table Metric_Description Table 12. Metadata table used to store the lineage for a derived relation Table 13. Target instance T Table 14. Table holding the description of some financial metrics with annotations Table 15. Table holding the actually calculated metrics for some accounts with annotations Table 16. Result table Metric_Description showing annotation s propagation Table 17. Functional requirements for the prototype solution Table 18. Table holding relations between objects Table 19. Table holding relations between classifiers Table 20. Categorization of systems according to provenance types Table 21. Prototype object usage summary Table 22. Prototype transformation steps description Table 23. DATA_GROUP attribute description Table 24. Metadata scanners methods description Table 25. Description of object categories Table 26. Prototype Testing Report

11 1 Introduction 1.1 The Background and the Problem This thesis presents one possible solution for the financial data lineage tracing in the existing Enterprise Data Warehouse environment in the Financial Institution N. Current research gives an overview of different lineage types and existing prototype solutions presented in other scientific works, and formalizes the possibilities and constraints of applying current approaches in the specific data warehouse environment. The aim of this work is to propose a possible lineage tracing solution according to the existing data warehouse structure and situation, and develop a working prototype in form of a metadata module. Data warehousing systems gather information from different sources and integrate the retrieved data within the local structures. The data integration is usually performed with the help of numerous ETL processes which retrieve data from external or internal data sources, transform the extracted data to a predefined state and, finally, load the data to the destination location. While data transformations are usually done in the staging area layer, the final output is stored within the physical layer in tables. Next, from the physical layer the data is distributed to the logical layer via views. Some part of the data can be reloaded separately from the data warehouse to data mart structures for a subjectoriented analysis. Finally, the data is reported to business analysts by the means of reporting tools, OLAP, Ad-Hoc, or other possible solutions, which retrieve the data from the logical layer. Therefore, the data journey undergoes table-to-table or view-totable transformations, carried out by the means of ETL, as well as table-to-view transformations, carried out by the means of view creation functionality. Usually, the whole data movement model is much more complicated and the amount of stored attributes within the data warehouse is vast. The subject of this work is to concentrate only on retrieving the data flow for the financial figures, calculated within the Enterprise Data Warehouse, according to the available metadata and financial 11

12 metrics data. Nonetheless, future research can be concentrated on extending developed processes to be used for other financial data lineage tracing. The party, interested in current research, is the Enterprise Data Warehouse management department of the Financial Institution N. The benefit of this work is to simplify the impact analysis for required financial metrics data modifications as well as to help to answer the most frequently arising questions: What financial metrics are actually calculated from month to month? Which objects are using the financial metrics data? What will break if we change the calculation logics or close the calculations for some financial metric? Current research has been done during winter spring The Tasks of this Thesis Following tasks are completed as the result of this thesis: 1. We research and explain different lineage tracing approaches presented in other scientific works. We show the algorithms of these solutions based on our own examples, and provide the analysis of the current ideas applicability for the formulated problem. 2. We try to use the knowledge obtained to find the solution for the financial metrics data lineage tracing problem in current warehousing environment. We introduce the architecture and logics of possible solution. 3. We create a prototype solution and illustrate the lineage tracing functionality as well as possible. 1.3 Methodology The research methodology of this thesis is the Constructive Research. The reason for selecting current methodology comes from the fact what the existing scientific materials for corresponding topic should be studied first. The construct is a data lineage tracing prototype, which should be developed as the result of this thesis. 12

13 The theoretical contribution of the current work is to provide the algorithm of the data lineage tracing solution, which can be implemented in the prototype. 1.4 Overview of the Work The structure of the current thesis includes the following: 1. In the first chapter we provide an overview of the lineage classification and lineage calculation approaches. We also give a review of the existing lineage tracing solutions and algorithms based on relational algebra operations. 2. In the second chapter we describe the prototype solution development in details. Additionally, we describe the performed test cases and present a graphical visualization of the financial metrics data lineage. 13

14 2 Lineage or Provenance 2.1 Introduction In this chapter, we start our research from explaining the lineage concept in databases and categorizing the existing lineage types in order to be able to better formulate what is the type of the lineage we wish to capture, and if there are already any methods proposed which can be used straight away. The word lineage is synonymously used with the word provenance in the database community. Sometimes it is also referred to as pedigree, source attribution or source tagging. The idea which stands behind the lineage (or provenance) is that it keeps the association of the data item with its derivation. In other words, the lineage provides the ability to track the data journey from its origin to the destination instance. 2.2 Provenance classification The lineage related studies have been performed since early 1990-s [5, p 2]. The provenance study efforts were done separately on two different areas of data management: provenance tracing for scientific workflow systems and database management system [6]. In current work, we will concentrate on the lineage in the data management systems, as we are mainly interested in the data residing in the Enterprise Data Warehouse (hereinafter referred to as EDW) database. In order to get a better overview of the lineage types, we will systemize possible differentiations found in scientific works simply in a table. Based on the categorization of the information found, we will analyze the lineage types we want to capture and we are searching the solution for. The most general differentiation of the lineage is based on a granularity of the data captured. Currently there are two main trends for capturing lineage information: workflow (also known as coarse-grained) and data (also known as fine-grained) 14

15 provenance [7]. We provide an overview of the both types; however, in current work we will focus mainly on the fine-grained provenance. Workflow (or coarse-grained) lineage describes the data processing tasks in complex. It captures the whole history of the data derivation and represents it as a sequence of steps, where understanding of what is exactly happening during the transformation step can be missing. Nonetheless, the amount of steps and the depth of the information captured for workflow provenance can be stored on the level of platforms, software, versions, common description of the transformation steps, etc. Data (or fine-grained) lineage describes relatively detailed information about the derivation of a piece of data which we will refer to as data item. It captures the journey of particular data items like columns, tuples or attributes and represents it as the sequence of transformation steps. Mainly, the lineage research questions are [6]: 1. Why a data item was derived? What are the pre-conditions so that the data item will be derived? 2. How a data item was derived? How does the query transform the data? 3. What data was used to derive a data item? Where does the data come from? In respect to these three main questions, three types of data provenance have been outlined: why-provenance, how-provenance and where-provenance [6]. The definition for the why-provenance has been formalized by Buneman in [8]. The idea of the why-provenance is to provide the information about the witnesses of the output of a query. Therefore, why-provenance describes the set of all combinations of source data items, named a witness basis, that indicate the existence of an output data item in the result of the query. For example, let A and B be the attributes of a binary relation (A,B) R 1 with two tuples t 1 = (1,2), t 2 = (1,3) and let B and C be the attributes of another binary relation (B,C) R 2 with three tuples t 3 = (2,3), t 4 = (3,3), t 5 = (3,4). The duplicate-eliminating result of the query π a,c (R 1 R 2 ) consists of two output tuples t 6 = (1,3), t 7 = (1,4). The why-provenance of tuple t 6 = (1,3) will capture the witness basis as the set of tuples {t 1, t 3, t 2, t 4 }, and for t 7 = (1,4) it will capture {t 2, t 5 }. In contrast to why-provenance, how-provenance provides additional information about how the source data items witness the output data item in the result of a query. In 15

16 addition to witness basis, how-provenance supplements the information about logical operations like conjunctions (AND logical statement) and disjunctions (OR logical statement) used in the query. Consider the same example described above with two binary relations (A,B) R 1 and (B,C) R 2. Opposite to the why-provenance, the lineage for the output tuple t 6 = (1,3) in how-provenance will be stored as ((t 1 t 3 ) (t 2 t 4 )), and for t 7 = (1,4) will be stored as (t 2 t 5 ). Where-provenance type is a little bit different from why- and how-provenance, and generally describes where the data item in the output of the query is copied from [4]. While why- and how-provenance are capturing the relationships between source and output data items, where-provenance describes the relationships between source and output locations of the data items. Consider the same example described above with two binary relations (A,B) R 1 and (B,C) R 2. The where-provenance for the output tuple t 6 = (1,3) will keep the addresses of the locations where the values are copied from as {(R 1, t 1, A), (R 1, t 2, A)} for value 1 and {(R 2, t 3, C), (R 2, t 4, C)} for value 2. The same is for the output tuple t 7 = (1,4), the addresses kept will be {(R 1, t 2, A)} for value 1 and {(R 2, t 5, C)} for value 4. The three data provenance types described above can be differentiated by the scope of the data covered; therefore, two more groups are outlined: an external and internal data provenance [9]. The external lineage captures the data item journey through all possible nodes the data passes through, including external databases, programs, etc. In terms of the EDW, the external lineage of the data item would include a source database, where the value came from, and a destination database, there the value went to. In contrast to external lineage, the internal lineage captures the data item journey only within particular database. For better overview of the provenance types studied above, we will put together a simple categorization shown in Table 1. 16

17 Table 1. Provenance classification Provenance classification Workflow (or coarse-grained) Data (or fine-grained) External Internal Why How Where Why How Where 2.3 Provenance calculation approaches Formally, all of the existing research efforts on the data lineage distinguish one of the two approaches for calculating data provenance: the annotation approach (also known as eager approach, bookkeeping approach) and the non-annotation approach (also known as lazy approach) [7]. The annotation approaches are presented in Figure 1. Tracing provenance Input Q Output Tracing provenance Input Q Output Input Q Output Extra information (a) Non-Annotation approach (b) Annotation approach Figure 1. Data provenance calculation approaches: (a) annotation approach, (b) non-annotation approach [6]. In the annotation (or eager) approach, the query is re-engineered to carry additional information from the source instances to the output of the query. Generally, the original transformation query Q is modified to a query Q, which produces the same output as Q, but can additionally manage annotations technique, as shown on the right side (b) of the Figure 1. The advantage of this approach is that the lineage for the data item can be retrieved simply by examining its annotation, and there is no need to examine the source instances separately. The disadvantage comes from the fact that the additional information should be stored and processed along with the actual data which can cause 17

18 storage overheads as well as performance overheads during the executions of the queries in the database. In the non-annotation (lazy) approach, the query Q is executed as it is, as shown on the left side (a) of the Figure 1. In order to compute the provenance for an output data item, the source data, output data and the transformation should be analyzed separately. The advantage of this approach is that it can be deployed on an existing system as a separate module without or with minor changes to the system, and it doesn t cause performance or storage overheads. The disadvantage of the non-annotation approach is that analyzing provenance of the input, transformation and output data requires sophisticated techniques and cannot be used in case the source data becomes unavailable. 2.4 Provenance in Data Warehouses A data warehouse (DW) is a collection of the corporate information and data derived from the external data sources. The main aim of the data warehouse is to support business decisions by allowing data consolidation, business intelligence analysis and financial reporting [10]. The overview of the common data warehouse architecture is shown in Figure 2. 18

BI Analysis OLAP BI Reports Data Mining Ad-Hoc Data Management ETL Process Data Mart Enterprise Data Warehouse

and integrate the retrieved data within the local structures.

or internal data sources, transform the extracted data to a predefined state and, finally, load the data to the

While data transformations are usually done in the staging area layer, the final output is stored within the

Some part of the data can be reloaded separately from the data warehouse to data mart structures for a

19 BI Analysis OLAP BI Reports Data Mining Ad-Hoc Data Management ETL Process Data Mart Enterprise Data Warehouse Metadata Repository ETL Process Extraction Cleansing Transformation Load Deposits Loans Leasing Source Systems Figure 2. Overview of Data Warehouse Architecture [11, p 2] Data warehousing systems gather information from different sources and integrate the retrieved data within the local structures. The data integration is usually performed with the help of numerous ETL processes which retrieve data from external or internal data sources, transform the extracted data to a predefined state and, finally, load the data to the destination location. While data transformations are usually done in the staging area layer, the final output is stored within the physical layer in tables. Next, from the physical layer the data is distributed to the logical layer via views. Some part of the data can be reloaded separately from the data warehouse to data mart structures for a subjectoriented analysis. Finally, the data is reported to business analysts by the means of business intelligence (BI) reporting tools, OLAP, Ad-Hoc, or other possible solutions, which are retrieving data from the logical layer. There are five main features which distinguish an enterprise data warehouse from any other data warehouse [12]: 1. EDW should have a single version of truth and corresponding rules. 19

20 2. EDW should contain data for multiple subject areas like marketing, sale, finance, human resources, etc. 3. EDW should have a normalized design. 4. EDW should be implemented as a mission-critical environment. 5. EDW should be scalable across several dimensions and be able to handle the growth of data and the complexity of business processes. An enterprise data warehouse environment should be able to handle huge amounts of data without becoming a data swamp, and therefore the metadata is an integral part of keeping the understanding about the business data consolidated from different sources. The techniques for collecting metadata can be sometimes very sophisticated and are usually implemented in different levels of data processing, such as ETL or BI processing, etc. In terms of data provenance tracing, metadata should be considered to be the main source describing the data relations found in EDW. Unfortunately, not many study efforts on data provenance describe the provenance tracing principles specifically in the Data Warehouse environment. There is even less works which would provide any contribution for tracing data provenance for ETL processes. The problem of tracing the lineage in Data Warehouses has been formally founded by Cui and Widom in [13] and [14], who have divided the lineage problem into two parts: lineage tracing for views and lineage tracing for ETL processes. We will review both types in next chapters Provenance for view structures Cui and Widom [13] proposed a fine-grained lineage tracing common algorithm for any aggregate-select-project-union (hereinafter referred to as ASPJ) view, which is automatically generated from the view definition and a small amount of auxiliary information maintained together with the warehouse views. It is assumed that view content is calculated by evaluating an algebraic view definition query tree bottom-up. Each operator in the tree calculates its result tuple-by-tuple based on the results of previous nodes, and passes the result upwards [13]. An example for a possible view definition for the view Metric_Description is shown in Figure 3. 20

21 Metric_Description π metric_name σ metric_nbr / sum(metric.amt) / metric_name Account_Metric_Sum α metric_nbr / sum(metric.amt) Account_Metric Metrics Figure 3. Example of view definition for Metric_Description. Any ASPJ view v can be transformed into an equivalent form v which contains (join), σ (selection), π (projection), α (aggregation) operator sequences. The form v' is named an ASPJ canonical form, where each operator, σ, π, α sequence is named an ASPJ segment. A view defined by one ASPJ segment is named a one-level ASPJ view, where lineage for the tuples can be calculated using a single relational query named the lineage tracing query [13]. The lineage for a relational operator and view tuple are defined separately [13]. Let Op be a relational operator ( (join), σ (selection), π (projection), α (aggregation)) and let T = Op(T 1,..,T m ) be a table that results from applying Op to tables T 1,..,T m. The lineage of a tuple t T is a subset of source table data (T 1,..,T m ) denoted as (T 1 *,..,T m *), so that Op(T 1 *,..,T m *)={t}. Let D be a database with base tables R 1,,R m, and let V = v(d) be a view over D. Consider a tuple t V : 1. v = R i : if tuple t R i, then t contributes to itself in V. 2. v = Op(v 1,, v k ), where v j is a view defined over D, j = 1...k. Tuple t* R i contributes to t according to v. Lineage tracing query can be defined next. Let V = v(t 1,,T m ) be a one-level ASPJ view, and let T V be a tuple set of V. The T s lineage in T 1,,T m according to v can be 21

22 computed with the following query, where Split is an operator which breaks a table into multiple tables TQ T,v = Split T1,..Tm (σ C (T1,Tm)). The procedure algorithm for tracing view lineage is presented in Figure 4 [9]. procedure ViewLineage (T, v, D) begin if v = R D then return (T); // else v = v (v 1,,v k ) where v is a one-level ASPJ query, // V j = v j (D) is an intermediate view or a base table, j = 1..k (V * 1,, V * k ) TQ(T, v, {V 1,,V k }); D* for j 1 to k do // concatenate the lineage of each subview onto the result D* D* ViewLineage (V * j, v j, D) Return (D*); end Figure 4. Procedure ViewLineage algorithm for view lineage tracing. We will review ViewLineage (T, v, D) procedure logics on a highly simplified database instance holding some financial figures (hereinafter referred to as financial metrics) and consisting of two base tables. We will omit the metrics calculation periods, as this fact have no impact on lineage tracing logics. Assume that table Metrics (id, metric_nbr, metric_name) holds the descriptions of financial metrics that can be calculated based on the customer account information, as shown in Table 2. Table Account_Metric (id, account, metric_nbr, metric_amt) holds the actually calculated metrics for some accounts, as shown in Table 3. Assume tables Metrics and Account_Metric are base tables R 1 and R 2 respectively, such that {R 1, R 2 } D. The tuples are denoted as t i, i [1,7]. Metrics Table 2. Table holding the descriptions for some financial metrics. metric_nbr metric_name t 1 : 1 t 2 : 2 t 3 : 3 Account balance Interest income Transaction fee income 22

23 Table 3. Table holding the actually calculated metrics for some accounts. Account_Metric account metric_nbr metric_amt t 4 : A t 5 : A t 6 : A t 7 : A Assume that Account_Metric_Sum (metric_nbr, metric_amt) is the intermediate view produced by the aggregation query α sum(metric_amt) (Account_Metric) and represented in Table 4. The tuples are denoted as u j, j [1,2]. The code representation is given in Figure 5. CREATE VIEW Account_Metric_Sum AS SELECT Account_Metric.metric_nbr, sum(account_metric.metric_amt) FROM Account_Metric Figure 5. Query α sum(metric_amt) (Account_Metric) code representation. Table 4. Intermediate view Account_Metric_Sum. Account_Metric_Sum metric_nbr metric_amt u 1 : u 2 : Assume that Metric_Description (metric_name) is the final view produced by query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) and presented in Table 5. The tuples are denoted as o n, n [1,2]. The code representation is given in Figure 6. CREATE VIEW Metric_Description AS SELECT Metrics.metric_name FROM Account_Metric_Sum, Metrics WHERE Account_Metric.metric_nbr = Metrics.metric_nbr Figure 6. Query π Metrics.metric_name (σ Account_Metric.metric_nbr=Metrics.metric_nbr (Account_Metric_Sum Metrics)) code representation. 23

24 Table 5. Final view Metric_Description. Metric_Description metric_name o 1 : o 2 : Account balance Interest income Now we would like to retrieve the lineage for set of tuples T = {o 1,o 2 } and execute the procedure ViewLineage ({o 1,o 2 }, Metric_Description, {Metrics, Account_Metric}). Firstly, we check if T = {o 1,o 2 } is simply selected from any of the base tables R 1 =T and R 1 =T, and understand that it is not. Next, we execute the lineage query for a one-level ASPJ view or TQ {o1,o2},metric_description = Split Account_Metric_Sum, Metrics (σ C (Account_Metric_Sum Metrics)) where the selection condition C is Account_Metric.metric_nbr = Metrics.metric_nbr. As the result of this tracing query, we receive the subsets (Account_Metric_Sum*, Metrics*) such that Op(Account_Metric_Sum*, Metrics*)={o 1,o 2 }, where Op is a join operator. The subset Account_Metric_Sum* consists of tuples u 1 and u 2, and subset Metrics* consists of tuples t 1 and t 2. Since Account_Metric_Sum is an intermediate view, we trace the lineage further for the Account_Metric_Sum*, and obtain a subset Account_Metric*, which consists of tuples t 4, t 5, t 6 and t 7. The final lineage result is obtained by concatenating (Account_Metric*, Metrics*) = ({t 4, t 5, t 6, t 7 }, {t 1, t 2 }) Provenance for ETL processes Cui and Widom [14] proposed the common lineage tracing algorithms for ETL processes in data warehousing environment. Each ETL transformation step has been classified to a certain transformation class according to the way the step maps the input data to the output data. The transformation class is named property and there are three main properties distinguished: dispatchers, aggregators and black-boxes. Figure 7 illustrates a dispatcher, aggregator and black-box property, where I is a set of the input data and O is a set of the output data [14]. 24

25 (a) Dispatcher (b) Aggregator (c) Black-box Figure 7. Transformation properties: (a) dispatcher, (b) aggregator, (c) black-box [10]. A transformation T is a dispatcher, if each input data item produces zero or more output data items independently. The lineage for the dispatcher is defined as T(o,I) = {i I o T({i})} [10]. A transformation T is an aggregator, if for all I and T(I) = O = {o 1,...,o n }, there exists a unique disjoint partition I 1,,I n of I such that T(I k ) = {o k } for k = 1 n. The lineage for the aggregator is defined as T(o k,i) = I k [10]. A transformation T is a black-box, if it is neither a dispatcher nor an aggregator and any subset of the input items may have been used to produce a given output. Lineage for the black-box is defined as for o O, T(o,I) = I [14]. Usually, the ETL process contains one or more transformation steps, and therefore, the lineage for the whole ETL process consists of the lineages for each ETL transformation step. Figure 8 represents a sequence of three simple ETL transformation steps with one possible property each. The transformation steps summary is given in Table 6. Account_Metric T 1 T 2 T 3 Metric_Sum Figure 8. The sequence of ETL transformation steps to derive result table Metric_Sum from Account_Metric. 25

26 Transformation summary Table 6. Description of transformation steps. Name T 1 T 2 T 3 Description selects metric_nbr = 1 from table Account_Metric sums metric_amt in table Account_Metric_1 selects metric_amt from Metric_Total Furthermore, we will review the proposed data provenance tracing algorithms for each transformation property separately. Let I be a set of input data, I* be the subset of the input data, and i be an input data item, so that i I* I. Let T be a transformation of an input data item T(I). Let O be a set of output data, O* be the subset of the output data, and o be an output data item of O*, so that o O* O. The lineage is defined as a subset = $ %!"# for a separate output data item o. The procedure algorithm for tracing the lineage for dispatcher properties is presented in Figure 9 [14]. procedure TraceDS (T, O*, I) I* for each i I do if T ({i}) O* then I* I* {i} return I*; Figure 9. Procedure TradeDS algorithm for tracing dispatcher s lineage. To illustrate the TraceDS (T, O*, I) procedure logics, we will review the transformation step T 1. We will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapter for representing the data provenance tracing in view structures. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). The tuples are denoted as t i, i [1,7]. According to the transformation T 1 we receive the output data set O* presented in table Metric_1 (account, metric_nbr, metric_amt), as shown in Table 7. The tuples are 26

27 denoted as o n, n [1,2]. The T 1 is a dispatcher; therefore, we execute the procedure TraceDS (T 1, {o 1 }, Account_Metric) to obtain the lineage for the output tuple o 1. Table 7. Result table produced by transformation step T 1. Metric_1 account metric_nbr metric_amt o 1 : A o 2 : A The subset I* is not an empty. For each input tuple i = {t 4, t 5, t 6, t 7 } we check if the input tuple is intersecting with the output data set O*. If it is, we add one of the input tuple to the I* subset. The TraceDS (T 1, {o 1 }, Account_Metric) procedure will return I* = {t 4, t 5 }. The procedure algorithm for tracing the lineage for aggregator properties is presented in Figure 10 [10]. procedure TraceAG (T, O*, I) L all subsets of I sorted by size for each I* L in increased order do if T(I*) = O* then if (I - I*) = O - O* then break; else L = all supersets of I* sorted by size; return I*; Figure 10. Procedure TraceAG algorithm for tracing aggregator s lineage. To illustrate the TraceAG (T, O*, I) procedure logics, we will review the transformation step T 2. According to the transformation T 2, we receive the output data set O* presented in table Metric_1_Total (metric_nbr, metric_amt), as shown in Table 8. Table 8. Result table produced by transformation step T 2. Metric_1_Total metric_nbr metric_amt o 3 :

28 We execute TraceAG (T 2, {o 3 }, Metric_1) to obtain the lineage of the output tuple o 3. Now I is the table Metric_1 (metric_nbr, account, metric_amt), and all subsets of I sorted by size are presented in L = {L 1, L 2, L 3 } where L 1 = {o 1 }, L 2 = {o 2 } and L 3 = {o 1, o 2 }. We check if any subset of L = {L 1, L 2, L 3 } produces o 3 via T 2 and find that T 2 (L 3 ) = {o 3 }. Finally, the TraceAG (T 2, {o 3 }, Metric_1) procedure returns L 3 = I* = {o 1, o 2 }. To illustrate the black-box lineage tracing logics, we will review the transformation step T 3. There is no separate algorithm given for black-boxes, and the lineage for the blackbox property is defined as o O, T(o,I) = I. Therefore, the lineage for the black-box simply returns the entire input set I. Assume that the output data set O* is stored in the table Metric_1_Date (metric_nbr, metric_amt, calculation_time), as shown in the Table 9. The input data set I is Metric_1_Total (metric_nbr, metric_amt). The lineage for the tuple o 4 will return I* = {o 3 }. Table 9. Result table produced by transformation step T 3. Metric_1_Date metric_nbr metric_amt calculation_time o 4 : :00:00 Finally, the whole lineage T(o 4,I) for tuple o 4 can be presented as a set of lineages for each separate transformation step, sorted by its position or T(o 4,I)= {(1, t 1, t 2 ),(2, o 1, o 2 ),(3, o 3 )}. The fact that worth mentioning is that the both proposed algorithms for tracing view and ETL lineages assume that all required prerequisite data (such as source and target instance mappings, annotations or schema mappings or reverse queries generation logics) is already stored in the data warehouse metadata. In case such information is absent, there is a need to collect missing data to fill the gaps, which is always the trickiest part. Both study efforts does not focus on how the transformation properties are specified or discovered or how the required metadata is collected, but rather on how to use collected data in common for lineage tracing. 28

29 2.5 Teradata possibilities for Data Lineage tracing The EDW database in the Financial Institution N is built on Teradata platform. Teradata is a company which develops and sells the relational database management system (hereinafter referred to as RDBMS) for big data analytics, as well as publishes other analytical software products integrated with the Teradata RDBMS [15]. Therefore, it is always worth to review if the database platform provider proposes any lineage tracing solution integrated with the database platform. However, according to the information studied [15], there are no lineage oriented metadata based products. Still, there are some possible options which can help to retrieve the lineage information at least on some basic level. In the following chapters, we will review the Teradata Meta Data Services (MDS) and the Teradata Mapping Manager (TMM) solutions Teradata Meta Data Services (MDS) Teradata Meta Data Services (MDS) is an infrastructural module for managing data warehouse metadata and creating tools to exchange the metadata with external operational applications like ETL, BI, etc. [16]. Nevertheless, there is no straight solution mentioning the lineage precisely, MDS gives a possibility to partly collect the required metadata as an input for the lineage tracking solution. MDS consists of several built-in components, among which there are three default built-in meta-models [16]: The Teradata Database Information Meta-model (DIM) is defined to store the information from the databases and associated business data. The Client Load Meta-model (CLM) is defined to store the information obtained from Teradata client FastLoad, MultiLoad and TPump utility scripts and output files. The Common Warehouse Meta-model (CWM) is defined to store the metadata obtained from supported business intelligence tools. The DIM meta-model holds relations between tables and views, between columns and tables/views, between columns and mapped business attributes, between macro and macro parameters, between procedures and procedure parameters [16]. This means that some basic relations can be queried very straightforward. Additionally, it worth outlining that DIM gives a possibility to see the relations between entities on a column 29

30 level. The CLM and CWM meta-models are more about linking the external data lineage part to the warehouse internal data lineage. However, the MDS doesn t propose any solution for more complicated lineages like ETL lineages. Still, this part stays without any architectural solution Teradata Mapping Manager (TMM) Teradata Mapping Manager (TMM) is a Java-based desktop solution for creating and maintaining the data mapping rules between the data and/or the requirements [13]. According to the tool logics described in the TTM user guide, this tool is very similar to the SPIDER solution described in the following chapters. Current solution allows creating mappings between source and target data models, and, based on the schema mapping, identify (debug) requirement gaps [17]. TMM is not meant for data lineage reporting, nevertheless, in case all the mappings between the source and target instances are described, for example, for view and table relations, for ETL data source and destination entities, then the TMM can be used to perform a data lineage reporting. However, describing the mappings is very tedious work, especially for ETL processes. The main disadvantage of such approach is that the source-to-target schema mapping can be described once, however, can t be maintained automatically. Therefore, new mappings should be described manually for every new relation appearing in the data warehouse. 2.6 Overview of some existing Lineage tracing Solutions for Relational Databases In this chapter, we will review the three different systems implementing the data lineage tracing functionality. We selected the systems which are more frequently discussed in other study efforts and referred to as an example of how- and where- provenance. These are TRIO, SPIDER and DBNotes. Note that there is no need to discuss the whyprovenance separately because it is an integral part of how-provenance, and therefore, the same system implementing how-provenance can be reviewed. To start with, we will categorize these systems in Table 10, according to the lineage types. Unfortunately, we were not able to find any publicly available example of a 30

31 system which would introduce the calculation of where-provenance according to the non-annotation approach. Table 10. Categorization of systems according to provenance types. Data (or fine-grained) Provenance Internal Data Provenance Annotation (eager) approach Non-annotation (lazy) approach Why-provenance TRIO SPIDER How-provenance TRIO SPIDER Where-provenance DBNotes How provenance: TRIO TRIO is introduced in [18]. It is the RDBMS system for probabilistic databases where the correctness of the data is uncertain. TRIO is built on top of PostgreSQL DBMS, where the Oracle SQL language is supplemented with additional functionality to annotatively store the lineage for data operations [19]. We would like to outline that in TRIO, the lineage can be queried with or without confidence values assigned to the data. Therefore, in current review we will leave the concept of uncertainty without attention and we will concentrate only on the lineage tracing functionality. We will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters for representing the data provenance tracing. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). The tuples are denoted as t i, i [1,7]. The lineage is denoted as a function λ associated with each tuple. Since Metrics and Account_Metric are base tables, the lineage of each of their tuples is the tuple itself or λ(t i ) = t i, i [1,7]. To illustrate how the lineage is stored for the derived data, assume we execute the query π account_nbr, metric_name, metric_amt (Metrics Account_Metric), the code representation is 31

32 given in Figure 11. Next, we store result in a table called Metric_Description presented in Table 11. The tuples are denoted as u j, j [1,4]. SELECT Account_Metric.account, Metrics.metric_name, Account_Metric.metric_amt INTO Metric_Description FROM Metrics, Account_Metric WHERE Metrics.metric_nbr = Account_Metric.metric_nbr; Figure 11. Query π account_nbr, metric_name, metric_amt (Metrics Account_Metric) code representation. Metric_Description Table 11. Result table Metric_Description. account metric_ name metric_amt u 1 : A001 Account balance λ(u 1 ) = t 1 t 4 u 2 : A002 Account balance λ(u 2 ) = t 1 t 5 u 3 : A002 Interest income λ(u 3 ) = t 2 t 6 u 4 : A001 Interest income λ(u 4 ) = t 2 t 7 The lineage of the first tuple u 1 in the table Metric_Description is given by λ(u 1 ) = t 1 t 4, indicating that the existence of u 1 is possible due to the existence of both tuples t 1 in relation Metrics and t 4 in relation Account_Metric. The same logic applies to the other tuples. This is the representation of the why-provenance concept. We know that join operations produce a conjunctive lineage, while operations like projection or union produce disjunctive lineage, in case they are duplicate-eliminating [2]. According to current example, the TRIO will store the conjunction of the input tuples in the lineage of u 1 and u 2. This means that the understanding that the input tuples have been joined can be obtained simply from the lineage information stored, and there will be no need to examine the base tables separately. The lineage for derived relation R result is stored in a separate metadata table lin: R result (parent_id, table, child_id), where relations are annotated with conjunctive/disjunctive lineage as well as with parent_id and child_id attributes, as shown in Table

33 lin: Metric_Description Table 12. Metadata table used to store the lineage for a derived relation. parent_id table child_id relational_operator u 1 Metrics t 1 u 1 Account_Metric t 4 u 2 Metrics t 1 u 2 Account_Metric t 5 In order to retrieve the lineage information, the Trio system executes recursive queries over lineage tables by asking for the parents of the data set and then the parents of the parents continuously until all relations are found [19] How provenance: SPIDER One of the widely used non-annotation approaches for calculating provenance is called schema mapping. A schema mapping is a specification that describes how the data in the source schema is transformed into the data residing in the target schema. The relationship between the source and target data in respect to the mapping is called a route. Routes are a form of how-provenance over schema mappings [20], [21]. SPIDER is a solution which gives a possibility to trace the data lineage over schema mappings [21]. It is implemented on top of data exchange system Clio from IBM, which provides a possibility to debug schema mappings together with source and target data, in order to find data quality issues based on provenance. A schema mapping designer is allowing to specify source and target instances, denoted as S and T respectively, and the way they are related to each other. The SPIDER system calculates routes for a specific data item selected in target instance T through chains of possibly recursive mappings. The schema mapping can be denoted as quadruple M = (S, T, Σ st, Σ t ), where S is a source instance, T is a target instance, Σ st is a set of source-to-target mappings and Σ t is the union of a finite set of target dependencies [6]. To illustrate the schema mappings idea, we will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 2) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 3). 33

34 Assume that Account_Metric (account, metric_nbr, metric_amt) is a source instance S. Let Account_Metric_Sum (metric_nbr, metric_amt) be a target instance T, as shown in Table 13. The tuples are denoted as u j, j [1,2]. Table 13. Target instance T. Account_Metric_Sum metric_nbr metric_amt u 1 : u 2 : In SPIDER, the schema mapping should be defined separately by a user. Assume, we set up the next mapping: Source-to-target mapping (Σ st ): m 1 : for t i in Account_Metrics exists u i in Account_Metric where t i [metric_nbr] = u i [metric_nbr] Target dependencies (Σ t ): Assume we would like to know the lineage for the tuple u 1. The SPIDER will return next route R: Account_Metrics(t 1,t 2 ) (m 1 ) Account_Metric_Sum(u 1 ) The route R consists of a satisfaction step (m 1, (t 1,t 2 )), which witnesses the existence of the target tuple u 1 because of the existence of the source tuples t 1 and t 2. This part contributes to the why-provenance concept. The mapping itself m 1 can be considered as the how-provenance part. If some values are not satisfying the mapping conditions, then data analyst can discover a bug in data quality. The SPIDER system applies the non-annotation (or lazy) approach for calculating data provenance because no additional information is stored during everyday database operations and provenance is calculated only when it is needed Where Provenance: DBNOTES DBNotes is a solution, implemented on top of the Oracle relational database system, which provides a possibility to add one or multiple annotations to every attribute value in the database [21]. Once annotations are assigned, they can be propagated together with further data flows covering whichever transformations. The annotations technique 34

35 is implemented in Java and there is no need to change underlying database schema when adding additional information [7]. The annotation stores a physical address of a value or an array of addresses, indicating where the value is coming from. Let V be a result of a query Q executed on a relational database D, where V could undergo different transformations. Annotation {} v, associated with result data item v V, consists of the annotations which are associated with each data item in the source (s 1, s 2,, s n ) S where V is copied from, so that {} v = ( {} s1, {} s2,, {} sn ). The location of the value v in the database is denoted as as (R, t, A) where R is a table, t is a tuple and A is a column in the table R [6]. The annotation of the value v corresponds to its location, so that {} v = (R, t, A). To illustrate the annotations technique, we will use the same example of a highly simplified database instance holding some financial metrics as we used in previous chapters. We remind that the database consists of two base tables: Metrics (id, metric_nbr, metric_name) (see Table 14) and Account_Metric (id, account, metric_nbr, metric_amt) (see Table 15). The tuples are denoted as t i, i [1,7]. Each value is assigned the annotation {a i }=(R, t, A), i [1,10], corresponding to the value location. Metrics Table 14. Table holding the description of some financial metrics with annotations. metric_nbr metric_name t 1 : t 2 : t 3 : 1 {a 1 } Account balance {a 4 } 2 {a 2 } Interest income {a 5 } 3 {a 3 } Transaction fee income {a 6 } {a 1 }=(Metrics, t 1, metric_nbr), {a 4 }=(Metrics, t 1, metric_name) {a 2 }=(Metrics, t 2, metric_nbr), {a 5 }=(Metrics, t 2, metric_name) {a 3 }=(Metrics, t 3, metric_nbr), {a 6 }=(Metrics, t 3, metric_name) 35

36 Table 15. Table holding the actually calculated metrics for some accounts with annotations. Account_Metric account metric_nbr metric_amt t 4 : A001 {a7} 1 {a11} {a15} {a7}=(account_metric, t4, account), {a11}=(account_metric, t4, metric_nbr), {a15}=(account_metric, t4, metric_amt) t 5 : A002 {a8} 1 {a12} {a16} {a8}=(account_metric, t4, account), {a12}=(account_metric, t4, metric_nbr), {a16}=(account_metric, t4, metric_amt) t 6 : A002 {a9} 2 {a13} {a17} {a9}=(account_metric, t4, account), {a13}=(account_metric, t4, metric_nbr), {a17}=(account_metric, t4, metric_amt) t 7 : A001 {a10} 2 {a14} {a18} {a10}=(account_metric, t4, account), {a14}=(account_metric, t4, metric_nbr), {a18}=(account_metric, t4, metric_amt) The propagation of the annotations is shown in Table 16, which stores the result of the query π account_nbr, metric_nbr, metric_name, metric_amt (Metrics Account_Metric). The tuples are denoted as u j, j [1,4]. Table 16. Result table Metric_Description showing annotation s propagation. Metric_Description account metric_nbr metric_ name metric_amt u 1 : A001 {a7} 1 {a 1, a 11 } Account balance {a 4 } {a15} u 2 : A002 {a8} 1 {a 1, a 12 } Account balance {a 4 } {a16} u 3 : A002 {a9} 2 {a 2, a 13 } Interest income {a 5 } {a17} u 4 : A001 {a10} 2 {a 1, a 14 } Interest income {a 5 } {a18} The annotations for values in column metric_nbr consist of other annotations where the values are copied from. For example, the value of metric_nbr in u 1 have annotation {a 1, a 11 }, since value 1 is copied from both locations: {a 1 }=(Metrics, t 1, metric_nbr) and {a11}=(account_metric, t4, metric_nbr). The same logic applies to other values in column metric_nbr. 36

37 In contrast to why- and how-provenance, which indicate the tuples in the input that witness the existence of an output tuple and save the way the values were used according to the query Q, where-provenance tells us precisely the location from where an attribute value in the output was copied from [6]. 37

38 3 Data Lineage tracing module development in Enterprise Data Warehouse 3.1 Introduction In this chapter we will describe the development tasks accomplished in order to implement one possible solution for the financial metrics data lineage tracing in EDW environment in the Financial Institution N. The prototype code is presented in Appendixes 1-7, where it can be simply copied from and executed in the Teradata database application (for example, Teradata administrator or Teradata Assistant) in Production environment. Code is divided into steps which should be executed in the chronological order. All database objects were created in form of volatile tables, which means that they will exist only during particular user session. 3.2 Prototype Requirements and Analysis Next, we provide the user story requirements gathered for the prototype solution. The functional requirements are described in Table 17. We would like to mention that at the current stage of development there are, generally, no non-functional requirements, which would significantly affect the prototype development directions. The only nonfunctional requirement proposed at the moment is that none of the code developed should fail during query processing because of the lack of user s spool space. Table 17. Functional requirements for the prototype solution. Use Case ID UC_01 UC_02 UC_03 Use Case Description There should be a functionality to query data lineage for the financial metrics (attributes Account_Metric_Type_Code, Account_Nbr_Modifier) stored in table EDW.T3058_ACCOUNT_METRIC. There should be a functionality to trace the metric lineage back to the sources and further to the targets to the specified amount of levels. Starting point is EDW.T3058_ACCOUNT_METRIC. There should be a functionality to query the lineage for actually calculated 38

39 Use Case ID Use Case Description financial metrics for a defined period (attribute Period_Ending_Date, table EDW.T3058_ACCOUNT_METRIC). UC_04 UC_05 UC_06 UC_07 There should be a functionality to see the object-level relations like table-totable or table-to-view relations. There should be a possibility to see source and target metrics if such exist. The lineage tracing solution should implement a reusable pattern, which could be later used for other classifiers than the financial metrics. The lineage tracing pattern should be implemented upon existing metadata. The EDW warehousing environment is logically divided into different layers. Next, we provide the description of the layers that we have mainly concentrated our solution for: EDW_DW is a denotation for the main physical layer. The physical schema name is EDW. EDW_CL is a denotation for the main logical layer. The physical schema name is HBG. EDW_SA is a denotation for the staging area layer. The physical schema name is OSA. We will not describe other layers here as this will not give any additional value to the current work. We know that in the physical layer, the data is generally stored by the means of tables, while in logical layer it actually appears in the views. Therefore, data propagation from the physical to logical layer is done by the means of view structures. The data propagation from tables to tables or from views to tables is possible only by the means of ETL processes. In prototype solution, we have divided the relations into ETL relations and DB relations (for the table-to-view relations) and have separated the ways we search for them. The table T3058_ACCOUNT_METRIC is residing in the physical layer EDW_DW, in the EDW schema. The prototype development has been started from this table according to the functional requirements UC_01 and UC_02. The table holds the month summary financial metrics calculated for Financial Institution s customer accounts. The UML diagram of T3058_ACCOUNT_METRIC together with the table attributes is presented in Figure 12. Table EDW.T7060_ACCOUNT_METRIC_TYPE is holding 39

the financial metrics semantical description and we will leave it without attention as we are not interested in financial metrics business meanings but only in technical data processing. Figure 12.

40 the financial metrics semantical description and we will leave it without attention as we are not interested in financial metrics business meanings but only in technical data processing. Figure 12. UML diagram for T3058_ACCOUNT_METRIC table. According to the functional requirements UC_04 and UC_05, we have outlined two different data lineage levels: Data lineage object level is the level which stores the relations between objects like table-to-table, table-to-view, view-to-view, view-to-table. Note, that the data propagation from table-to-table and from view-to-table is possible only with the means of ETL processes. Data lineage classifier level is the level which stores the relations between (classifier and object)-to-(classifier and object). In our case the classifier attribute is Account_Metric_Type_Code, which holds the values for financial metrics. The data lineage for object and classifier levels is depicted in Figure 13. The object level lineage is shown with the black arrows and the classifier lineage is shown with the red arrows. The tables are annotated as T (with number) and views as V (with number). We assume that tables are residing in the physical layer and views are residing in the logical layer. The red denotation M1 denotes the value 1 in the attribute Account_Metric_Type_Code, and therefore, the red flow represents the lineage for the 40

41 financial metric 1, which is the classifier level. The data propagation way is shown between objects as ETL or view. view V1 ETL ETL view ETL ETL T1 T2 T3 V2 T4 T5 M1 M1 M1 M1 view V3 view V4 Figure 13. The object level and the classifier level example. The relations for the data lineage object level can be stored in a separate table with level_no, parent_id and child_id attributes, as presented in the Table 18 (based on the relations given in Figure 13). The lineage for a specific object usage can be queried recursively upon this table. Table 18. Table holding relations between objects. Object_Relations level_no source target 1 T1 T2 2 T2 T3 3 T3 V1 4 T3 V2 5 T3 V4 6 V2 T4 7 T4 T5 Generally, we have adopted the idea for storing data lineage classifier level relations from the TRIO project, where the concept of the lineage table lin: R result (parent_id, table, child_id) has been presented. The classifier level relations storing example is presented in Table

42 Table 19. Table holding relations between classifiers. Classifier_Relations level_no metric_nbr source target 1 1 T1 T2 2 1 T2 T3 3 1 T3 V1 4 1 T3 V2 5 1 V2 T4 According to the functional requirements, the type of lineage we need to capture can be identified as the internal where-provenance, as we are not interested in what transformations exactly the tuples from the source objects undergo, but we are interested in the locations where the values are coming from. We use the lineage type classification table from the previous chapters in order to depict the required lineage type marked with red X, as shown in Table 20. The lineage annotation calculation approach is possible only then the facilities for such a solution are foreseen from the very beginning of the database implementation. There are no possibilities for such approach in current warehousing environment, and therefore we have no choice than to implement the lazy calculation approach. Table 20. Categorization of systems according to provenance types. Data (or fine-grained) Provenance Internal Data Provenance Annotation (eager) approach Non-annotation (lazy) approach Why-provenance TRIO SPIDER How-provenance TRIO SPIDER Where-provenance DBNotes X According to the requirement UC_03, the lineage should be calculated only for the financial metrics existing for the specified period (in our case for the selected month). While table T3058_ACCOUNT_METRIC holds the data aggregated on a monthly basis (each period is denoted with the last day of month like , , etc.), the source objects can hold different customers account facts data on a daily 42

43 basis. Current specialty was kept in mind while implementing Data Group Scanner described later in this work. We would like to mention that the requirement UC_06 was not fulfilled. In order to make the prototype solution pattern applicable for other possible warehouse classifiers (not only Account_Metric_Type_Code but also for other numerical attributes), additional research should be performed for each classifier separately. The prototype solution can be redeveloped to be more universal taking into account other classifier specialties. This is very time consuming and tedious work, which would be out of the scope of current thesis workload, and will be left for future prototype improvements. According to the requirement UC_07, the prototype solution has been built upon the metadata information available in the current warehousing environment as much as it was possible. Several metadata usage constraints have been identified within this work and the workaround have been implemented. 3.3 Prototype Description In this chapter, we will describe the prototype solution in more detail. The system consists of seven main transformation steps. Additional recursive queries are built upon the lineage result table to query the metric lineage back to the sources and forward to the next target objects. Additional query was built in order to implement data lineage visualization. The prototype object usage summary is presented in Error! Reference source not found.. The metadata objects used are denoted with the schema name MEPL. Volatile tables, created specifically for the prototype, have no schema denotation. Moreover, each object is supplemented with the overall description. Additionally, the description of what is happening during each transformation step is presented in Table 22. Table 21. Prototype object usage summary. Step No Source Object Source Object Description Target Object Target Object description 1.1 MEPL.MD_LO GICAL_OBJEC T_USAGE Metadata - Logical object usage in ETL processes and DB views, macros and MD_ETL_SOU RCE_TARGET Volatile table holding source and target objects for ETL packages by package 43

44 Step No Source Object 1.2 MD_ETL_SOU RCE_TARGET MEPL.MD_LO GICAL_OBJEC T MEPL.MD_LO GICAL_OBJEC T_COLUMN 2.1 MEPL.MDE_L OG_OBJECT_ USAGE_SUM MARY 2.2 AUX_MD_ETL _SOURCE_TA RGET 3.1 MEPL.MDE_L OG_OBJECT_ USAGE_SUM MARY Source Object Description procedures. Volatile table holding source and target objects for ETL packages by package and package step. Each object is assigned an object category. Metadata - Logical object. Metadata - Columns of logical object. Metadata - show logical object usage both in ETL and in DB with owner services. Volatile (auxiliary) table holding filtered out source and target objects with financial facts. Metadata - Logical object usage in ETL processes and DB views, macros and procedures. Target Object AUX_MD_ETL _SOURCE_TA RGET ETL_SOURCE _TARGET ETL_SOURCE _TARGET DB_SOURCE_ TARGET Target Object description and package step. Each object is assigned an object category. Volatile (auxiliary) table holding filtered out source and target objects with financial facts. Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Volatile table holding relations table-to-view and view-to-view. Information is gathered from two different sources. 3.2 MEPL.MD_LO Metadata - Logical DB_SOURCE_ Volatile table holding 44

45 Step No Source Object GICAL_OBJEC T MEPL.MD_LO GICAL_OBJEC T_COLUMN MEPL.MD_LO GICAL_OBJEC T_USAGE 4 EDW.T0009_P ROCESS_RUN MEPL.MD_PR OCESS EDW.T3058_A CCOUNT_MET RIC EDW.T03056_ ACCOUNT_SU MMARY_DLY _DD EDW.T03064_ ACCT_INSUR ANCE_METRI C EDW.T03057_ ACCOUNT_M ETRIC_HISTO RY EDW.T4150_A Source Object Description Target Object Target Object description object. TARGET relations table-to-view and view-to-view. Metadata - Columns of logical object. Metadata - Logical object usage in ETL processes and DB views, macros and procedures. The process run instance, where execution of ODI processes is logged. Metadata - Process - Package with execution parameters, scheduling and dependencies information. EDW - This entity contains summary account information at the end of a time period for a particular balance category. EDW - This entity contains summary account information at the end of day. EDW - Insurance metric values for insurance accounts. EDW - This entity contains account metric history for validity periods with specific start date and end date. EDW - This entity DATA_GROUP Information is gathered from two different sources. Volatile table holding scanned information in form of data groups. 45

46 Step No Source Object SSET_METRIC _HISTORY EDW.T3059_A CCOUNT_MET RIC_DLY Source Object Description contains asset metric value history for a particular asset metric type code. EDW - This entity contains account metric information at the end of day for a particular balance category. 5 DATA_GROUP Volatile table holding scanned information in form of data groups. OSA_DATA_G ROUP_NO_PR OCESS MEPL.MDE_C LASSIFIER_US AGE_SUMMA RY Volatile (temporary) table holding data groups with no process_run_id attribute assigned. Metadata - Classifier usages summary scanned from ETL, DB objects and foreign keys. 6 DATA_GROUP Volatile table holding scanned information in form of data groups. ETL_SOURCE _TARGET MEPL.MD_LO GICAL_OBJEC T 7 EDW.T0079_P ORCESS_OBJE Volatile table holding the source and target objects for ETL packages. The information is retrieved from the parallel metadata source comparing to step 1.1. Metadata - Logical object. Target Object OSA_DATA_G ROUP_NO_PR OCESS DATA_GROUP OSA_ETL_SO URCE OSA_T0079 Target Object description Volatile (temporary) table holding data groups with no process_run_id attribute assigned. Volatile table holding scanned information in form of data groups. Volatile (temporary) table storing the source objects for the data groups. Volatile (temporary) table holding copy of 46

47 Step No Source Object Source Object Description Target Object Target Object description CT_REL_DTL mapping from EDW.T0079_PROCE SS_OBJECT_REL_D TL. OSA_T0079 Volatile (temporary) table holding copy of mapping from EDW.T0079_PROCE SS_OBJECT_REL_D TL. OSA_T0079_2 Volatile (temporary) table holding the mapping replaced with real data values. DATA_GROUP Volatile table holding scanned information in form of data groups. OSA_T0079_2 Volatile (temporary) table holding the mapping replaced with real data values. DATA_GROUP _RELATION Volatile table holding data group relations. Result table holding lineage information. DATA_GROUP Volatile table holding scanned information in form of data groups. Table 22. Prototype transformation steps description. Step No Target Object description 1.1 Find ETL relations: Retrieve ETL relations from MEPL.MD_LOGICAL_OBJECT_USAGE. Left join together source and target objects by package and package step. Assign the object category to each object. 1.2 Find ETL relations: Based on the object obtained in 1.1, filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by object category. Additionally, filter out the objects by checking each object columns and assuming that a table holding financial facts should have some specific columns like (_metic_amt, _metric_code, _metric_rate, _metric_code, etc.). 2.1 Find ETL relations: Retrieve ETL relations from a parallel source of information - MEPL.MDE_LOG_OBJECT_USAGE_SUMMARY. Left join together source and target objects by package. Assign object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category. 47

48 Step No Target Object description 2.2 Find ETL relations: Add the relations from the auxiliary object obtained in 1.2, which are true, and which are missing in object obtained in step Find Table-View relations: Retrieve ETL relations from MEPL.MDE_LOG_OBJECT_USAGE_SUMMARY. Left join together source and target objects. Assign the object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category similarly to step Find Table-View relations: Retrieve ETL relations from MEPL.MD_LOGICAL_OBJECT_USAGE. Left join together source and target objects. Assign the object category to each object. Filter out the objects which are not holding any financial facts (objects holding mappings, rules, configurations, etc.) by category and by columns similarly to step Data Group Scanner: According to a pattern developed, scan metric facts keeping database objects. 5 Data Group Scanner: Find Processes and Packages for the data groups which are missing the process_run_id from metadata object MEPL.MDE_CLASSIFIER_USAGE_SUMMARY. 6 Data Group source objects: Find source objects for data groups by processes and packages related to the data groups. Process and package for a data group is found in steps 4 and 5, and ETL relations are found in Data Group Relations: Find source-target relations between data groups. Each transformation step code is presented in Appendixes 1-8. Despite the amount of objects processed, and the volume of the transformation steps, the final architectural solution is very simple and consists of two tables. The whole prototype work and transformations are done in order to retrieve the lineage information which is, finally, stored in form of data groups and the relations between them. All other objects created during prototype work can be considered as temporary and are required only for the data processing. The architecture of the final lineage module is presented in the Figure

49 Figure 14. UMLdiagram for Data Lineage module. The explanation of how the data groups are obtained is provided in the next chapter. 3.4 Data Scanner In terms of financial metrics lineage, we are not interested in retrieving the lineage for every particular account or tuple in the table EDW.T3058_ACCOUNT_METRIC. Therefore, we implemented the Data Scanner which mainly divides the set of data of EDW.T3058_ACCOUNT_METRIC into the subsets and stores the subsets in the aggregated form. Lately the subsets are checked whether they are intersecting with the relations found. The idea is more similar to the schema mapping solution described in the SPIDER application example. The data is mainly grouped by: Account_Metric_Type_Code the type code of the financial metric; Account_Nbr_Modifier the attribute identifies from which source database the account has been loaded to the data warehouse (credit, leasing, etc.); Country_Context_Code the attribute identifies which country the account belongs to; Process_Name the ETL process name by which financial metrics have been derived to the table T3058_ACCOUNT_METRIC; 49

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server: