UNIVERSITY OF CINCINNATI

Size: px

Start display at page:

Download "UNIVERSITY OF CINCINNATI"

Shanna Ford
6 years ago
Views:

1 UNIVERSITY OF CINCINNATI, 20 I,, hereby submit this as part of the requirements for the degree of: in: It is entitled: Approved by:

2 Migrating an Operational Database Schema to Data Warehouse Schemas A thesis submitted to the Division of Graduate Studies and Research of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical & Computer Engineering and Computer Science of the College of Engineering March 8, 2001 by Cassandra Phipps B.S., Wilmington College, 1995 Thesis Advisor and Committee Chair: Dr. Karen C. Davis

3 Abstract The popularity of data warehouses for analysis of data has grown tremendously, but much of the creation of data warehouses is currently done manually. Although the initial design process is labor-intensive and expensive, research towards automating data warehouse creation has been limited. We propose and illustrate algorithms for automatic schema development. Our first algorithm uses a conceptual enterprise schema of an operational database as a starting point for source-driven data warehouse schema design. Candidate conceptual data warehouse schemas are created in ME/R model form. We extend the ME/R modeling notation to note where additional user input can be used to further refine a schema. Our second algorithm follows a user-driven requirements approach that utilizes queries to guide selection of candidate schemas most likely to meet user needs. We propose a guideline of manual steps to refine the conceptual schemas to suit additional user needs, for example, the level of detail needed for date fields. The selected and possibly refined schemas are now ready to be transformed into logical schemas. The third algorithm creates logical schemas in Star model notation from the conceptual schemas in ME/R notation. The logical model provides a basis for physical modeling and the data warehouse implementation. Our algorithms provide a foundation for an automated software tool to create and evaluate data warehouse schemas. The algorithms are illustrated using the TPC-H Benchmark schema and queries.

5 Table of Contents LIST OF FIGURES...III LIST OF TABLES...IV CHAPTER 1: INTRODUCTION GENERAL RESEARCH OBJECTIVE SPECIFIC RESEARCH OBJECTIVES RESEARCH METHODOLOGY CONTRIBUTIONS OF THE RESEARCH OVERVIEW CHAPTER 2: RELATED RESEARCH PRE-DEVELOPMENT ARCHITECTURE SELECTION SCHEMA CREATION Decision Support Functionality and Terminology ER-based Models Dimensional Models Models for Schema Creation Automated Schema Design DATA WAREHOUSE POPULATION DATA WAREHOUSE MAINTENANCE SUMMARY CHAPTER 3: DEVELOPING A CONCEPTUAL DATA WAREHOUSE SCHEMA PRE-DEVELOPMENT: USER REQUIREMENT GATHERING Source-driven Requirements User-driven Requirements CONCEPTUAL SCHEMA CREATION CANDIDATE SCHEMA SELECTION MANUAL REFINEMENT OF CONCEPTUAL SCHEMAS SUMMARY CHAPTER 4: DEVELOPING A LOGICAL DATA WAREHOUSE SCHEMA LOGICAL SCHEMA GENERATION MANUAL REFINEMENT OF LOGICAL SCHEMAS SUMMARY CHAPTER 5: CONCLUSIONS AND FUTURE WORK BASIS OF ALGORITHMS Conceptual Schema Creation Conceptual Schema Evaluation Logical Schema Creation FUTURE WORK BIBLIOGRAPHY APPENDIX A: TPC-H SCHEMA AND QUERIES A.1 TPC-H SCHEMA A.2 TPC-H QUERIES A.2.1 Pricing Summary Report Query (Q1) i

6 A.2.2 Minimum Cost Supplier Query (Q2) A.2.3 Shipping Priority Query (Q3) A.2.4 Order Priority Checking Query (Q4) A.2.5 Local Supplier Volume Query (Q5) A.2.6 Forecasting Revenue Change Query (Q6) A.2.8 National Market Share Query (Q8) A.2.9 Product Type Profit Measure Query (Q9) A.2.10 Returned Item Reporting Query (Q10) A.2.11 Important Stock Identification Query (Q11) A.2.12 Shipping Modes and Order Priority Query (Q12) A.2.13 Customer Distribution Query (Q13) A.2.14 Promotion Effect Query (Q14) A.2.15 Top Supplier Query (Q15) A.2.16 Parts/Supplier Relationship Query (Q16) A.2.17 Small-Quantity-Order Revenue Query (Q17) A.2.18 Large Volume Customer Query (Q18) A.2.19 Discounted Revenue Query (Q19) A.2.20 Potential Part Promotion Query (Q20) A.2.21 Suppliers Who Kept Orders Waiting Query (Q21) A.2.22 Global Sales Opportunity Query (Q22) APPENDIX B: CONCEPTUAL SCHEMA TABLES APPENDIX C: DOCUMENTATION FOR ALGORITHM SUBROUTINES C.1 SUBROUTINES FOR CONCEPTUAL SCHEMA CREATION C.2 SUBROUTINES FOR CANDIDATE SCHEMA EVALUATION C.3 SUBROUTINES FOR LOGICAL SCHEMA CREATION ii

7 List of Figures FIGURE 2.1: CUBE DEPICTED GRAPHICALLY FIGURE 2.2: SAMPLE ER SCHEMA FIGURE 2.3: AN EVER SCHEMA FIGURE 2.4: A STARER SCHEMA FIGURE 2.5: ME/R SCHEMA FIGURE 2.6: ME/R LEGEND FIGURE 2.7: A DIMENSIONAL STAR SCHEMA FIGURE 2.8: SNOWFLAKE SCHEMA FIGURE 2.9: DFM SCHEMA FIGURE 3.1: TPC-H SCHEMA FIGURE 3.2: ALGORITHM FOR CONCEPTUAL SCHEMA CREATION FIGURE 3.3: WALK RELATIONS SUB-PROCEDURE FOR CONCEPTUAL SCHEMA GENERATION FIGURE 3.4: LINEITEM EVENT OF ME/R FIGURE 3.5: ATTRIBUTES OF LINEITEM EVENT FIGURE 3.6: DATE LEVELS OF LINEITEM EVENT FIGURE 3.7: LINEITEM NODE WITH ATTRIBUTES ADDED TO LINEITEM EVENT FIGURE 3.8: ORDER LEVEL AND ATTRIBUTES ADDED TO THE LINEITEM EVENT FIGURE 3.9: COMPLETE ORDERS DIMENSION OF LINEITEM EVENT FIGURE 3.10: CANDIDATE SCHEMA 1: LINEITEM EVENT FIGURE 3.11: CANDIDATE SCHEMA 2: PARTSUPP EVENT FIGURE 3.12: CANDIDATE SCHEMA 3: PART EVENT FIGURE 3.13: CANDIDATE SCHEMA 4: ORDERS EVENT FIGURE 3.14: CANDIDATE SCHEMA 5: CUSTOMER EVENT FIGURE 3.15: CANDIDATE SCHEMA 6: SUPPLIER EVENT FIGURE 3.16: CANDIDATE SCHEMA EVALUATION ALGORITHM FIGURE 3.17: CANDIDATE SCHEMA 1: ME/R LINENUMBER MEASURE CHANGE FIGURE 3.18: CANDIDATE SCHEMA 2: ME/R WITH DATE DIMENSION DEFINED FIGURE 3.19: PARTIAL LINEITEM ME/R WITH ADDED REVENUE MEASURE FIGURE 3.20: TWO FACTS WITH A SHARED DIMENSION FIGURE 4.1: AUTOMATED LOGICAL SCHEMA CREATION ALGORITHM FIGURE 4.2: ADD_LEVEL_TO_DIMENSION PROCEDURE FOR AUTOMATED LOGICAL SCHEMA CREATION FIGURE 4.3: LINEITEM EVENT FACT TABLE FIGURE 4.4: ADDITION OF LINEITEM DIMENSION TO STAR MODEL OF CANDIDATE SCHEMA FIGURE 4.5: ADDITION OF COLUMNS FOR LINEITEM DIMENSION FIGURE 4.6: ORDERS DIMENSION FOR CANDIDATE SCHEMA FIGURE 4.7: STAR SCHEMA WITH COMPLETED ORDERS DIMENSION FIGURE 4.8: PARTSUPP DIMENSION ADDED TO STAR SCHEMA OF CANDIDATE SCHEMA FIGURE 4.9: CANDIDATE SCHEMA 1 AS A STAR SCHEMA FIGURE 4.10: CANDIDATE SCHEMA 2 AS A STAR SCHEMA FIGURE 4.11: CANDIDATE SCHEMA 5 AS A STAR SCHEMA FIGURE 4.12 CANDIDATE SCHEMA 1: LINEITEM EVENT WITH MERGED DATE DIMENSIONS iii

8 List of Tables TABLE 2.1: MODELS FOR CONCEPTUAL AND LOGICAL SCHEMAS TABLE 3.1: NUMERIC COLUMNS PER OLTP TABLE TABLE 3.2: FACT_NODE_TABLE FOR LINEITEM EVENT CANDIDATE SCHEMA TABLE 3.3: FACT_ATTRIBUTE_TABLE FOR LINEITEM EVENT TABLE 3.4: LEVEL_TABLE FOR LINEITEM EVENT WITH DATE/TIME LEVELS TABLE 3.5: LEVEL_TABLE WITH LINEITEM LEVEL ADDED TABLE 3.6: LEVEL_ATTRIBUTE_TABLE WITH LINEITEM ATTRIBUTES TABLE 3.7: LEVEL_TABLE FOR LINEITEM WITH ORDERS AND THEIR SUB-LEVELS ADDED TABLE 3.8: LEVEL_ATTRIBUTE_TABLE WITH ADDITIONAL LINEITEM ATTRIBUTES TABLE 3.9: CANDIDATE SCHEMA EVALUATION TABLE 3.10: FACT_NODE_TABLE, EVALUATED FOR SCHEMAS TO USE FOR DATA WAREHOUSE 75 TABLE A.1: TABLES TABLE A.2: TABLE_COLUMNS TABLE A.3: TABLE_CONSTRAINTS TABLE A.4: TABLE_RELATIONS TABLE B.1: CANDIDATE SCHEMA 1: FACT_NODE_TABLE TABLE B.2: CANDIDATE SCHEMA 1: FACT_ATTRIBUTE_TABLE TABLE B.3: CANDIDATE SCHEMA 1: LEVEL_TABLE TABLE B.4: CANDIDATE SCHEMA 1: LEVEL_ATTRIBUTE_TABLE TABLE B.5: FACT_NODE_TABLE FOR CANDIDATE SCHEMAS TABLE B.6: FACT_ATTRIBUTE_TABLE FOR CANDIDATE SCHEMAS TABLE B.7: LEVEL_TABLE FOR CANDIDATE SCHEMAS TABLE B.8: LEVEL_ATTRIBUTE_TABLE FOR CANDIDATE SCHEMAS iv

9 Chapter 1: Introduction As conventional transaction processing systems have matured, becoming faster and more stable, the focus of user needs has changed. Conventional systems store daily activities and display or report these events on a regular basis. Now companies want to increase the value of their transaction processing systems; value to an organization means turning data into actionable information [R96a]. The demands of analytical processing for decision support may exceed the capabilities of systems already processing daily transactions. Although traditional OnLine Transaction Processing (OLTP) systems have some, or all, of the necessary data, it is not easily accessed by the user for analytical processing. The need for OnLine Analytical Processing (OLAP) gives rise to the data warehouse concept. Inmon gives a succinct definition of a data warehouse: a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making [I92]. In other words, a data warehouse is a non-changing collection of data about some logical part of an organization s business. The data is generally a quantitative result at some time interval such as day or month. This data is used to make analytical decisions for business planning and evaluation. The data in a data warehouse includes direct copies of data from a transactional system, historical data, summarized data, and/or consolidated data. The data warehouse is often an aggregation of data from various sources that provides a repository to support OLAP without interruption to operational OLTP system(s) [R96b]. To facilitate decision support, the data warehouse has some distinct variations from a traditional OLTP system. Widom summarizes these differences as: 1) the views in a data warehouse are more complicated, 2) data warehouses contain highly aggregated and summarized data instead of transactional data, 3) view maintenance is not easily automated in a data warehouse because data can come from multiple heterogeneous sources, 4) the refresh requirements on data in a warehouse are different, and 5) data warehouse data may need to be 5

10 cleaned, scrubbed, or otherwise conformed [W95]. These differences necessitate that data warehouse design and creation are approached differently than OLTP database design/creation. The data warehouse fulfills different needs and thus requires a different approach to design. The data warehouse concept is critical when all the data is not stored in the same location and/or manner. For decision support, data is integrated from multiple sources that may be stored in different database platforms under different data models or have different data types. Building a data warehouse includes mapping data from different sources, cleaning and translating the data into a common model, populating the database and maintaining the data warehouse. A data warehouse creation process consists of five steps: pre-development activities, architecture selection, schema creation, warehouse population, and data warehouse maintenance [M97, SC99]. Each of these steps is affected by the previous step. The focus of this thesis is the schema creation phase that is impacted by the output of the pre-development and architecture selection steps. The pre-development activities phase defines the business processes model, including the detail necessary for this business information. This phase is primarily end-user requirement analysis; information is collected from discussions with users and from analysis of the organizational goals. During the pre-development phase, capacity planning is also performed. The architecture selection step determines how data warehouse synchronization (keeping data warehouse information up-to-date with business transactional data) occurs, and how it is physically stored. Information can be stored in one physical location/database or split over multiple database instances. The schema creation step includes the creation of an enterprise data model, the integration of existing schemas, and identification of constraint (database business rules) mismatches. The warehouse population step deals with semantic issues, such as resolving how to manage duplicate data. Scalability issues as the data warehouse grows, along with accomplishing incremental updates, are concerns of the maintenance step. The data warehouse maintenance phase encompasses plans for maintaining the database after it is created. This 6

11 includes traditional database maintenance issues such as tuning, performance, and view maintenance. The focus of this thesis is the schema creation phase and its automation; while this phase includes conceptual, logical, and physical schema design, this thesis only addresses conceptual and logical design of a data warehouse. The conceptual model allows a high level design of entities and their relationships, represented in a user-friendly manner independent of implementation issues. A conceptual schema is a description of the data to be in the data warehouse that is understood by end users to verify requirements and possible gaps. The conceptual schema gives the users an understandable picture of how the data can be conceptualized for analysis of business goals. The logical model describes more specifically how the data is organized, including actual table and field names to accommodate user needs in a form more like the eventual implementation database for the data warehouse. This schema is a proposed basic layout for organization of the data into usable entities for query processing. The physical schema design is the implementation and storage of the data warehouse schema. At this point, database sizing, index creation, and view materialization are considered. Most approaches to the conceptual and logical schema design problem are manual in nature. The task of developing a data warehouse schema is generally given to those who know the OLTP systems best. This manual approach is time-consuming and labor-intensive. The traditional online databases of an organization are built upon differing technologies, represent overlapping information, and are modeled using different techniques. OLTP systems are designed to support user applications. User requirements of a data warehouse are more analytical in nature, and less predictable. Their design and data models must accommodate these different user needs. The differences in user needs and system designs between OLTP and OLAP make converting from one to the other challenging. This thesis focuses on how to reduce the manual labor involved in modeling data warehouses. Schemas of existing OLTP system(s) and user requirements are used to create a data 7

12 warehouse schema that is as complete as possible with decreased manual manipulation. This thesis examines OLTP systems and user requirements and creates a conceptual schema for the data warehouse. The candidate conceptual schemas generated by the algorithm are evaluated based on a list of user queries (requirements). An algorithm is presented to convert a conceptual schema to an initial logical schema. The conceptual schema is presented using the ME/R model and the Star model is used for logical schema representation. Both models are explained in Chapter General Research Objective The research objective of this thesis is to create semi-automated techniques to develop candidate data warehouse schemas, both conceptual and logical, with corresponding migration paths from operational databases. 1.2 Specific Research Objectives In order to develop semi-automated techniques for data warehouse schema creation, several research objectives must be met along the way. These objectives guide the design of the algorithms. The following is a list of objectives to be met: A. Select a data warehouse architecture(s) as an environment for schema generation. An architecture is more appropriately represented by the physical design phase, but the conceptual and logical schemas are impacted by the target architecture. B. Determine data warehouse modeling techniques to be used for conceptual and logical schema representation. C. Determine how a traditional operational database schema can be used to design data warehouse conceptual schemas. D. Evaluate the automatically generated conceptual schemas based on fulfillment of the user requirements. 8

13 E. Devise a method to create a logical data warehouse schema from the conceptual schema. 1.3 Research Methodology In order to achieve the above objectives, literature in the data warehouse field is surveyed. The following activities are conducted: A. Evaluate existing architectures for data warehouses. B. Survey the literature on data warehouse schema styles for conceptual and logical modeling. Consider ER, SERM, ME/R, Star, Snowflake, and StarER models and evaluate the relative merits of each. C. Survey data warehouse schema design techniques in order to evaluate/adopt current techniques to create candidate data warehouse schemas from an existing operational database schema; propose new algorithms where appropriate. D. Analyze relative merits of the candidate schemas with respect to a set of user queries. Evaluation is a measurement of how well a candidate schema can fulfill the requirements of the user queries. E. Survey techniques to create logical data warehouse schemas. Evaluate techniques and propose new algorithms where appropriate. 1.4 Contributions of the Research As a result of the activities in the research methodology, we expect to make the following contributions: A. A feature analysis of candidate architectures, with emphasis on schema design for data warehouses, and evaluation of a selected architecture with respect to the proposed schema design process. B. Evaluation and adoption of the most suitable graphical techniques for representing conceptual and logical data warehouse schemas. 9

14 C. Introduction and illustration of a schema design technique to derive candidate data warehouse conceptual designs from existing schemas of operational databases. D. Proposal of a semi-automated analysis technique for the candidate schemas in relation to user requirements. E. Development of an algorithm to convert from candidate conceptual schemas to logical schemas. In this thesis, algorithms for generating conceptual and logical schemas from existing OLTP schemas are presented. An approach to evaluate candidate schemas based on user queries is also contributed. 1.5 Overview Chapter 2 gives an overview of data warehouse design concepts that are part of automating schema creation. It includes a discussion of user requirements definition, architecture selection, data warehouse schema creation concepts, and various data models used to represent schemas. Chapter 3 presents an automated approach to constructing the candidate conceptual models for a data warehouse from an OLTP schema and known user requirements. This includes an algorithm for the candidate schema creation and another for evaluation of its completeness. Chapter 4 gives an automated approach for creation of a logical schema from a conceptual schema. As the conclusion, Chapter 5 states the advantages and known limitations of the thesis compared to currently proposed schema design techniques as well as a discussion of future work. Included in this chapter is a discussion of OLTP modeling techniques that are covered by this approach to schema generation. 10

15 Chapter 2: Related Research The data warehouse creation process used here adopts the three data warehouse design steps of Srivastava and Chen: architecture selection, schema creation, and warehouse population [SC99] and adds two additional steps, pre-development and warehouse maintenance. As part of the pre-development phase, user requirements gathering is the first logical step and is discussed in Section 2.1. The next phase, architecture selection, is presented in Section 2.2. A discussion of schema creation is described in Section 2.3. This section includes information about three schema levels: conceptual, logical and physical. Also discussed are various models used to represent these schema levels. The final two sections, 2.4 and 2.5, briefly discuss data warehouse population and maintenance, respectively. 2.1 Pre-Development The driving force behind data warehouse systems creation is determining what to analyze and why. Helping users answer these questions is user requirements gathering or predevelopment. Various industry and academic resources propose questions whose answers define what is needed in a data warehouse [BH98, GR99a, K96a, K96b, K00, M97, M98a, R96a, R96b]. In general, the information obtained includes descriptions of what they did in a typical day, how they measured the success of what they did, and how they thought they could understand their business better [K96a]. This information is the goal of user requirements gathering. In our research, we use two forms of user requirements, source-driven and user-driven. Source-driven requirements gathering is a method based on defining the requirements of the data warehouse using the source data or schema from the OLTP system. Automation of data warehouse modeling generally uses a source-driven method, Starting with the conceptual OLTP schema for generating requirements [BE99, GM98b]. We also utilize a source-driven approach for automated candidate schema generation. The benefits of using the source-driven requirements gathering approach is that minimal user time is required to start the project, and complete data is 11

16 supplied since the existing data in the OLTP system provides the framework for the data warehouse. The disadvantages of this approach are incomplete knowledge of user needs and the possibility of the limited data in the OLTP database falling short of those user needs. Reduced user involvement may result in the production of an incorrect set of requirements. To alleviate this disadvantage we include an opportunity for user refinement on our design process. User-driven requirements gathering is a method based on investigating functions users perform. This form of requirement gathering necessitates large blocks of time, from key users, to determine present and future business needs. The goal, meeting user needs, is achieved through questions defining what users need to know about their business. Determining what users want a system to be is part of the job of the design task or team. A good design is able to answer known business needs. This differs from OLTP user requirements gathering in that the transactional world requires knowing the processes by which users do their jobs. In a data warehouse we want to provide answers to questions. Various authors pose questions to be answered at the end of the user requirements gathering process [BH98, K00, R96a, R96b]. The resulting information from user-driven requirements gathering should answer the following [BH98]: 1. Who (people, groups, organizations) is of interest to the user? 2. What (functions) is the user trying to analyze? 3. Why does the user need the data? 4. When (for what point in time) does the data need to be recorded? 5. Where (geographically, organizationally) do relevant processes occur? 6. How do we measure the performance or state of the functions being analyzed? An advantage to this requirements approach is that the scope is more focused toward user need and thus may be smaller. On the negative side, it is challenging to keep user expectations at an attainable level and it may be impossible to answer all questions. There may not be enough appropriate data available. In this thesis, we focus on user needs as represented in a query workload to be applied against the data warehouse. The approach presented in this thesis relies primarily on source-driven requirements with secondary attention to user requirements for input and evaluation. This allows us to start the 12

17 design process while gathering user requirements. Having user-driven requirements prevents critical business needs from being overlooked. Our approach overcomes the disadvantage of missing key data by using enterprise-wide schemas as inputs. As long as the data exists somewhere in the organization, it can be found with the source-driven approach. What is not found in the organization is noted during user requirements gathering and added in where appropriate and possible. In this thesis, to illustrate the proposed algorithms, both source-driven and user-driven requirements are gathered from the TPC-H TM benchmark (TPC-H is a trademark of the Transaction Processing Performance Council). This benchmark provides an OLTP schema for source derivation and user queries for user requirements [TPC-H]. It is designed as a standard set of queries over a defined schema; they are presented in Appendix A. Capacity planning is another part of the pre-development phase. This deals with defining the amount of stored data including the number of years the data is kept for analysis. Also defined is the expected workload, and number of concurrent users. While this is an important step in the pre-development phase, the answers are more important to implementation and physical design. This may impact hardware and software technology decisions but is not a concern in the data warehouse schema creation algorithm proposed here. 2.2 Architecture Selection As the next phase of data warehouse design, the selection of an architecture is important to determine the scope of the design process and data warehouse project functionality. Architecture selection is based on such factors as current infrastructure, business environment, desired management and control structure, commitment to and scope of the implementation effort, capability of the technical environment the organization employs, and resources available [BH98]. The architecture of a data warehouse includes source databases, some method for translation to a data warehouse environment, and the data warehouse database. 13

18 The two most common architectures are top-down and bottom-up, or some derivation of the two. In a top-down architecture, a central data warehouse is designed. Data marts may be created as a subset of the data warehouse, where a data mart is generally a miniature, selfcontained data warehouse encompassing only a small part, or function, of the organization. A data mart is simply a smaller data warehouse that functions independently, or is interconnected with others to form an enterprise-wide data warehouse. In a bottom-up architecture the data marts are created initially and the data warehouse is created incrementally from these data marts [BH98, F98, I92]. An in-depth discussion of data warehouse architectures is given by Firestone [F98]. The determining factors for architecture selection are necessary resources, and time to completion. In the top-down architecture approach the cost of initial planning and design can be significant. It is a time consuming process and can delay actual implementation, benefits, and return on investment [BH98]. With the bottom-up architecture approach, the time it takes to have an initial offering is shorter. Data marts are generally quicker to build, less expensive, and require less hardware resources because their scope is smaller than an entire data warehouse. These advantages make the bottom-up architecture popular in industry. There is, however, an advantage to top-down versus bottom-up. Because all the data marts are created from an existing data warehouse, the data definitions in a top-down architecture are consistent, and enforcement of an organization s business rules are easier to implement in a single data location. Conversely, a disadvantage of the bottom-up architecture is the lack of data consistency. It is difficult to maintain data consistency and reduce data redundancy. Creating a data warehouse from many individual data marts requires the same data validation effort necessary only once with the topdown architecture. The difference is that for a top-down architecture the data consistency is established from the data sources. Since the data sources of the data warehouse are the data marts in a bottom-up architecture, data consistency is a concern later in the implementation. The proposed automation of data warehouse schema design is independent of the architecture. The automation proposed in Chapters 3 and 4 can be performed to create an entire 14

19 data warehouse, or data marts can be created on some subset of the organization s OLTP schema. An entire enterprise OLTP schema is needed for a complete top-down schema creation. With the TPC-H benchmark as an example, we illustrate the top-down approach by considering the entire function of a business is as a wholesale supplier. If this is only part of the business, as is the case for many industries, then our candidate schemas result in a bottom-up data mart implementation. Even though we create an enterprise data warehouse schema here, it is possible to implement the data warehouse in sections similar to the data mart approach using our algorithm. 2.3 Schema Creation Once an architecture is selected, schema creation, or design, begins. Data warehouse design has several goals, most of which are met or determined at this phase of data warehouse creation. The goals of data warehouse design are to provide the following [K96b]: 1. access to organizational data, 2. data consistency, 3. ability to examine data from multiple perspectives for analysis (slice and dice), 4. a data representation that allows querying, analyzing, and presenting data about business practices, 5. a means to publish user data, and 6. quality data to drive business reengineering. Schema design addresses these goals by creating a visualization of the business world for both end users and a guideline for implementation of the data warehouse for developers. In the OLTP arena, ER-based models are most widely used, and in the OLAP arena dimensional modeling is most popular. An ER model is aimed at producing a graphical representation of entities and their relationships to each other. The dimensional notation organizes data based on business rules of an organization. These two basic models are discussed further in Sections and 2.3.3, respectively. In the schema creation phase, as many as three schemas are produced to help meet these goals: a conceptual, a logical and a physical schema. The conceptual schema is a high level design of entities and their relationships, aimed at providing the users with an understandable 15

20 model of the data warehouse. Users can provide feedback enabling refinement of the conceptual design prior to its implementation. The conceptual schema provides a roadmap for the creation of a logical schema. The logical schema describes how the data is organized in the data warehouse in terms of table and attribute names, for example. The physical model is the actual data warehouse schema implementation, including file and index organization. The three schemas are expressed in various notations including ME/R [HS00, SB98], Star [BH98], StarER [TB99], and dimensional fact model [GR98]. This chapter is a discussion of data warehouse design and techniques used. Section discusses additional functionality and terminology introduced for data warehousing. Sections and cover the two modeling approaches and the models used for each approach. Section is Entity Relationship-based (ER) and Section is dimensional-based. Section discusses various models for conceptual, logical, and physical schemas. First a discussion on the advantages/disadvantages of ER versus dimensional models is given. A survey of schema models is given next concluding with decisions on what models are used throughout this thesis and why they are chosen. The final section, 2.3.5, discusses schema design processes proposed in the literature Decision Support Functionality and Terminology With the advent of OLAP systems, new functionality and terminology to support analysis that is not typically supported by OLTP systems is introduced. The two main differences are additional types of query processing and an accompanying data organization to support this new functionality. The term cube is used to represent the visualization of a three dimensional model of data. A cube is a visualization tool for how data in an OLAP system is related. Generally, a cube is considered to be three-sided, one side for each of the features analyzed. If a cube has more than three sides it is called a hypercube. The sides, or features, of a cube are generally high level data 16

21 such as customer, part, and ship dates for an order/shipping OLAP database. Each side has, as its rows, attributes about a customer such as city, region, and country. The intersect point of the desired rows of data from each of the three sides is the point of interest. In Figure 2.1, the intersect point of a customer city, a part number, and year gives the sales of a particular part for a specific city in the given year. The arrows around the cube denote the hierarchy of the attributes. For example, a country is made up of many regions, and each region of many cities, so the arrow is drawn from city and points toward the end of the cube towards country. Figure 2.1: Cube Depicted Graphically Four functions, roll-up, drill-down, slice, and dice, are operations to represent the data of the cube to the user. The functions roll-up and drill-down are opposites of each other. Roll-up is a term describing movement from one level of aggregation to a wider, more general, one; intuitively, zooming out to a summary data level. It takes the current data object and summarizes an aggregated feature on one of the sides of the cube. For example, a roll-up from city to region changes the result to regional sales for a part in a specific year. In contrast, the drill-down 17

22 function zooms in to a detailed data level. Drill-down generates the opposite analytical result from roll-up; country sales could be drilled down to yield sales results by region and further drilled down to show results by city. Slice and dice are functions related to visual data browsing of the cube to restrict what is viewed or to view only a specific entity. Slice and dice relates to selection and projection (using relational algebra terminology) along a side of the cube. Slicing cuts through the cube, focusing on a specific perspective. For example, if the user only wants to look at sales for the year 2000, the perspective of the cube changes to show only that year (the cube is sliced). Dicing rotates the cube to another perspective. If the user wants to look at part sales across regions then he or she may change perspective to look at yearly sales by region. The focus is changed from one side of the cube to another. The data content of a cube can be represented using either an ER-based model or a dimensional model. Each technique has its own set of modeling concepts and associated notational conventions. While the visual look-and-feel of both models are different, both are used in database modeling. The next two sections discuss specific modeling techniques that are ERbased and dimensional, respectively ER-based Models ER modeling has traditionally been used for OLTP systems because it depicts ambiguous relationships in the business world in an unambiguous manner. The ER model has been the traditional method used to provide a communication medium to end users. The primary focus of the ER model is data consistency and reducing data redundancy. The ER model has 2 basic concepts: entities and relationships. A detailed ER model may also contain attributes that are properties of the entities or relationships. Figure 2.2 provides an ER-based schema, created from the cube example in Section 2.3.1, for explaining ER modeling concepts. An entity, shown as a rectangle, represents a real world 18

23 object. In this example, Product and Customer are the entities. A relationship is represented by a diamond connected with lines to two or more entities that participate in the relationship. In Figure 2.2, Orders is the relationship between the Product and Customer entities. In a ER-based model the cardinality of the relationships is noted. In the figure, (1,1) and (1, n) denote the cardinality between the entities. The (1,1) cardinality denotes that a product is ordered by one customer and the (1, n) cardinality denotes that one customer orders n (many) products. Attributes are represented by lines terminating in a small circle and describe characteristics of an entity or relationship. In the example, Product has three attributes: ProductNo, PartType, and PartNumber. The ProductNo attribute is terminated with a solid circle because it is the key, or unique identifier of the Product entity. A key is a special type of attribute. With the creation of entities, data is easily grouped and separated. More advanced ER modeling information can be found elsewhere [BH98]. Figure 2.2: Sample ER Schema Because the ER model is time-tested, and works well for conceptual modeling of OLTP systems, it is logical to extend it to meet the needs of data warehousing. The ER model is well known to OLTP data modelers, who tend to be the ones now creating data warehouse models. 19

24 Three such extended models are discussed here: EVER, starer, and ME/R. They are ER-based models that have been extended to include the dimensional functionality necessary in data warehousing. The emphasis is shifted to monitoring facts or events of a business rather than describing data storage for an end user application. EVER is the abbreviation for the EVent-Entity-Relationship model [B99], an entitybased conceptual modeling language. The concept of EVER is that measurement data is associated with an event such as sale, order, ship, or pay. Using the cube example from Section we can examine an EVER schema, Figure 2.3. In the EVER model, events are unique occurrences of an activity and are represented as circles. In our example, the event is Orders which can also be thought of as sale of a product to a customer. An entity is representative of the event and is denoted with a rectangle. Product and Customer are the two entities that comprise the Orders event. Also in this model are dotted rectangle symbols representing a finite set of values from a specific domain. A domain can be a string, integer, or date. These values are to attributes of the entity. Joining these three constructs are relationships. Events can be related to entities that can in turn be related to other entities or values. This modeling technique is useful in that facts tend to be considered events, and are shown in the schema itself. The disadvantage of this model is when business measures are not associated with an event such as an evaluation of customers and their geographic locations. 20

25 Figure 2.3: An EVER Schema The starer [TB99] model modifies the ER model semantics with dimensional capabilities similar to the Star model. In the Star model, the fact is central with the entities that affect that fact surrounding it. More detailed information about dimensional models and Star schemas are discussed in Section The starer model has entities, relationships, and attributes. The biggest difference is that these are all centered around facts. Figure 2.4 shows a starer schema based on the cube from Section As with the ER model there are entities represented by rectangles, relationships represented by diamonds connecting two objects, and attributes represented by ovals. Where this model is different is in the central fact, shown as a circle, that the entities are related to. All objects in this model are related to the fact. Objects connected to the fact form dimensions showing a hierarchical structure for aggregation. In Figure 21

26 2.4, the central fact is an order. This order is tied to customers and products. There is another construct added in the starer model that differentiates it from traditional ER-based model, numerical attribute types. Numerical fact and entity attributes can be noted to be one of three types: stock, flow, or value-per-unit. The stock type is used for fields that denote a value at a specific point in time and are represented by an S in the attribute oval. One example of a stock type field is a customer account balance. This is kept current at the point in time. A flow type, represented with an F, shows a cumulative effect over a period of time. An example of a flow type field would be order amount that is comprised of the price of products minus any discounts. The flow type is generally most useful in a data warehouse environment because its instances are easily summarized. An attribute of type value-per-unit, represented with V, is similar to stock except that is associated with a unit of measurement. An example of this type is the price of a product for an order. Product prices can change and volume of the order may affect the price. The price field cannot be summed for an entire order to yield an order total because the price is associated with the quantity sold field. The starer model addresses one additional modeling construct for data warehouses, a temporal dimension. This time-based dimension is important because facts are of interest in time periods. In the example, we add the order date/time entities. This model stresses the need for a time dimension in the created schemas. The starer model includes rules for membership and aggregation among entities. One such membership is the noncomplete membership shown between the entities of the date dimension with the arrow notation. More specifics on these can be found elsewhere [TB99]. The starer model displays the dimensions and attributes that affect the facts. The disadvantage to this model is the lack of ease of use by users. A schema using this notation gets cluttered quickly with all of the various objects (circles, rectangles, ovals, and diamonds) shown. 22

27 Figure 2.4: A starer schema The ME/R (Multidimensional Entity-Relationship) model is similar to, but simpler than, the EVER and starer models. Figure 2.5 is an ME/R example of the customer orders application. The central construct is a fact node which is represented by a diamond. This fact node is similar to the event in the EVER model and the fact of the starer model. The level node, represented by a rectangle, is similar to the object of the starer model. Each level or fact node can also possess attributes (descriptors for this entity). These are the three main constructs that make up the ME/R model, and are connected by various relations, shown in Figure 2.6. In the example, Order is the fact node. Product, Customer, Day, Month, Week, and Year are level nodes. The attributes are similar to the attributes of other models. A fact node is related to a level by a dimension edge which is an undirected line. The has edge connects facts or levels to their attributes and is also an undirected line. The classification edge is the relationship between levels. This is represented by a pitchfork-like symbol at one end of a directed line. All the levels of the fact node, related by a classification edges, represent a dimension of the model. The dimension that 23

28 represents the order date in our example is made up of a Day level, a classification of the Month level which in turn is a classification level of Year. The directed edge between the levels is important in that it shows hierarchical aggregation. This is especially useful when a level node can be a classification node for more than one other level as seen in the Day to Week relationship. In our example, we can roll-up from the Day level to the Month level, changing the level of aggregation shown for the fact attributes. Week is an alternate roll-up path from Day. With only three main constructs, the ME/R model is simpler for users to read. In the ME/R model, aggregation is shown through the hierarchy of levels representing a dimension. This model looks most like the Star model, leading to ease of understanding to both users and logical modelers of the data warehouse. Because of its simplicity, ME/R is chosen to represent conceptual schemas in this thesis [HS00, SB98]. Figure 2.5: ME/R Schema 24

29 Figure 2.6: ME/R Legend Dimensional Models Dimensional modeling is prevalent in data warehouse systems because of its effectiveness in providing support to OLAP technologies and decision support. Dimensional modeling, also known as multidimensional modeling, is useful for sifting, summarizing, and arranging data to facilitate analysis of ordinary facets of a business. It is based on the idea of the cube, where attention is focused on the intersection of properties of interest. The dimensional model strives to present data in a user-friendly manner while facilitating high performance OLAP queries. The dimensional model has four basis concepts: measures, facts, attributes, and dimensions. The cube from Figure 2.1 can be transformed into the dimensional model given in Figure 2.7. In the dimensional model, the measure is a business performance indicator. Generally, it is a numeric value that is added for summary purposes. Measures are items of interest at the intersection points on the side of the cube, such as Price and OrderAmt. These measures can be used to determine the amount of various products ordered by customer. Measures are stored in 25

30 facts, but facts also contain the relationship directives to the various dimensions. A fact is an event in a business process, considered to be quantitative measurements. In the example, Order is the fact with two measures and the foreign keys to link to the dimensions. The dimensions in the dimensional model are the associated characteristics that define the measure or sides of the cube. In our example, the dimensions are Customer, Product, and OrderDate. These dimensions define the amount of aggregation for the measures of the business. The dimensions become the nouns and adjectives associated with that event. Dimensions are qualitative objects, seen as qualifying the event or measure. Additionally, a dimension is described by a set of attributes. These attributes may have hierarchical relationships. In Figure 2.7, the customer location dimension has the attributes of city, region, and country. In a dimensional model facts become central key elements while dimensions are the supporting information. The relationships in a dimensional model are implied rather than explicit. The existence of a fact at the intersection of dimensions implies that a relationship exists between the dimensions [BH98, BE99, R96a]. Figure 2.7: A Dimensional Star Schema 26

31 The dimensional model can take many forms. All of these forms are based on the idea that an organization s facts are centered and the data unfolds around it. The most well-known dimensional model is the Star model [BH98]. The Snowflake model [BH98] is a derivation of the Star model. These two models tend to be used to represent logical schemas because keys and other database constraints are evident. A new dimensional modeling notation, Dimensional Fact Model (DFM), is specifically designed for conceptual schema creation using a dimensional approach [GM98a]. This model includes all of the concepts of dimensional modeling including showing hierarchy among attributes without detailing the keys. This thesis discusses these models and their advantages and disadvantages. A Star schema, as seen in Figure 2.7, is the direct result of creating a schema in the dimensional model. In this schema each fact is represented by a fact table holding the measures and the keys to the dimensions that comprise the measure. Each dimension is represented by its own table, the dimension table containing columns representing the attributes of the dimension. In the Star schema all of the levels of a particular dimension are in one dimension table. The dimension tables surround the fact table and the resulting diagram looks similar to a star. In this model the fact tables are normalized (without redundancy) and the dimensions are denormalized. Denormalization occurs when all of the various hierarchical levels of the dimension are arranged into the one dimension table. The notation of the Star model has four constructs: primary keys, foreign keys, entities, and relationships. The boxes in the diagram represent the entities. The name of the entity is in the top portion of the box. If an entity has a primary key the fields comprising this key are in the middle box and denoted with PK. The Customer entity has a primary key of CustomerNo. The bottom portion is for the rest of the columns of the entity. And columns denoted with FK and a number are the foreign keys. The foreign keys are numbered because an entity may have more than one set of foreign keys. In the example, the Order entity has three foreign keys: ProductNo, OrderDateKey, and CustomerNo. These foreign keys correspond to primary keys in other entities and express how the entities are related. The arrows 27

32 between the entities represent the type of relationship. The arrow points away from the table that is on the many side of the relationship. For example, an Order exists for one Customer, but one Customer can have many Orders. An Order belongs to a Customer so the arrow points towards Customer. The Snowflake model is similar to the Star model. The difference in this model is that the dimensions are also normalized, which creates new joins from the dimensions to retrieve supporting data. In the Snowflake model, the hierarchy in a dimension is represented by normalized dimension tables with keys to their subordinate dimensions. In Figure 2.8, the Customer dimension has been split into four dimensions, Customer, City, Region, and Country. This is done to reduce redundant data in the dimension rows. If there are only two countries then there are only two rows in the Country dimension. This model is useful when dimensions are large or the degree of aggregation is great [BH98]. 28

33 Figure 2.8: Snowflake Schema The dimensional fact model (DFM) is a conceptual model. In DFM facts, measures, dimensions, dimension attributes, non-dimension attributes, hierarchies, and aggregation are shown. Using the example in Figure 2.9, we explain the DFM graphical representation. A fact is represented by a box inside which is the fact or event name and the measures that make up that fact. Dimensions and dimension attributes are only subtly different and are both represented by circles. Dimension attributes are the discrete attributes about the facts. Dimensions are also 29

34 discrete attributes but are dimension attributes directly connected to the fact. In our example, CustomerNo, ProductNo, and Day are dimensions and PartType, City, and Month are examples of dimension attributes. All of the dimension attributes related to the dimension comprise the hierarchy. The Customer hierarchy is made up of the CustomerNo, City, Region, and Country. Non-dimension attributes are additional information about a dimension attribute that are nonhierarchical in nature. The non-dimension attributes are represented by lines. In Figure 2.9 a nondimension attribute is OrderAmt; it is an important part of the customer dimension but does not have other levels of detail associated with it. Aggregation of measures is generally sums of the measures across the appropriate dimensions. Because there are some measures which are nonadditive in nature, the DFM model has the notation of a dotted line to show when a measure is non-additive with a given dimension. The DFM model presents to the user a model of data that inherently shows levels of detail, as in what fields can be rolled-up or drilled-down to analyze a business [GM98a]. Figure 2.9: DFM Schema The DFM model is similar to the Star model because they are both dimensional in form; the facts are central and descriptive information surrounds the facts. What is different is the visualization. The DFM model is designed to be used for conceptual schemas. Hierarchies or aggregation paths are shown. The DFM model provides the user with an understanding of how 30

35 the data is organized. The Star model is more appropriate for designers as a precursor to physical design. The Star model is inherently logical with details such as primary and foreign keys which are not meant for users. Now that we have presented information on ER-based and dimensional-based modeling and various notations used we give a discussion on their use for data warehouse schema design purposes. Section gives advantages/disadvantages of the modeling techniques presented and how these modeling techniques are used in the schema design phase of data warehouse creation Models for Schema Creation While ER models and dimensional models look different, they represent the same information. Ballard et al. suggest that the dimensional model is a special form of an ER model [BH98]. While the two modeling techniques are similar and represent the same data, there are distinct differences and advantages in one approach or the other. One such difference is the creation of a schema. The ER model emphasizes first identifying the entities and their relationships. In dimensional modeling the object or events for analysis are defined first and then their supporting information or dimensions. The dimensional model lends itself more to modeling actual business rules rather than just data rules. This is because the fact tables tend to center around business objects users are most interested in, such as sales and revenue, and are presented as a slice in time. For example, if a user asks the question, How many widgets are sold in April to Japan, the query contains the summation of sales of widgets in April to Japan. The dimension attributes of part, country and month are the limiting factor as to what rows in the fact table are desired by the user. ER models work well for OLTP systems because they model transactions that need to occur to operate the business, not how the business is measured [K97, R96a, WB97]. The framework of a dimensional model consists of one fact table surrounded by dimensions, and thus query processors can take advantage of constraining the dimensions 31

36 (generally the where clause of an SQL statement) and then gathering the associated fact table data. This is helpful because the fact table has the largest number of rows and constraining the dimensions first necessitates only a subset of the fact table rows being processed. This allows the database engine to make strong assumptions about constraining the dimensions when gathering fact table entries. For ER models, cost-based query optimizers are used. These need to base decisions on rows of data and sizes of tables to determine what to constrain first. The dimensional model was designed specifically for query-based analysis of data. The structure leads to fast performance for queries although it would not be ideal in a transactional system where data is constantly inserted or updated. The structure or framework associated with the dimensional model is ideal for a data warehouse environment where most of the work is user queries. The insertion or updating of data is only to keep the data warehouse up-to-date. Three further advantages specific to the physical design and maintenance phase are 1) ease of design with respect to query patterns, 2) easy of extensibility of the model, and 3) a body of standard approaches to modeling business world situations [K97]. In a dimensional environment there is no need to design the structure based on the expected pattern of use (queries). The design of the data warehouse model does not rely on an expected query pattern because the dimensions are all considered to be of equal weight in relation to the fact table. This is an added advantage to data warehousing because user queries can never be fully anticipated, by definition they are of an ad-hoc nature. The even weighting of dimension tables allows for new queries to have similar performance to those already being performed. The dimensional model is easily extended for new data rows and/or new data warehouse structure additions. The existence of data in a dimension shows the occurrence of that data as it impacts the business event. Data is added for any new occurrence of the event. The data warehouse structure is extendable because fields can be added to fact tables or dimensions without requiring changes in query tools against the data. The third advantage is that standard approaches to situations such as slowly changing dimensions are available. Many authors offer advice on how to handle various situations is a 32

37 dimensional schema such as what to do as customer account balances change [BH98, K96b, K96c]. The dimensional model s popularity has stimulated research on the topic. Further discussion about ER versus dimensional modeling is available [K95a, R96b]. The models for schema creation in this thesis are dimensional in nature. Both are discussed because OLTP systems, which provide the data for data warehousing, are generally in ER form. Information for converting from an ER model to a dimensional model is given by both Kimball [K97] and McGuff [M98a]. Dimensional modeling is used for data warehouse models by Ballard et al. [BH98], Golfarelli and Rizzi [GR98], Trujillo and Palomar [TP98], and Boehnlein and Ende [BE99]. McGuff [M98a] uses ER modeling for its conceptual model and dimensional for the logical and physical designs, while Wu and Buchmann [WB97] do the opposite for schema creation. Whether using ER modeling or dimensional modeling, the goal of model selection and schema design is the same, to create a usable, clear description of the data contents. No matter which model is used for schema creation, it provides a way to visualize the business world represented in a data repository. The individual schemas, conceptual, logical, and physical, are useful representations of the specific data in the repository. The conceptual schema is for use by users of the data warehouse and for guidance in planning (logical schema) or implementing (physical schema) the data warehouse. In the schema creation process used here, a conceptual schema is created first. It is used by the data warehouse users and logical modelers. The goal of conceptual modeling is: to translate user requirements into an abstract representation understandable to the user, that is independent of implementation issues, but is formal and complete, so that it can be transformed into the next logical schema without ambiguities [TB99]. A well designed conceptual schema acts as a communication device between the designer and the user. The schemas, understandable to the user, permit verification of requirements. The conceptual schema provides a road map for logical design of the entities, how they are related, and how aggregation occurs. Requirements of a conceptual schema for a data warehouse include 33

38 a) providing a modeling construct to represent business facts and their properties, b) connecting temporal dimensions to facts, c) relating objects with their properties and associations, d) defining relationships between objects and business facts, and e) outlining dimensions and their respective hierarchies [TB99]. Since the goal of the conceptual model is to portray data relationships to users, it should ultimately be modeled in a user-friendly manner whether ER-based or dimensional-based. Attention must also be paid to make sure that enough information is available to the logical modelers using the conceptual schema as a guide for user needs. Golfarelli and Rizzi [GR98, GR99a], Tryfona et al. [TB99], Baekgaard [B99], and McGuff [M98a] propose that the conceptual model is important to users and speeds up time necessary to create logical and physical designs to be ignored. During the schema creation phase the logical schema is created next. The logical schema provides a blueprint for how the data is stored. Defined in the schema are the tables and the columns of those tables for the data warehouse. The logical schema does not include storage or other implementation details directly, but is concerned with performance of queries to some extent. While this schema does not deal with indexing and views to help performance, emphasis is placed on how the data can be structured to improve query performance. The logical schema tends to be reflective of the underlying data storage structure planned for use, and thus ER-based or dimensional-based schemas can be created. Because no conceptual model has been adopted as a standard, some [GM98b] choose to start with a logical model. Golfarelli et al. [GM98b] make the decision not to create a conceptual schema because users do not demand a conceptual schema for evaluation and the logical schema can be used to present defined fields to users. The last schema created as part of the schema creation phase is the physical schema. This is the actual implementation in the chosen database of the logical schema tables with storage and index requirements. The physical schema is dependent on the underlying structure of the database on which the data warehouse resides, the amount of data, and how the data is queried. An 34

39 automated approach to generating a physical scheme from an OLAP conceptual schema is given by Hahn, Sapia, and Blaschka [HS00]. Table sizing, index creation, data partitioning, and view materialization are created in this phase. Physical design, while expressed as a schema, is really part of the data warehouse population and maintenance phases conducted after schema creation. The conceptual and logical schemas for data warehouses can be built with ER-based or dimensional-based models. It is possible that the conceptual schema may be in one form and the logical another. The models proposed for schema specification include ER, EVER, StarER, Star, Snowflake, DFM, and ME/R. Table 2.1 gives a summary of models used by various authors for conceptual and logical schema creation. The Star model is popular for both conceptual and logical modeling. Using the Star model for both schema phases would make the design task easier, but several feel that the Star model is not necessarily the best choice for use by users. The Star model does not show the drill-down and roll-up paths available to the users [BH98]. Conceptual Design Logical Design ER [M98a] [BH98] EVER [B99] StarER [TB99] ME/R [HS00], [SB98] Star [WB97],[R96b],[K96c],[BH98],[M94],[M98a] [K96c],[BE99],[CD97],[K97],[BH98], [M98a],[M98a] Snowflake [BE99],[CD97],[BH98] DFM [GR98],[GR99a] Table 2.1: Models for Conceptual and Logical Schemas For this thesis, the ME/R model is chosen for conceptual schema creation and the Star model for logical schema creation. The ME/R model is based on the proven techniques of the ER model with enough expressiveness to create a dimensional-like model. The Star schema is popular in both research and industry for logical schema creation, and has an appropriate level of detail to support implementation planning Automated Schema Design Typically, data warehouse schema design is a manual task with little to no automation. There have been some proposals for semi-automated or automated approaches to schema creation 35

40 from OLTP systems. These approaches vary greatly in assumed requirements and output schema notations. An overview of these approaches is provided below so that a comparison can be made to the approach of this thesis in later chapters. Only those approaches that result in a conceptual schema (or a logical schema if a conceptual schema is not built) are discussed. Kimball gives three steps to convert from an ER-based model to a dimensional model: 1) separate into discrete business units, 2) create fact tables from ER many-to-many relationships containing numeric non-key facts, and 3) denormalize remaining tables with single keys to relate to fact tables [K97]. While these steps are simple they are important because they are similar to approaches used to create data warehouse schemas. What makes this approach difficult is the initial definition of the business units in terms of the ER model. This approach makes the assumption that interesting data or facts are represented in many-to-many relationships in an ER schema. Kimball s approach to translate from one schema to another is presented as a manual approach. Step 1 is a manual step because no method is given to differentiate the business units. Steps 2 and 3 are simplified principles that are used in our data warehouse schema creation algorithms and as such could be automated. We do not rely on many-to-many relationships pointing to fact type information, but numeric data is what is identified first. A semi-automated approach to generate a DFM conceptual schema from an OLTP ER schema is given by Golfarelli, Maio and Rizzi [GM98b]. Their methodology is comprised of the following steps: 1. Defining facts 2. For each fact: (a) Building the attribute tree. (b) Pruning and grafting the attribute tree. (c) Defining dimensions. (d) Defining fact attributes. (e) Defining hierarchies. In this methodology the facts are determined by finding entities that are frequently updated. Step 2(a) is automated in this methodology. To build the attribute tree, an attribute tree comprised of dimension attributes is created from the ER schema using a recursive procedure. 36

41 This step examines the relationships of each fact to yield dimension attributes. The rest of the steps are performed manually based on user needs and knowledge of the OLTP system. The initial determination of facts is a manual process of recording system events. While the dimensions are automatically generated, there is still extensive manual work to finish defining the schema. The authors give a secondary method to find the facts and provide an automated algorithm to determine part of a conceptual schema from an OLTP schema. This method assumes that the most interesting data is updated frequently. We suggest that measures of interest to the users may not fall under this assumption. Step 2(a) used in this method has no rules to follow in how many entities of the OLTP schema can be traversed to produce the attribute tree, thus every entity may be in the resulting attribute tree. As an automated step this may be computationally intensive and require significant manual resources to prune the tree to a useful size. In our approach, we define stopping points based on the relationships between the entities in the OLTP schema. Boehnien and Ende propose a semi-automated approach to create a Star logical schema from a Structured Entity Relationship Model (SERM) OLTP schema [BE99]. The methodology is broken into three steps: 1. Identification of business measures. 2. Identification of dimensions and dimension hierarchies. 3. Identification of integrity constraints along the dimension hierarchy. As with Kimball s approach, the business measures are determined from user requirements and business objectives gathering. Step 2 is semi-automated in this methodology by examining objects connected to the identified measures. While this step can be automated, the authors stress that they are candidate dimensions and manual evaluation is still needed. Step 3 can also be automated in this approach because the data along the dimension hierarchy can be consolidated (summarized) into one dimension table in the Star schema. Depending on the form of the OLTP system, these steps may be all manual. One drawback to this methodology is the loss 37

42 of aggregation levels for user understanding. If the Star schema were to be used for a conceptual schema, graphical details for users would be lacking about hierarchy or aggregation paths inherent in the schema. An OLTP conceptual schema in third normal form would easily convert to a model with aggregation paths. In this approach the representation of the hierarchy would be non-existent because the fields comprising an aggregation path would all be shown in the same entity of the logical schema. Since the authors have chosen their target schema to be of logical nature the lack of easily defined hierarchies or aggregation paths is not as important. This work is similar to our proposed algorithm to convert from a conceptual schema to a logical one. The same principle of examining the relationships is used and because our logical model is also the Star model, the single dimension laden with attributes is also pertinent. Because our approach first creates a conceptual schema, we have aggregation paths for user understanding of the data. The determination of measures, facts, or events can prove to be the most difficult part of the design process, and is usually done manually. Four different manual approaches are suggested by different authors. One derives the fact table of a Star schema by selecting the many-to-many relationships in the ER model containing numeric and additive nonkey facts [K97]. A second approach suggests finding candidate measures by analyzing the business queries for data items indicating the performance of the business [BH98]. A third approach finds fact properties are usually numerical data, and can be summarized (or aggregated) [TB99]. The fourth approach proposes that facts are found by finding entities most frequently updated [GM98b]. None of the semi-automated approaches include a mechanism for finding candidate measures. This thesis is the first research effort that addresses automation of creating an entire candidate schema from an OLTP schema, including the initial determination of facts or measures. 2.4 Data Warehouse Population The physical model and its population, while important to data warehousing research, is beyond the scope of this thesis. Other research has addressed these issues. Many of the issues are 38

43 carried over from traditional OLTP database research such as table sizing, index creation, and data partitioning. Information on physical design is found in articles by McGuff [M98a] and Srivastava and Chen [SC99]. Once workload and query patterns can be viewed against the data warehouse, workload refinement and index reorganization can be performed [GM98a, GR99b, OQ97]. Issues with indexes and partitions as related to Oracle (a registered trademark of Oracle Corporation) are given in a white paper by Oracle Corporation [O98]. The initial and successive data population of the data warehouse involves data cleaning and integration. Data coming from different OLTP systems needs to be merged, duplicates removed, and data types determined [W95, WB97]. Once the warehouse is populated, data refreshing and purging are necessary tasks [WB97]. Keeping a history of data elements that change can be very important to the data warehouse environment. It can be important to know what the account balance of a customer was one year ago versus today. This history of data changes is an issue strictly related to an OLAP system where historical data is stored. Various methods have been proposed to store and maintain this changing data [BH98, GL98, K96a]. A large volume of current research addresses creation, modification, and maintenance of materialized views [LH99, LQ97, TS99a]. 2.5 Data Warehouse Maintenance The data warehouse maintenance phase includes those tasks necessary once the data warehouse is operational with data and regularly scheduled data updates. Some performance tasks are in this stage such as parallel processing to speed up query response time [DV99]. Currently other data warehouse maintenance issues such as detecting runaway queries, database failure, checkpointing, and resource scheduling have borrowed from OLTP system research, but there is a need for modifications because of the differences between OLTP and OLAP requirements [CD97]. 39

44 2.6 Summary As a result of our literature survey, we choose the following techniques and models for the first three phases of data warehouse development. The main focus of the pre-development phase is requirements gathering. We choose to use both source-driven and user-driven requirements for creation of our data warehouse. We use an OLTP schema for source-driven requirements. The OLTP schema shows the requirement of what is in the current OLTP system and is available for the data warehouse. We use a query workload as our user-driven requirements. These queries provide us with an idea of what questions the users expect the data warehouse to answer. For the architecture selection phase we choose a top down enterprise schema approach. This assumes that the OLTP schema(s) used for source-driven requirements encompass the entire enterprise and not just a subset of the business. With this approach a data warehouse is created and data mart of individual business units can be created later is necessary. If the schema for source-driven requirements is only a portion of the business than the bottom up architecture is applicable. We use a top down architecture but are flexible enough for either depending on scope of the inputs. The main focus of this thesis is on data warehouse schema design and as such there are more techniques and models chosen for the schema design phase. The conceptual schema is a high-level model to communicate to the users. For this, we have chosen the ME/R model. For the logical schema to be used by designers for physical warehouse creation we chose the Star model. Both models are widely used and thus familiar to users and designers. ME/R is chosen for conceptual modeling because it resembles the ER model and is familiar to users, and it shows the hierarchical structure of the data inherent in dimensional modeling. The Star model is chosen for our logical representation because it supports faster query performance due to the denormalization of dimensional data. 40

45 The algorithms developed in the next chapter use an ER-based OLTP schema and SQL queries as input, and produce candidate conceptual schemas in ME/R form. In Chapter 4, an algorithm for translating an ME/R conceptual schema to a Star logical schema is developed and illustrated. 41

46 Chapter 3: Developing a Conceptual Data Warehouse Schema This chapter provides an algorithm to migrate from an OLTP schema to a set of candidate conceptual schemas for a data warehouse. The algorithm is presented along with an example to show its application to an order entry OLTP system. In Section 3.1, the design methodology begins with the pre-development step of user requirements gathering, followed by the schema creation process. Section 3.2 contains the algorithm for candidate schema creation from an existing OLTP schema. The ME/R model is used in the conceptual schema generation example in this section. Section 3.3 presents an algorithm for evaluation of the candidate schemas based on user needs as defined in queries. Section 3.4 provides a methodology for additional manual refinement of the candidate schemas to accommodate user requirements that cannot be automatically derived or are not available in the OLTP schema. This includes the merging of dimensions and creation of additional dimensions to meet users needs. 3.1 Pre-Development: User Requirement Gathering For data warehouse creation, both source-driven and user-driven requirements are needed to derive and evaluate candidate conceptual schemas. The TPC-H benchmark is used for both source and user requirements in the example below. The TPC-H benchmark is created specifically to benchmark query performance in a data warehouse environment, but here we treat the TPC-H schema as an OLTP schema since it resembles one in a normalized ER format. The schema and queries of the benchmark are designed as if for a wholesale supplier system that manages, sells, or distributes a product. This is representative of any distribution business, a rental business, or even a manufacturer. The basic premise is that orders are taken from the customer. These orders are filled by a supplier or from the warehouse. Most businesses that sell something require some means of order fulfillment, making this example applicable to many organizations. 42

47 There are benefits from the decision to use the TPC-H benchmark. First, it is an industry benchmark example and thus not biased toward the schema creation algorithm. Second, order entry and/or part distribution is a common function of a wide range of businesses. Third, and most important for our purposes, a schema for source-driven requirements and queries for userdriven analysis are given Source-driven Requirements The TPC-H benchmark schema, Figure 3.1, is used as the source-driven user requirements of a wholesale supplier system. It serves as an OLTP schema and provides a graphical depiction of the database that is useful for communicating database structure, but it is not easily analyzed by an automatic mechanism. For automatic analysis, tabular forms of this information, and additional information such as data types for fields, are useful. A tabular representation of the TPC-H schema is represented in Appendix A. Figure 3.1 is a graphical representation of the TPC-H schema using the ER notation from Microsoft Visio (a registered trademark of Microsoft Corporation). In this convention each rectangular box represents a table, with the table name shown in the shaded top compartment. The middle compartment contains the primary keys of the table. Primary keys, unique identifiers to a row of data in a table, are noted with PK and are bold and underlined. Primary keys are always in the top section of the entity and are thus set apart from the rest of the fields. The primary key of the LineItem table is made up of the L_OrderKey and L_LineNumber fields. The third compartment contains attributes, some of which may be foreign keys. Foreign keys, column(s) of data that are in the primary key of another table, are noted with the FK# notation where # is an actual number or digit. The number is used to differentiate between foreign keys since more that one can exist. The foreign key of the Orders table is comprised of the O_CustKey field which must be contained in the primary key of the Customer table. The arrows represent relationships between entities. The arrow side shows a cardinality of 1 and the non-arrow side of 43

48 the line denotes a cardinality of many. For example, a customer has many orders, but an order belongs to only one customer. Figure 3.1: TPC-H Schema 44

49 3.1.2 User-driven Requirements We use the TPC-H benchmark query sets to represent user-driven requirements. These queries are reproduced in Section A.2 of Appendix A. The benchmark queries provide users answers to the following classes of business analysis: pricing and promotions, supply and demand management, profit and revenue management, customer satisfaction study, market share study, and shipping management [TPC-H]. Answering the 22 queries is the minimum requirement of the data warehouse, thus, a schema that allows answering more of these queries is better than one that allows fewer to be answered. Designed to represent most of the data available in the data warehouse, these queries may be asked in an ad hoc manner in a data warehouse environment. The queries are used for analysis of the candidate schemas and their ability to meet user needs in Section Conceptual Schema Creation This section focuses on the semi-automated creation of candidate conceptual schemas. Our algorithm for creating candidate schemas has five steps. These five steps result in candidate schemas centered around likely business events. The candidate schemas can have multiple levels of data comprising the dimensions describing these events. Although there are instances where the events chosen have little or no meaning to users and additional commonly used calculations may need to be added, most of the initial schema creation is automated. The relational database schema(s) used as input meets some standard guidelines. There are tables (entities), columns of the tables (attributes), and relationships among the tables. These 45

50 basics are all that are needed to create candidate schemas in five steps. These steps can be generalized as follows: 1. Find entities with numeric fields and create a fact node for each entity identified. 2. Create numeric attributes of each fact node. 3. Create date and or time levels (dimensions) with any date/time type fields per fact node. 4. Create a level (dimension) of the remaining fact node entity fields (non-numeric, non-key, and non-date fields). 5. Recursively examine the relationships of the entities to add additional levels in a hierarchical manner (creating a dimension). There are two underlying heuristics for our approach to semi-automated schema generation. One is that numeric fields represent measures of potential interest to a business and the more numeric fields in an entity the more likely that entity is an event or fact. The second premise is that the cardinality of a relationship determines how useful a related entity is to the schema being created. Any entity related with a cardinality of many ( * ) is of likely importance and any of its entities may be as well. The premises are discussed further as the TPC-H schema example is used to illustrate the algorithm. The five steps for semi-automated conceptual schema creation are represented in the form of an algorithm in Figure 3.2. In the algorithm, a sub-procedure, Walk_Relations, is called. The sub-procedure is presented in Figure 3.3. Our illustration of the algorithm references the tabular representation of the TPC-H schema (Appendix A, Section A.1) to create the candidate conceptual ME/R schemas. The parenthetical notation on the right margin indicates a correspondence to the five steps outlined above. Further explanation on the subroutines used in conceptual schema creation and the Walk_Relations sub-procedure can be found in Appendix C. 46

51 Input Parameters: Table_Columns Table_Relations In/Out Parameters: Fact_Node_Table Algorithm for Conceptual Schema Creation // Table containing table name, column name, and column type // for every OLTP table. // Table containing the OLTP schema relationships // Table defining the fact nodes of ME/R schemas and the // OLTP table name that is used to create the fact node. Fact_Attribute_Table // Table defining the attributes of the fact nodes for the ME/R // schema. Level_Table // Table defining the levels of the ME/R schema. Level_Attribute_Table // Defines the attributes of the levels. Variables: num_tables[] // Array of table names from OLTP schema(s) with numeric // fields. Array is ordered in descending order of numeric // fields. fact_node num_field[] date_field[] other_field[] // Fact node name. // Array of numeric OLTP attribute field names. // Array of date OLTP attribute field names. // Array of OLTP non-key, non-numeric, non-date/time fields. Method: num_tables[] := select_order_tables_numeric_fields (Table_Columns) (1) for each num_tables[j] fact_node := create_fact_node(num_tables[j], Fact_Node_Table) num_field[] := select_num_field (Table_Columns, num_tables[j]) (2) for each num_field[m] create_fact_node_attribute (fact_node, num_field[m], Fact_Attribute_Table) end for loop date_field[v] := select_date_field (Table_Columns, num_tables[j]) (3) if isempty(date_field[]) then create_review_levels(fact_node, Level_Table) else for each date_field[v] create_date_time_level (fact_node, date_field[v], Level_Table) end for loop end if if exists other_field_in_oltp_table (Table_Columns, num_tables[j]) (4) create_level (fact_node, num_tables[j], Level_Table) other_field[] := select_other_fields (Table_Columns, num_tables[j]) for each other_field[a]) add_fields_to_level (fact_node, other_field [a], num_tables[j], Level_Attribute_Table) end for loop end if Walk_Relationships (num_tables[j], fact_node, Table_Columns, Table_Relations, (5) Level_Table, Level_Attribute_Table) end for loop end algorithm Figure 3.2: Algorithm for Conceptual Schema Creation 47

52 Walk Relationships Procedure for Candidate Schema Algorithm Procedure Walk_Relationships (tablename, fact_node, Table_Columns, Table_Relations, Level_Table, Level_Attribute_Table) Input Parameters: Table_Columns Table_Relations tablename fact_node // Table containing table name, column name, and column type // for every table and all of its columns in a relational // schema. // Table indicating what OLTP tables are related and the // cardinality between them. // OLTP table name representing the table to have // relationships evaluated. // Name of fact node created for the levels and attributes being // created/modified in this procedure. In/Out Parameters: Level_Table Variables: relation[] // Table defining the levels of the ME/R schema. The level // information is defined by fact node and includes the name // of the level (OLTP table name for non-date fields) the // parent of the level if not the fact node, and a date/time // level indicator. Level_Attribute_Table // Defines the attributes of the levels. Definition is by fact // node name and level name. The attribute name is the // OLTP column names. related_table cardinality level_column[] // Array of relation names to current OLTP table being // evaluated. // Table related to tablename table. // Cardinality between tablename and related_table. // Array of OLTP columns of level node table. Method: relation[] := select_relations (Table_Relations, tablename) for each relation[f] cardinality := get_cardinality (Table_Relations, relation[f], tablename) related_table := get_related_table (Table_Relations, relation[f], tablename) create_level_node (fact_node, tablename, related_table, Level_Table) level_column[] := select_columns (Table_Columns, related_table) for each level_column[r] insert_level_attributes (fact_node, related_table, level_column[r], Level_Attribute_Table) end for if cardinality = * then Walk_Relationships (related_table, fact_node, Table_Columns, Table_Relations, Level_Table, Level_Attribute_Table) end if end for end algorithm procedure Figure 3.3: Walk Relations Sub-Procedure for Conceptual Schema Generation 48

53 The output of the algorithm is candidate schemas represented graphically, using ME/R notation and in tabular form. In Section 2.3.4, Figure 2.5 and 2.6 we describe the ME/R modeling notation used for the graphical representation. For the tabular form of conceptual schemas, four additional tables are created: Fact_Node_Table, Fact_Attribute_Table, Level_Table, and Level_Attribute_Table. These store the graphical ME/R schema in tabular form for use in the logical schema creation. The Fact_Node_Table stores fact node name for the various candidate schemas created. The Level_Table stores the names of each level off of the fact nodes and the levels which are sub-levels of the level nodes. The two attribute tables, Fact_Attribute_Table and Level_Attribute_Table, store the attributes or columns of the facts and levels, respectively. These tables are described further as they are created and used in schema automation. Step 1 orders the entities with numeric fields in descending order of number of numeric fields. The descending order is not necessary but the entities with the greatest number of numeric fields create candidate schemas that are generally better for answering the user queries. By processing the entities in this order the candidates that are more likely to be useful are created first. In our example, the only numeric fields are of type decimal and integer. The result of this step is a list of tables in a ranked order of number of numeric fields per table as shown in Table 3.1. The array num_tables holds only the table names. Table Name Number of Numeric Columns LineItem 5 PartSupp 2 Part 2 Orders 2 Supplier 1 Customer 1 Table 3.1: Numeric Columns per OLTP Table 49

54 The creation of the num_tables in the algorithm corresponds to this step. The array (implemented as an array, a linked list, a procedural cursor, or other similar construct) has the value {LineItem, PartSupp, Part, Orders, Supplier, Customer}. This array is traversed in the first FOR loop. Each iteration of the loop creates a new candidate schema. There are 6 resulting candidate schemas, one for each item in num_tables. There are exceptions to this and they are discussed in Section 3.4. Starting with the first table of the num_tables (LineItem) we continue with Step 1 by creating an ME/R diagram. This table becomes the fact node of an ME/R diagram. In the algorithm this is the step Create Fact Node. The fact node is represented by a diamond shape. This becomes LineItem Event, shown in Figure 3.4. Figure 3.4: LineItem Event of ME/R This create fact node step adds the first entry into the Fact_Node_Table with the fact_node_name being the name of the table with the word event appended, or LineItem Event. This table is shown in Table 3.2. The keep column, used to denote the conceptual schemas to convert to logical schemas, is not used until the schema evaluation of Section 3.3. This fact node becomes the focus of the candidate ME/R schema created on this iteration of the loop. All processing for the current loop iteration revolves around the LineItem Event fact. Fact_Node_Name Table_Name Keep LineItem Event LineItem Table 3.2: Fact_Node_Table for LineItem Event Candidate Schema Step 2 finds the numeric fields for this table from the Table_Columns table. These become the attributes (measures) of the fact node. In LineItem Event, the numeric fields are L_LineNumber, L_Quantity, L_ExtendedPrice, L_Discount, and L_Tax. These are added to the diamond as circles, indicating attributes, as shown in Figure

55 Figure 3.5: Attributes of LineItem Event In the algorithm, this step is processed as a loop to identify all of the numeric columns of the table and store them in the Fact_Attribute_Table as shown in Table 3.3. Fact_Node_Name LineItem Event LineItem Event LineItem Event LineItem Event LineItem Event Attribute_Name L_LineNumber L_Quantity L_ExtendedPrice L_Discount L_Tax Table 3.3: Fact_Attribute_Table for LineItem Event Next, as part of Step 3, we identify any date fields of this table, again using the Table_Columns table. The LineItem table has three date and/or time fields: L_ShipDate, L_CommitDate, and L_ReceiptDate. These fields become the date/time levels or dimensions of our ME/R diagram. At this point in the automated process, we do not know how the user wants the date or time dimensions defined. For example, some users may want to see information by day, month, quarter, and year. The information about the quarter is not inherently stored in the date field (it would have to be a calculation). In some instances, day as a level may not be needed or appropriate. This is one example where user refinement is needed to determine how to represent the levels that make up the date dimension. In ME/R diagrams, dimensions are represented by a collection of levels in a hierarchical manner. Instead of representing the date and time dimensions as rectangles normally used to represent levels in an ME/R diagram, we 51

56 introduce a new notation, a hexagon. Because we do not know the user s desired levels to complete the dimension, a new construct is temporarily necessary. We use a hexagon to indicate a portion of the schema where user refinement may be needed. The ME/R schema created so far as given in Figure 3.6. Figure 3.6: Date Levels of LineItem Event In addition to modifying the schema diagram, we add these levels to the Level_Table. For each of the three date/time fields of the LineItem table, both a date and a time entry are added. If the date/time fields are kept together there is the possibility of creating a lot of entries for each individual date and time combination. By splitting the fields into two separate fields, a query can be created for various ship times. This method of storing hours would only produce 24 records in the time level. If times are not necessary then when user requirements are examined the time levels can be removed. The fourth column in the Level_Table is specifically designed for these date/time fields; it is a Y(Yes)/N(No)/R(Review) value that denotes if the level is a date or time level. The R value is used to denote that the date/time level added is not derived from an OLTP 52

57 column. This is important because we need to know that this is not a fully formed dimension. The Level_Table with these entries can be seen in Table 3.4. Fact_Node_Name Level_Name Parent_Level_Name Date_Level LineItem Event L_ShipDate Date Y LineItem Event L_ShipDate Time Y LineItem Event L_CommitDate Date Y LineItem Event L_CommitDate Time Y LineItem Event L_ReceiptDate Date Y LineItem Event L_ReceiptDate Time Y Table 3.4: Level_Table for LineItem Event with Date/Time Levels Step 4 is only processed if there are columns remaining in the table that are not yet processed (columns that are not key, numeric, or date fields). The numeric and date/time fields have already been processed. The keys are not needed in the data warehouse as they are blind keys used to relate the OLTP tables. If the keys are indeed meaningful to the user then they will likely be defined as numeric or character values, not as identifiers. The remaining fields are generally text data type fields. If there are text columns as part of this entity then a level (dimension) node is created with this entity s name. The level nodes are symbolized by rectangles. Each remaining field becomes an attribute of the level node. For our example, the LineItem table has five such columns. They are L_ReturnFlag, L_LineStatus, L_ShipInstruct, L_ShipMode, and L_Comment. The new level and its attributes can be seen in Figure 3.7. In this step, we add a new level to the fact node. This requires a new entry to the Level_Table shown in Table 3.5. Also added as part of this step are attributes to the LineItem Event level shown in Table

58 Figure 3.7: LineItem Node with Attributes Added to LineItem Event Fact_Node_Name Level_Name Parent_Level_Name Date_Level LineItem Event L_ShipDate Date Y LineItem Event L_ShipDate Time Y LineItem Event L_CommitDate Date Y LineItem Event L_CommitDate Time Y LineItem Event L_ReceiptDate Date Y LineItem Event L_ReceiptDate Time Y LineItem Event LineItem Table 3.5: Level_Table with LineItem Level Added Fact_Node_Name Level_Name Attribute_Name LineItem Event LineItem L_ReturnFlag LineItem Event LineItem L_LineStatus LineItem Event LineItem L_ShipInstruct LineItem Event LineItem L_ShipMode LineItem Event LineItem L_Commont Table 3.6: Level_Attribute_Table with LineItem Attributes 54

59 Step 5 is the most complicated step. In the algorithm, this step is the sub-procedure detailed in Figure 3.3. The rest of the TPC-H schema entities (Customer, Orders, Nation, Region, Supplier, Part, and PartSupplier) may be used here. In this step, the relationships of the fact node/event entity are evaluated. Every connected entity is made into a level node with their columns becoming attributes of that level. If this level entity is part of the many (*) side of the relationship, its relationships are also evaluated and processed by calling the sub-procedure again. For this step, we use the Table_Relations table. First, every row of the table where the table_name is equal to LineItem is found. This identifies the relation names of interest; Order_LineItem and PartSupp_LineItem. These relationships are processed one at a time within the sub-procedure loop. The relation names found (Order_LineItem and PartSupp_LineItem) are used to find the entities that make up dimensions for LineItem Event. Starting with the Order_LineItem relationship name the first table with a relationship is found. The Table_Relations table is queried where relation_name is Order_LineItem and table_name is not equal to LineItem. The result of the query is the entity Orders. A level node is created for the Orders table. The columns of the Orders table, as found in Table_Columns, become the attributes of the Orders level node. Levels and attributes for Step 5 are illustrated with the same graphical representation as Step 4. Adding the Orders level with attributes to our diagram results in Figure

60 Figure 3.8: Order Level and Attributes Added to the LineItem Event At this point we have added our first level not derived from the LineItem table. We now need to determine if any of the OLTP tables related to the Orders table should be included in this schema. In the Table_Relations table there is a cardinality field. This field is used to indicate how many entities are related to other entities by the relationship. We query the Table_Relations table for all rows where table_name is equal to LineItem. For the Orders_LineItem relation for the LineItem table the cardinality is * indicating that the LineItem table is on the many side of the relationship with Orders. In other words, one Order can have many LineItem records. Because of the cardinality on this side of the relationship, the procedure Walk_Relations is recursively called with the Orders table now being the parameter passed. 56

61 Using the same sub-procedure we determine that the Orders table is on the many side of a relationship with the Customer table. The Customer table is added as a level node and its columns are added as attributes of that level. Because the Orders table relates to Customer with a many-to-one relationship the procedure is called again with Customer as the input parameter. The Customer table has a many-to-one relationship with the Nation table which in turn has a many-toone relationship to Region, thus the sub-procedure is called two more times. The Customer, Nation, and Region levels are added with attributes to produce Figure 3.9. The Orders dimension in the figure now has four levels in the hierarchy; Orders, Customer, Nation, and Region. In the example, all of the relationships are many-to-one, thus we continue to recursively call the Walk_Relations procedure. If we encounter a relationship that is not many-to-one or mny-tomany, we do not include it in the diagram. If the cardinality of Orders to Customer had not been a many-to-one relationship, the Customer, Nation, and Region tables would not have been visited and added to the schema. As is, the sub-leveling ends with the Region table because it has no relationships not already evaluated. This evaluation of related entities is similar to Golfarelli et al. [GM98b] in that the entities relating to the fact node are evaluated and added to the candidate schema. Golfarelli et al. evaluate and add all relating entities to the schema while we contend that only those entities on the many side of a relationship are necessary. This is because the relating level is of higher detail and thus contains important descriptor type information about the measures. Boehnien and Ende also use the relationships between OLTP entities to derive schema dimensions. They use 0-to-many relationships and 1-to-many relationships to determine usefulness of entities to the business measures. 57

62 Figure 3.9: Complete Orders Dimension of LineItem Event 58

63 As noted in Step 4, when levels and attributes are created, entries are made to the tables Level_Table and Level_Attribute_Table. These new tables with Orders, Customer, Nation, and Region entities are shown in Table 3.7 and 3.8. Here the parent_level_name field is used to show the hierarchy of the sub-levels of the Orders dimension. Fact_Node_Name Level_Name Parent_Level_Name Date_Level LineItem Event L_ShipDate Date Y LineItem Event L_ShipDate Time Y LineItem Event L_CommitDate Date Y LineItem Event L_CommitDate Time Y LineItem Event L_ReceiptDate Date Y LineItem Event L_ReceiptDate Time Y LineItem Event LineItem LineItem Event Orders LineItem Event Customer Orders LineItem Event Nation Customer LineItem Event Region Nation Table 3.7: Level_Table for LineItem with Orders and Their Sub-levels Added Fact_Node_Name Level_Name Attribute_Name LineItem Event LineItem L_ReturnFlag LineItem Event LineItem L_LineStatus LineItem Event LineItem L_ShipInstruct LineItem Event LineItem L_ShipMode LineItem Event LineItem L_Commont LineItem Event Orders O_Comment LineItem Event Orders O_OrderStatus LineItem Event Orders O_TotalPrice LineItem Event Orders O_OrderDate LineItem Event Orders O_OrderPriority LineItem Event Orders O_Clerk LineItem Event Orders O_ShipPriority LineItem Event Customer C_Comment 59

64 Fact_Node_Name Level_Name Attribute_Name LineItem Event Customer C_Name LineItem Event Customer C_Address LineItem Event Customer C_AcctBal LineItem Event Customer C_MktPriority LineItem Event Nation N_Name LineItem Event Nation N_Comment LineItem Event Region R_Name LineItem Event Region R_Comment Table 3.8: Level_Attribute_Table with Additional LineItem Attributes The last relationship of the LineItem table is to the PartSupp table. This is a many-to-one relationship, so the relationships of the PartSupp table are recursively processed as well. PartSupp is a little different from other tables in this example because it has multiple sub-levels. The other difference is that the Nation and Region levels already exist from the relationship to the Customer level so that we do not have to duplicate part of the diagram. The Nation and Region levels are added again to the Level_Table because the parent tables are different. They do not need to be added to the Level_Attribute_Table because the attributes are the same as before. The first complete candidate schema (for one iteration of the outer most loop in Figure 3.2) is given in Figure The corresponding completed tables are given in Appendix B (Table B.1 through B. 4). 60

65 Figure 3.10: Candidate Schema 1: LineItem Event Now that we have completed the first iteration of the algorithm, we can create another candidate schema in the second iteration. Steps 1 (starting with node creation) through 5 are reapplied with the PartSupp table as the event entity. This second candidate schema has something that the first one did not. The PartSupp table is on the one side of the relationship with 61

66 the LineItem table. This means that we do not delve any lower on that dimension path. We do not bring in any Customer table information at this time as a sub-level of LineItem. The completed candidate schema is shown in Figure This candidate schema has fewer levels on most paths. Only the relationship with the Supplier table leads to sub-levels. Another new symbol is introduced in this diagram. It is the cloud shape. This shape denotes that a level may be needed and is not automatically derived from the OLTP schema. In data warehouses, events tend to be measured by date or by time periods. Although the PartSupp table does not have any date/time fields the final design in the data warehouse probably will. This dimension may be created by the grain requirements of the user for analysis or by the refresh requirements used to capture data from the OLTP system to store into the data warehouse. This level can be defined by user refinements for the later physical design. For example, user requirements may want to know the PartSupp availability every Monday or the end of every month; thus day or month dimensions could be added. Dimensions represented by the cloud are added to any fact node with no date or time fields. 62

67 Figure 3.11: Candidate Schema 2: PartSupp Event Following the sequence of the algorithm, the other tables with numeric fields are examined and candidate schemas are produced. These are given in Figure 3.12 through These same schemas, represented in tabular form, are given in Appendix B as Tables B.5 through B.8. When adding to the Level_Attribute_Table as part of the LineItem schema (Candidate Schema 1) we did not duplicate the level attributes if a sub-level was used by more than one level. This is not the case for levels across candidate schemas. Even though the same level name may be in more than one candidate schema they may not have the same attributes. This can be seen in the PartSupp level in Candidate Schemas 1 and 2. In Candidate Schema 1, the PartSupp level has all of the fields of that table as part of the level. Candidate Schema 2 only has the fields that are not numeric, date/time, or keys fields. This possible difference necessitates that 63

68 level_attributes are defined per fact_node_name (LineItem Event, PartSupp Event, etc.), i.e., per candidate schema. Figure 3.12: Candidate Schema 3: Part Event 64

69 Figure 3.13: Candidate Schema 4: Orders Event 65

70 Figure 3.14: Candidate Schema 5: Customer Event 66

71 Figure 3.15: Candidate Schema 6: Supplier Event We have completed the automatic generation of six possible conceptual schemas from an OLTP schema. Not all of these schemas may prove to be useful to the user. The six candidate schemas generated by our algorithm are evaluated against user requirements in the next section. 67

72 3.3 Candidate Schema Selection The TPC-H schema provides a standard set of queries for a data warehouse environment. These queries are used here to evaluate which candidate schemas best meet users needs. There are two aspects of a query that are used to determine if a candidate schema can answer a query; the tables in the FROM clause and the numeric fields in the SELECT clause. If a candidate schema does not contain the table(s) in the FROM clause it cannot answer the query because the fields of that table are not in the schema either. It is unnecessary to check for every field in the query SELECT statement because the candidate schema generation algorithm dictates that every field in a table of the OLTP system is in the schema. The numeric fields from the SELECT clause are essentially the measures that need to be attributes of the fact node to answer the query. In order to compare the candidate schemas for satisfying the queries, we create a table. This table can be in the form of an array, list, or other data structure. The data structure (illustrated in Table 3.9) is sized to represent the 22 queries and the 6 candidate schemas that might answer them. In the table, an X shows that the candidate schema completely meets the query requirement. A P means that the schema partially answers the query. This occurs when a numeric value of interest is not in a fact node but is in one of the dimensions. A blank in the row-column combination means that the query is not answerable by this schema. The table is generated by the algorithm in Figure The subroutines in the candidate schema evaluation algorithm are further explained in Appendix C. 68

73 Input Parameters: Fact_Node_Table Candidate Schema Evaluation Algorithm // Table defining the fact nodes of ME/R schemas and the OLTP // table name that is used to create the fact node. Fact_Attribute_Table // Table defining the attributes of the fact nodes. Level_Table // Table defining the levels of the ME/R schema. Level_Attribute_Table // Defines the attributes of the Levels. In/Out Parameters: eval_array[] Variables: query[] a fact_node[] from_field[] num_sel_field[] // Two dimensional array, table, linked list, or other data type to // represent the queries and the candidate conceptual // schemas that could fulfill their needs. // Array of user queries. Includes the whole text of queries. // Represents the number of queries to be evaluated. Used as an // index to the query[] array. // Array of conceptual schema fact nodes. // Array of tables listed in FROM statement of query. // Array of numeric fields from SELECT statement of query. Method: query[] := select_queries_to_evaluate for each query[a] fact_node[] := select_from_fact_node_table (Fact_Node_Table) for each fact_node[g] from_field[] := select_from_fields (query[a]) for each from_field[n] if find_fact_level_tables (Fact_Node_Table, Level_Table, from_field[n], fact_node[g]) then eval_array[a, fact_node[g]] := X // Set the intersection of a and fact_node[g] in the // eval_array to X to denote that this candidate // schema works for this query so far. else eval_array[a, fact_node[g]] := null // Set the intersection of a and fact_node[g] in the // eval_array to null to denote that this candidate // schema does not work for this query. end if end for continue // Since candidate schema identified by fact_node[g] // failed this test no reason for further evaluation of // from fields. Continue with next iteration of the // fact_node[g] loop, the next fact node. 69

74 num_sel_field[v] := select_num_fields (query[a]) for each num_sel_field[v] if not find_fact_attribute (Fact_Attribute_Table, num_sel_field[v], fact_node[g]) then if find_level_attribute(level_attribute_table, num_sel_field, fact_node[g]) then eval_array[a, fact_node[g]] := P // Set the intersection of a and fact_node[g] in the // eval_array to P to denote that this candidate // schema might work for this query. else eval_array[a, fact_node[g]] := null // Set the intersection of a and fact_node[g] in the // eval_array to null to denote that this candidate // schema does not work for this query. end if end if end for end for end for end algorithm continue // Since candidate schema identified by fact_node[g] // failed this test no reason for further evaluation of // numeric select fields. Continue with next iteration // of the fact_node[g] loop, the next fact node. Figure 3.16: Candidate Schema Evaluation Algorithm As an example, this algorithm can be performed against Q1. Starting with Candidate Schema 1, LineItem Event, if the table LineItem is in the Fact_Node_Table or Level_Table the first part of the algorithm is completed with the answer that so far this candidate conceptual schema answers the query. The entry for LineItem Event is in the Fact_Node_Table showing that the LineItem table is the basis of that fact node. That ends the IF statement and since LineItem is the only table in the FROM statement of the query the FOR loop is complete as well. Continuing on to the next section all numeric fields in the SELECT statement are identified: L_Quantity, L_ExtendedPrice, L_Discount, and L_tax. L_Quantity is in the Fact_Attribute_Table for the LineItem Event so the X remains. The rest of the numeric fields are processed with the same results. Thus, Candidate Schema 1 satisfies the query. Doing the same steps for Candidate Schema 2, PartSupp Event, we see that the LineItem table in again included in the schema. The numeric fields are again checked. The L_Quantity field is not in Fact_Attribute_Table but is in 70

75 the Level_Attribute_Table so the value is set to P for probable match. Review of the other numeric fields leaves the value set as P. Candidate Schemas 3 through 6 are processed in this manner with only Candidate Schema 4, Orders Event, being found to be even probable to answer the query. Evaluating the remaining schemas in the same manner yields Table 3.9. Candidate Candidate Candidate Candidate Candidate Candidate Schema 1 Schema 2 Schema 3 Schema 4 Schema 5 Schema 6 (LineItem (PartSupp (Part Event) (Orders (Customer (Supplier Event) Event) Event) Event) Event) Q1 X P P Q2 P P Q3 P P Q4 X X Q5 X Q6 X Q7 X Q8 X Q9 P Q10 P P Q11 P X P Q12 X X Q13 X X X Q14 X P Q15 X P Q16 P X Q17 X P Q18 X P Q19 X P Q20 X P Q21 X Q22 P P X Table 3.9: Candidate Schema Evaluation 71

76 This algorithm identifies how various candidate schemas meet the requirements of the user queries. Table 3.9 is then evaluated manually to determine which candidate schemas to keep and which can be discarded. The rest of this section covers this manual process. There are two types of queries that yield results that may need manual refinement before manual evaluation of the table: a query with numeric SELECT fields from multiple OLTP entities, and queries with no numeric fields. Any query that has numeric SELECT fields from more than one OLTP schema results in only a partial solution to the query. Q10 is an example of this. Query Q10 has three numeric fields, L_ExtendedPrice, L_Discount, and C_AcctBal. Because these fields span different OLTP entities, no candidate schema can be a definite answer to the query. With our proposed conceptual schema algorithm, a fact node has measures from one OLTP entity. Any candidate schema that meets the FROM criteria evaluates to a P for possible. Manual evaluation is needed to determine if the schema can indeed answer the query or minor changes to the candidate schema may allow that schema to completely answer the queries. One such change may be the movement of a numeric field from a level to a fact node. For queries with no numeric type data fields in the SELECT statement, the schemas that include all the entities in the FROM statement of the query evaluate as definite answers to the query requirements. With no numeric SELECT fields more schemas are eligible. This evaluation of the candidate schemas is correct because the queries only require textual or date type information. With no requirement for numerical data all data can come from any candidate schema with the appropriate table data in the levels. Query Q12 is an example of this type of query. The from section of Query Q12 has two tables, orders and lineitem. Candidate Schemas 1 and 4 have data from these tables and as such are equally likely to meet the user requirements of the query. The algorithm generates a table depicting the ability of candidate schemas to meet the needs of the user queries, but the data in this table needs manual evaluation to determine which candidate schemas are most needed by users and should be converted to logical schemas. Table 72

77 3.9 gives us several possibilities for schemas that meet our data warehouse requirements as defined by the user queries. Candidate Schemas 3 and 6 can be eliminated since they do not answer any queries that cannot be answered by other candidate schemas. Candidate Schema 4, while satisfying many queries, does not satisfy any that are not satisfied by Candidate Schema 1. Candidate Schema 1 is either a stronger match for each of the queries or just as able to answer the query. This means we can use Candidate Schema 1 to answer the same questions as can be answered with Candidate Schema 4. Candidate Schema 4 can be discarded. Candidate Schema 1 answers most of the user questions, but Candidate Schema 2 is also promising. It is stronger than Candidate Schema 1 for a few of the queries ( X instead of P ). Candidate Schema 5 may not be needed. Candidate Schema 5 answers Q22 which can only be partially answered by Candidate Schema 1, and Q13 which is also answered by Candidate Schema 1. Further analysis is needed to decide if Candidate Schema 5 can be dropped or is needed as part of the data warehouse for specific purposes. The analysis consists of possible modifications to the query or modifications to Candidate Schema 1. Candidate Schema 2 was noted as promising. This schema answers queries that none of the others are even promising on such as Q16. Candidate Schema 2 is a stronger answer for query Q11 than Candidate Schema 1. The ability to answer these two queries makes us decide to keep Candidate Schema 2 for the time being. The data warehouse designer can examine the information in Candidate Schema 5 to see what is offered that cannot be gleaned elsewhere. Candidate Schema 5 is centered around customer events. For a company in shipping, customers and what can be learned about them is important. What Candidate Schema 5 gives us is the possibility to look at customers without necessarily needing orders to tie to them. This is why Candidate Schema 1 is not necessarily a good alternative to Candidate Schema 5. Candidate Schema 1 is centered around the existence of line items which by definition mean an order exists for those items and only customer data for customers with orders is necessary in Candidate Schema 1. Query Q22 specifically looks at customers who have made no orders for the past 7 years; since the TPC-H criteria specifies data 73

78 is only kept for 7 years, there would be no corresponding line items. Candidate Schema 1 could handle this query, although not in the cleanest manner, because it does contain customer information. The customer dimension alone could answer this query if all customers are added to the Customer dimension even when no order exists. In the hopes that some dimensions can be used in conjunction with other schemas or for levels that may be shared by dimensions in a schema, it is likely that all data is populated into the warehouse. Whether or not Candidate Schema 5 can be removed is determined by further user input or the implementation of the data warehouse. At this point in conceptual design, we could make a decision that all customers should be populated in the customer dimension and then Candidate Schema 5 can be dropped, but data warehouse population is part of physical design. The results of the schema evaluation with the user queries leaves us with a conservative scenario of keeping three candidate schemas and a more likely scenario of only needing two. At this juncture we note the three likely schemas before initiating logical schema design. The schemas that are deemed desirable are noted in the Fact_Node_Table. The Fact_Node_Table has an additional field not used with schema creation, the Keep field. This field can be filled in with information about which candidate schemas we currently plan to keep. For our working example, the Keep field would be set to Y for LineItem Event, PartSupp Event, and Customer Event. The others can be set to N to let us know they have been evaluated. The keep field is used by the logical schema algorithm to determine which conceptual schemas to convert to logical schemas. This keep field is better than deleting information from the conceptual schema tables because it may be decided later that one of the schemas we originally discarded is indeed necessary. If this is the case then we still have all the information to automate creation of additional logical schemas in Star model form. The updated Fact_Node_Table is given in Table

79 Fact_Node_Name Table_name Keep LineItem Event LineItem Y PartSupp Event PartSupp Y Part Event Part N Orders Event Orders N Customer Event Customer Y Supplier Event Supplier N Table 3.10: Fact_Node_Table, Evaluated for Schemas to Use for Data Warehouse Thus far we have created six candidate conceptual schemas and narrowed those likely to be of the most use to the users down to three. Not all of the work on the conceptual schemas is finished. Additional steps that need further user input or knowledge of a designer are discussed in the next section. 3.4 Manual Refinement of Conceptual Schemas Data warehouse schema generation cannot be entirely automated. In the algorithms of Section 3.2, we show that a majority of the work can be automated. There are additional user steps to further refine the candidate schemas. Most of the additional user steps require little knowledge of the existing OLTP database schema; knowledge of the user needs is more important. It makes sense at this point to evaluate user needs more closely and refine the schemas further. This provides us with a complete conceptual schema prior to starting any logical design. These are manual tasks that may need further user input. In Chapter 4, an algorithm is proposed to automate changing a conceptual schema into a logical one. Because the generation of a logical schema is automated, manual manipulation of the schema can occur after logical schema generation. Saving the manual tasks until after the logical schema is created means that manual changes made to a conceptual schema have to be made to its corresponding logical schema as well and vice versa. 75

80 Because the user queries to be answered by the data warehouse are known, we are able to eliminate several of the candidate schemas. Elimination of unnecessary candidate schemas is the first user step. If the user queries are not known, our candidate schema evaluation algorithm indicates that Candidate Schemas 1, 2, and 3 are the best candidates based solely on the number of numeric fields in the OLTP tables. The result of the candidate schema evaluation leads us to choose Candidate Schemas 1, 2, and 5. The evaluation above of the candidate schemas from the user queries is considered a manual step because while the table creation occurs with an automated process the determination of which schema to keep based on those results is manual. It is likely that the best determination of which schemas to keep is subjective and should be considered by a designer. The candidate conceptual schemas still need some refinement by a designer. For example, the date and time dimensions are represented by hexagons and are not broken into levels comprising the dimensions. It is possible at this point with further user input to replace the hexagons with rectangles representing the levels. This is just one example where additional refinement is needed. Steps for further refinement of the conceptual schemas are given below. We show examples and discuss the manual user-driven steps, but these changes are not implemented in the candidate conceptual schemas we use as input to logical design. We recommend completing the manual steps first but since they can be influenced by designer preference and modeled in varying ways we have chosen not to clutter the current candidate schemas. The manual refinements are given for completeness of the conceptual design, but our logical design in Chapter 4 simply uses the candidate conceptual schemas from Section 3.2. We have identified seven manual steps for conceptual schema refinement. They are summarized below (numbering starts with 6 because the automated conceptual design steps above ended with 5). 6. If user queries are known, eliminate unnecessary candidate schemas. 7. Inspect measures in each fact node. Are they indeed measures or attributes? 76

81 8. What is the necessary grain of data? What date/time information are users interested in? (This step is only necessary for conceptual modeling to determine date time dimensions. It can be postponed until physical design.) 9. Are other calculated fields necessary? 10. Can schemas be merged? 11. Can any fields be eliminated as not necessary? (May fall into physical design.) 12. Is there any data required that did not exist in the OLTP database? There may be other additions dictated by need and situation but the steps above should cover most of the necessary changes that need to be made manually. We provide some examples, but these examples rely on our interpretation of some of the user needs from the TPC-H information. The following is for illustrative purposes only and is not meant as a comprehensive solution for finishing the automatically generated schemas. The basics of Step 6 are described with the evaluation of the schema in Section 3.3 and are not further explained here. Step 7 analyzes the measures of a fact node. Some numeric fields are actually attributes (descriptors) rather than measures. Fields such as width, length, and other physical properties are generally attributes. Fields such as size are generally used as limiting criteria in a query rather than as a field to be summed as a measure of the business. In the LineItem Event, there is one measure that is probably an attribute or a key for the OLTP table. This is the L_LineNumber field. This field is likely used to sequence the items that make up the order. It is not a number that can be used in queries to measure the business. L_LineNumber is non-additive, and would not be summed for each item of an order, which is a good indicator that it is an attribute rather than a measure. As part of Step 7, the L_LineNumber field is moved from the fact table and placed in the level node or dimension table. Figure 3.17 shows this change to the facts and dimensions of the ME/R diagrams. Only the portion of the diagram that changes is shown. 77

82 Figure 3.17: Candidate Schema 1: ME/R LineNumber Measure Change Step 8, determining the grain of data necessary, is an optional step. We include it because many times just by having the user requirements the grain or level of detail needed in the data warehouse is known. If the grain of information is known, the undetermined date and/or time measures can be specified. For example, if the user wants to see part inventory levels at the various suppliers on a weekly basis, a level representing week is necessary. In data warehouse population the grain is important because we now know how often the data need to be captured or what level of detail is needed (i.e., weekly). Users may not care about a day level of the date. They may care about inventory levels on a weekly, monthly, quarterly and yearly basis. If we add this grain information to Candidate Schema 2 the ME/R diagram is given in Figure 3.18, where the cloud dimension is replaced with a dimension of actual levels. 78

83 Figure 3.18: Candidate Schema 2: ME/R with Date Dimension Defined The other important aspect of Step 8 is the determination of levels of interest in the date and/or time dimensions. For some date dimensions, such as the example above, the day is not necessary. For other date columns the user might not only be interested in the date but the day of week (Monday, Tuesday, etc.). The time dimension can be broken into twenty-four 1 hour intervals, by shift, or maybe even down to the minute of the day. The grain may give a hint as to this requirement or it may need to be determined from users. In Step 9, any calculated fields used on a regular basis are added to the schema. This way the calculation is done before data is in the warehouse rather than by end user tools. Calculated fields may replace other fields. If this is the case, the unnecessary fields can be removed. This step is intended specifically for the fact tables. These fields are the numeric fields of most interest 79

84 to users. There can be additional calculations in the dimensions, but as those are probably less frequently used, it may be more efficient for the calculation to be performed by end user tools. In many of the TPC-H queries, LineItem measures are combined to produce a field termed revenue in many of the queries. Because this is used in multiple queries it is a likely candidate for Step 9. The calculation is sum(l_extendedprice * (1 L_Discount)). Because sum is an aggregation of all data in the query, we would not use it as part of the new field. Our new revenue field (revenue of each individual lineitem) is defined as L_ExtendedPrice * (1 L_Discount). Figure 3.19 shows this in the ME/R schema. One drawback to the ME/R model is that there is no place to make a notation for definitions such as the calculation that makes up our new revenue field. Such a notation does not exist because ME/R is a conceptual model and not intended to show functional information such as the actual calculation. The calculation could be an additional field added to the Fact_Attribute_Table, but it would not be shown in the graphical representation as more than another measure. Figure 3.19: Partial LineItem ME/R with Added Revenue Measure Another important aspect of Step 9 is adding count fields. In some instances a count may be important. Maybe the number of line items per order is important. This can be done in queries or a counter field could be added to the event node. Count fields tend to be a little more userfriendly even though they are redundant. The actual implementation of the count field is left up to the data warehouse designer. 80

85 Step 10 is based on the merging of facts and dimensions as described by Ballard et. al [BH98] and Kimball [K96b, K98]. The merging of either facts or dimensions requires knowledge of not only user requirements but also of the OLTP system. If any of our schemas have a common dimension (a dimension having all of the same levels as a dimension in another schema) they can be merged into fact trees. This would give us more of an enterprise view. For example, if a business wanted to evaluate two aspects of their customers, orders and negotiated contracts. It is likely that one of the conceptual schemas would be centered around orders such as Candidate Schema 4. Another schema might be centered around the negotiated pricing contracts or discounts for various products. These two schemas might have the same customer dimension as seen in Figure The sharing of a dimension gives users a more complete view of their order data in one schema. 81

86 Figure 3.20: Two Facts with a Shared Dimension This can be hazardous because it is possible to create queries across the fact trees that do not fulfill their intended purpose. If a query was written that included both O_TotalPrice and Discount% would be possible because the fact are joined by the Customer level, but this would not be correct if the two date dimensions were of differing granularity. Unless the two date levels are used to pare down the query to the appropriate level the numeric data would be inconsistent. This sharing of dimension tables is not necessarily bad, it can also be handy for linking various aspects of the enterprise data together. The example above is better if the date level granularity is the same and the facts also share a date. In the schemas generated by our algorithm there is no 82

87 example of common dimensions. This merging of dimensions would be most often used with Star schemas or in the logical phase of data warehouse design. The second form of schema merging is fact merging. In this form of merging facts from two candidate conceptual schemas can be merged to a single schema. We may find that merging some of our candidate schemas (or parts of them) may allow for additional queries to be answered. This may be the case where a table is not part of a many relationship to another table and some information from that table fulfills a query. In our evaluation summary given in Table 3.9, Candidate Schema 4 answers several queries. We see that it only partially answers Q1, Q3, Q10, Q18 and Q22. If we decide to add the numeric fields from LineItem to the Ordesr Event we could answer Q1, Q3 and Q18 completely and would now have more of the numeric SELECT fields from Q10 and Q22. In this example we would have to be careful with the physical design on the Orders Event so as not to change the granularity of the TotalPrice attribute. The numeric attributes of LineItem Event are per ordered item. The TotalPrice attribute is a value of the price of all order items comprising an order so multiple order items comprise the TotalPrice in the fact table. The TotalPrice attribute now becomes a non-additive measure, it is per customer for a time frame but not per order item. This type of merge is not likely with our conceptual schema algorithm because the measures being moved to the fact node should already exist in a fact node of another schema and the fact nodes all represent distinct entities of differing granularity. Thus the other schema with another entity as a fact node should fulfill the user needs. This type of merge is useful if there exists a candidate schema that fulfills one and only one query. If another candidate schema could partially answer this query it is possible that a merge of this type could modify the schema to also fulfill the needs of this query. Another instance where fact merging would be beneficial is where the entities that provide the measures for two fact nodes from candidate schemas are related in the OLTP schema by a one-to-one relationship. A one-to-one relationship means that the measures should carry equivalent weight. And because they are oneto-one the additional dimensions that would appear in the schemas could be useful in a single 83

88 schema. The merging of these two candidate schemas creates a new schema that is compilation of the attribute in the original two schemas. The new fact node has all measures from both previous fact nodes and keys to all dimensions that exist in the two schemas. The levels containing the attributes of the OLTP schema entities that comprise the fact nodes can be merged into one dimension. Step 11 removes fields that never hold information of interest. In a few entities, such as Region, it is unlikely that the comment field holds information of value if so it could be removed. There may be fields still in existence in an OLTP database that are no longer used by any application. We would not want to bring across these empty fields. The last manual step is Step 12. This addresses instances where user requirements reference data that is not stored in the enterprise. For example, a car lot may see a decline is sales on rainy days. The users want to track the weekly sales looking at how many days of that week are rainy. The OLTP system may not have data about the weather. In this case an outside source of information is integrated into the data warehouse. Because the data is not in the OLTP data sources used as input, this step cannot be automated. 3.5 Summary This chapter showcases a methodology for converting an OLTP schema to a conceptual schema for a data warehouse. We start with the inputs of an OLTP schema and user queries to be performed on the data warehouse. An algorithm for creating the candidate conceptual schemas is given and illustrated. These schemas are then evaluated against the user queries. We then provide a guideline for manual refinement of the conceptual schemas based on designer knowledge of the environment and further definition of user needs. In Chapter 2, we meet our first objective by choosing a top-down architecture approach to data warehousing. In Chapter 3, we complete two more objectives and answer a portion of a third. As mentioned in Chapter 2 and shown here, we choose the ME/R model for conceptual 84

89 modeling of our data warehouse. In Section 3.2, we propose and illustrate an algorithm for automated translation of an OLTP schema to a data warehouse conceptual schema in ME/R form. In Section 3.3, we provide a semantic means to evaluate candidate conceptual schemas based on user-driven requirements as represented by queries. In addition to meeting several of our objectives we provide a guideline for manual refinement of conceptual schemas in Section 3.4. In Chapter 3, we design the first schema of the schema design phase, conceptual schemas. At the conclusion of Chapter 3 we have 3 candidate conceptual schemas to convert to logical schemas. Chapter 4 provides an algorithm with illustration of how the candidate conceptual schemas created in this chapter can be converted to logical schemas in Star form. 85

90 Chapter 4: Developing a Logical Data Warehouse Schema This chapter defines an algorithm for generating a logical schema in Star form from a conceptual schema in ME/R form. An automated method for converting a conceptual schema to a logical schema is given in Section 4.1. Section 4.2 provides a guideline for manual refinements to a logical schema. Section 4.3 is a summary of the work in this chapter and a discussion of possible alterations to the algorithms presented. 4.1 Logical Schema Generation After a set of candidate conceptual schemas is selected, the corresponding logical schemas can be created. The Star modeling notation is used here for logical schemas. The Star model is widely used in industry. The ME/R model provides hierarchical information about the data represented by the schema. The Star model condenses these levels all into a single dimension. This single dimension does not show the underlying hierarchical logic inherent in the dimension but provides a structure centered around the business measures that facilitates writing and processing of queries. Making a single dimension from multiple levels reduces the joins needed during query evaluation. The steps involved in creating a candidate logical schema from a conceptual one is: 1. A fact node becomes a fact table. Fields of a fact node become measure fields of the fact table. 2. For each level node attached to an event node, add a dimension table to the logical schema. The relationship between the fact table and dimension table are represented by a foreign and primary key relationship in the two tables. 3. Each field of the level is added to the dimension table. 4. The sub-levels of the current level node are visited. Each column of the sub-levels is added to the dimension table as a field. These 4 steps encompass the creation of Star schemas from candidate ME/R schemas. Each ME/R schema results in one Star schema. The algorithms given in Figures 4.1 and 4.2 performs the four steps on the tables created in the conceptual schema design. The steps corresponding to the pseudocode statements are represented by the numbers in the margin. The 86

91 logical schemas created could again be represented in tabular form, but we only illustrate the graphical versions here along with a discussion of manual refinement and alternate approaches. The tabular representation is straightforward to create and is similar to that of the conceptual schema. The selected candidate conceptual schemas from the previous chapter are used here to illustrate the algorithms and discussion. 87

92 Logical Schema Creation Algorithm Input Parameters: Fact_Node_Table // Table defining the fact nodes of ME/R schemas. Fact_Attribute_Table // Table defining the attributes of the fact nodes for the ME/R // schema. Level_Table // Table defining the levels of the ME/R schema. Level_Attribute_Table // Defines the attributes of the Levels. Output Parameters: Fact Table // Logical schema fact nodes. Dimension Table // Logical schema dimensions. Variables: fact_node[] // Array of conceptual fact nodes to be evaluated. fact_attrib[] // Array of Attribute_Name columns from Fact_Attribute_Table. level[] // Array of Level_Name columns from Level_Table. level_attrib[] // Array of Attribute_Name columns from // Level_Attribute_Table. cloud // Yes/No field to denote that level to be created is under review. Method: fact_node[] := select_cand_schemas (Fact_Node_Table) (1) for each fact_node[a] create_fact_table (fact_node[a], Fact Table) fact_attrib[] := select_fact_attributes (fact_node[a], Fact_Attribute_Table) for each fact_attrib[m] add_measures (fact_attrib[m], fact_node[a], Fact Table) end for level[] := select_level (fact_node[a], Level_Table) (2) for each level[p] cloud := check_review(fact_node[a], level[p], Level_Table) create_dimension_table (level[p], cloud, Dimension Table) add_keys (level[p], fact_node[a], Dimension Table) level_attrib[] := select_level_attribute (level[p], fact_node[a], (3) Level_Attribute_Table) for each level_attrib[x] add_attrib_to_dimension (level[p], level_attrib[x], Dimension Table) end for Add_Level_To_Dimension (fact_node[a], level[p], level[p], Level_Table, (4) Level_Attribute_Table, Dimension Table) end for end for end algorithm Figure 4.1: Automated Logical Schema Creation Algorithm 88

93 Add Level to Dimension Procedure for Logical Schema Creation Algorithm Procedure Add_Level_To_Dimension (fact_node, parent_level, dimension, Level_Table, Level_Attribute_Table, Dimension Table) Input Parameters: Level_Table // Table defining the levels of the ME/R schema. Level_Attribute_Table // Defines the attributes of the Levels fact_node // Conceptual fact node being evaluated. parent_level // Level passed to procedure, now used as parent level. dimension // Name of dimension table, level connected to fact node. IN/Out Parameters: Dimension Table Variables: level[] level_attrib[] // Array of Level_name columns from Level_Table. // Array of Attribute_name columns from // Level_Attribute_Table. Method: level[] := select_sub_level (Level_Table, parent_level, fact_node) for each level[g] level_attrib[] := select_level_attrib (Level_Attributes_Table, fact_node, level[g]) for each level_attrib[k] add_level_attrib_to_dimension (dimension, level_attrib[k], Dimension Table) end for Add_Level_To_Dimensions (fact_node, level[g], dimension, Level_Table, Level_Attribute_Table, Dimension Table) end for end algorithm procedure Figure 4.2: Add_Level_to_Dimension Procedure for Automated Logical Schema Creation Using the ME/R Candidate Schemas 1, 2, and 5 from Chapter 3 we create Star schemas using the algorithms in Figures 4.1 and 4.2. We use the Candidate Schema LineItem Event (Figure 3.10) to illustrate the algorithm. The other ME/R schemas, Candidate Schemas 2 and 5, are given in a completed Star form in Figures 4.10 and Starting with Step 1, there are three entries in the Fact_Node_Table with a keep value of Y. The first row to process is LineItem Event. This becomes the name of the fact table for the Star schema being created. The five numeric fields (attributes) of the LineItem Event become measures of the fact table. These attributes are found by querying the Fact_Attribute_Table where 89

94 the fact_node_name = LineItem Event. The fact table for the Star schema created by Step 1 is shown in Figure 4.3. Figure 4.3: LineItem Event Fact Table Step 2 adds the dimension table to the diagram for level nodes attached to an event node. These level nodes can be found in the Level_Table where the fact_node_name = LineItem Event. We start by processing the Lineitem level. A dimension table called LineItem is created as shown in Figure 4.4. A primary key is created in the LineItem dimension. A corresponding foreign key is placed in the LineItem Event fact table. These become blind keys, fields not analyzed or useful to users, that are only used to determine instances of a dimension on the measure being analyzed. Figure 4.4: Addition of LineItem Dimension to Star Model of Candidate Schema 1 Step 3 continues processing of this LineItem level node. In this step each attribute of this level is added to the dimension. The attributes are found in the Level_Attribute_Table where fact_node_name = LineItem Event and level_name = LineItem. The LineItem level with attribute fields added is shown in Figure

95 Figure 4.5: Addition of Columns for LineItem Dimension The LineItem level node has no other relationships to other sub-level nodes. This can be determined by looking for rows in the Level_Table where fact_node_name = LineItem Event and parent_level_name = LineItem. In this case Step 4 can be skipped because no such rows exist. Step 4 is discussed in more detail in the next level example. Starting again with Steps 2 and 3, the Orders level results in the schema shown Figure 4.6. In this figure, the Orders level has been added to create an Orders dimension with keys and attributes. 91

96 Figure 4.6: Orders Dimension for Candidate Schema 1 Querying the Level_Table for a parent_level_name = Orders shows that sub-levels exist, so Step 4 is processed for this level. The attributes of these sub-levels will be added to the Orders dimension. The first sub-level is Customer. The five Customer attributes are added into the Orders dimension. Here the Step 4 sub-procedure is called recursively to determine if Customer has additional sub-levels. The Customers level has an additional sub-level, Nation. The attributes of the Nation level are also added to the Orders dimension. Following this same sequence of steps the Region attributes are added as well. These changes to the Star schema are shown in Figure

97 Figure 4.7: Star Schema with Completed Orders Dimension Figure 4.8 is a diagram of the Star schema after adding the PartSupp level and its sublevels. Nation and Region attributes are also included as part of this dimension. In the Star schema, the sub-level attributes are rolled up into the highest level to create a dimension table. At this point the Customer and Supplier nodes cannot share a single Nation level because the individual levels no longer exist. The attributes of the level are part of the dimension and parts of dimensions cannot be shared. 93

98 Figure 4.8: PartSupp Dimension Added to Star Schema of Candidate Schema 1 Returning to Step 2 the date and time levels remain to be processed. The sub-levels and attributes that make up these nodes are not known. We represent them in the Star schema as dimensions comprised of only keys at this point. The levels that eventually make up the dimension including the attributes are added when additional user needs are known. The completed Star schema is given in Figure

Figure 4.9: Candidate Schema 1 as a Star Schema The remaining candidate schemas (Candidate Schemas 2 and 5) are represented in Figure 4.10 and 4.11.

99 Figure 4.9: Candidate Schema 1 as a Star Schema The remaining candidate schemas (Candidate Schemas 2 and 5) are represented in Figure 4.10 and The date and time dimensions are modeled slightly differently in these schemas. The dimensions are not even known to exist because there are no date and time fields associated with the schemas. We add them in conceptual schema creation because data warehouse schemas tend to rely on some sort of date field in order to compare measures. The conceptual schema 95

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This