Logical Design A logical design is conceptual and abstract. It is not necessary to deal with the physical implementation details at this stage.

Size: px

Start display at page:

Download "Logical Design A logical design is conceptual and abstract. It is not necessary to deal with the physical implementation details at this stage."

Daniel Wiggins
5 years ago
Views:

2 Logical Design A logical design is conceptual and abstract. It is not necessary to deal with the physical implementation details at this stage. You need to only define the types of information specified by your requirements. One technique you can use to model your logical information requirements is entity-relationship (ER) modeling. ER modeling involves identifying important data (entities), the properties of these entities (attributes), and how they are related to one another (relationships). For modeling purposes, an entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity that helps define the uniqueness of the entity. In relational databases, an attribute maps to a column. To ensure that your data is consistent, you should use unique identifiers. A unique identifier is added to tables so that you can differentiate between the same item when it appears in different places. In practice, this is usually a primary key. Although entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables. In dimensional modeling, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables.

3 Logical Design (continued) You identify business subjects or fields of data, define relationships between business subjects, and name the attributes for each subject. Your logical design should include: A set of entities and attributes corresponding to fact tables and dimension tables A model of operational data from your source data into subjectoriented information in your target data warehouse schema You can create the logical design using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specifically designed for modeling the ETL process) or Oracle Designer (a general purpose modeling tool).

4 Data Warehousing Schemas A schema is a collection of database objects that includes tables, views, indexes, and synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. Most data warehouses use a dimensional model. The model of your source data and the requirements of your users help you design the data warehouse schema. You can sometimes get the source model from your enterprise data model and reverse-engineer the logical data model for the data warehouse from this. The physical implementation of the logical data warehouse model may require some changes to help you adapt it to your system parameters size of machine, number of users, storage capacity, type of network, and software. A common data warehouse schema model is the star schema. However, there are other schema models that are commonly used for data warehouses. The most prevalent of these schema models is the third normal form (3NF) schema. The snowflake schema is a type of star schema, but slightly more complex. Additionally, some data warehouse schemas are neither star schemas nor 3NF schemas, but share characteristics of both schemas; these are referred to as hybrid schema models. The important thing to remember when designing your schema is not to get lost in theory and academic comparisons. These days, most successful data warehouses employ a hybrid approach to schemas.

5 Schema Characteristics The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table. A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans (star transformation) for them. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data is grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema may be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. Although this saves space, it increases the number of dimension tables and requires more foreign key joins. This results in more complex queries and reduced query performance.

6 Schema Characteristics (continued) A third normal form (3NF) schema is a classical relational database modeling technique that minimizes data redundancy through normalization. A relationship can be considered to be 3NF if none of the nonprimary key attributes are duplicated in other tables. When compared to a star schema, a 3NF schema typically has a larger number of tables due to this normalization process. 3NF schemas are typically chosen for large data warehouses, especially environments with significant data-loading requirements that are used to feed data marts and execute long-running queries.

7 Star Schema Model Normalization is not always a good thing when dealing with large amounts of data. Although it is ideal for data updates, inserts, deletes, and integrity, it can slow down processing. To speed up processing, you can denormalize data into a star schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central fact table. The center of the star consists of one or more fact tables and the points of the star are the dimension tables. This kind of schema can be more natural to nontechnical end users who are more familiar with logical entities rather than entities and relationships. A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables, each of which contains information about the entries for a particular attribute in the fact table. A star query is a join between a fact table and several dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. A typical fact table contains keys and measures. For example, in the Sales History schema, the fact table, sales, contains the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions.

8 Star Schema Model (continued) The products dimension table, for example, contains information about each product number that appears in the fact table. The main advantages of star schemas are that they: Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design Provide highly optimized performance for typical star queries Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data warehouse schema contain dimension tables. The most natural way to model a data warehouse is as a star schema, where only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row. Star schemas do have some inherent difficulties. It is possible for the central fact table to grow very large, with an upper limit of the product of the number of rows in each dimension table. Also, the dimension tables are no longer normalized, so they are larger and harder to maintain with lots of duplicate data.

9 Snowflake Schema Model If your business needs require more normalization, you can employ a snowflake schema, which is a star schema with some of the features of third normal form (3NF) data. The snowflake schema is a more complex data warehouse model than a star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been normalized into multiple smaller tables instead of one large table. For example, a product dimension table in a star schema may be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. Although this saves space, it increases the number of dimension tables and requires more foreign key joins. This results in more complex queries and reduced query performance. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and the joining of smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increased number of lookup tables. Although snowflake schemas are unnecessary when the dimension tables are small, a business having large dimension tables containing millions of rows can use snowflake schemas to significantly improve performance.

10 Snowflake Schema Model (continued) However, a potential problem that you may encounter with snowflake schemas is that they may start to show signs of the performance problems of 3NF queries. Note: It is suggested that you choose a star schema over a snowflake schema unless you have a clear business reason to choose the snowflake schema.

14 Data Warehousing Objects Fact tables and dimension tables are the two types of objects commonly found in dimensional data warehouse schemas. A fact table is large and typically has two types of columns: those that contain numeric facts (often called measurements) and those that are foreign keys to dimension tables. Measures are the data that you want to analyze, such as total_sales or unit_cost. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data that can be analyzed and examined. Examples of fact tables include SALES, COST, and PROFIT. Facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analyzed.. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Fact tables are usually deep but not wide. Must decide the Granularity of Fact Table what level of detail do you want? Transactional grain finest level Aggregated grain more summarized data. Can impact the dimension table attributes. Finer grains better for market basket analysis capability where you want to identify affinity between products However, Finer grain more dimension tables, more rows in fact table!

15 Data Warehousing Objects(continued) Though most facts are additive, they can also be semiadditive or nonadditive. Additive facts can be aggregated by simple arithmetic addition. A common example of this is sales. Nonadditive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it. You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all its foreign keys. You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all its foreign keys. However, it is common to add a surrogate key particularly in regard to deal with slowly changing dimensions. For example, Product Floppy Disk may have a natural key of x001 which held in the Fact table as a foreign key relationship to the Product Dimension in the Datawarehouse. What impact is there on the Data Warehouse if the Product Description is changed to Optical Disk using the same natural key x001 in the production system as Floppy Disk is not a product produced by the company? The fact data relating to the old product Floppy Disk is broken. By introducing a surrogate key to for the product dimension table we will be able to preserve the old relationship with the old product by inserting a new row for Optical Disk and the next surrogate key value in the dimension table. By doing this, we have preserved the fact data relating to the old product name. Dimension tables, also known as lookup or reference tables, contain the relatively static data in the data warehouse. Dimension tables store the information that you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are CUSTOMERS and PRODUCTS. Relationships guarantee the integrity of business information. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between the sales information in the fact table and the dimension tables PRODUCTS and CUSTOMERS enforces the business rules in databases.

16 Data Warehousing Objects(continued) Dimension tables, also known as lookup or reference tables, contain the relatively static data in the data warehouse. Dimension tables store the information that you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are CUSTOMERS and PRODUCTS. The dimension table is wide, but not typically deep and may have more that 50 attributes It may only have rows (in some cases could have 1,000s to millions). For star schema the dimension data is not normalised. Dimensions can support drill-downs and roll-ups (aka drill-ups) where they exhibit natural hierarchies e.g. Drill down Total sales by year, quarter, month, week, day. Relationships guarantee the integrity of business information. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between the sales information in the fact table and the dimension tables PRODUCTS and CUSTOMERS enforces the business rules in databases. Commonly, a surrogate key is used as the primary key rather than the natural key due to slowly changing dimensions (SCD) SCD Type 1 Overwrite data in the dimension. Does not maintain history but good for correcting errors. Most DW s start out with type 1 as the default SCD Type 2 Add a new dimensional record to preserve history. Must generalize the primary key by replacing it with a surrogate key and with start and end effective date columns. SCD Type 3 Some user assigned attributes can legitimately have more than one assigned value depending on the observer s viewpoint. For example, in a stationery shop a marking pen could be assigned to household goods category or the art supplies category. We need to add a new alternate category column to the dimension to facilitate this. This approach however, does not scale gracefully beyond a few choices.

17 Dimensions and Hierarchies A dimension is a structure composed of one or more hierarchies that categorizes data. Dimensions are descriptive labels that provide supplemental information about facts and are stored in dimension tables. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher-level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies. Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a TIMEdimension, a hierarchy may aggregate data from the month level to the quarter level to the year level. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the PRODUCT dimension, there may be two hierarchies one for product categories and the other for product suppliers.

18 Defining Dimensions and Hierarchies The example in the slide describes a single hierarchy within the time dimension, but it is possible to have multiple hierarchies. For example, another hierarchy can be created to link sales date with week or season. Note that by creating a dimension, you just create meta information that the Oracle server can use afterward, for example, during query rewrite. It does not mean that you enforce any of the relationships described in the newly created dimension. That is why constraints can still be used, whenever possible, to maintain dimension validity. After defining the dimension, you can validate the dimension by using the DBMS.DIMENSION.VALIDATE_DIMENSION procedure.

19 Current approaches to dimensional design mostly take a top down approach. From the demand side, the models are developed fro user query analysis requirements However, from the supply side, to a large extent, a data warehouse is simply a repackaging of operational data in a more accessible form. Therefore, dimensional modelling design is highly constrained and limited by what data is available from operational systems. The transformation of the an ER model to dimensional form takes place in four steps: Step 1 Classify Entities Step 2 Design High-level star schema Step 3 Identify Star Schemas Required Step 4 Define Level of Summarisation

20 Transaction Entities record details of business events (e.g. orders, shipments, airline reservations)most BI applications focus on these events, to identify patterns, trends and potential problems. Component Entities are directly related to a transaction entity by a one-to-many relationship. They are involved in the business event and answer the who what where how and why questions about the event. Classification Entities are related to the component entity by a chain of one-to-many relationships. These define embedded hierarchies in the data model and are used to classify component entities. 2. Design High Level Star Schema Identify Star Schemas Required Each transaction entity is a candidate for a star schema. The process of creating star schemas is one of sub-setting i.e. dividing a large and complex model into manageable sized chunks. Note there is not always a one-to-one correspondence between transaction entities and star schemas. For instance, not all transactions will be important for decision making and user input is required to identify transaction entities that are important. When transaction entities are connected in master detail structure, they should be combined into a single star schema e.g. Order and OrderItem. Define Level of Summarisations Deciding on the level of granularity is one of the most critical decisions in star schema design Fine (or transaction level) grain unsummarised each fact table row corresponds to a single transaction. This grain provides maximum flexibility but has storage implications Coarse grain Summarised may be summarised by a subset of dimensions or dimensional attributes. Each row in the fact table corresponds to multiple transactions. Less storage is required here but summarisation loses information and can limit the types of analyses. Most data warehouse environments have a combination of unsummarised and summarised data. An important design decision with respect to integration of the star schemas is that all the star schemas should share common dimensions called conformed dimensions. This ensures that users can drill across from one star schema to another. Identify Relevant Dimensions Component entities associated with each transaction entity represent candidate dimensions. There is not always a one-to-one mapping; All component entities may not be relevant for the purposes of analysis or for the level of granularity. Date/and or time appear as explicit dimensions in most star schemas but not normally represented as entities in operational systems. If non transaction granularity is chosen the dimension required will be determined by how the transactions are summarized (often a subset of the dimensions used in the transaction level star schema).

21 Step 3 Detailed Fact Table design Define Key For fact table it will be a composite key made up by the dimension keys which act as foreign keys to the dimension tables Define Facts Facts are determined on what facts are defined in the operational data. These will be the non-key attributes of the fact table are popularly called measures. Facts should be numeric and at the same grain. For performance reasons it is common to stored derived values/pre-calculations in the fact table e.g gross pay deductibles = net pay While we can derive net pat from the gross pay and deductibles it makes sense to store net pay to improve query performance. There are 3 types of measures 1) Fully additive, 2) semi-additive and 3) non-additive. Where possible, one should convert non-additive and semi-additive to fully-additive facts. In some cases fact tables can be fact-less where one wants to capture that a particular events has occurred e.g. a crime, a student registration. Step 4 Detailed Dimension Table Design Define the Dimension Primary Key This should be a simple numeric key that is generalized. This facilitates preservation of the historic especially in regard to slowly changing dimensions(keys may be used over time). One should still retain the natural key as part of the table definition. It should be generalised to preserve historical data as the primary keys in the operational data may be reused in the OLTP. e.g. for a student dimension introduce a generalized numeric key but retain the natural key X Collapse Hierarchies Dimension tables are formed by collapsing or denormalising hierarchies which are defined by classification entities. Dimension tables are wide and can have 100+ attributes. This introduces redundancy in the data in the form of Transitive dependencies i.e. breaking the 3NF rule. Replace Codes and abbreviations by descriptive text. For understand ability and readability, codes and abbreviations in the source data should be removed and replaced by descriptive text. This is also called rounding out the dimension tables.

22 Examples Transaction entities:- Order, Order Item and Stock Level Component entities:- Customer, Employee, RetailOutlet, Product, Warehouse and Delivery Method 31 classification entities

23 I/O Performance in Data Warehouses Input/output (I/O) performance should always be a key consideration for data warehouse designers and administrators. The typical workload in a data warehouse is especially I/O intensive, with operations such as large data loads and index builds, creation of materialized views, and queries over large volumes of data. The underlying I/O system for a data warehouse should be designed to meet these heavy requirements. One of the major causes of data warehouse performance issues is poor I/O configuration. Database administrators who have previously managed other systems need to pay more attention to the I/O configuration for a data warehouse than they may have previously done for other environments. Although data warehouses usually require large storage systems, storage configurations should be chosen on the basis of I/O bandwidth. Every component of the I/O system should provide enough bandwidth including the physical disks, the I/O channels, and the IO adapters. (As a rule, at least 200 MB per second of IO bandwidth per gigahertz of processing power will be needed.) When considering I/O in high-performance OLTP environments, the critical factor is often random I/Os per second; however, in data warehouses, the critical factor is often sequential I/O throughput. The sequential throughput is usually bounded by the number of active channels between the hosts and the disk arrays.

24 I/O Performance in Data Warehouses Input/output (I/O) performance should always be a key consideration for data warehouse designers and administrators. The typical workload in a data warehouse is especially I/O intensive, with operations such as large data loads and index builds, creation of materialized views, and queries over large volumes of data. The underlying I/O system for a data warehouse should be designed to meet these heavy requirements. One of the major causes of data warehouse performance issues is poor I/O configuration. Database administrators who have previously managed other systems need to pay more attention to the I/O configuration for a data warehouse than they may have previously done for other environments. Although data warehouses usually require large storage systems, storage configurations should be chosen on the basis of I/O bandwidth. Every component of the I/O system should provide enough bandwidth including the physical disks, the I/O channels, and the IO adapters. (As a rule, at least 200 MB per second of IO bandwidth per gigahertz of processing power will be needed.) When considering I/O in high-performance OLTP environments, the critical factor is often random I/Os per second; however, in data warehouses, the critical factor is often sequential I/O throughput. The sequential throughput is usually bounded by the number of active channels between the hosts and the disk arrays.

25 Performance of Sequential I/Os Unlike many OLTP databases whose throughput comprises many small I/Os, data warehouse drive arrays generally see random large I/Os spread across the devices. This type of throughput is known as multiuser sequential workload. Acceptable multiuser sequential throughput requires that large I/Os up to 1 megabyte in size be issued to disks. However, it is common for the host operating system, device drivers, or storage array to fracture these large I/Os into smaller I/Os. For example, default Linux configurations often fracture I/Os into smaller ones (up to 32 KB). This level of I/O fracturing can have a disastrous effect on the total throughput. Therefore, it is important that you use a version of Linux or UNIX with host bus adapters and drives capable of handling 128 KB I/Os or larger. A lot of attention is paid to file system disk fragmentation, but you must remember that in a database environment, I/O fracturing is at least as important.

26 Minimizing I/O Requests Intelligent partitioning can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). In the query example in the slide, only the data for March, April, and May need to be accessed. The unnecessary partitions are pruned, so only the partitions corresponding to March, April, and May are accessed. In this example, the partition pruning results in a two-times gain in performance, because three partitions are being scanned instead of six. In many cases, the actual gains from partition pruning can be much more dramatic. Consider the business query that examines data from one month in a partitioned table containing 36 months of historical data. Partition pruning works in conjunction with all other performance features. A query can take advantage of partition pruning while taking advantage of other features such as parallelism and indexing. You can also improve the performance of massive join operations when large amounts of data (for example, several million rows) are joined together by using partitionwise joins. Finally, partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore. Granularity in a partitioning scheme can be easily changed by splitting or merging partitions. Thus, if a table s data is skewed to fill some partitions more than others, the ones that contain more data can be split to achieve a more even distribution.

27 Minimizing I/O Requests (continued) Bitmap indexes are ideally suited for data warehousing. In fact, bitmap indexes should be the most common type of index within a data warehouse. Most people who have used any sort of relational database are familiar with B-tree indexes. However, B-tree indexes rarely provide significant performance benefits for data-warehouse queries, and require large amounts of disk space; meanwhile, bitmap indexes are often an order of magnitude smaller than B-tree indexes and are also much more effective for data warehouse queries. The advantages of using bitmap indexes are greatest for low-cardinality columns, or the columns in which the number of distinct values is small compared to the number of rows in the table. A gender column, which has only two distinct values (male and female), is ideal for a bitmap index. However, data warehouse administrators can also choose to build bitmap indexes on columns with much higher cardinalities. Bitmap indexes specifically provide a mechanism for efficiently doing set-based logic. For example, consider the simple data warehouse query: How many of my customers live in New York, are between the ages of 30 and 40, and bill more than $100 per month? One way to process this query is to scan the entire table, and examine each

28 Minimizing I/O Requests (continued) When base tables contain a large amount of data, it is an expensive and time-consuming process to compute the required aggregates or to compute joins between these tables. In such cases, queries can take minutes or even hours to return the answer. Because materialized views contain already precomputed aggregates and joins, the Oracle Database server employs an extremely powerful process called query rewrite to quickly answer the query using materialized views. One of the major benefits of creating and maintaining materialized views is the ability to take advantage of query rewrite, which transforms a SQL statement expressed in terms of tables or views into a statement accessing one or more materialized views that are defined on the detail tables. The transformation is transparent to the end user or application, requiring no intervention and no reference to the materialized view in the SQL statement. Because query rewrite is transparent, materialized views can be added or dropped just like indexes without invalidating the SQL in the application code.

29 Minimizing I/O Requests (continued) A better way to evaluate this query is to apply set-based logic: Find the set of customers who live in New York, the set of customers between 30 and 40, and the set of customers who bill more than $100, and then do an intersection of those three sets. This is exactly the functionality provided by bitmap indexes, an efficient mechanism for doing set-based manipulations of data. Bitmap indexes are ideal for a wide range of data warehouse queries. The star transformation is a powerful optimization technique that relies upon implicitly rewriting (or transforming) the SQL of the original star query. The end user never needs to know any of the details about the star transformation. Oracle's query optimizer automatically chooses the star transformation where appropriate. A prerequisite of the star transformation is that there be a single-column bitmap index on every join column of the fact table. These join columns include all foreign key columns. The star transformation is a query transformation aimed at executing star queries efficiently. Oracle processes a star query using two basic phases. The first phase retrieves exactly the necessary rows from the fact table (the result set). Because this retrieval utilizes bitmap indexes, it is very efficient. The second phase joins this result set to the dimension tables.

30 We have seen some of these steps when deriving Dimensional Models from ER Model earlier Step 1: Choosing the process The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. 2: Choosing the grain Deciding what a record of the fact table is to represent is critical. The dimensions of the fact table must be identified. The grain decision for the fact table also determines the grain of each dimension table. Also include time as a core dimension, which is always present in star schemas. 3: Identifying and conforming the dimensions Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other if Drill-Across the data marts is desired. A dimension used in more than one data mart is referred to as being conformed. To provide fast access and intuitive "drill down" capabilities of data originating from multiple operational systems, it is often necessary to replicate dimensional data in Data Warehouses and in Data Marts. Examples of obvious conformed dimensions include Customer, Location, Organization, Time, and Product. 4: Choosing the facts The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. Unusable facts include: (1)non-numeric facts (2) non-additive facts (3)fact at different granularity from other facts in table

31 Step 5: Storing pre-calculations in the fact table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations i.e summary tables and Materialized Views. 6: Rounding out the dimension tables Text descriptions are added to the dimension tables. Text descriptions should be as intuitive and understandable to the users as possible. Usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables. 7: Choosing the duration of the database Duration measures how far back in time the fact table goes. Very large fact tables raise at least two very significant data warehouse design issues. Often difficult to source increasing old data from OLTP systems. It is mandatory that the old versions of the important dimensions be used, not the most current versions. Remember the Slowly Changing Dimensions problem. 8: Tracking slowly changing dimensions Slowly changing dimension problem means that the proper description of the old dimension data must be used with the old fact data. Often, a generalized key must be assigned to important dimensions in order to distinguish multiple snapshots of dimensions over a period of time. Most critical physical design issues affecting the end-user s perception includes: physical sort order of the fact table on disk presence of pre-stored summaries or aggregations 9: Deciding the query priorities and the query modes Additional physical design issues include administration, backup, indexing performance, and security. Most critical physical design issues affecting the end-user s perception that includes: physical sort order of the fact table on disk presence of pre-stored summaries or aggregations Additional physical design issues include administration, backup, indexing performance, and security.

Designing Data Warehouses. Data Warehousing Design. Designing Data Warehouses. Designing Data Warehouses

Designing Data Warehouses. Data Warehousing Design. Designing Data Warehouses. Designing Data Warehouses Designing Data Warehouses To begin a data warehouse project, need to find answers for questions such as: Data Warehousing Design Which user requirements are most important and which data should be considered