Star Schema In a star schema, each dimension table has a single-part primary key that links to one part of the multipart primary key in the fact table. מחסני נתונים תכנון לוגי של מסד נתונים רב מימדי באמצעות סכימה טבלאית 4 Star Schema Example 1 Time Dimensions Day of week Day_number_of month Week_number_in_yea r Month Quarter Year Holliday_flag Weekday_flag Product_key Store_key Dollars_sold Units_sold Product Dimension Product_key Description Brand Category Store Dimension Store_key Store_name Address Floor_plan_type Mainly descriptiv e textual Dimension 1 Fact Table Dimension 2 3 d1_key1 Att1 Att2 d2_key1 fact1 Dimension 3 fact2 Dimension 4 d3_key1 Star Schema d1_key1 d2_key2 d3_key1 d4_key1 Mainly numeric and additive d4_key1 1
Star Schema Example 3 Star Schema Example 2 Reminder: Normal Forms Seeks to eliminate data redundancy: transaction that changes any data only need to touch the database in one place (optimized for updates) The Standard Template Query Select p.brand, sum(f.dollars),sum(f.units) From sales f, product p, time t Where f.product_key=p.product_key And f.time_key = t.time_key And t.quarter= 1 Q 1995 Group by p.brand 2
On the other hand 1. Complexity of query specification is high. Without normalization it will be much clearer to user. (Simple queries structures) 2. Poor access efficiency Normalized design is the worst, by far, for most query access. A normalized design is optimized for key- based, record-at-a-time inquiry or table-level query that efficiently uses the provided indexes. Resisting Normalization 1. Eliminate redundancy? Generally eliminating duplicate rows is good. However eliminating "redundant" attributes in a star schema dimension table will actually destroy its high- access efficiency. Time saving (browsing performance) is much more critical in data warehouse. 2. Save space? This corollary to eliminating redundancy is a holdover from another era. The relative impact of storage on cost is way down. The loss of access efficiency has far greater cost impact. Furthermore The Fact table in a dimensional schema is naturally highly normalized. Disk space saving due to normalization is typically less than 1%. 3. Support efficient update? Does not apply at all - Data Warehouse is Nonvolatile: no updates of data (only data loading). The load methods for relational tables in a star schema design can actually be more efficient than a load of normalized transaction and snow- flaked reference data. Division Division_id Division_desc ER - BCNF Region Region_id Region_desc Why Normalization of Dimension does not save space? A typical Example Fact Table data size: Fact Table index size: Largest dim table size: Savings by normalization: Total size before: Total size after: 30GB 20GB 0.1GB 0.05GB 51GB 50.5GB. Dept Dept_desc Division_id Facts Week_id Market Market_desc Region_id 3
Snowflake Schema Dimensional (Denormalization) In a snowflake schema, one or more dimension tables are decomposed into multiple tables with the subordinate dimension tables joined to a primary dimension table instead of to the fact table. i.e.:a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Dept. Lookup Dept_desc Division_desc Facts Week_id Market Lookup Market_desc Region_desc Snowflake Schema Snowflake Schema Large Hierarchy Customer 15 amount Name Demographic Income_Level Age_Level Sex 4
18 amount Mini-Dimension Customer Name Demographic Income_Level Age_Level Sex Star schemas or Snowflake schemas? Both star and snowflake schemas can represents the same dimensional models; the difference is in their RDBMS implementations. Snowflake schemas support ease of dimension maintenance because they are more normalized. Star schemas are easier for direct user access and often support simpler and more efficient queries. The decision to model a dimension as a star or snowflake depends on the nature of the dimension itself, such as how frequently it changes and which of its elements change, and often involves evaluating tradeoffs between ease of use and ease of maintenance. In most designs, star schemas are preferable to snowflake schemas because they involve fewer joins for information retrieval. Surrogate keys A surrogate key is the primary key for a dimension table and is independent of any keys provided by source data systems. Surrogate keys are created and maintained in the data warehouse and should not encode any information about the contents of records; automatically increasing integers make good surrogate keys. The original key for each record may be carried in the dimension table but is not used as the primary key. Benefits: a layer of isolation between DW and the source system; Simple: numeric keys Can handle ambiguous ID s. Drawback: increased ETL processing Dimensions Keys Using Original Operational keys Benefit: reduced transformation effort Drawbacks: Compound and textual keys; Dependency on the source systems (OLTP); for instance what happen if the operational system create new key when customer change address, while we don t want to create a new customer. Ambiguous ID s coming from different sources; Multiple application systems World wide companies with many branches: each branch uses its own customer s counting. companies that have done mergers or acquisitions. 5
Time/Date Dimension For hourly time granularity, the hour breakdown can be incorporated into the date dimension or placed in a separate dimension. Business needs influence this design decision. If the main use is to extract contiguous chunks of time that cross day boundaries (for example 11/24/2000 10 p.m. to 11/25/2000 6 a.m.), then it is easier if the hour and day are in the same dimension. However, it is easier to analyze cyclical and recurring daily events if they are in separate dimensions. Unless there is a clear reason to combine date and hour in a single dimension, it is generally better to keep them in separate dimensions! Time/Date Dimension A date dimension with one record per day will suffice if users do not need time granularity finer than a single day. A date by day dimension table will contain 365 records per year (366 in leap years). A separate time dimension table should be constructed if a fine time granularity, such as minute or second, is needed. A time dimension table of one-minute granularity will contain 1,440 rows for a day, and a table of seconds will contain 86,400 rows for a day. If exact event time is needed, it should be stored in the fact table. When a separate time dimension is used, the fact table contains one foreign key for the date dimension and another for the time dimension. Separate date and time dimensions simplify many filtering operations. For example, summarizing data for a range of days requires joining only the date dimension table to the fact table. Analyzing cyclical data by time period within a day requires joining just the time dimension table. The date and time dimension tables can both be joined to the fact table when a specific time range is needed. 6