Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely covers [Jarke et al.] chapter 1 Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 What is Business Intelligence (BI)? BI Is Important BI is different from Artificial Intelligence (AI) AI systems make decisions for the users BI systems help the users make the right decisions, based on available data Combination of technologies Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining () Data Visualization (VIS) Decision Analysis (what-if) Customer Relationship Management (CRM) Worldwide BI revenue in 2005 = US$ 5.7 billion 10% growth each year The Web makes BI more necessary Customers do not appear physically in the store Customers can change to other stores more easily Thus: Know your customers using data and BI! Utilize Web logs, analyze customer behavior in a more detail than before (e.g., what was not bought?) Combine web data with traditional customer data Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
Data Analysis Problems The same data found in many different systems Example: customer data across different departments The same concept is defined differently Heterogeneous sources Relational S, On-Line Transaction Processing (OLTP) Unstructured data in files (e.g., MS Excel) and documents (e.g., MS Word) Data is suited for operational systems Accounting, billing, etc. Do not support analysis across business functions Data quality is bad Missing data, imprecise data, different use of systems Data are volatile Data deleted in operational systems (6 months) Data change over time no historical information Data Warehousing Solution: new analysis environment (DW) where data are Subject oriented (versus function oriented) Integrated (logically and physically) Time variant (data can always be related to time) Stable (data not deleted, several versions) Supporting management decisions (different organization) A good DW is a prerequisite for successful BI Getting multidimensional data into the DW Data from the operational systems are Extracted Cleansed Transformed Aggregated? Loaded into DW Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6 DW: Purpose and Definition DW Architecture Data as Materialized Views DW is a store of information organized in a unified data model Data collected from a number of different sources Finance, billing, web logs, personnel, The purpose of a data warehouse (DW) is to support decision making Easy to perform advanced analyses Ad-hoc analyses and reports Data mining: discovery of hidden patterns and trends Existing databases and systems (OLTP) (Global) Data Warehouse New databases and systems (OLAP) DW (Local) Data Marts OLAP Data mining Visualization Aalborg University 2007 - DWML course 7 Analogy: suppliers supermarket customers Aalborg University 2007 - DWML course 8
Quick review of normalized database OLTP vs. OLAP Customer ID 3302 3303 Product Beer Rice Beer Wheat Category Beverage Beverage Price 6.00 4.00 6.00 5.00 Date 05-02-2007 07-02-2007 Target Data OLTP operational needs small, operational data OLAP business analysis large, historical data Customer ID 3302 3303 ProductID 013 052 013 067 Date 05-02-2007 07-02-2007 ProductID Normalized database avoids Redundant data Modification anomalies How to get the original table? (join them) No redundancy in OLTP, controlled redundancy in OLAP 013 052 067 Product Beer Rice Wheat Category Beverage Price 6.00 4.00 5.00 Model Query language Queries Updates Transactional recovery Optimized for normalized SQL small frequent and small necessary update operations denormalized/ multidimensional not unified large infrequent and batch not necessary query operations Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10 Queries hard or infeasible for OLTP Business analysis In the past five years, which product is the most profitable? Which public holiday we have the largest sales? Does the sales of dairy products increase over time? Is there any pattern (correlation) between the sales of beers and the sales of diapers? Function- vs. Subject Orientation Function-oriented systems All subjects, integrated Subject-oriented systems DW Selected subjects D- D- D- Aalborg University 2007 - DWML course 11 Bus architecture Aalborg University 2007 - DWML course 12
n x m versus n + m Top-down vs. Bottom-up D-App D-App D-App inflexible, expensive Aalborg University 2007 - DWML course 13 Top-down: 1. Design of DW 2. Design of s DW In-between: 1. Design of DW for 1 2. Design of 2 and integration with DW 3. Design of 3 and integration with DW 4.... Aalborg University 2007 - DWML course 14 D- D- D- Bottom-up: 1. Design of s 2. Maybe integration of s in DW 3. Maybe no DW Multidimensional database design Cube Example Motivation: Why not use ER model? Cubes: Dimensions, Facts, Measures OLAP queries Advanced multidimensional modeling Mainly handling changes in dimensions MS SQL server and Analysis Services 350 300 250 200 Total 150 100 50 0 2000 Year 2001 Sales Copenhagen Aalborg City Aalborg Copenhagen Text-based results difficult for managers to understand Why Cube? Good for visualization Multidimensional, intuitive Support OLAP operations Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
On-Line Analytical Processing (OLAP) Performance Optimization On-Line Analytical Processing Interactive analysis Explorative discovery Fast response times required OLAP operations Aggregation, e.g., SUM Starting level, (Year, City) Roll Up: Less detail Drill Down: More detail Slice/Dice: Selection, Year=2000 102 250 All Time 20 25 70 57 Performance optimization Fine tune performance for important queries Aggregates, indexing, other optimizations (environment, partitioning) Using aggregates How can aggregates improve performance? Choosing aggregates Which aggregates should we materialize? Maintaining views How do we keep the (aggregate) views up to date? Bitmapped indices Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18 Materialization Example Imagine 1 billion sales rows, 1000 products, 100 locations CREATE VIEW TotalSales (pid,locid,total) AS SELECT s.pid,s.locid,sum(s.sales) FROM Sales s GROUP BY s.pid,s.locid The materialized view has 100,000 rows Rewrite the query to use the view SELECT p.category,sum(s.sales) FROM Products p, Sales s WHERE p.pid=s.pid GROUP BY p.category Rewritten to SELECT p.category,sum(t.total) FROM Products p, TotalSales t WHERE p.pid=t.pid GROUP BY p.category Query becomes 10,000 times faster! Extract, Transform, Load (ETL) Getting multidimensional data into the DW Extract Transformations / cleansing Load Aalborg University 2007 - DWML course 19 Aalborg University 2007 - DWML course 20
Data s Way To The DW Extraction Extract from many heterogeneous systems Staging area Large, sequential bulk operations => flat files best? Cleansing Data checked for missing parts and erroneous values Default values provided and out-of-range values marked Transformation Data transformed to decision-oriented format Data from several sources merged, optimize for querying Aggregation? Are individual business transactions needed in the DW? Loading into DW Large bulk loads rather than SQL INSERTs Fast indexing (and pre-aggregation) required Aalborg University 2007 - DWML course 21 DW Applications: Visualization Graphical presentation of complex result Color, size, and form help to give a better overview Aalborg University 2007 - DWML course 22 DW Applications: Data Mining Data mining is automatic knowledge discovery Roots in AI and statistics Classification Partition data into pre-defined classes Prediction Predict/estimate unknown value based on similar cases Clustering Partition data into groups so the similarity within individual groups are greatest and the similarity between groups are smallest Association rule Find associations/dependencies between data that occur together Rules: A -> B (c%,s%): if A occurs, B occurs with confidence c and support s Important to choose the granularity for mining No useful results at too small granularity (shirt brand,..) Data Mining Examples Wal-Mart: USA s largest supermarket chain Has DW with all ticket item sales for the last 5 years (huge!) Use DW and mining intensively to gain business advantages Analysis of association within sales tickets Discovery: Beer and diapers on the same ticket Men buy diapers, and must just have a beer Put the expensive beers next to the diapers Wal-Mart's suppliers use the DW to optimize delivery The supplier puts the product on the shelf The supplier only get paid when the product is sold Web log mining What is the association between time of day and requests? What user groups use my site? How many requests does my site get in a month? (Yahoo) Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
Common DW Issues Metadata management Need to understand data = metadata needed Greater need that in OLTP applications as raw data is used Need to know about: Data definitions, dataflow, transformations, versions, usage, security DW project management DW projects are large and different from ordinary SW projects 12-36 months and US$ 1+ million per project Data marts are smaller and safer (bottom up approach) Reasons for failure Lack of proper design methodologies High HW+SW cost (not so much anymore) Deployment problems (lack of training) Organizational change is hard (new processes, data ownership,..) Ethical issues (security, privacy, ) Summary Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction Analysis technologies that use the DW OLAP Data mining Visualization BI can provide many advantages to your organization A good DW is a prerequisite for BI But, a DW is a means rather than a goal it is only when it is heavily used that success is achieved Aalborg University 2007 - DWML course 25 Aalborg University 2007 - DWML course 26 DWML Mini Project and Exam Performed in groups of ~4 persons Documented in report of 20 pages Deadline: April 20 But every part should be done when indicated on home page Basis for discussion at the oral exam (20 mins per person) Maximum 4 persons at a time in exam Exam also covers literature Not just mini project Questions in theoretical background, too DWML Software Groups to be formed today! Inform MLY about the groups at 16.00 MS software via MSDNAA Talk to msdnaa@cs.aau.dk about accounts DW software MS SQL Server 2005 RMS MS Analysis Services, Integration Services, Reporting Services Read the mini-project webpage (part 1c) for installation details Data mining software Presented by Thomas D. Nielsen Aalborg University 2007 - DWML course 27 Aalborg University 2007 - DWML course 28