Handout 12 CS-605 Spring 17 Page 1 of 6 Handout 12 Data Warehousing and Analytics. Operational (aka transactional) system a system that is used to run a business in real time, based on current data; also called a system of record Informational (analytical) system a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications o Collect business operational data o Reduce it to a form that can be used to analyze the behavior of the business. o Not limited to Database, but often using the Database technology. Data warehouse (simple definition) an archival database for decision support. Operational Databases Decision Support Databases Support day-to-day business operations Read/writeable: records may be inserted, updated, deleted. Not as big as ones used for Decision Support Hold historical information integrated from multiple sources Primarily read-only Updating limited to o Load o Refresh o (i.e. Inserts, some Deletes, almost never Updates) Include a temporal component. Tend to be very large (especially when storing transaction data) Integrity not a big concern Usually designed in ad hoc manner Queries Often involve complex logical expressions in WHERE Require access to many kinds of facts/business objects, i.e. may require many joins. Functionally complex: may involve complex statistical computations Analytically complex: rarely answered in one query. Data Warehouse: A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple and heterogeneous organizational data sources Time-variant: Can study trends and changes Nonupdatable (nonvolatile): Read-only, periodically refreshed - 1 -
Handout 12 CS-605 Spring 17 Page 2 of 6 Data Mart: A data warehouse that is limited in scope. Intended for use by a smaller, more specialized group of people Creating a Data Warehouse - ETL (Extract, Transform, Load ) Need to integrate uncoordinated and inconsistent multiple databases in organizations. Need to separate operational and informational systems and data to improve performance of data management Extract Static extract = capturing a snapshot of the source data at a point in time Incremental extract = capturing changes that have occurred since the last static extract Scrub/Cleanse uses pattern recognition and AI techniques to upgrade data quality Problems: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Figure 9-1 from MDM Examples of heterogeneous data Establishing standard abbreviations and identifiers, replacing synonyms. Transform and consolidate convert data from format of operational system to format of data warehouse split/combine source records synchronize time information: e.g. customer - revenue data stored by fiscal quarter customer - salesperson data stored by calendar quarter can t tell which salesperson is responsible for what part of the customer revenue - 2 -
Handout 12 CS-605 Spring 17 Page 3 of 6 Load/Index Place transformed data into the warehouse and create indexes Move the data Initial / Refresh mode: bulk rewriting of target data at periodic intervals Check uniqueness constraints CPU intensive process, especially if many indices are present drop/reset indices could help. Several Common Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and @ctive Warehouse Generic Two-Level Architecture Operational Databases / One company-wide Warehouse Benefit: single integrated view of organizational data Problem: Periodic extraction data is not completely current in warehouse Independent Data Mart Multiple Data marts - mini-warehouses, limited in scope No single consolidated warehouse. Benefits: easier to create than one integrated warehouse Problems: redundancy, extra work in ETL for each data mart, potential lack of consistency, complex querying across multiple data marts users of individual marts must themselves provide an integrated view this is difficult and does not add up to having a single warehouse with well-defined known structure. Dependent Data Mart and Operational Data Store Data loaded from Operational Data Store to single Data Warehouse from Data Warehouse to Data Marts Benefits: single ETL no redundancy Logical Data Mart and @ctive Warehouse Data marts are logical views of the warehouse. Works well when data warehouse is not too large. Used in e-commerce applications. Problems: performance degrades with increasing size of the warehouse Benefits: Data in marts always current, no redundancy in storage/etl - 3 -
Handout 12 CS-605 Spring 17 Page 4 of 6 Data Warehouse Structure Star-schema: Dimension tables (often de-normalized for performance reasons) describe major business subjects + Time Period. Fact table an associative entity of the dimensions. Contains factual and quantitative summary data. Examples (From MDM) Fact table provides statistics for sales broken down by product, period and store dimensions - 4 -
Handout 12 CS-605 Spring 17 Page 5 of 6 Issues: Dimension table keys must be surrogate (non-intelligent and non-business related) for the following reasons Object descriptions may change over time e.g.: decided to change size of product with business number 20. Length/format consistency Across multiple organizational databases, the same product may have different identification numbers/primary keys Granularity of Fact Table what level of detail do you want? Transactional grain finest level enter every transaction into warehouse Aggregated grain more summarized enter just summary data Finer grain => better analysis capability more dimension tables => more rows in fact table Modeling dates: Technologies Data Mining Knowledge discovery using a blend of statistical, AI, and computer graphics techniques Explain observed events or conditions why sudden increase in turkey sales? Confirm hypotheses do turkey sales increase in November? do more students take Literature courses as sophomores than juniors? Explore data for new or unexpected relationships what else are the customers that buy turkeys in November likely to buy? which group of customers is likely to be interested in a product? Data visualization representing data in graphical/multimedia formats for analysis. Often used in conjunction with data mining. Helps identify trends and patterns. - 5 -
Handout 12 CS-605 Spring 17 Page 6 of 6 Big Data - evolving term - usually refers to voluminous amount of structured, semi-structured and unstructured data - can be mined for information Analytics o Systematic analysis and interpretation of data typically using mathematical, statistical, and computational tools to improve our understanding of a real-world domain. Big data characteristics The Five Vs of Big Data Volume much larger quantity of data than typical for relational databases Variety lots of different data types and formats Velocity data comes at very fast rate (e.g. mobile sensors, web click stream) Veracity traditional data quality methods don t apply; how to judge the data s accuracy and relevance? Value big data is valuable to the bottom line, and for fostering good organizational actions and decisions - Schema on Read, rather than Schema on Write Schema on Write preexisting data model, how traditional databases are designed (relational databases) Schema on Read data model determined later, depends on how you want to use it Capture and store the data, and worry about how you want to use it later - Data Lake o A large integrated repository for internal and external data that does not follow a predefined schema o Capture everything, dive in anywhere, flexible access NoSQL = Not Only SQL databases A category of recently introduced data storage and retrieval technologies not based on the relational model Supports schema on read Largely open source BASE basically available, soft state, eventually consistent - 6 -