Data warehouses Decision support The multidimensional model OLAP queries

Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing (OLTP) Such DBs have driven the growth of the DBMS industry, and will continue to be important Organizations are increasingly analyzing current and historical data to identify useful patterns Market research to target products to particular market segments (including political polling) Decision support for high level decision-making On-line Analytic Processing (OLAP)

High level decision making requires an overall view of all aspects of an organization This can be prohibitively expensive with a large, global, distributed database Many organizations have created large consolidated data warehouses Information from several databases is consolidated into a data warehouse By copying tables from many sources into one location The focus is on identifying patterns and global analysis, so having data current is not a priority

External Data Sources extract clean transform load refresh Metadata Repository Operational Databases Data Warehouse

Extract data from operational databases and other sources Clean data to minimize errors, and fill in missing information where possible Transform data to reconcile semantic mismatches Typically done by defining a view over the data sources Load data by materializing, and storing, the views created in the transformation stage Sort the data and generate summary information Partition data and build indexes to increase efficiency

Semantic integration Eliminate mismatches between data from many sources e.g. different schemas, currencies Heterogeneous sources Data has to be accessed from a variety of sources formats Load, refresh, purge Data must be loaded and periodically refreshed Data that is too old should be purged Metadata management The source, loading date, and other information should be maintained for all data in the warehouse

Traditional SQL queries are inadequate for typical decision support queries WHERE clauses often contain many AND, and OR conditions, SQL handles OR conditions poorly Many statistical functions are not supported by SQL, so queries must be embedded in host language programs Many queries involve conditions over time, or aggregation of data over time Users often need to submit several related queries, SQL does not support optimization of such families of queries

Some systems support querying where the data is considered to be a multidimensional array Typical queries involve group-by and aggregation operators, complex conditions, and statistical functions OLAP applications Some DBMSs are designed to support OLAP queries as well as traditional SQL queries Relational DBMSs optimized for decision support A third class of analysis tools supports exploratory data analysis Identifying interesting patterns in the data, data mining

OLAP applications use ad hoc, complex queries involving group-by and aggregation operators Typically OLAP queries are considered in terms of a multidimensional data model The data can be represented as a multidimensional array The focus is on a collection of numeric measures Each measure depends on a set of dimensions Each measure has one value, over each set of possible values for the dimensions

Measure of sales, dimensions are products, location and time location_id product_id 13 8 10 10 12 30 20 50 11 25 8 15 1 2 3 time_id slice where location_id = 1

Multidimensional data can be stored physically in a multidimensional array, MOLAP The array must be disk-resident and persistent Or, data can be stored in relations, ROLAP The relation which relates the dimensions to a particular measure is known as the fact table Each dimension (location, product, etc.) can have additional attributes stored in a dimension table Fact tables are usually much larger than dimension tables

Dimensions can be considered as hierarchies Attributes have positions in the hierarchies week year quarter date month country province city In many OLAP applications it is necessary to store information about time Such data often cannot be represented by an SQL date or timestamp e.g. date, week, month, quarter, year, holiday status,

Star schemas are a common pattern in OLAP DBs So called because the dimension tables form a star pattern around the fact table The majority of the data is in the fact table The fact table should have no redundancy and be in BCNF To reduce size, system generated dimension identifiers are used as the primary keys of dimensions e.g. product ID instead of a product name Dimension tables are often not normalized As they are static, so update, insert, and delete anomalies are not an issue

Influenced by SQL and by spreadsheets A measure is commonly aggregated over one or more dimensions e.g. Find total sales Find total sales for each city, or state Find the top five products ranked by sales Aggregation may be at different levels of a dimension hierarchy Roll-up total sales by city to get sales by state, or Drill-down from total sales by state, to get sales by city It is possible to drill-down on different dimensions

Pivoting sales on location and time produces the total sales for each location and time The result is a crosstabulation Slicing a dataset is an equality selection on a dimension Dicing a dataset is a range selection BC Ont total 2001 63 81 144 2002 38 107 145 2003 75 35 110 total 176 223 399 cross-tabulation

The queries shown below give the cross-tabulation shown previously: total SELECT SUM(S.sales) FROM Sales S details SELECT SUM(S.sales) FROM Sales S, Times T, Location L WHERE S.timeID = T. timeid AND S.llocationID = L. locationid GROUP BY T.year, L.province sub-totals SELECT SUM(S.sales) FROM Sales S, Times T WHERE S. timeid = timeid GROUP BY T.year SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.locationID = L. locationid GROUP BY L.province

The cross-tabulation shown previously is a roll-up on the location and time dimensions With the sub-totals being roll-ups on time, and location Each roll-up query corresponds to an SQL GROUP BY query with different grouping criteria With three dimensions (time, location, and product) how many such queries are there? If there are k dimensions, there are 2 k possible GROUP BY queries generated by pivoting These can be generated by an SQL 99 CUBE query

SQL 1999 extended the GROUP BY construct to provide better OLAP support The GROUP BY clause with the CUBE keyword is equivalent to a collection of GROUP BY statements With one GROUP BY statement for each subset of the k dimensions The ROLLUP keyword can also be used, which distinguishes between the GROUP BY attributes Subsets where the first attribute is null are not included in the result (except where all attributes are null)

T.Year L.province SUM(s.sales) 2001 BC 63 2201 Ont 81 2001 null 144 2002 BC 38 2002 Ont 107 2002 null 145 2003 BC 75 2003 Ont 35 2003 null 110 null BC 176 null Ont 223 null null 399 SELECT t.year, L.state, SUM (S.sales) FROM Sales S, Times T, Location L WHERE S.timeID = T.timeID AND S.locationID = L.locationID GROUP BY CUBE (T.year, L.province) not included in the equivalent rollup query (GROUP BY ROLLUP)

The eight grouping lists for the set product, time and location are shown below Note that the CUBE operation calculates the specified groupings and it's child nodes {product, location, time} {product, location} {product, time} {location, time} {product} {location} {time} {}

Trend analysis examples Find the percentage change in the total monthly sales of each product Find the top five products ranked by total sales Find the trailing n day moving average of sales For each day compute the average daily sales over the preceding n days The first two queries are hard to express in SQL The third is imposible if n is a parameter of the query The SQL 1999 WINDOW clause allows such queries, over a table viewed as a sequence

SELECT L.province, T.month, AVG(S.sales) OVER W AS movavg FROM Sales S, Time T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id WINDOW W AS (PARTITION BY L.province ORDER BY T.month RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND INTERVAL '1' MONTH FOLLOWING) The FROM and WHERE clause is a normal SQL query Call this query Temp Temp is the sales rows with attributes of time and location The subquery defines the window

Temp is partitioned by the PARTION BY clause The result has one row for each row in the partition, rather than one row for each partition The partitions are sorted by the ORDER BY clause The WINDOW clause makes groups nearby records Value based, using RANGE (like the example) Based on the number of rows, using the ROWS clause Compute the aggregate function for each row, and its corresponding group, i.e. its window There are new aggregate functions, RANK, and variants

In analyzing trends it is common to want to find the 10 (or n) best, or worst performers e.g., which the top ten best selling products Sort sales by each product Return answer in descending order If there are a million products this is very wasteful Instead guess (!) a sales value, c All top ten performers are better than c, But many more are less than c Add the selection sales > c

SELECT P.productID, P.productName, S.sales FROM Product P, Sales S WHERE P. productid = S. productid AND S.locationID = 1 AND S.timeID = 3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS The OPTIMIZE construct is supported by some DBMSs The cutoff value (c) is chosen by the optimizer Choosing the cutoff can be tricky The effectiveness of the approach depends on how accurately the cutoff can be estimated

SELECT L.province, AVG(S.sales) FROM Sales S, Location L WHERE S.loc_id = L.loc_id GROUP BY L.province This query may be expensive if the tables are large If speed is of the essence it is possible to return data before the query is complete Either return the current running average Or use sampling and other statistical techniques to return an approximation Note that the algorithms must be non-blocking

The OLAP environment motivates some different implementation techniques Indexing is very important in OLAP systems Interactive response time is desired for queries over very large database OLAP systems are mostly read, and rarely updated which reduces the cost of maintaining indexes New indexing techniques have been developed for OLAP systems Bitmap indexes Join indexes

A bitmap index can speed up queries on sparse columns, that have few possible values One bit is allocated for each possible value The indexes can be used to answer some queries How many male customers have a rating of 3? AND the M and 3 columns and count the 1s M F id name sex rating 1 2 3 4 5 sex index 1 0 112 Sam M 5 0 0 0 0 1 0 1 113 Sue F 3 0 0 1 0 0 0 1 121 Ann F 2 0 1 0 0 0 1 0 131 Bob M 3 0 0 1 0 0 rating index

Joins are often expensive operations Join indexes can be built to speed up specific join queries A join index contains record IDs of matching records from different tables e.g. Sales, products and locations of all sales in B.C. The index would contain the sales record IDs and their matching product and location rids Only locations where province = "BC" are included The number of such indexes can be a problem where there are many similar queries

To reduce the number of join indexes separate indexes can be created on selected columns With record IDs of dimension table records that meet the condition, and record IDs of matching fact table records The separate join indexes have to be combined, using record ID intersection, to compute a join query The intersection can be performed more efficiently if the new indexes are bitmap indexes Particularly if the selection columns are sparse The result is a bitmapped join index

Decision support is a rapidly growing area of database use and research It involves the creation of large, consolidated data repositories called data warehouses Warehouses are queried using sophisticated analysis techniques Complex multidimensional queries influenced by both SQL and spreadsheets New techniques for database design, indexing, view maintenance and querying must be supported

Data mining consists of finding interesting trends in large datasets Related to an area of statistics - exploratory data analysis Such patterns should be identified with minimal user input The knowledge discovery process is in four steps Data selection - find the target subset of the data Data cleaning - remove noise and outliers, transform fields to common units and prepare the data for analysis Data mining - apply data mining algorithms to find interesting trends or patterns Evaluation - present results to end users

A market basket is a collection of items purchased by a customer in a single transaction Retailers commonly want to know which items are purchased together, to identify marketing opportunities An itemset is a set of items bought in a transaction The support of an itemset is the fraction of transactions that contain all the items in an itemset e.g. if {milk, cookies} has 60% support then 60% of all transactions contain both milk and cookies We may be interested in single item itemsets as they identify frequently purchased items

The a priori property is that every subset of a frequent itemset is also a frequent itemset The algorithm can proceed iteratively by first identifying frequent itemsets with only one item Each single item itemset can then be extended with another item to generate larger candidate itemsets Each iteration of the algorithm scans the transactions once Increasing the candidate itemsets by one item The algorithm can be improved by only considering additional items that are themselves itemsets The minimum support level has to be specified by the user

Find items that customers have purchased more than five times SELECT P.customerID, P.item, SUM(P.quantity) FROM Purchases P GROUP BY P.customerID, P.item HAVING SUM (P.qty) > 5 If the number of {custpmerid, item} pairs is large the relation may have to be sorted or hashed, but The result set is probably small, just the tip of the iceberg The query will waste time computing all of the groups, even though only few will meet the HAVING condition A modification of the priori property suggests that we only need to consider Customers that have purchased 5 items, and Items that have been purchased 5 times

An association rule is a rule of the form {milk} {cookies}, which states that: If milk is purchased in a transaction then it is likely that cookies are also purchased in that transaction There are two measures for association rules The support is the percentage of transactions that contain the {LHS RHS} The same as the support for that itemset The confidence is a measure of the strength of the rule The percentage of times that cookies are purchased whenever milk is purchased, or support ({LHS RHS}) / support ({LHS})

Users ask for association rules with given minimum support and confidence First all frequent itemsets with the specified minimum support are found As discussed previously Once the frequent itemsets have been produced they are divided into LHS and RHS The confidence measure is then tested for each possible LHS and RHS combination of the qualifying itemsets The most expensive part of the algorithm is identifying the frequent itemsets

Association rules can be applied to sets of days By using the date field as a grouping attribute In calendric market basket analysis the user specifies a group of dates, or calendar, to test the rule over Sequential patterns can be analyzed, where a customer purchases a given sequence of itemsets Care must be taken when using association rules for predictive purposes Rules like {milk} {cookies} may not be causal

Finding causal relationships can be hard Although two events (or purchases) are correlated there may not be a causal relationship between them Each possible combination of causal relationships can be considered as a model of the world Assign a score to each model based on its consistency with the observed data Bayesian networks are graphs that can be used to describe such models The number of models is exponential in the number of variables so some subset should be considered

An insurance company may want to predict whether or not customers are high risk What information can they use to do this? e.g. If a male aged between 16 and 25 drives a truck they are high risk There is one attribute whose value is to be predicted, the dependent attribute, and A number of predictor attributes The general form of such rules is: P 1 (X 1 ) P 2 (X 2 ) P k (X k ) Y = c P i s are predicates, X i s are predictors

The form of predicates depends on the type of the predictor attribute If the attribute is numerical, then numerical computations can be performed The predicate is of the form low X i high If the attribute is categorical, then we must test to see if two values are equal The predicate is of the form X i {v 1,, v j } age is numerical, car type and risk rating are categorical (16 age 25) car {truck} hirisk = true Rules are based on the dependent attribute type Categorical classification rules Numerical regression rules

Support and confidence can be defined for classification and regression Support the support for a condition, C, is the percentage of records that satisfy C The support for C 1 C 2 is the support for C 1 C 2 Confidence the confidence is the percentage of records of C 1 that also satisfy condition C 2 Classification and regression rules differ from association rules They consider more than one set-valued field as the left hand side of the rule

A collection of classification rules can be represented as a decision tree Each internal node of the tree is labeled with a predictor attribute, referred to as a splitting attribute Outgoing edges are labeled with predicates Leaf nodes are given labels values of dependent attributes Decision trees are constructed in two phases In the growth phase an overly large tree is built Trees are built by repeatedly splitting the tree on the best remaining splitting criterion The database is then partitioned on this criterion The tree is then pruned to remove overspecialized rules

The goal is to partition a set of records into groups Records in a group are similar, and records that belong to different groups are dissimilar Each group is called a cluster, and records should belong to only one cluster Partitional clustering algorithms partitions the data into groups based on some criterion Hierarchical clustering algorithms generate a sequence of partitions In the first partition each cluster consists of one record, the algorithm then mergers two partitions in each step