Data Quality. Data Cleaning and Integration. Data Cleaning. Data Preprocessing. Handling Missing Values. Disguised Missing Data?

Size: px

Start display at page:

Download "Data Quality. Data Cleaning and Integration. Data Cleaning. Data Preprocessing. Handling Missing Values. Disguised Missing Data?"

Egbert French
6 years ago
Views:

2014-05-06 Data Quality Data Cleaning and Integration Accuracy Completeness Consistency Timeliness Believability Interpretability

data quality Transform data to facilitate the target task Major tasks Data cleaning Data integration Data reduction Data

values Smoothing data J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J.

missing values Manually Using a global constant Using a measure of central tendency for the attribute, such as mean, median, or

Online forms Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear

1 Data Quality Data Cleaning and Integration Accuracy Completeness Consistency Timeliness Believability Interpretability J. Pei: Big Data Analytics -- Data Cleaning and Integration 2 Data Preprocessing Processing data before an analytic task Improve data quality Transform data to facilitate the target task Major tasks Data cleaning Data integration Data reduction Data transformation Data Cleaning The process of detecting and correcting corrupt or inaccurate records from data Handling missing values Smoothing data J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J. Pei: Big Data Analytics -- Data Cleaning and Integration 4 Handling Missing Values Ignore records with missing values Fill in missing values Manually Using a global constant Using a measure of central tendency for the attribute, such as mean, median, or mode Using the central tendency of the class Using the most probable value Disguised Missing Data? Online forms Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values Information about "State" is missing "Alabama" is used as disguise J. Pei: Big Data Analytics -- Data Cleaning and Integration 5 J. Pei: Big Data Analytics -- Data Cleaning and Integration 6 1

2 Disguised Missing Data Is Misleading Wrong conclusion Unreasonable results Types of Disguised Missing Data Randomly choose a valid value as disguise A small number of values are chosen as disguise Number of customers Alabama Ohio Washington Number of customers Alabama Ohio Washington Real values Disguised missing values J. Pei: Big Data Analytics -- Data Cleaning and Integration 7 J. Pei: Big Data Analytics -- Data Cleaning and Integration 8 Problem Definition Cleaning disguised missing data Given a table T with attributes A, an integer k For each attribute A i, output k candidates of frequently used disguise values Examples Alabama in state 0 in blood pressure 21 in age J. Pei: Big Data Analytics -- Data Cleaning and Integration 9 Ideas Observation 1: Frequently used disguises A small number of values are frequently used as the disguises Observation 2: Missing at random Number of customers Missing data are often distributed randomly 2500 A random subset of the whole database Alabama Ohio Washington J. Pei: Big Data Analytics -- Data Cleaning and Integration 10 General Framework For each attribute A For each frequent value v in A Compute the maximal embedded unbiased sample contained in T v Return the k values with the best (in both quality and size) embedded unbiased sample Id State Age Gender 1 Alabama 30 M 2 Alabama 30 M 3 Alabama 30 F 4 Alabama 20 F 5 Ohio 20 F 6 Ohio 20 F Smoothing Noisy Data Noise: a random error or variance in a measured variable Smoothing noise removing noise J. Pei: Big Data Analytics -- Data Cleaning and Integration 11 J. Pei: Big Data Analytics -- Data Cleaning and Integration 12 2

2014-05-06 Binning Regression Sorted data for price (in dollars) :4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins : Bin1 :4, 8, 15 Bin2 : 21, 21, 24 Bin3 : 25, 28, 34 Smoothing

3 Binning Regression Sorted data for price (in dollars) :4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins : Bin1 :4, 8, 15 Bin2 : 21, 21, 24 Bin3 : 25, 28, 34 Smoothing by bin means : Bin1 :9, 9, 9 Bin2 : 22, 22, 22 Bin3 : 29, 29, 29 Smoothing by bin boundaries : Bin1 :4, 4, 15 Bin2 : 21, 21, 24 Bin3 : 25, 25, 34 J. Pei: Big Data Analytics -- Data Cleaning and Integration 13 J. Pei: Big Data Analytics -- Data Cleaning and Integration 14 Outlier Analysis Data Cleaning as a Process Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools Data scrubbing: use simple domain knowledge (e.g., postal code, spellcheck) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes Iterative and interactive (e.g., Potter s Wheels) J. Pei: Big Data Analytics -- Data Cleaning and Integration 15 J. Pei: Big Data Analytics -- Data Cleaning and Integration 16 Data Integration Data Integration System Architecture Combining data from multiple (autonomous and heterogeneous) sources Providing a unified view Why is data integration hard? Systems challenges Data logical organization challenges Social and administrative challenges J. Pei: Big Data Analytics -- Data Cleaning and Integration 17 J. Pei: Big Data Analytics -- Data Cleaning and Integration 18 3

2014-05-06 Wrappers Computer programs that extract content from a particular data source and transform into a target form, such as a relational table Example: CMS (content management system) wrapper

4 Wrappers Computer programs that extract content from a particular data source and transform into a target form, such as a relational table Example: CMS (content management system) wrapper <html> <head> <title> %page_title%</title> </head> <body> %page_content% <P> %page_powered_by% </body> </html> J. Pei: Big Data Analytics -- Data Cleaning and Integration 19 How to Build Wrappers? Manual construction Machine learning based methods: learning schemas from training data Supervised learning approaches Unsupervised learning approaches J. Pei: Big Data Analytics -- Data Cleaning and Integration 20 Schema Matching and Mapping Schema matching: finding the semantic correspondences between attributes in data sources and those in the mediated schema Example: attribute name in source S1 corresponds to attributes firstname and surname in the mediated schema Name based matching Instance based matching Schema mapping: transforming attribute values from sources to mediated schema Example: a query or a program extracting name values from source S1, and forming firstname and surname values for the mediated schema Entity Detection and Recognition Entity detection: identify atomic elements in text or other data into predefined categories such as person names, locations, organizations, etc. Entity disambiguation: identify entities carrying the same name J. Pei: Big Data Analytics -- Data Cleaning and Integration 21 J. Pei: Big Data Analytics -- Data Cleaning and Integration 22 Example Data Provenance The data about how a data entry came to be Also known as data lineage/predigree The annotation approach: a series of annotations describing how each data item was produced The graph of data relationships approach: connecting sources and deriving new data items via mapping J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 J. Pei: Big Data Analytics -- Data Cleaning and Integration 24 4

5 Deep / Hidden Web Sites that are difficult for a crawler to find Probably over 100 times larger than the traditionally indexed web Three major categories of sites in deep web Private sites intentionally private no incoming links or may require login Form results only accessible by entering data into a form, e.g., airline ticket queries Hard to detect changes behind a form Scripted pages using JavaScript, Flash, or another client-side language in the web page A crawler needs to execute the script can slow down crawling significantly Deep web is different from dynamic pages Wikis dynamically generates web pages but are easy to crawl Private sites are static but cannot be crawled J. Pei: Big Data Analytics -- Data Cleaning and Integration 25 5

6 Outline Multidimensional Analysis Why multidimensional analysis? Multidimensional analysis principle OLAP OLAP indexes Jian Pei: Big Data Analytics -- Multidimensional Analysis 2 Dimensions An aspect or feature of a situation, problem, or thing, a measurable extent of some kind Dictionary Dimensions/attributes are used to model complex objects in a divide-and-conquer manner Objects are compared in selected dimensions/ attributes More often than not, objects have too many dimensions/attributes than one is interested in and can handle Multi-dimensional Analysis Find interesting patterns in multi-dimensional subspaces Michael Jordan is outstanding in subspaces (total points, total rebounds, total assists) and (number of games played, total points, total assists) Different patterns may be manifested in different subspaces Feature selection (machine learning and statistics): select a subset of relevant features for use in model construction a set of features for all objects Different subspaces may manifest different patterns Jian Pei: Big Data Analytics -- Multidimensional Analysis 3 Jian Pei: Big Data Analytics -- Multidimensional Analysis 4 OLAP Conceptually, we may explore all possible subspaces for interesting patterns What patterns are interesting? How can we explore all possible subspaces systematically and efficiently? Fundamental problems in analytics and data mining Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; In TPC, 6 standard benchmarks have 83 queries, aggregates are used 59 times, group-bys are used 20 times Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently Jian Pei: Big Data Analytics -- Multidimensional Analysis 5 OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction (Day, Store, Product type, SUM(sales)! (Month, City, *, SUM(sales)) Drill down (roll down): reverse of roll-up, from higher level summary to lower level summary or detailed data, or introducing new dimensions Jian Pei: Big Data Analytics -- Multidimensional Analysis 6 1

Other Operations Dice: pick specific values or ranges on some dimensions Pivot: rotate a cube changing the order of dimensions in visual analysis Relational Representation If there are n dimensions,

7 Other Operations Dice: pick specific values or ranges on some dimensions Pivot: rotate a cube changing the order of dimensions in visual analysis Relational Representation If there are n dimensions, there are 2 n possible aggregation columns Roll up by model by year by color in a table Jian Pei: Big Data Analytics -- Multidimensional Analysis 7 Jian Pei: Big Data Analytics -- Multidimensional Analysis 8 Difficulties Dummy Value ALL Many group bys are needed 6 dimensions! 2 6 =64 group bys In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait! Jian Pei: Big Data Analytics -- Multidimensional Analysis 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 10 CUBE SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color); CUBE DATA CUBE Model Year Color Sales Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941 Semantics of ALL ALL is a set Model.ALL = ALL(Model) = {Chevy, Ford } Year.ALL = ALL(Year) = {1990,1991,1992} Color.ALL = ALL(Color) = {red,white,blue} Jian Pei: Big Data Analytics -- Multidimensional Analysis 11 Jian Pei: Big Data Analytics -- Multidimensional Analysis 12 2

8 OLTP Versus OLAP What Is a Data Warehouse? OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat relational Isolated usage repetitive ad-hoc access read/write, index/hash on prim. key historical, summarized, multidimensional integrated, consolidated lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Jian Pei: Big Data Analytics -- Multidimensional Analysis 13 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 14 Subject-Oriented Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process Integrated Integrating multiple, heterogeneous data sources Relational databases, flat files, on-line transaction records Data cleaning and data integration Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted Jian Pei: Big Data Analytics -- Multidimensional Analysis 15 Jian Pei: Big Data Analytics -- Multidimensional Analysis 16 Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational databases: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element Nonvolatile A physically separate store of data transformed from the operational environment Operational updates of data do not occur in the data warehouse environment Do not require transaction processing, recovery, and concurrency control mechanisms Require only two operations in data accessing Initial loading of data Access of data Jian Pei: Big Data Analytics -- Multidimensional Analysis 17 Jian Pei: Big Data Analytics -- Multidimensional Analysis 18 3

9 Why Separate Data Warehouse? Star Schema High performance for both Operational DBMS: tuned for OLTP Warehouse: tuned for OLAP Different functions and different data Historical data: data analysis often uses historical data that operational databases do not typically maintain Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state_or_province country Jian Pei: Big Data Analytics -- Multidimensional Analysis 19 Jian Pei: Big Data Analytics -- Multidimensional Analysis 20 Snowflake Schema Fact Constellation Shipping Fact Table time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_key location location_key street city_key supplier supplier_key supplier_type city city_key city state_or_province country time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city province_or_state country time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type Jian Pei: Big Data Analytics -- Multidimensional Analysis 21 Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 (Good) Aggregate Functions Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j i=1,...,i}) j=1,...j}) Examples: COUNT(), MIN(), MAX(), SUM() G=SUM() for COUNT() Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j i=1,.., I}) j=1,..., J }) Examples: AVG(), standard deviation, MaxN(), MinN() For AVG(), G() records sum and count, H() adds these two components and divides to produce the global average Jian Pei: Big Data Analytics -- Multidimensional Analysis 23 Holistic Aggregate Functions There is no constant bound on the size of the storage needed to describe a subaggregate. There is no constant M, such that an M-tuple characterizes the computation F({Xi,j i=1,...,i}). Examples: Median(), MostFrequent() (also called the Mode()), and Rank() Jian Pei: Big Data Analytics -- Multidimensional Analysis 24 4

10 Index Requirements in OLAP Data is read only (Almost) no insertion or deletion Query types Point query: looking up one specific tuple (rare) Range query: returning the aggregate of a (large) set of tuples, with group by Complex queries: need specific algorithms and index structures, will be discussed later Jian Pei: Big Data Analytics -- Multidimensional Analysis 25 OLAP Query Example In table (cust, gender, ), find the total number of male customers Method 1: scan the table once Method 2: build a B+ tree index on attribute gender, still need to access all tuples of male customers Can we get the count without scanning many tuples, even not all tuples of male customers? Jian Pei: Big Data Analytics -- Multidimensional Analysis 26 Bitmap Index For n tuples, a bitmap index has n bits and can be packed into!n /8" bytes and!n /32" words From a bit to the row-id: the j-th bit of the p- th byte! row-id = p*8 +j cust gender Jack M Cathy F Nancy F Using Bitmap to Count Shcount[] contains the number of bits in the entry subscript shcount[ ]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[b[i]]; Jian Pei: Big Data Analytics -- Multidimensional Analysis 27 Jian Pei: Big Data Analytics -- Multidimensional Analysis 28 Advantages of Bitmap Index Efficient in space Ready for logic composition C = C1 AND C2 Bitmap operations can be used Bitmap index only works for categorical data with low cardinality Naively, we need 50 bits per entry to represent the state of a customer in US How to represent a sale in dollars? Jian Pei: Big Data Analytics -- Multidimensional Analysis 29 Bit-Sliced Index A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits 24 bits is good for up to $167,772.15, appropriate for many stores A bit-sliced index is N bitmaps Tuple j sets in bitmap k if the k-th bit in its binary representation is on The space costs of bit-sliced index is the same as storing the data directly Jian Pei: Big Data Analytics -- Multidimensional Analysis 30 5

Using Indexes SELECT SUM(sales) FROM Sales WHERE C; Tuples satisfying C is identified by a bitmap B Direct access to rows to calculate SUM: scan the whole table once B+ tree: find the tuples from the

11 Using Indexes SELECT SUM(sales) FROM Sales WHERE C; Tuples satisfying C is identified by a bitmap B Direct access to rows to calculate SUM: scan the whole table once B+ tree: find the tuples from the tree Projection index: only scan attribute sales Bit-sliced index: get the sum from (B AND B k )*2 k Cost Comparison Traditional value-list index (B+ tree) is costly in both I/O and CPU time Not good for OLAP Bit-sliced index is efficient in I/O Other case studies in [O Neil and Quass, SIGMOD 97] Jian Pei: Big Data Analytics -- Multidimensional Analysis 31 Jian Pei: Big Data Analytics -- Multidimensional Analysis 32 Horizontal or Vertical Storage A fact table for data warehousing is often fat Tens of even hundreds of dimensions/attributes A query is often about only a few attributes Horizontal storage: tuples are stored one by one Vertical storage: tuples are stored by attributes A 1 A 2 A 100 x 1 x 2 x 100 z 1 z 2 z 100 A 1 A 2 A 100 x 1 x 2 x 100 z 1 z 2 z 100 Horizontal Versus Vertical Find the information of tuple t Typical in OLTP Horizontal storage: get the whole tuple in one search Vertical storage: search 100 lists Find SUM(a 100 ) GROUP BY {a 22, a 83 } Typical in OLAP Horizontal storage (no index): search all tuples O(100n), where n is the number of tuples Vertical storage: search 3 lists O(3n), 3% of the horizontal storage method Projection index: vertical storage Jian Pei: Big Data Analytics -- Multidimensional Analysis 33 Jian Pei: Big Data Analytics -- Multidimensional Analysis 34 Rolling-up/Drilling-down Analysis Extending GROUP BY Pivot Roll up by model by year by color Not a table, many NULL values, no key SELECT Manufacturer, Year, Month, Day, Color, Model, SUM(price) AS Revenue FROM Sales GROUP BY Manufacturer, ROLLUP Year(Time) AS Year, Month(Time) AS Month, Day(Time) AS Day, Manufacturer Year, Mo, Day CUBE Color, Model; Model xcolor cubes Jian Pei: Big Data Analytics -- Multidimensional Analysis 35 Jian Pei: Big Data Analytics -- Multidimensional Analysis 36 6

12 DATA CUBE Model Year Color Sales CUBE SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color); CUBE Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941 MOLAP TV PC VCR sum Product Date 1Qtr 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country Jian Pei: Big Data Analytics -- Multidimensional Analysis 37 Jian Pei: Big Data Analytics -- Multidimensional Analysis 38 Pros and Cons Easy to implement Fast retrieval Many entries may be empty if data is sparse Costly in space ROLAP Data Cube in Table A multi-dimensional database Base table Dimensions Measure Store Product Season Sales Dimensions Measure S1 P1 Spring 6 Store Product Season AVG(Sales) S1 P2 Spring 12 S1 P1 Spring 6 S2 P1 Fall 9 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 Cubing * * * 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 39 Jian Pei: Big Data Analytics -- Multidimensional Analysis 40 Observations Once a base table (A, B, C) is sorted by A-B- C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters To compute other aggregates, we can sort the base table in some other orders How to Sort the Base Table? General sorting in main memory O(nlogn) Counting in main memory O(n), linear to the number of tuples in the base table How to sort 1 million integers in range 0 to 100? Set up 100 counters, initiate them to 0 s Scan the integers once, count the occurrences of each value in 1 to 100 Scan the integers again, put the integers to the right places Jian Pei: Big Data Analytics -- Multidimensional Analysis 41 Jian Pei: Big Data Analytics -- Multidimensional Analysis 42 7

Iceberg Cube In a data cube, many aggregate cells are trivial Having an aggregate too small Iceberg query Jian Pei: Big Data Analytics -- Multidimensional Analysis 43 Monotonic Iceberg Condition If

13 Iceberg Cube In a data cube, many aggregate cells are trivial Having an aggregate too small Iceberg query Jian Pei: Big Data Analytics -- Multidimensional Analysis 43 Monotonic Iceberg Condition If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c For cells c 1 and c 2, c 1 is called an ancestor of c 2 if in all dimensions that c 1 takes a non-* value, c 2 agrees with c 1 (a,b,*) is an ancestor of (a,b,c) An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P Jian Pei: Big Data Analytics -- Multidimensional Analysis 44 Pushing Monotonic Conditions BUC searches the aggregates bottom-up in depth-first manner Only when a monotonic condition holds, the descendants of the current node should be expanded How to Push Non-Monotonic Ones? Condition P(c)=AVG(price)>=800 AND COUNT(*)>=50 is not monotonic BUC cannot push such a constraint Jian Pei: Big Data Analytics -- Multidimensional Analysis 45 Jian Pei: Big Data Analytics -- Multidimensional Analysis 46 Ideas Let AVG k (price) be the average of the top-k tuples AVG k (price)>=800 is a monotonic condition If the top-10 average of (Vancouver, *, *) is less than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more AVG k (price)>=800 can be a filter for AVG(price)>=800 If AVG k (price)<800, AVG(price)<800 Generally, AVG()<=AVG k () Minimal Cubing Computing only a shell of a data cube Only compute and materialize low dimensional cuboids, dimensionality < k (k << n) Save space and cubing time Indexing the shell cells as well as their cover the tuples contributing to the shell cells Query answering Using the shell cells and their intersection to compute the non-materialized cells Jian Pei: Big Data Analytics -- Multidimensional Analysis 47 Jian Pei: Big Data Analytics -- Multidimensional Analysis 48 8

14 A Data Cube Is Often Huge 10 dimensions, cardinality 20 for each dimension! =16,679,880,978,201 possible tuples in the cube Even 1/1,000 of possible tuples are not empty, still more than 16 billion tuples Compression of Data Cubes Traditional compression methods, e.g., zip High compression ratio The compression cannot be queried directly Requirements for data cube compression The compression can be queried efficiently High compression ratio Lossless compression and lossy compression Jian Pei: Big Data Analytics -- Multidimensional Analysis 49 Jian Pei: Big Data Analytics -- Multidimensional Analysis 50 Redundancy in Data Cube A base table with only one tuple (a 1,, a 100, 1000) and aggregate function SUM() The data cube contains tuples! Every query about SUM() returns 1000 A data cube or a sub-cube may be populated by a single tuple base single tuple We do not need to pre-compute and store all aggregates Jian Pei: Big Data Analytics -- Multidimensional Analysis 51 A Little More General Case A base table with two tuples, t 1 = (a 1, a 2, b 3, b 4, 100) and t 2 = (a 1, a 2, c 3, c 4, 1000), aggregate function SUM() (a 1, a 2, *, *), (a 1, *, *, *), (*, a 2, *, *) and (*, *, *, *) all have sum 1100, since they are populated by the group of tuples {t 1, t 2 } base group tuples Jian Pei: Big Data Analytics -- Multidimensional Analysis 52 Semantic Compression Can we summarize a data cube so that the summarization can be browsed and understood effectively? The summarization itself is a compression The compression preserves the roll-up/drill-down relation Directly query-able and browse-able for OLAP Syntactic compression Not preserving the roll-up/drill-down semantics Directly query-able for some queries, but may not be directly browse-able for OLAP Cube Cell Lattice Observation: many cells may have same aggregate values Can we summarize the semantics of the cube by grouping cells by aggregate values? (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 53 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 54 9

15 A Naïve Attempt Put all cells of same agg values into a class The result is not a lattice anymore! Anomaly: the rollup/drilldown semantics is lost A Better Partitioning Quotient cube: partitioning preserving the rollup/drilldown semantics (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C5 (*,*,*):9 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 55 Jian Pei: Big Data Analytics -- Multidimensional Analysis 56 Why Semantic Compression Useful? Why Semantic Compression Useful? OLAP browsing (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (*,*,f):9 (S2,*,*):9 C1 C2 C1 C2 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (*,*,*):9 C5 (*,*,*):9 C5 Jian Pei: Big Data Analytics -- Multidimensional Analysis 57 Jian Pei: Big Data Analytics -- Multidimensional Analysis 58 Goals Given a cube, characterize a good way (the quotient cube way) of partitioning its cells into classes such that The partition generates a reduced lattice preserving the roll-up/drill-down semantics The partition is optimal: the number of classes as small as possible Compute, index and store quotient cubes efficiently to answer OLAP queries Why Equivalent Aggregate Values? Two cells have equivalent aggregate values if they cover the same set of tuples in the base table Tuples in base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 59 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 60 10

partition carry the same aggregate value with respect to any aggregate function But cells in a class of MIN() may have different covers For COUNT() and SUM() (positive), cover equivalence coincides

16 Cover Partition For a cell c, a tuple t in base table is in c s cover if t can be rolled up to c E.g., Cov(S1,*,spring)={(S1,P1,spring), (S1,P2,spring)} Dimensions Measure Store Product Season Sales S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 Cover Partitions & Aggregates All cells in a cover partition carry the same aggregate value with respect to any aggregate function But cells in a class of MIN() may have different covers For COUNT() and SUM() (positive), cover equivalence coincides with aggregate equivalence Jian Pei: Big Data Analytics -- Multidimensional Analysis 61 Jian Pei: Big Data Analytics -- Multidimensional Analysis 62 Quotient Cube A quotient cube is a quotient lattice of the cube lattice such that Each class is convex and connected All cells in a class carry the identical aggregate value w.r.t. a given aggregate function Quotient cube preserves the roll-up / drilldown semantics Multi-Criteria Decision Problems Two dimensions: and Preferences: Multidimensional decision problems have a long history more than 2300 years Multidimensional decision problems are often challenging Jian Pei: Big Data Analytics -- Multidimensional Analysis 63 Jian Pei: Big Data Analytics -- Multidimensional Analysis 64 Skyline Best Tradeoffs Two dimensions: distance to water and height Skyline: the buildings that are not dominated by any other buildings in both dimensions SFU Harbor Center Jian Pei: Big Data Analytics -- Multidimensional Analysis 65 Skyline: Formal Definition A set of objects S in an n-dimensional space D=(D 1,, D n ) Numeric dimensions for illustration in this talk For u, v S, u dominates v if u is better than v in one dimension, and u is not worse than v in any other dimensions For illustration in this talk, the smaller the better u S is a skyline object if u is not dominated by any other objects in S Jian Pei: Big Data Analytics -- Multidimensional Analysis 66 11

17 Example Price u skyline points v travel time Skyline Computation First investigated as the maximum vector problem in [Kung et al. JACM 1975] An O(n log d-2 n) time algorithm for d 4 and an O(n log n) time algorithm for d = 2 and 3 Divide-and-conquer-based methods: DD&C, LD&C, FLET Skyline computation in database context Data cannot be held into main memory External algorithms Jian Pei: Big Data Analytics -- Multidimensional Analysis 67 Jian Pei: Big Data Analytics -- Multidimensional Analysis 68 Skyline Computation on Large DB A rule of thumb in database research scalability on large databases Index-based methods Using bitmaps and the relationships between the skyline and the minimum coordinates of individual points, by Tan et al. Using nearest-neighbor search by Kossmann et al. The progressive branch-and-bound method by Papadias et al. Index-free methods Divide-and-conquer and block nested loops by Borzsonyi et al. Sort-first-skyline (SFS) by Chomicki et al. Full Space Skyline Is Not Enough! Skylines in subspaces Skyline in space (# stops, price, travel-time) If one does not care about # stops, how can we derive the superior trade-offs between price and travel-time from the full space skyline? Sky cube computing skylines in all nonempty subspaces (Yuan et al., VLDB 05) A database/data warehousing approach Any subspace skyline queries can be answered (efficiently) Jian Pei: Big Data Analytics -- Multidimensional Analysis 69 Jian Pei: Big Data Analytics -- Multidimensional Analysis 70 Sky Cube Understanding Skylines Both Wilt Chamberlain and Michael Jordan are in the full space skyline of the Great NBA Players Data mining/exploration-driven questions Which merits, respectively, really make them outstanding? How are they different? Jian Pei: Big Data Analytics -- Multidimensional Analysis 71 Jian Pei: Big Data Analytics -- Multidimensional Analysis 72 12

18 Redundancy in Sky Cube Does it just happen that skylines in multiple subspaces are identical? Mining Decisive Subspaces Decisive subspaces the minimal combinations of factors that determine the (subspace) skyline membership of an object Examples Total rebounds for Chamberlain For Jordan, (total points, total rebounds, total assists) and (games played, total points, total assists) Details in [Pei et al., VLDB 2005] Jian Pei: Big Data Analytics -- Multidimensional Analysis 73 Jian Pei: Big Data Analytics -- Multidimensional Analysis 74 Database & Data Mining Can Meet Conceptually, computing skylines in all subspaces Only computing skyline groups and their decisive subspaces Concise representation, leading to fast algorithms [Pei et al., ACM TODS 2006] Improvement: borrowing frequent itemset mining techniques to speed up computation in high dimensional spaces [Pei et al., ICDE 2007] DB Extensions and Applications Improving database query answering Efficient skyline query answering in subspaces [Tao et al., ICDE 2006] Effective summary of skyline: distance-based representative skyline [Tao et al., ICDE 2009] Extensions in data types Probabilistic skylines on uncertain data [Pei et al., VLDB 2007] Interval skyline queries on time series [Jiang and Pei, ICDE 2009] Jian Pei: Big Data Analytics -- Multidimensional Analysis 75 Jian Pei: Big Data Analytics -- Multidimensional Analysis 76 Dynamic User Preferences Personalized Recommendations Different customers may have different preferences Jian Pei: Big Data Analytics -- Multidimensional Analysis 77 Jian Pei: Big Data Analytics -- Multidimensional Analysis 78 13

19 Favorable Facet Mining A set of points in a multidimensional space Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality (Categorical) Partially ordered attributes: the preference orders are not fully determined, e.g., airlines, hotel groups, and property types Some templates may apply, e.g., single houses > semi-detached houses Favorable facts of a point p: the partial orders that make p in the skyline Jian Pei: Big Data Analytics -- Multidimensional Analysis 79 Monotonicity of Partial Orders If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R Jian Pei: Big Data Analytics -- Multidimensional Analysis 80 Minimal Disqualifying Conditions For a point p, a most general partial order that disqualifies p in the skyline is a minimal disqualifying condition (MDC) Any partial orders stronger than an MDC cannot make p in the skyline How to compute MDC s efficiently? MDC-O: computing MDC s on the fly MDC-M: materializing MDC s Details in [Wong et al., KDD 2007] Skyline Warehouse on Preferences Materializing all MCDs and precompute skylines Using an Implicit Preference Order tree (IPO-tree) index Can online answer skyline queries with respect to any user preferences Details in [Wong et al., VLDB 2008] Jian Pei: Big Data Analytics -- Multidimensional Analysis 81 Jian Pei: Big Data Analytics -- Multidimensional Analysis 82 Learning User Preferences Realtors selling realties a typical multi-criteria decision problem User preferences on multiple dimensions: location, size, price, style, age, developer, Thousands of realties How can a realtor learn a user s preferences on dimensions? Give a user a short list of realties and ask the user to pick the ones (s)he is/is not interested in An interesting realty a skyline point in the short list An uninteresting realty a non-skyline in the short list Jian Pei: Big Data Analytics -- Multidimensional Analysis 83 Mining Preferences from Examples Given a set of example points labeled skyline or non-skyline in a multidimensional space, can we learn the preferences on attributes? Favorable facets are for one superior example only Mining the minimal satisfying preference sets (SPS) The simplest hypotheses that fit the superior and inferior examples Jian Pei: Big Data Analytics -- Multidimensional Analysis 84 14

Learning Methods Complexity The SPS existence problem is NP-hard The minimal SPS problem is NP-hard A greedy approach The term-based greedy algorithm The condition-based greedy algorithm Details in

Reverse look-up: What are the group-bys in time and region where Apple ipad was popularly searched for?

20 Learning Methods Complexity The SPS existence problem is NP-hard The minimal SPS problem is NP-hard A greedy approach The term-based greedy algorithm The condition-based greedy algorithm Details in [Jiang et al., KDD 08] Multidimensional Analysis of Logs Look-up: What are the top-5 electronics that were most popularly searched by the users in the US in December, 2009? Reverse look-up: What are the group-bys in time and region where Apple ipad was popularly searched for? Different users/applications may bear different concept hierarchies in mind in their multidimensional analysis Jian Pei: Big Data Analytics -- Multidimensional Analysis 85 Jian Pei: Big Data Analytics -- Multidimensional Analysis 86 A Topic-Concept Cube Approach A Successful Case Study Jian Pei: Big Data Analytics -- Multidimensional Analysis 87 Jian Pei: Big Data Analytics -- Multidimensional Analysis 88 15

Information Management course

Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 14 : 18/11/2014 Data Mining: Concepts and Techniques (3 rd ed.) Chapter