Data Quality. Data Cleaning and Integration. Data Cleaning. Data Preprocessing. Handling Missing Values. Disguised Missing Data?

Size: px
Start display at page:

Download "Data Quality. Data Cleaning and Integration. Data Cleaning. Data Preprocessing. Handling Missing Values. Disguised Missing Data?"

Transcription

1 Data Quality Data Cleaning and Integration Accuracy Completeness Consistency Timeliness Believability Interpretability J. Pei: Big Data Analytics -- Data Cleaning and Integration 2 Data Preprocessing Processing data before an analytic task Improve data quality Transform data to facilitate the target task Major tasks Data cleaning Data integration Data reduction Data transformation Data Cleaning The process of detecting and correcting corrupt or inaccurate records from data Handling missing values Smoothing data J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J. Pei: Big Data Analytics -- Data Cleaning and Integration 4 Handling Missing Values Ignore records with missing values Fill in missing values Manually Using a global constant Using a measure of central tendency for the attribute, such as mean, median, or mode Using the central tendency of the class Using the most probable value Disguised Missing Data? Online forms Disguised missing data is the missing data entries that are not explicitly represented as such, but instead appear as potentially valid data values Information about "State" is missing "Alabama" is used as disguise J. Pei: Big Data Analytics -- Data Cleaning and Integration 5 J. Pei: Big Data Analytics -- Data Cleaning and Integration 6 1

2 Disguised Missing Data Is Misleading Wrong conclusion Unreasonable results Types of Disguised Missing Data Randomly choose a valid value as disguise A small number of values are chosen as disguise Number of customers Alabama Ohio Washington Number of customers Alabama Ohio Washington Real values Disguised missing values J. Pei: Big Data Analytics -- Data Cleaning and Integration 7 J. Pei: Big Data Analytics -- Data Cleaning and Integration 8 Problem Definition Cleaning disguised missing data Given a table T with attributes A, an integer k For each attribute A i, output k candidates of frequently used disguise values Examples Alabama in state 0 in blood pressure 21 in age J. Pei: Big Data Analytics -- Data Cleaning and Integration 9 Ideas Observation 1: Frequently used disguises A small number of values are frequently used as the disguises Observation 2: Missing at random Number of customers Missing data are often distributed randomly 2500 A random subset of the whole database Alabama Ohio Washington J. Pei: Big Data Analytics -- Data Cleaning and Integration 10 General Framework For each attribute A For each frequent value v in A Compute the maximal embedded unbiased sample contained in T v Return the k values with the best (in both quality and size) embedded unbiased sample Id State Age Gender 1 Alabama 30 M 2 Alabama 30 M 3 Alabama 30 F 4 Alabama 20 F 5 Ohio 20 F 6 Ohio 20 F Smoothing Noisy Data Noise: a random error or variance in a measured variable Smoothing noise removing noise J. Pei: Big Data Analytics -- Data Cleaning and Integration 11 J. Pei: Big Data Analytics -- Data Cleaning and Integration 12 2

3 Binning Regression Sorted data for price (in dollars) :4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins : Bin1 :4, 8, 15 Bin2 : 21, 21, 24 Bin3 : 25, 28, 34 Smoothing by bin means : Bin1 :9, 9, 9 Bin2 : 22, 22, 22 Bin3 : 29, 29, 29 Smoothing by bin boundaries : Bin1 :4, 4, 15 Bin2 : 21, 21, 24 Bin3 : 25, 25, 34 J. Pei: Big Data Analytics -- Data Cleaning and Integration 13 J. Pei: Big Data Analytics -- Data Cleaning and Integration 14 Outlier Analysis Data Cleaning as a Process Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools Data scrubbing: use simple domain knowledge (e.g., postal code, spellcheck) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface Integration of the two processes Iterative and interactive (e.g., Potter s Wheels) J. Pei: Big Data Analytics -- Data Cleaning and Integration 15 J. Pei: Big Data Analytics -- Data Cleaning and Integration 16 Data Integration Data Integration System Architecture Combining data from multiple (autonomous and heterogeneous) sources Providing a unified view Why is data integration hard? Systems challenges Data logical organization challenges Social and administrative challenges J. Pei: Big Data Analytics -- Data Cleaning and Integration 17 J. Pei: Big Data Analytics -- Data Cleaning and Integration 18 3

4 Wrappers Computer programs that extract content from a particular data source and transform into a target form, such as a relational table Example: CMS (content management system) wrapper <html> <head> <title> %page_title%</title> </head> <body> %page_content% <P> %page_powered_by% </body> </html> J. Pei: Big Data Analytics -- Data Cleaning and Integration 19 How to Build Wrappers? Manual construction Machine learning based methods: learning schemas from training data Supervised learning approaches Unsupervised learning approaches J. Pei: Big Data Analytics -- Data Cleaning and Integration 20 Schema Matching and Mapping Schema matching: finding the semantic correspondences between attributes in data sources and those in the mediated schema Example: attribute name in source S1 corresponds to attributes firstname and surname in the mediated schema Name based matching Instance based matching Schema mapping: transforming attribute values from sources to mediated schema Example: a query or a program extracting name values from source S1, and forming firstname and surname values for the mediated schema Entity Detection and Recognition Entity detection: identify atomic elements in text or other data into predefined categories such as person names, locations, organizations, etc. Entity disambiguation: identify entities carrying the same name J. Pei: Big Data Analytics -- Data Cleaning and Integration 21 J. Pei: Big Data Analytics -- Data Cleaning and Integration 22 Example Data Provenance The data about how a data entry came to be Also known as data lineage/predigree The annotation approach: a series of annotations describing how each data item was produced The graph of data relationships approach: connecting sources and deriving new data items via mapping J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 J. Pei: Big Data Analytics -- Data Cleaning and Integration 24 4

5 Deep / Hidden Web Sites that are difficult for a crawler to find Probably over 100 times larger than the traditionally indexed web Three major categories of sites in deep web Private sites intentionally private no incoming links or may require login Form results only accessible by entering data into a form, e.g., airline ticket queries Hard to detect changes behind a form Scripted pages using JavaScript, Flash, or another client-side language in the web page A crawler needs to execute the script can slow down crawling significantly Deep web is different from dynamic pages Wikis dynamically generates web pages but are easy to crawl Private sites are static but cannot be crawled J. Pei: Big Data Analytics -- Data Cleaning and Integration 25 5

6 Outline Multidimensional Analysis Why multidimensional analysis? Multidimensional analysis principle OLAP OLAP indexes Jian Pei: Big Data Analytics -- Multidimensional Analysis 2 Dimensions An aspect or feature of a situation, problem, or thing, a measurable extent of some kind Dictionary Dimensions/attributes are used to model complex objects in a divide-and-conquer manner Objects are compared in selected dimensions/ attributes More often than not, objects have too many dimensions/attributes than one is interested in and can handle Multi-dimensional Analysis Find interesting patterns in multi-dimensional subspaces Michael Jordan is outstanding in subspaces (total points, total rebounds, total assists) and (number of games played, total points, total assists) Different patterns may be manifested in different subspaces Feature selection (machine learning and statistics): select a subset of relevant features for use in model construction a set of features for all objects Different subspaces may manifest different patterns Jian Pei: Big Data Analytics -- Multidimensional Analysis 3 Jian Pei: Big Data Analytics -- Multidimensional Analysis 4 OLAP Conceptually, we may explore all possible subspaces for interesting patterns What patterns are interesting? How can we explore all possible subspaces systematically and efficiently? Fundamental problems in analytics and data mining Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; In TPC, 6 standard benchmarks have 83 queries, aggregates are used 59 times, group-bys are used 20 times Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently Jian Pei: Big Data Analytics -- Multidimensional Analysis 5 OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction (Day, Store, Product type, SUM(sales)! (Month, City, *, SUM(sales)) Drill down (roll down): reverse of roll-up, from higher level summary to lower level summary or detailed data, or introducing new dimensions Jian Pei: Big Data Analytics -- Multidimensional Analysis 6 1

7 Other Operations Dice: pick specific values or ranges on some dimensions Pivot: rotate a cube changing the order of dimensions in visual analysis Relational Representation If there are n dimensions, there are 2 n possible aggregation columns Roll up by model by year by color in a table Jian Pei: Big Data Analytics -- Multidimensional Analysis 7 Jian Pei: Big Data Analytics -- Multidimensional Analysis 8 Difficulties Dummy Value ALL Many group bys are needed 6 dimensions! 2 6 =64 group bys In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait! Jian Pei: Big Data Analytics -- Multidimensional Analysis 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 10 CUBE SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color); CUBE DATA CUBE Model Year Color Sales Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941 Semantics of ALL ALL is a set Model.ALL = ALL(Model) = {Chevy, Ford } Year.ALL = ALL(Year) = {1990,1991,1992} Color.ALL = ALL(Color) = {red,white,blue} Jian Pei: Big Data Analytics -- Multidimensional Analysis 11 Jian Pei: Big Data Analytics -- Multidimensional Analysis 12 2

8 OLTP Versus OLAP What Is a Data Warehouse? OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat relational Isolated usage repetitive ad-hoc access read/write, index/hash on prim. key historical, summarized, multidimensional integrated, consolidated lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Jian Pei: Big Data Analytics -- Multidimensional Analysis 13 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 14 Subject-Oriented Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process Integrated Integrating multiple, heterogeneous data sources Relational databases, flat files, on-line transaction records Data cleaning and data integration Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted Jian Pei: Big Data Analytics -- Multidimensional Analysis 15 Jian Pei: Big Data Analytics -- Multidimensional Analysis 16 Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational databases: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element Nonvolatile A physically separate store of data transformed from the operational environment Operational updates of data do not occur in the data warehouse environment Do not require transaction processing, recovery, and concurrency control mechanisms Require only two operations in data accessing Initial loading of data Access of data Jian Pei: Big Data Analytics -- Multidimensional Analysis 17 Jian Pei: Big Data Analytics -- Multidimensional Analysis 18 3

9 Why Separate Data Warehouse? Star Schema High performance for both Operational DBMS: tuned for OLTP Warehouse: tuned for OLAP Different functions and different data Historical data: data analysis often uses historical data that operational databases do not typically maintain Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city state_or_province country Jian Pei: Big Data Analytics -- Multidimensional Analysis 19 Jian Pei: Big Data Analytics -- Multidimensional Analysis 20 Snowflake Schema Fact Constellation Shipping Fact Table time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_key location location_key street city_key supplier supplier_key supplier_type city city_key city state_or_province country time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city province_or_state country time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type Jian Pei: Big Data Analytics -- Multidimensional Analysis 21 Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 (Good) Aggregate Functions Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j i=1,...,i}) j=1,...j}) Examples: COUNT(), MIN(), MAX(), SUM() G=SUM() for COUNT() Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j i=1,.., I}) j=1,..., J }) Examples: AVG(), standard deviation, MaxN(), MinN() For AVG(), G() records sum and count, H() adds these two components and divides to produce the global average Jian Pei: Big Data Analytics -- Multidimensional Analysis 23 Holistic Aggregate Functions There is no constant bound on the size of the storage needed to describe a subaggregate. There is no constant M, such that an M-tuple characterizes the computation F({Xi,j i=1,...,i}). Examples: Median(), MostFrequent() (also called the Mode()), and Rank() Jian Pei: Big Data Analytics -- Multidimensional Analysis 24 4

10 Index Requirements in OLAP Data is read only (Almost) no insertion or deletion Query types Point query: looking up one specific tuple (rare) Range query: returning the aggregate of a (large) set of tuples, with group by Complex queries: need specific algorithms and index structures, will be discussed later Jian Pei: Big Data Analytics -- Multidimensional Analysis 25 OLAP Query Example In table (cust, gender, ), find the total number of male customers Method 1: scan the table once Method 2: build a B+ tree index on attribute gender, still need to access all tuples of male customers Can we get the count without scanning many tuples, even not all tuples of male customers? Jian Pei: Big Data Analytics -- Multidimensional Analysis 26 Bitmap Index For n tuples, a bitmap index has n bits and can be packed into!n /8" bytes and!n /32" words From a bit to the row-id: the j-th bit of the p- th byte! row-id = p*8 +j cust gender Jack M Cathy F Nancy F Using Bitmap to Count Shcount[] contains the number of bits in the entry subscript shcount[ ]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[b[i]]; Jian Pei: Big Data Analytics -- Multidimensional Analysis 27 Jian Pei: Big Data Analytics -- Multidimensional Analysis 28 Advantages of Bitmap Index Efficient in space Ready for logic composition C = C1 AND C2 Bitmap operations can be used Bitmap index only works for categorical data with low cardinality Naively, we need 50 bits per entry to represent the state of a customer in US How to represent a sale in dollars? Jian Pei: Big Data Analytics -- Multidimensional Analysis 29 Bit-Sliced Index A sale amount can be written as an integer number of pennies, and then represented as a binary number of N bits 24 bits is good for up to $167,772.15, appropriate for many stores A bit-sliced index is N bitmaps Tuple j sets in bitmap k if the k-th bit in its binary representation is on The space costs of bit-sliced index is the same as storing the data directly Jian Pei: Big Data Analytics -- Multidimensional Analysis 30 5

11 Using Indexes SELECT SUM(sales) FROM Sales WHERE C; Tuples satisfying C is identified by a bitmap B Direct access to rows to calculate SUM: scan the whole table once B+ tree: find the tuples from the tree Projection index: only scan attribute sales Bit-sliced index: get the sum from (B AND B k )*2 k Cost Comparison Traditional value-list index (B+ tree) is costly in both I/O and CPU time Not good for OLAP Bit-sliced index is efficient in I/O Other case studies in [O Neil and Quass, SIGMOD 97] Jian Pei: Big Data Analytics -- Multidimensional Analysis 31 Jian Pei: Big Data Analytics -- Multidimensional Analysis 32 Horizontal or Vertical Storage A fact table for data warehousing is often fat Tens of even hundreds of dimensions/attributes A query is often about only a few attributes Horizontal storage: tuples are stored one by one Vertical storage: tuples are stored by attributes A 1 A 2 A 100 x 1 x 2 x 100 z 1 z 2 z 100 A 1 A 2 A 100 x 1 x 2 x 100 z 1 z 2 z 100 Horizontal Versus Vertical Find the information of tuple t Typical in OLTP Horizontal storage: get the whole tuple in one search Vertical storage: search 100 lists Find SUM(a 100 ) GROUP BY {a 22, a 83 } Typical in OLAP Horizontal storage (no index): search all tuples O(100n), where n is the number of tuples Vertical storage: search 3 lists O(3n), 3% of the horizontal storage method Projection index: vertical storage Jian Pei: Big Data Analytics -- Multidimensional Analysis 33 Jian Pei: Big Data Analytics -- Multidimensional Analysis 34 Rolling-up/Drilling-down Analysis Extending GROUP BY Pivot Roll up by model by year by color Not a table, many NULL values, no key SELECT Manufacturer, Year, Month, Day, Color, Model, SUM(price) AS Revenue FROM Sales GROUP BY Manufacturer, ROLLUP Year(Time) AS Year, Month(Time) AS Month, Day(Time) AS Day, Manufacturer Year, Mo, Day CUBE Color, Model; Model xcolor cubes Jian Pei: Big Data Analytics -- Multidimensional Analysis 35 Jian Pei: Big Data Analytics -- Multidimensional Analysis 36 6

12 DATA CUBE Model Year Color Sales CUBE SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color); CUBE Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941 MOLAP TV PC VCR sum Product Date 1Qtr 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country Jian Pei: Big Data Analytics -- Multidimensional Analysis 37 Jian Pei: Big Data Analytics -- Multidimensional Analysis 38 Pros and Cons Easy to implement Fast retrieval Many entries may be empty if data is sparse Costly in space ROLAP Data Cube in Table A multi-dimensional database Base table Dimensions Measure Store Product Season Sales Dimensions Measure S1 P1 Spring 6 Store Product Season AVG(Sales) S1 P2 Spring 12 S1 P1 Spring 6 S2 P1 Fall 9 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 Cubing * * * 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 39 Jian Pei: Big Data Analytics -- Multidimensional Analysis 40 Observations Once a base table (A, B, C) is sorted by A-B- C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters To compute other aggregates, we can sort the base table in some other orders How to Sort the Base Table? General sorting in main memory O(nlogn) Counting in main memory O(n), linear to the number of tuples in the base table How to sort 1 million integers in range 0 to 100? Set up 100 counters, initiate them to 0 s Scan the integers once, count the occurrences of each value in 1 to 100 Scan the integers again, put the integers to the right places Jian Pei: Big Data Analytics -- Multidimensional Analysis 41 Jian Pei: Big Data Analytics -- Multidimensional Analysis 42 7

13 Iceberg Cube In a data cube, many aggregate cells are trivial Having an aggregate too small Iceberg query Jian Pei: Big Data Analytics -- Multidimensional Analysis 43 Monotonic Iceberg Condition If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c For cells c 1 and c 2, c 1 is called an ancestor of c 2 if in all dimensions that c 1 takes a non-* value, c 2 agrees with c 1 (a,b,*) is an ancestor of (a,b,c) An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P Jian Pei: Big Data Analytics -- Multidimensional Analysis 44 Pushing Monotonic Conditions BUC searches the aggregates bottom-up in depth-first manner Only when a monotonic condition holds, the descendants of the current node should be expanded How to Push Non-Monotonic Ones? Condition P(c)=AVG(price)>=800 AND COUNT(*)>=50 is not monotonic BUC cannot push such a constraint Jian Pei: Big Data Analytics -- Multidimensional Analysis 45 Jian Pei: Big Data Analytics -- Multidimensional Analysis 46 Ideas Let AVG k (price) be the average of the top-k tuples AVG k (price)>=800 is a monotonic condition If the top-10 average of (Vancouver, *, *) is less than 800, the top-10 average of (Vancouver, laptop, *) cannot be 800 or more AVG k (price)>=800 can be a filter for AVG(price)>=800 If AVG k (price)<800, AVG(price)<800 Generally, AVG()<=AVG k () Minimal Cubing Computing only a shell of a data cube Only compute and materialize low dimensional cuboids, dimensionality < k (k << n) Save space and cubing time Indexing the shell cells as well as their cover the tuples contributing to the shell cells Query answering Using the shell cells and their intersection to compute the non-materialized cells Jian Pei: Big Data Analytics -- Multidimensional Analysis 47 Jian Pei: Big Data Analytics -- Multidimensional Analysis 48 8

14 A Data Cube Is Often Huge 10 dimensions, cardinality 20 for each dimension! =16,679,880,978,201 possible tuples in the cube Even 1/1,000 of possible tuples are not empty, still more than 16 billion tuples Compression of Data Cubes Traditional compression methods, e.g., zip High compression ratio The compression cannot be queried directly Requirements for data cube compression The compression can be queried efficiently High compression ratio Lossless compression and lossy compression Jian Pei: Big Data Analytics -- Multidimensional Analysis 49 Jian Pei: Big Data Analytics -- Multidimensional Analysis 50 Redundancy in Data Cube A base table with only one tuple (a 1,, a 100, 1000) and aggregate function SUM() The data cube contains tuples! Every query about SUM() returns 1000 A data cube or a sub-cube may be populated by a single tuple base single tuple We do not need to pre-compute and store all aggregates Jian Pei: Big Data Analytics -- Multidimensional Analysis 51 A Little More General Case A base table with two tuples, t 1 = (a 1, a 2, b 3, b 4, 100) and t 2 = (a 1, a 2, c 3, c 4, 1000), aggregate function SUM() (a 1, a 2, *, *), (a 1, *, *, *), (*, a 2, *, *) and (*, *, *, *) all have sum 1100, since they are populated by the group of tuples {t 1, t 2 } base group tuples Jian Pei: Big Data Analytics -- Multidimensional Analysis 52 Semantic Compression Can we summarize a data cube so that the summarization can be browsed and understood effectively? The summarization itself is a compression The compression preserves the roll-up/drill-down relation Directly query-able and browse-able for OLAP Syntactic compression Not preserving the roll-up/drill-down semantics Directly query-able for some queries, but may not be directly browse-able for OLAP Cube Cell Lattice Observation: many cells may have same aggregate values Can we summarize the semantics of the cube by grouping cells by aggregate values? (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 53 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 54 9

15 A Naïve Attempt Put all cells of same agg values into a class The result is not a lattice anymore! Anomaly: the rollup/drilldown semantics is lost A Better Partitioning Quotient cube: partitioning preserving the rollup/drilldown semantics (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C5 (*,*,*):9 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 55 Jian Pei: Big Data Analytics -- Multidimensional Analysis 56 Why Semantic Compression Useful? Why Semantic Compression Useful? OLAP browsing (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (*,*,f):9 (S2,*,*):9 C1 C2 C1 C2 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 C4 (*,*,*):9 C5 (*,*,*):9 C5 Jian Pei: Big Data Analytics -- Multidimensional Analysis 57 Jian Pei: Big Data Analytics -- Multidimensional Analysis 58 Goals Given a cube, characterize a good way (the quotient cube way) of partitioning its cells into classes such that The partition generates a reduced lattice preserving the roll-up/drill-down semantics The partition is optimal: the number of classes as small as possible Compute, index and store quotient cubes efficiently to answer OLAP queries Why Equivalent Aggregate Values? Two cells have equivalent aggregate values if they cover the same set of tuples in the base table Tuples in base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 59 (*,*,*):9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 60 10

16 Cover Partition For a cell c, a tuple t in base table is in c s cover if t can be rolled up to c E.g., Cov(S1,*,spring)={(S1,P1,spring), (S1,P2,spring)} Dimensions Measure Store Product Season Sales S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 Cover Partitions & Aggregates All cells in a cover partition carry the same aggregate value with respect to any aggregate function But cells in a class of MIN() may have different covers For COUNT() and SUM() (positive), cover equivalence coincides with aggregate equivalence Jian Pei: Big Data Analytics -- Multidimensional Analysis 61 Jian Pei: Big Data Analytics -- Multidimensional Analysis 62 Quotient Cube A quotient cube is a quotient lattice of the cube lattice such that Each class is convex and connected All cells in a class carry the identical aggregate value w.r.t. a given aggregate function Quotient cube preserves the roll-up / drilldown semantics Multi-Criteria Decision Problems Two dimensions: and Preferences: Multidimensional decision problems have a long history more than 2300 years Multidimensional decision problems are often challenging Jian Pei: Big Data Analytics -- Multidimensional Analysis 63 Jian Pei: Big Data Analytics -- Multidimensional Analysis 64 Skyline Best Tradeoffs Two dimensions: distance to water and height Skyline: the buildings that are not dominated by any other buildings in both dimensions SFU Harbor Center Jian Pei: Big Data Analytics -- Multidimensional Analysis 65 Skyline: Formal Definition A set of objects S in an n-dimensional space D=(D 1,, D n ) Numeric dimensions for illustration in this talk For u, v S, u dominates v if u is better than v in one dimension, and u is not worse than v in any other dimensions For illustration in this talk, the smaller the better u S is a skyline object if u is not dominated by any other objects in S Jian Pei: Big Data Analytics -- Multidimensional Analysis 66 11

17 Example Price u skyline points v travel time Skyline Computation First investigated as the maximum vector problem in [Kung et al. JACM 1975] An O(n log d-2 n) time algorithm for d 4 and an O(n log n) time algorithm for d = 2 and 3 Divide-and-conquer-based methods: DD&C, LD&C, FLET Skyline computation in database context Data cannot be held into main memory External algorithms Jian Pei: Big Data Analytics -- Multidimensional Analysis 67 Jian Pei: Big Data Analytics -- Multidimensional Analysis 68 Skyline Computation on Large DB A rule of thumb in database research scalability on large databases Index-based methods Using bitmaps and the relationships between the skyline and the minimum coordinates of individual points, by Tan et al. Using nearest-neighbor search by Kossmann et al. The progressive branch-and-bound method by Papadias et al. Index-free methods Divide-and-conquer and block nested loops by Borzsonyi et al. Sort-first-skyline (SFS) by Chomicki et al. Full Space Skyline Is Not Enough! Skylines in subspaces Skyline in space (# stops, price, travel-time) If one does not care about # stops, how can we derive the superior trade-offs between price and travel-time from the full space skyline? Sky cube computing skylines in all nonempty subspaces (Yuan et al., VLDB 05) A database/data warehousing approach Any subspace skyline queries can be answered (efficiently) Jian Pei: Big Data Analytics -- Multidimensional Analysis 69 Jian Pei: Big Data Analytics -- Multidimensional Analysis 70 Sky Cube Understanding Skylines Both Wilt Chamberlain and Michael Jordan are in the full space skyline of the Great NBA Players Data mining/exploration-driven questions Which merits, respectively, really make them outstanding? How are they different? Jian Pei: Big Data Analytics -- Multidimensional Analysis 71 Jian Pei: Big Data Analytics -- Multidimensional Analysis 72 12

18 Redundancy in Sky Cube Does it just happen that skylines in multiple subspaces are identical? Mining Decisive Subspaces Decisive subspaces the minimal combinations of factors that determine the (subspace) skyline membership of an object Examples Total rebounds for Chamberlain For Jordan, (total points, total rebounds, total assists) and (games played, total points, total assists) Details in [Pei et al., VLDB 2005] Jian Pei: Big Data Analytics -- Multidimensional Analysis 73 Jian Pei: Big Data Analytics -- Multidimensional Analysis 74 Database & Data Mining Can Meet Conceptually, computing skylines in all subspaces Only computing skyline groups and their decisive subspaces Concise representation, leading to fast algorithms [Pei et al., ACM TODS 2006] Improvement: borrowing frequent itemset mining techniques to speed up computation in high dimensional spaces [Pei et al., ICDE 2007] DB Extensions and Applications Improving database query answering Efficient skyline query answering in subspaces [Tao et al., ICDE 2006] Effective summary of skyline: distance-based representative skyline [Tao et al., ICDE 2009] Extensions in data types Probabilistic skylines on uncertain data [Pei et al., VLDB 2007] Interval skyline queries on time series [Jiang and Pei, ICDE 2009] Jian Pei: Big Data Analytics -- Multidimensional Analysis 75 Jian Pei: Big Data Analytics -- Multidimensional Analysis 76 Dynamic User Preferences Personalized Recommendations Different customers may have different preferences Jian Pei: Big Data Analytics -- Multidimensional Analysis 77 Jian Pei: Big Data Analytics -- Multidimensional Analysis 78 13

19 Favorable Facet Mining A set of points in a multidimensional space Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality (Categorical) Partially ordered attributes: the preference orders are not fully determined, e.g., airlines, hotel groups, and property types Some templates may apply, e.g., single houses > semi-detached houses Favorable facts of a point p: the partial orders that make p in the skyline Jian Pei: Big Data Analytics -- Multidimensional Analysis 79 Monotonicity of Partial Orders If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R Jian Pei: Big Data Analytics -- Multidimensional Analysis 80 Minimal Disqualifying Conditions For a point p, a most general partial order that disqualifies p in the skyline is a minimal disqualifying condition (MDC) Any partial orders stronger than an MDC cannot make p in the skyline How to compute MDC s efficiently? MDC-O: computing MDC s on the fly MDC-M: materializing MDC s Details in [Wong et al., KDD 2007] Skyline Warehouse on Preferences Materializing all MCDs and precompute skylines Using an Implicit Preference Order tree (IPO-tree) index Can online answer skyline queries with respect to any user preferences Details in [Wong et al., VLDB 2008] Jian Pei: Big Data Analytics -- Multidimensional Analysis 81 Jian Pei: Big Data Analytics -- Multidimensional Analysis 82 Learning User Preferences Realtors selling realties a typical multi-criteria decision problem User preferences on multiple dimensions: location, size, price, style, age, developer, Thousands of realties How can a realtor learn a user s preferences on dimensions? Give a user a short list of realties and ask the user to pick the ones (s)he is/is not interested in An interesting realty a skyline point in the short list An uninteresting realty a non-skyline in the short list Jian Pei: Big Data Analytics -- Multidimensional Analysis 83 Mining Preferences from Examples Given a set of example points labeled skyline or non-skyline in a multidimensional space, can we learn the preferences on attributes? Favorable facets are for one superior example only Mining the minimal satisfying preference sets (SPS) The simplest hypotheses that fit the superior and inferior examples Jian Pei: Big Data Analytics -- Multidimensional Analysis 84 14

20 Learning Methods Complexity The SPS existence problem is NP-hard The minimal SPS problem is NP-hard A greedy approach The term-based greedy algorithm The condition-based greedy algorithm Details in [Jiang et al., KDD 08] Multidimensional Analysis of Logs Look-up: What are the top-5 electronics that were most popularly searched by the users in the US in December, 2009? Reverse look-up: What are the group-bys in time and region where Apple ipad was popularly searched for? Different users/applications may bear different concept hierarchies in mind in their multidimensional analysis Jian Pei: Big Data Analytics -- Multidimensional Analysis 85 Jian Pei: Big Data Analytics -- Multidimensional Analysis 86 A Topic-Concept Cube Approach A Successful Case Study Jian Pei: Big Data Analytics -- Multidimensional Analysis 87 Jian Pei: Big Data Analytics -- Multidimensional Analysis 88 15

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 14 : 18/11/2014 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

What is a Data Warehouse?

What is a Data Warehouse? What is a Data Warehouse? COMP 465 Data Mining Data Warehousing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Defined in many different ways,

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 22 Table of contents 1 Introduction 2 Data warehousing

More information

Chapter 4, Data Warehouse and OLAP Operations

Chapter 4, Data Warehouse and OLAP Operations CSI 4352, Introduction to Data Mining Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 07 : 06/11/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Data Warehousing & OLAP

Data Warehousing & OLAP Data Warehousing & OLAP Data Mining: Concepts and Techniques Chapter 3 Jiawei Han and An Introduction to Database Systems C.J.Date, Eighth Eddition, Addidon Wesley, 4 1 What is Data Warehousing? What is

More information

Introduction to Data Warehousing

Introduction to Data Warehousing ICS 321 Spring 2012 Introduction to Data Warehousing Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/23/2012 Lipyeow Lim -- University of Hawaii at Manoa

More information

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Jarek Szlichta  Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques Jarek Szlichta http://data.science.uoit.ca/ Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques Frequent Itemset Mining Methods Apriori Which Patterns Are

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction 2 Data warehousing

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data warehousing

More information

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa ICS 421 Spring 2010 Data Warehousing 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/2010 Lipyeow Lim -- University of Hawaii at Manoa 1 Data Warehousing

More information

Data Warehousing & On-Line Analytical Processing

Data Warehousing & On-Line Analytical Processing Data Warehousing & On-Line Analytical Processing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl

More information

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi Analyse des Données Master 2 IMAFA Andrea G. B. Tettamanzi Université Nice Sophia Antipolis UFR Sciences - Département Informatique andrea.tettamanzi@unice.fr Andrea G. B. Tettamanzi, 2016 1 CM - Séance

More information

A Multi-Dimensional Data Model

A Multi-Dimensional Data Model A Multi-Dimensional Data Model A Data Warehouse is based on a Multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in

More information

Dta Mining and Data Warehousing

Dta Mining and Data Warehousing CSCI6405 Fall 2003 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: q.gao@dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

CS 412 Intro. to Data Mining

CS 412 Intro. to Data Mining CS 412 Intro. to Data Mining Chapter 4. Data Warehousing and On-line Analytical Processing Jiawei Han, Computer Science, Univ. Illinois at Urbana -Champaign, 2017 1 2 3 Chapter 4: Data Warehousing and

More information

CS490D: Introduction to Data Mining Chris Clifton

CS490D: Introduction to Data Mining Chris Clifton CS490D: Introduction to Data Mining Chris Clifton January 16, 2004 Data Warehousing Data Warehousing and OLAP Technology for Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse

More information

Data Warehouse. Concepts and Techniques. Chapter 3. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Warehouse. Concepts and Techniques. Chapter 3. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1 Data Warehouse Concepts and Techniques Chapter 3 SS Chung April 5, 2013 Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional

More information

Data Warehousing & On-line Analytical Processing

Data Warehousing & On-line Analytical Processing Data Warehousing & On-line Analytical Processing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ Chapter 4: Data Warehousing and On-line

More information

Data Warehousing (1)

Data Warehousing (1) ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/2010 Lipyeow Lim -- University of Hawaii at Manoa 1 Motivation

More information

Data Mining & Data Warehouse

Data Mining & Data Warehouse Data Mining & Data Warehouse Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology (1) 2016 2017 1 Points to Cover Why Do We Need Data Warehouses?

More information

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Data Mining Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology (1) 2016 2017 Department of CS- DM - UHD 1 Points to Cover Why Do We Need Data

More information

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad All the content of these PPTs were taken from PPTS of renown author via internet. These PPTs are only mean to share the knowledge among

More information

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process. MTAT.03.183 Data Mining Week 7: Online Analytical Processing and Data Warehouses Marlon Dumas marlon.dumas ät ut. ee Acknowledgment This slide deck is a mashup of the following publicly available slide

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Syllabus. Syllabus. Motivation Decision Support. Syllabus Presentation: Sophia Discussion: Tianyu Metadata Requirements and Conclusion 3 4 Decision Support Decision Making: Everyday, Everywhere Decision Support System: a class of computerized information systems

More information

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:- UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to

More information

ECT7110 Introduction to Data Warehousing

ECT7110 Introduction to Data Warehousing ECT7110 Introduction to Data Warehousing Prof. Wai Lam ECT7110 Introduction to Data Warehousing 1 What is Data Warehouse? Defined in many different ways, but not rigorously. A decision support database

More information

ECLT 5810 Introduction to Data Warehousing

ECLT 5810 Introduction to Data Warehousing ECLT 5810 Introduction to Data Warehousing Prof. Wai Lam ECLT 5810 Introduction to Data Warehousing 1 What is Data Warehouse? Provides tools for business executives Systematically organize and understand

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse Principles of Knowledge Discovery in bases Fall 1999 Chapter 2: Warehousing and Dr. Osmar R. Zaïane University of Alberta Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in bases University

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

DATA WAREHOUSING & DATA MINING. by: Prof. Asha Ambhaikar

DATA WAREHOUSING & DATA MINING. by: Prof. Asha Ambhaikar DATA WAREHOUSING & DATA MINING by: Prof. Asha Ambhaikar 1 UNIT-I Overview and Concepts 2 Contents of Unit-I Need for data warehousing, Basic elements of data warehousing, Trends in data warehousing. Planning

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Managing Information Resources

Managing Information Resources Managing Information Resources 1 Managing Data 2 Managing Information 3 Managing Contents Concepts & Definitions Data Facts devoid of meaning or intent e.g. structured data in DB Information Data that

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

On-Line Analytical Processing (OLAP) Traditional OLTP

On-Line Analytical Processing (OLAP) Traditional OLTP On-Line Analytical Processing (OLAP) CSE 6331 / CSE 6362 Data Mining Fall 1999 Diane J. Cook Traditional OLTP DBMS used for on-line transaction processing (OLTP) order entry: pull up order xx-yy-zz and

More information

Reminds on Data Warehousing

Reminds on Data Warehousing BUSINESS INTELLIGENCE Reminds on Data Warehousing (details at the Decision Support Database course) Business Informatics Degree BI Architecture 2 Business Intelligence Lab Star-schema datawarehouse 3 time

More information

ETL and OLAP Systems

ETL and OLAP Systems ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Decision Support Systems

Decision Support Systems Decision Support Systems 2011/2012 Week 3. Lecture 6 Previous Class Dimensions & Measures Dimensions: Item Time Loca0on Measures: Quan0ty Sales TransID ItemName ItemID Date Store Qty T0001 Computer I23

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

Quotient Cube: How to Summarize the Semantics of a Data Cube

Quotient Cube: How to Summarize the Semantics of a Data Cube Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of British Columbia) * Jian Pei (State Univ. of New York at Buffalo) * Jiawei Han (Univ. of Illinois at Urbana-Champaign)

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Decision Support, Data Warehousing, and OLAP

Decision Support, Data Warehousing, and OLAP Decision Support, Data Warehousing, and OLAP : Contents Terminology : OLAP vs. OLTP Data Warehousing Architecture Technologies References 1 Decision Support and OLAP Information technology to help knowledge

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 07 Terminologies Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Database

More information

collection of data that is used primarily in organizational decision making.

collection of data that is used primarily in organizational decision making. Data Warehousing A data warehouse is a special purpose database. Classic databases are generally used to model some enterprise. Most often they are used to support transactions, a process that is referred

More information

An Overview of Data Warehousing and OLAP Technology

An Overview of Data Warehousing and OLAP Technology An Overview of Data Warehousing and OLAP Technology CMPT 843 Karanjit Singh Tiwana 1 Intro and Architecture 2 What is Data Warehouse? Subject-oriented, integrated, time varying, non-volatile collection

More information

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) INTRODUCTION A dimension is an attribute within a multidimensional model consisting of a list of values (called members). A fact is defined by a combination

More information

Table of Contents. Rajesh Pandey Page 1

Table of Contents. Rajesh Pandey Page 1 Table of Contents Chapter 1: Introduction to Data Mining and Data Warehousing... 4 1.1 Review of Basic Concepts of Data Mining and Data Warehousing... 4 1.2 Data Mining... 5 1.2.1 Why Data Mining?... 5

More information

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015 Q.1 a. Briefly explain data granularity with the help of example Data Granularity: The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers

More information

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS PART A 1. What are production reporting tools? Give examples. (May/June 2013) Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs. Such

More information

Data warehouses Decision support The multidimensional model OLAP queries

Data warehouses Decision support The multidimensional model OLAP queries Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing

More information

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders Data Warehousing Conclusion Esteban Zimányi ezimanyi@ulb.ac.be Slides by Toon Calders Motivation for the Course Database = a piece of software to handle data: Store, maintain, and query Most ideal system

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores Announcements Shumo office hours change See website for details HW2 due next Thurs

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example Overview DW Performance Optimization Choosing aggregates Maintaining views Bitmapped indices Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 07 - DWML

More information

Decision Support Systems aka Analytical Systems

Decision Support Systems aka Analytical Systems Decision Support Systems aka Analytical Systems Decision Support Systems Systems that are used to transform data into information, to manage the organization: OLAP vs OLTP OLTP vs OLAP Transactions Analysis

More information

Data mining: Hmm, what is it?

Data mining: Hmm, what is it? Data mining: Hmm, what is it? Data warehousing Examples Discussions The extraction of implicit, previously unknown and potentially useful information from large bodies of data often accumulated for other

More information

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

More information

Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format.

Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format. About the Tutorial IBM Cognos Business intelligence is a web based reporting and analytic tool. It is used to perform data aggregation and create user friendly detailed reports. IBM Cognos provides a wide

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical

More information

Business Intelligence

Business Intelligence Business Intelligence Data Warehouse drives the corporate information supply chain to support Corporate Business Intelligence process. Business Intelligence introduced by Howard Dresner of the Gartner

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 4320 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

An Overview of Data Warehousing and OLAP Technology

An Overview of Data Warehousing and OLAP Technology An Overview of Data Warehousing and OLAP Technology What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation lecture 2 1 What is Data Warehouse?

More information

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843 WHAT IS A DATA CUBE? The Data Cube or Cube operator produces N-dimensional answers

More information

CHAPTER 3 Implementation of Data warehouse in Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining CHAPTER 3 Implementation of Data warehouse in Data Mining 3.1 Introduction to Data Warehousing A data warehouse is storage of convenient, consistent, complete and consolidated data, which is collected

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Data Warehouse. Asst.Prof.Dr. Pattarachai Lalitrojwong

Data Warehouse. Asst.Prof.Dr. Pattarachai Lalitrojwong Data Warehouse Asst.Prof.Dr. Pattarachai Lalitrojwong Faculty of Information Technology King Mongkut s Institute of Technology Ladkrabang Bangkok 10520 pattarachai@it.kmitl.ac.th The Evolution of Data

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing

CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing Recall : Database System Principles Notes 3: Data Warehousing Three approaches to information integration: Federated databases did teaser Data warehousing next Mediation Hector Garcia-Molina (Some modifications

More information

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues. Slides by Michael Hahsler

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues. Slides by Michael Hahsler Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues Slides by Michael Hahsler Data Mining & Analytics Analytics is the discovery and communication of meaningful

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Outline. Managing Information Resources. Concepts and Definitions. Introduction. Chapter 7

Outline. Managing Information Resources. Concepts and Definitions. Introduction. Chapter 7 Outline Managing Information Resources Chapter 7 Introduction Managing Data The Three-Level Database Model Four Data Models Getting Corporate Data into Shape Managing Information Four Types of Information

More information

MSCIT 5210/MSCBD 5002: Knowledge Discovery and Data Mining

MSCIT 5210/MSCBD 5002: Knowledge Discovery and Data Mining MSCIT 5210/MSCBD 5002: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei 2012 Han, Kamber &

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Data Warehousing and OLAP

Data Warehousing and OLAP Data Warehousing and OLAP INFO 330 Slides courtesy of Mirek Riedewald Motivation Large retailer Several databases: inventory, personnel, sales etc. High volume of updates Management requirements Efficient

More information

Processing of Very Large Data

Processing of Very Large Data Processing of Very Large Data Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach

More information

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures) CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:

More information