Data Mining and Analytics Introduction
Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data mining is viewed as an essential step in the process of knowledge discovery
Knowledge discovery is shown in the Figure - 1 which consists of the following steps: o Data cleaning o Data Integration o Data Selection o Data Transformation o Data Mining o Pattern Evaluation o Knowledge Presentation
Figure 1: Data mining as a step in the process of knowledge discovery
First 4 steps are different forms of data preprocessing, where the data are prepared for mining Data mining step interacts with user or knowledge base Interesting patterns presented to the user and that can be stored as a new knowledge in a separate knowledge base
Kinds of data There are a number of data repositories from where data can be taken These data repositories will include relational database, data warehouse, transactional database, spatial database, time-series database, multimedia databases, legacy database, WWW
Relational database It is a collection of tables, each is assigned with a unique name Each table has set of attributes (columns or fields) and stores a lot of tuples (records or rows)
Data warehouse It s a repository of information collected from various resources, stored under a unified schema that usually resides at a single site Data warehouses are constructed through a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing Data warehouse systems are well suited for online analytical processing, or OLAP OLAP operations include drill-down, roll-up which allow the user to view the data at different views
Spatial databases In addition to usual data, stores geographical information like maps, global and regional positioning Such spatial databases presents new challenges to data mining algorithms
Multimedia databases Includes videos, images, audio and text media. They can be stored on object oriented databases or simply a file system Multimedia is characterized by its high dimensionality, which makes data mining even more challenging Data mining from Multimedia repositories may require computer vision, image interpretation, computer graphics and natural language processing methodologies
World Wide Web (WWW) Most heterogeneous and dynamic repository available Large number of authors and publishers are continuously contributing to its growth and number of users are accessing it daily Data in WWW is organized in inter connected documents These documents are video, text, audio or even raw data WWW comprises of three major components: Content of web, structure of web and usage of web
Content of web comprises of the documents available. Structure of web comprises of the relationships between documents. Usage of web describes how and when resources are accessed
Data Mining Functionalities What kind of patterns can be mined? Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks Two categories: Descriptive and Predictive Descriptive mining tasks characterize the general properties of the data in the database Predictive mining tasks perform inference on the current data in order to make predictions
Concept/Class Description: Characterization and Discrimination Data characterization: Summarizes the data of the class under study (also called the target class) Data discrimination: by comparison of the target class with one or a set of comparative classes (also called the contrasting classes) Examples for each in textbook
Mining Frequent Patterns, Associations, and Correlations Frequent patterns: Patterns that occur frequently in data. The kinds of frequent patterns: Frequent itemsets patterns Frequent sequential patterns Frequent structured patterns Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
Association Analysis Suppose, as a marketing manager of AllElectronics, you would like to determine which items are frequently purchased together within the same transactions An example from the AllElectronics transactional database, is: where X is a variable representing a customer
A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together This association rule involves a single attribute or predicate (i.e., buys) that repeats Association rules that contain a single predicate are referred to as single-dimensional association rules Association rules that contain more than one predicate are referred to as multi-dimensional association rules
Classification and Prediction Classification Construct models (functions) that describe and distinguish classes or concepts to predict the class of objects whose class label is unknown Example: In weather judgment problem the play or don t play In contact lenses recommendation problem the lens
Derived model can be represented in various forms: Decision tree Neural network If Then rules Instead of predicting categorical response labels for each store item, you would like to predict the amount of revenue that each item will generate during an upcoming sale at AllElectronics, based on previous sales data. This happens in Prediction Numeric prediction is a variant of classification learning in which the outcome is a numeric value rather than a category
Cluster Analysis Clustering: Grouping similar instances into clusters The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.
Outlier Analysis In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones The analysis of outlier data is referred to as outlier mining
Evolution Analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time Examples are stock market, inventory control, etc
Are all the patterns interesting? Data mining may generate thousands of patterns. Not all of them are interesting. What makes a pattern interesting? Ans: 1) easily understood by humans 2) valid on new or test data with some degree of certainty 3) potentially useful 4) novel 5) validates some hypothesis that a user seeks to confirm
Can a data mining system generate all the interesting patterns? Ans: refers to completeness of data mining algorithm Do we need to find all the interesting patterns? Ans: association rule mining is an example Can a data mining system generate only interesting patterns? Ans: generate only the interesting patterns
Data Mining Task Primitives The data mining primitives specify the following: Task - relevant data Kind of knowledge to be mined Background knowledge Interestingness measures Knowledge presentation and visualization techniques to be used for displaying the discovered patterns
Issues in data mining 1) Mining methodology and user interaction issues: Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Including background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results To handle noisy or incomplete data Pattern evaluation
2) Performance issues: Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining algorithms
3) Issues related to diversity of database types: Handling relational and complex types of data Mining information from heterogeneous databases and global information systems
Descriptive Data Summarization To learn data characteristics better Central Tendency and Dispersion of Data Measures of central tendency include mean, median, mode, and midrange Measures of data dispersion include quartiles, interquartile range (IQR), and variance
Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: x 1 n n i1 x i x N Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Mode Value that occurs most frequently in the data x n i1 n i1 w x i w median i i L 1 n / 2 ( ( f median f ) l ) c Unimodal, bimodal, trimodal Empirical formula: mean mode 3( mean median )
Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data
Dispersion of Data Measurement Quartiles, outliers and boxplots Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, M, Q 3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ] ) ( 1 [ 1 1 ) ( 1 1 n i i n i i x N x N 1 2 2 1 2 2 1 ) ( 1 -- Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2)
Properties of Normal Distribution Curve The normal (distribution) curve From μ σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it
Boxplot Analysis The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order Minimum, Q1, Median, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
Visualization of Data Dispersion: Boxplot Analysis
Histogram Analysis A histogram for an attribute A partitions the distribution of A into disjoint subsets, or buckets Typically, the width of each bucket is uniform Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket If A is numeric, the term histogram is preferred
Quantile Plot A quantile plot is a simple and effective way to have a first look at a univariate data distribution First, it displays all of the data for the given attribute (allowing the user to assess both the overall behavior and unusual occurrences) Second, it plots quantile information
Quantile-Quantile (Q-Q) Plot A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the corresponding quantiles of another It is a powerful visualization tool in that it allows the user to viewwhether there is a shift in going fromone distribution to another
Scatter plot A scatter plot is one of the most effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane
Loess Curve A loess curve is another important exploratory graphic aid that adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence The word loess is short for local regression
Positively and Negatively Correlated Data
Not Correlated Data
Data Preprocessing Data is preprocessed in order to avoid noisy, incomplete and inconsistent data Techniques for doing preprocessing are as follows: Data cleaning in order to remove noisy data, avoid missing values and avoid inconsistent data Missing values can be avoided by the following methods: 1) Ignore the tuple 2) Fill in the missing values manually 3) Use a global constant to fill the missing value 4) Use the attribute mean to fill in the missing value 5) Use the attribute mean for all samples belonging to the same class as the given tuple 6) Use the most probable value to fill in the missing value
Noisy data can be removed by the following methods: 1) Binning First sort data and partition into (equal-frequency) bins Then, one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc 2) Regression Linear regression involves finding the best line to fit two attributes, so that one attribute can be used to predict the other Multiple linear regression is an extension of linear regression, where more than two attributes are involved 3) Clustering Outliers may be detected by clustering, where similar values are organized into groups, or clusters
Data Integration Data integration is merging the data from different stores Some issues to be considered while data integration are as following: 1) Entity identification problem Identify real world entities from multiple data sources, 2) Redundancy Redundant data occur often when integration of multiple databases Redundant attributes may be able to be detected by correlation analysis 3) Detection and resolution of data value conflicts For the same real-world entity, attribute values from different sources may differ This may be due to differences in representation, scaling, or encoding
Data Transformation The data are transformed or consolidated into forms appropriate for mining. It involves the following: 1) Smoothing: Remove noise from data 2) Aggregation: Summarization, data cube construction 3) Generalization of the data: Concept hierarchy 4) Normalization: scaled to fall within a small, specified range min-max normalization: z-score normalization: normalization by decimal scaling: v'= v-mina (new_maxa-new_mina)+new_mina maxa-mina v' A A where, j is the smallest integer such that Max( ν ) < 1 v v v' 10 j
5) Attribute construction: New attributes constructed from the given ones Data Reduction: Data reduction techniques can be applied to obtain a reduced representation of the dataset that is much smaller in volume, yet closely maintains the integrity of the original data Various strategies used for data reduction are as follows: 1) Data cube aggregation 2) Attribute subset selection: it includes the following techniques: Stepwise forward selection Stepwise backward elimination Combination of forward selection and backward elimination Decision tree induction
3) Dimensionality reduction: two effective methods are as following: Wavelet Transforms Length, L, must be an integer power of 2 (padding with 0 s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length Principal Component Analysis - Given N data vectors from n-dimensions, find k n orthogonal vectors (principal components) that can be best used to represent data - Works for numeric data only - Used when the number of dimensions is large
4) Numerosity reduction: some of the techniques in numerosity reduction are as follows: Regression and Log-Linear Models In Linear regression, the data are modeled to fit a straight line. y = wx+ b Multiple linear regression allows a response variable, y, to be modeled as a linear function of two or more predictor variables Log-linear models approximate discrete multidimensional probability distributions. Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equal-width: equal bucket range Equal-frequency (or equal-depth) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) MaxDiff: set bucket boundary between each pair for pairs have the β 1 largest differences
Clustering - Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only - Can be very effective if data is clustered but not if data is smeared Sampling Obtaining a small sample s to represent the whole data set N - Simple random sample without replacement (SRSWOR) of size s - Simple random sample with replacement (SRSWR) of size s - Cluster sample - Stratified sample
Discretization and concept hierarchy generation Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Binning (covered above) Top-down split, unsupervised, Histogram analysis (covered above) Top-down split, unsupervised Clustering analysis (covered above) Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split S1 S 2 I( S, T) Entropy ( S1) Entropy ( S 2) Entropy is calculated based on class distribution of the S samples in the set. S Given m classes, the entropy of S1 is: where p i is the probability of class i in S 1 Interval merging by 2 Analysis: unsupervised, bottom-up merge Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge Entropy ( S Initially, each distinct value of a numerical attribute A is considered to be one interval 2 tests are performed for every pair of adjacent intervals m 1 ) p i log 2( p i ) i1 Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair indicate similar class distributions This merge process proceeds recursively until a predefined stopping criterion is met
Discretization by Intuitive Partitioning A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals. If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule count Step 1: -$351 -$159 profit $1,838 $4,700 Step 2: Step 3: Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max msd=1,000 Low=-$1,000 High=$2,000 (-$1,000 - $2,000) (-$1,000-0) (0 -$ 1,000) ($1,000 - $2,000) Step 4: (-$400 -$5,000) (-$400-0) (0 - $1,000) ($1,000 - $2, 000) ($2,000 - $5, 000) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100-0) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000)
Concept Hierarchy Generation for Categorical Data Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a hierarchy for a set of values by explicit data grouping {Urbana, Champaign, Chicago} < Illinois Specification of only a partial set of attributes E.g., only street < city, not others Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country}
Summary of Unit - I Data mining as a step in knowledge discovery Introduction to Data Mining Kinds of Data Data mining Functionalities Interesting Patterns Data mining task primitives Issues in data mining Data preprocessing Techniques