Data Mining. Outline. Motivation. Data Mining. Fraud division, some large telephone company: Sharma Chakravarthy IT Laboratory and CSE Department

Outline Data Mining Sharma Chakravarthy IT Laboratory and CSE Department The University of Texas at Arlington sharma@cse.uta.edu http://itlab.uta.edu/sharma Overview Association rules Will try to discuss 3 papers Database mining Association rule and graph mining Graph mining Overview One or 2 approaches Email classification application 2 Data Mining Motivation Fraud division, some large telephone company: The key in business is to know something that nobody else knows (Aristotle Onassis) How do we find these guys? There are 10 billion records on 10 million customers in the main database. With all this information we have about our customers and all the calls they make, can t you just ask the database to figure out which lines have been set-up temporarily and exhibited similar calling patterns in the same time periods? The information is in there, I just know it 3 4 1

Problem Another Example Find-similar problem just described is hard e.g., What products need to be improved? e.g., Which books won t be checked out and can be taken off the shelves? Why? Massive amounts of data More and more online data stores (e.g., Web, corporate databases, etc.) No easy way to describe what to look for Traditional, interactive approaches fail Size of data, different purposes Marketing cellular phones Churn is too high Turnover after the initial contract is too high What is a good strategy Giving new phone to everyone is too expensive (and wasteful) Bringing back customers after they leave is very difficult 5 6 What to do Data Mining A few months before the contract expires, if one can predict which customers are likely to quit, Give incentive to those who are likely to quit Don t do anything for those who are NOT likely to quit How do I predict future behavior? Corporate Palm reading! Human intuition!! Data mining (DM) or knowledge discovery (KDD) Data Mining (DM) is part of the knowledge discovery process carried out to extract valid patterns and relationships in very large data sets Usually don t know what to look for, like a voyage into the unknown Regarded as unsupervised learning from basic facts (axioms) and data Roots in AI and statistics Uses techniques from machine learning, pattern recognition, statistics, database, visualization, etc. 7 8 2

Another Definition Data Mining Data mining is the iterative and interactive process of discovering valid, novel, useful, previously unknown, and understandable patterns or models in Massive data sets Constituents of Data Mining? There is an element of discovery. What is discovered may be counter-intuitive even to the expert. Exhaustive scan/processing of the available data Verification of conjecture or hypothesis The nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Brawley et al., 92] 9 10 Characteristics: Data Mining Enablers Automated extraction of predictive information from large data sets Key words: Automated Extraction Predictive Large data sets A methodology is assumed (typically statistical) Reduced cost of storage Reduced cost of processing Ability to store, process, and manage large volumes of data (e.g., DW, Internet) New techniques such as association rules, sequence data processing, text mining However, Scalability, visualization of results, filtering very large outputs are new issues! 11 12 3

Data Mining has come about due to Data Mining Convergence of multiple technologies Increase in Computing Power DM Causality and correlation The above two are different! Which one does mining try to identify? Application Of statistical/ Machine learning algorithms Improved data management 13 14 Drivers AI and Statistics Today, business information (or BI) systems are as important to corporations as transaction systems were earlier Mass personalization and better utilization of data Identify new and profitable markets, and channels to enter them Increase customer loyalty, profitability, life time value Decrease risk If DM is rooted in AI and statistics, what is the need for DM? AI traditionally dealt with small samples The emphasis was an learning, extrapolation, and generalization The emphasis in DM is on processing the actual data, not just samples! DM tries to leverage the data collected, accumulated and derive tangible rules/conclusions (generalization is also possible) 15 16 4

Tower of Babel Machine Learning Statistics Pattern recognition Machine learning AI Databases Visualization Observation Analysis Theory Prediction Either the predictions are correct in which case the theory is corroborated, or the predictions are wrong. New theory or exceptions! 17 18 DM Vs. Machine learning DM Vs. Statistics ML methods form the core of DM Amount of data makes a (big) difference accessing examples can be a problem missing values and incomplete data DM has more modest goals: automating the tedious discovery tasks Similar goals; different methods Amount of data DM as a preliminary stage for statistical analysis Challenge to DM: better ties with statistics 19 20 5

Data Mining is NOT Data warehousing Ad hoc query/reporting Online Analytical Processing (OLAP) Data Visualization Agents/mediators, Pervasive computing, What DM is not likely to do! Substitute for human intuition and discovery I don t think a DM system will (ever?) discover e = mc 2 I don t think DM will (ever?) discover PV = RT I don t think DM will (ever?) discover gravity, Newton s law s of motion, It may discover new black holes! The value of pi is data-driven but its intuition is not! 21 22 Applications DM Applications Vs. DM Customer profiling Find new customers, Market basket analysis Manage inventory Risk analysis Insurance, loan, stock, Text analysis Library, search, Fraud detection CRM, Scientific discovery, forecasting, Problem, goal and task definition (10%) Data Warehousing: data collection and organization (50%) Data Mining: data analysis and knowledge discovery (30%) Decision support / optimization: assess pros and cons, take actions (10%) 23 24 6

OLAP Vs. Data Mining DM Vs. DW DW makes DM a lot cheaper DM is one of the reasons for DW OLAP: verification-driven sales in CA Vs. FL in Q1 of 2003 DM: discovery-driven why Microsoft is making so much money?? Will Google be a successful IPO? OLTP and OLAP OLAP is user driven Analyst generates hypothesis, uses OLAP to verify e.g., people with high debt are bad credit risks Data mining tool generates the hypothesis Tool performs exploration e.g., find risk factors for granting credit Discover new patterns that analysts didn t think of e.g., debt-to-income ratio OLAP and DM complement each other 25 26 How is Data Mining Used? Use data to build a model of the real world (domain of interest) describing patterns and relationships Models are used in two ways Guide business decisions e.g., determine layout of shelves in grocery store Make predictions e.g., what recipients to include on mailing list Not magic, still need to understand data, its semantics, and statistics! Things to keep in mind Misinterpretation of results Statistical significance Dirty data Too much information generated Legality Privacy/Ethics 27 28 7

DB Traditional Data Analysis Query Graphics Statistics Reporting,... Data Mining Process Identify necessary data Granularity of each field Choose preprocessing and mining techniques Use tools to complement mining Interpret results Note: this is an iterative process 29 30 DM Process Data Mining Cycle Assess and transform (DW) Select: reduces cost, increases speed Explore: summarize, Segment, visualize Modify: data filtering, variable selection Model: regression, neural nets, decision trees, associations, sequences Assess (BI) DB Preprocess Rethink Mine Select Transform Analysis 31 32 8

A Word About Data Quality Can be tolerant of some noise But may lead to poor or even erroneous results Some common problems Missing fields Outliers or incorrect data Statistical significance Data warehouse integration and cleaning as a prerequisite for data mining Recall the integration process with its cleansing steps... Data Pyramid Visualization / Analysis Data Mining Data Exploration Querying, Statistics,. Data Warehouse / Data Marts Data Sources 33 34 Types of data analysis DM Approaches Supervised Classification, prediction Clustering Correlation Rules (association rules) Time-series analysis Text classification/filtering Graph Mining Driven by business problems Optimize existing solutions/markets Unsupervised Exploration Relevance Find new markets 35 36 9

Predictive Modeling A black box that makes predictions about the future based on information from past and present Models Some models are better than others Understandability Accuracy Range from easy to understand to How do I interpret the results Last few year s sales data Last month s sales data Model Usually Large number of inputs available This month's projection This year s prediction Decision trees Rules knn Regression analysis Neural networks Easier Harder 37 38 Model details Using a Model 1999 Data Data mining System Sep2000 data Model Nov 2000 prediction Qualitative Gives the analyst an understanding of the rules/classification If 35 < age < 50 then buy expensive cars Now, with all the recession, the above rule may change to If 25 < age < 35 then trade your expensive car to an average car Interaction with the model and visualization 39 40 10

Using a Model Model Testing Quantitative Automated process Classification/scoring done periodically (every month, when mailing is done, ) Classification into a finite set Estimate continuous numerical value (e.g., total worth of a customer) Scoring (a probability value) 41 42 Model Quality Cross Validation E R R O R New data Divide the data into n sets (of equal size) Use set i for validating and build the model using sets 1, 2,, i-1, i+1,, n through n Repeat the above process for i from 1 through n Amount/representativeness of Training Data 43 44 11

Application of Statistics Techniques have been waiting for technology to catch up Statisticians have been doing small scale data mining for decades Good data mining is intelligent application of statistical process (+ some new ones) Emphasis on scalability, handle large data sets, interactive capability, visualization, integration with databases Classification Discussion Neural Networks, Decision Trees, knn, Bayes, SVM Clustering K-means, Non-hierarchical and hierarchical Prediction Linear regression, Multi-variate regression Association Rules (market-basket analysis) Apriori algorithm, FP-tree, use of taxonomies Time series analysis, sequence detection Clustering, significant interval discovery, event patterns Text Filtering Topic identification, classification, filtering Choosing the right approach to the domain of interest is an important and difficult task. 45 46 Data Mining Problems Data Mining Models (contd.) Classification Multiple category (large/medium/small) Value prediction Scoring Clustering/Segmentation Association Rule extraction (Market basket analysis) Sequence detection (ordered data, temporal) Graph mining (for applications where structure is important and need to be taken into account) Classification (predicting) Classifies a data item into one of predefined classes Regression and time series analysis (forecasting) Uses series of existing values to forecast what continuous values will be Clustering (description of patterns) Finding clusters that consist of similar records Association analysis and sequence discovery (description of behavior) Discovers rules for describing items that occur together in a given event or record 47 48 12

Classification Classify the input records based on the attribute values Training set; class attribute Decision tree classifiers / neural-net classifiers Tree generation SLIQ, SPRINT, CLOUDS, C4.5 Tree pruning MDL Classification A process of building a model from a training set that classifies new data, based upon the attribute values Popular classification models are neural networks, decision trees, and knn Classification models are widely used to solve business problems such as creation of mailing lists for marketing purposes 49 50 Approach Neural Network Model Examine a collection of cases for which the group they belong to is already known Inductively determine the pattern of attributes or characteristics that identify the group to which each case belongs Pattern can be used to understand the data as well as to predict how new instances will be classified Very loosely based on biology Input transformed via a network of processors Processor combines weighted inputs and produces an output value I1 O1 I2 51 52 13

Neural Network Neural Network Linear combination of inputs Simple linear regression Linear combination of inputs Classic perceptron I1 I1 O1 O1 I2 I2 53 54 Neural Network Neural Network Non-linear combination of inputs Multi-layer Neural networks Output layer I1 I1 O1 I2 O1 I2 Fully connected Hidden Layer 55 56 14

Learning Weights are adjusted by observing errors on output and propagating adjustments back through the network Back propagation Output layer I1 I2 Error O1 Neural Network Issues Difficult to understand Relationship between weights and variables is complicated No intuitive understanding of results Training time Error depends on the sample size, amount of effort in fine-tuning Pre-processing of data often required Fully connected Hidden Layer 57 58 Decision Trees Decision Tables A Major data mining approach Give one attribute (e.g., wealth), try to predict the value of new people s wealth by means of some of the other available attributes. Applies to categorical outputs Categorical attribute: an attribute which takes on two or more discrete values. Also known as a symbolic attribute. Real attribute: a column of real numbers 1-d table 2-d table 3-d table or cube But the number increases exponentially as the number of attributes increase For 16 attributes, number of 3-d tables is 16 choose 3 or 16*15/2 or 510 For 100 attributes, it is 161,700 59 60 15

Tid Job Age 0 1 2 3 4 5 Self Industry Univ. Self Univ. Industry 30 35 50 40 30 35 Salary 30K 40K 70K 60K 70K 60K Class C C 6 Self 35 60K A 7 Self 30 70K A Training Data Set C A B B Classification Example (<=50K) Class (Univ., Industry) Class B Sal (<=40) Job (>50K) Age c (Self) Class A Sample Decision Tree (>40) Class Centralized Decision Tree Induction algorithm Select a random subset of given instances Repeat Build the decision tree to explain the current window Find the exceptions of this decision tree for the remaining instances Form a new window with the exceptions to the decision tree generated from it Until there are no exceptions 61 62 Selection criteria Types of Decision Trees Entropy/Information gain (Quinlan 1993) Gain ratio (used in C4.5) Gini index (used in CART) MDL (minimum description length) Decision tree can also be seen as nested if/then rules CHAID: Chi-Square Automatic Interaction Detection Kass (1980) n-way splits Categorical Variables CART: Classification and Regression Trees Breimam, Friedman, Olshen, and Stone (1984) Binary splits Continuous Variables C4.5 Quinlan (1993) Also used for rule induction 63 64 16

Nearest Neighbor classification 1NN Rule We are concerned with the following problem: we wish to label some observed pattern x with some class category θ. Two possible situations with respect to x and θ may occur: We may have complete statistical knowledge of the distribution of observation x and category θ. In this case, a standard Bayes analysis yields an optimal decision procedure. We may have no knowledge of the distribution of observation x and category θ aside from that provided by pre-classified samples. In this case, a decision to classify x into category θ will depend only on a collection of correctly classified samples. The nearest neighbor rule is concerned with the latter case. Such problems are classified in the domain of non-parametric statistics. No optimal classification procedure exists with respect to all underlying statistics under such conditions. 65 66 The k-nn Rule The k-nn Rule If the number of pre-classified points is large it makes good sense to use, instead of the single nearest neighbor, the majority vote of the nearest k neighbors. This method is referred to as the k-nn rule. The k-nn rule only requires An integer k A set of labeled examples (training data) A metric to measure closeness The number k should be: large to minimize the probability of misclassifying small (with respect to the number of samples) so that the points are close enough to x to give an accurate estimate of the true class of x. Disadvantages Large storage requirements Computationally intensive Highly susceptible to the curse of dimensionality 67 68 17

The k-nn Rule Classification 1-NNR versus k-nnr The use of large values of k has two main advantages Yields smoother decision regions Provides probabilistic information The ratio of examples for each class gives information about the ambiguity of the decision However, large values of k are detrimental It destroys the locality of the estimation since farther examples are taken into account In addition, it increases the computational burden Genetic algorithms Rough set approach Fuzzy set approach 69 70 Prediction Prediction Linear regression is used to make predictions about a single value. Simple linear regression involves discovering the equation for a line that most nearly fits the given data. That linear equation is then used to predict values for the data Example 1: A cost modeler wants to find the prospective cost for a new contract based on the data collected from previous contracts. Example 2: If the university authorities want to predict a student's grade on a freshman college calculus midterm based on his/her SAT score, then they may apply linear regression. Linear regression assumes that the expected value of the output for a given an input, E[y x], is linear. Simplest case: y = c + a*x where a and c can be computed from the data set And can be applied to any new value of x 71 72 18

Clustering Very Simple Clustering Algorithm Unsupervised mining Segments a database into different groups Goal is to find groups whose members have two characteristics (notion of similarity) Members in each cluster are as similar as possible Members in different clusters have as few commonalties as possible Unlike classification, don t know what the clusters will be at start, or around which attributes the data will cluster Business analysts will need to analyze clusters 1. Choose k objects at random (or max separation) and make them clusters 2. For each object in data set D, find the nearest cluster and assign the object to the cluster 3. Find the new cluster center for each cluster 4. Repeat steps 2 and 3 until the input is exhausted Problems How to choose k? a huge problem! Meta approaches to choose k (computationally intensive) How to define nearness/similarity? 73 74 Clustering algorithm Clustering algorithm Typically distance or Euclidean distance (in multidimensional space) is used as the nearness or similarity measure But it can be different (based on the domain) Domain knowledge is important Typically, the square-error criterion is used for convergence of the iterative algorithm Example 1: Setting up ATM machines Euclidean distance is not useful; driving distance between the population and the ATM is more important Example 2: Battlefield movement Obstacles in the path (hills, terrain) needs to be taken into account Computing the distance may not be easy! Domain knowledge is important 75 76 19

Clustering More on Clustering A process of maximizing inter-cluster similarities while minimizing intra-cluster similarities Requirements on clustering Minimal requirements of domain knowledge Discovery of clusters with arbitrary shape Good efficiency on large databases Partitioning algorithms (K-means) k-means method: Clusters represented by gravity center k-medoid method: Clusters represented by a central object Hierarchical algorithms (minimal spanning tree algorithm) Agglomerative approach (bottom-up) Divisive approach (top-down) Clustering has been studied in many fields, including sociology, statistics, machine learning, biology Scalability was not a design goal, data assumed to fit in main memory Focus on improving cluster quality Do not scale to large data sets Recently new set of algorithms with greater emphasis on scalability 77 78 Clustering DM Summary Density-based methods (for noisy data) DBSCAN, OPTICS, DENCLUE Technology (computation speed, fast, cheap, and large storage) has moved many of these approaches into mainstream usage Grid-based method (uses multiresolution grid data structure) STING, CLIQUE Can now be applied to actual data sets instead of samples Click-stream analysis, recommendation systems, web search are all very large data size problems We will also address data filtering/aggregation problem in stream data processing where we deal with large amounts of continuous data 79 80 20

Questions??? 81 21