Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification Regression Dependency finding Clustering Anomaly detection How to identify different tasks 1

Tasks Classification Predicting a discrete variable Regression Predicting a continuous variable Dependency finding Describing relationships among variables Clustering Describing groupings of instances Anomaly detection Identify unusual instances Classification 2

Classification defined For a set of data instances, each of which is characterized by variables X={x 1,x 2,...x n }, assign each instance a value of y, where y is a discrete variable with a finite number of values. Example method Classification trees 3

Example application Handwriting Prof. R. Manmatha in the UMass CS department has constructed a system for information retrieval from handwritten documents The system uses a learned classifier to recognize word images that match preclassified word images Querying for Alexandria in George Washington s handwritten correspondence retrieves: And also some similar-looking errors: Class labels, ranks, and probabilities Different classification tasks can require different levels of model output Class labels Crisp class boundaries only Ranking Allows for exploration of many potential class boundaries Probabilities Allows for more refined reasoning about sets of instances Each requires progressively more accurate models (e.g., a poor probability estimator can still produce an accurate ranking) 4

Alternative formulations Non-mutually exclusive categories A model can identify multiple categories for each instance (e.g., player-managers in sports or actor-writer-directors in movies) Hierarchical classification A model can identify a set of hierarchically organized classes (e.g., book : fiction : mystery) Regression 5

Regression defined For a set of data instances, each of which is characterized by variables X={x 1,x 2,...x n }, assign each instance a value of y, where y is a continuous variable. Example Least-squares regression Least-squares linear regression minimizes the squared deviations of the predictions from the actual values of y. 6

Regression application Polling trends One election-watcher (www.electoral-vote.com) analyzed polling data using linear regression to account for temporal trends The models were re-estimated each day for the last several weeks of the election based on polls from the previous 30 days Each day, the predicted poll results were aggregated to project the overall outcome on Election Day Polling trends (continued) Bush 256 Kerry 238 7

Regression vs. classification Don t confuse knowledge representation and task description. Linear equations can be used to predict the probability of a given class (Classification) Classification trees can be used to predict the distribution of a continuous variable (Regression) Focus on the goal of the analysis, rather than on the form of the model. Dependency finding 8

Dependency finding defined Identify and summarize the statistical dependencies among a large collection of variables or items { Cheerios, milk, apple, cookies } { chicken, chicken, chicken, chicken, Coke}. { milk, milk, cookies } Milk and Cookies are frequently purchased together Multiple packages of chicken are frequently purchased together. Example method Dependency networks Use one or more predictive modeling techniques (e.g., decision trees for classification) to identify what other variables are correlated with each variable in a data set Draw a graph where each variable is vertex and edges represent dependence. This is called a dependency network 9

Example application Profiling users Researchers at Microsoft Research analyzed data from Media Metrix, containing demographic and internet-use data for about 5000 individuals during the month of January 1997 They summarized the dependencies in the form of a dependency network D. Heckerman, D. Maxwell Chickering, C. Meek, R. Rounthwaite, C. Kadie (2001). Dependency Networks for Inference, Collaborative Filtering, and Data Visualization. Microsoft Research. MSR- TR-2000-1. but there was only one strong dependency between and the There among demographic dependencies were the demographic dependencies variables produced characteristics and among the a large sites graph visited of users 10

Issues Spurious dependencies It is very easy for KD algorithms to find spurious ( false ) dependencies in any given data set. The key is whether discovered dependencies generalize to new data. We will discuss hypothesis tests in a later lecture Confounding effects Examining only pairwise dependencies can be misleading because two variables can appear dependent, yet be independent given their relationships to other variables We will discuss conditional independence in a later lecture Clustering 11

Clustering defined Partitioning a set of instances into a fixed number of subsets that are relatively homogeneous; Identifying a small number of instances that are representative exemplars; or Identifying a family of overlapping distributions that define the density of instances with a given set of variable values. Example method k-means clustering Select k points (called seeds) Iterate until little change in seed locations Assign each instance to nearest seed, forming clusters Replace each seed with centroid of points in its cluster www.oefai.at/~elias/ ma/documentation.html 12

Example Clustering whiskeys Two researchers in Canada applied "an array of statistical methods to a database derived from a connoisseur's description of these liquors. The taster's literary descriptions of Scotches were turned into a numerical database (109 Scotches x 68 binary variables). A first classification was produced by distance computation and hierarchical clustering. F. Lapointe and P. Legendre (1994). A Classification of pure malt scotch whiskies. Applied Statistics 43(1):237-257. Clustering whiskeys (continued) 13

Issues Clustering can be used to assist other tasks Classification and regression Identifying subgroups that should be modeled separately Anomaly detection Identifying groups that anomalies lie far from Clustering can be very difficult to evaluate Prediction tasks (e.g., regression) have a clear method of evaluating accuracy Clusters are less obviously correct or incorrect Anomaly detection 14

Anomaly detection defined Identify individual instances that differ substantially from nearly all other instances in the data. Also called: outlier detection Example Computer Security Lane and Brodley looked for anomalies in Unix command sequences in order to identify unauthorized users of computer accounts T. Lane and C. Brodley (1999). Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security 2(3): 295-331. 15

Computer Security (continued) Issues Anomalies vs. infrequent behaviors With insufficient training data, infrequent combinations of variable values may look like anomalies Using the joint distribution Anomalies rarely are evident from a single variable. Instead, algorithms have to examine the joint distribution of several variables simultaneously 16

How to identify different tasks Consider how you will evaluate success Prediction vs. description vs. identification Prediction Classification or Regression Description Clustering or Dependency finding Identification Anomaly detection Discrete vs. continuous predictions Discrete Classification Continuous Regression Instances vs. sets Predictions about instances Classification, Regression, and Anomaly detection Descriptions of sets of instances Clustering and Dependency finding 17

The central role of output Classification Classes, rankings, or probability distributions Regression Continuous values or probability distributions for continuous values Dependency finding Dependencies Clustering Sets or descriptions of similar cases Anomaly detection Anomalous instances Don t confuse representation and task Equations can produce both probability estimates (classification) and continuous values (regression) Trees can produce both predicted classes and means & variances that are predictions of a continuous variable Rules can be used for classification but also represent the output of algorithms for dependency finding 18

The centrality of probability estimation Many tasks can reduce to probability estimation Examples Classification Select the most probable class Ranking Rank by probability of a specific class Dependency finding Learn a joint probability model and examine the model to identify associations Anomaly detection Use a joint model to identify data instances that occur, but the model says are unlikely But, you don t always need a full probability model to do each of these tasks Simpler models sometimes suffice Choosing a good task specification Try simple specifications first Iterate As a thought experiment, ask how the problem might be reduced to probability estimation Consider whether the problem encompasses several distinct tasks 19