Knowledge Discovery and Data Mining

Similar documents
9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Introduction to Data Mining

Contents. Preface to the Second Edition

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CSE 158. Web Mining and Recommender Systems. Midterm recap

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Chapter 3: Data Mining:

Information Retrieval and Web Search Engines

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Mining Course Overview

Evaluating Classifiers

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Machine Learning using MapReduce

Mining Web Data. Lijun Zhang

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

SCHEME OF COURSE WORK. Data Warehousing and Data mining

What to come. There will be a few more topics we will cover on supervised learning

Clustering Basic Concepts and Algorithms 1

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Oracle9i Data Mining. Data Sheet August 2002

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Cluster Analysis. Ying Shen, SSE, Tongji University

Exploratory Analysis: Clustering

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Knowledge Discovery and Data Mining

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Preprocessing. Data Preprocessing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Information Retrieval and Web Search Engines

Mining Web Data. Lijun Zhang

Data Mining: Exploring Data. Lecture Notes for Chapter 3

VECTOR SPACE CLASSIFICATION

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

ECLT 5810 Clustering

KNOWLEDGE DISCOVERY AND DATA MINING

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

DATA WAREHOUING UNIT I

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

3. Cluster analysis Overview

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Chapter 5: Outlier Detection

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

CSC 411: Lecture 05: Nearest Neighbors

Data Preprocessing. Slides by: Shree Jaswal

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Gene Clustering & Classification

Data Mining for Improving Intrusion Detection

Data Mining Concepts

COMP90049 Knowledge Technologies

ECLT 5810 Clustering

CS570: Introduction to Data Mining

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Data Mining Concepts & Tasks

D B M G Data Base and Data Mining Group of Politecnico di Torino

Chapter 1, Introduction

Machine Learning - Clustering. CS102 Fall 2017

Random projection for non-gaussian mixture models

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Artificial Intelligence. Programming Styles

Clustering and Visualisation of Data

Classification and Regression

Data Mining Clustering

Data Mining and Analytics. Introduction

Using the DATAMINE Program

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Document Clustering: Comparison of Similarity Measures

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

1) Give decision trees to represent the following Boolean functions:

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Using Machine Learning to Optimize Storage Systems

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Data Mining: Exploring Data

DATA MINING - 1DL105, 1DL111

Jarek Szlichta

Data mining, 4 cu Lecture 6:

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

ECLT 5810 Data Preprocessing. Prof. Wai Lam

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

Review on Data Mining Techniques for Intrusion Detection System

Unsupervised Learning I: K-Means Clustering

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Clustering: Overview and K-means algorithm

Network Traffic Measurements and Analysis

Unsupervised Learning : Clustering

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

Transcription:

Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification Regression Dependency finding Clustering Anomaly detection How to identify different tasks 1

Tasks Classification Predicting a discrete variable Regression Predicting a continuous variable Dependency finding Describing relationships among variables Clustering Describing groupings of instances Anomaly detection Identify unusual instances Classification 2

Classification defined For a set of data instances, each of which is characterized by variables X={x 1,x 2,...x n }, assign each instance a value of y, where y is a discrete variable with a finite number of values. Example method Classification trees 3

Example application Handwriting Prof. R. Manmatha in the UMass CS department has constructed a system for information retrieval from handwritten documents The system uses a learned classifier to recognize word images that match preclassified word images Querying for Alexandria in George Washington s handwritten correspondence retrieves: And also some similar-looking errors: Class labels, ranks, and probabilities Different classification tasks can require different levels of model output Class labels Crisp class boundaries only Ranking Allows for exploration of many potential class boundaries Probabilities Allows for more refined reasoning about sets of instances Each requires progressively more accurate models (e.g., a poor probability estimator can still produce an accurate ranking) 4

Alternative formulations Non-mutually exclusive categories A model can identify multiple categories for each instance (e.g., player-managers in sports or actor-writer-directors in movies) Hierarchical classification A model can identify a set of hierarchically organized classes (e.g., book : fiction : mystery) Regression 5

Regression defined For a set of data instances, each of which is characterized by variables X={x 1,x 2,...x n }, assign each instance a value of y, where y is a continuous variable. Example Least-squares regression Least-squares linear regression minimizes the squared deviations of the predictions from the actual values of y. 6

Regression application Polling trends One election-watcher (www.electoral-vote.com) analyzed polling data using linear regression to account for temporal trends The models were re-estimated each day for the last several weeks of the election based on polls from the previous 30 days Each day, the predicted poll results were aggregated to project the overall outcome on Election Day Polling trends (continued) Bush 256 Kerry 238 7

Regression vs. classification Don t confuse knowledge representation and task description. Linear equations can be used to predict the probability of a given class (Classification) Classification trees can be used to predict the distribution of a continuous variable (Regression) Focus on the goal of the analysis, rather than on the form of the model. Dependency finding 8

Dependency finding defined Identify and summarize the statistical dependencies among a large collection of variables or items { Cheerios, milk, apple, cookies } { chicken, chicken, chicken, chicken, Coke}. { milk, milk, cookies } Milk and Cookies are frequently purchased together Multiple packages of chicken are frequently purchased together. Example method Dependency networks Use one or more predictive modeling techniques (e.g., decision trees for classification) to identify what other variables are correlated with each variable in a data set Draw a graph where each variable is vertex and edges represent dependence. This is called a dependency network 9

Example application Profiling users Researchers at Microsoft Research analyzed data from Media Metrix, containing demographic and internet-use data for about 5000 individuals during the month of January 1997 They summarized the dependencies in the form of a dependency network D. Heckerman, D. Maxwell Chickering, C. Meek, R. Rounthwaite, C. Kadie (2001). Dependency Networks for Inference, Collaborative Filtering, and Data Visualization. Microsoft Research. MSR- TR-2000-1. but there was only one strong dependency between and the There among demographic dependencies were the demographic dependencies variables produced characteristics and among the a large sites graph visited of users 10

Issues Spurious dependencies It is very easy for KD algorithms to find spurious ( false ) dependencies in any given data set. The key is whether discovered dependencies generalize to new data. We will discuss hypothesis tests in a later lecture Confounding effects Examining only pairwise dependencies can be misleading because two variables can appear dependent, yet be independent given their relationships to other variables We will discuss conditional independence in a later lecture Clustering 11

Clustering defined Partitioning a set of instances into a fixed number of subsets that are relatively homogeneous; Identifying a small number of instances that are representative exemplars; or Identifying a family of overlapping distributions that define the density of instances with a given set of variable values. Example method k-means clustering Select k points (called seeds) Iterate until little change in seed locations Assign each instance to nearest seed, forming clusters Replace each seed with centroid of points in its cluster www.oefai.at/~elias/ ma/documentation.html 12

Example Clustering whiskeys Two researchers in Canada applied "an array of statistical methods to a database derived from a connoisseur's description of these liquors. The taster's literary descriptions of Scotches were turned into a numerical database (109 Scotches x 68 binary variables). A first classification was produced by distance computation and hierarchical clustering. F. Lapointe and P. Legendre (1994). A Classification of pure malt scotch whiskies. Applied Statistics 43(1):237-257. Clustering whiskeys (continued) 13

Issues Clustering can be used to assist other tasks Classification and regression Identifying subgroups that should be modeled separately Anomaly detection Identifying groups that anomalies lie far from Clustering can be very difficult to evaluate Prediction tasks (e.g., regression) have a clear method of evaluating accuracy Clusters are less obviously correct or incorrect Anomaly detection 14

Anomaly detection defined Identify individual instances that differ substantially from nearly all other instances in the data. Also called: outlier detection Example Computer Security Lane and Brodley looked for anomalies in Unix command sequences in order to identify unauthorized users of computer accounts T. Lane and C. Brodley (1999). Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security 2(3): 295-331. 15

Computer Security (continued) Issues Anomalies vs. infrequent behaviors With insufficient training data, infrequent combinations of variable values may look like anomalies Using the joint distribution Anomalies rarely are evident from a single variable. Instead, algorithms have to examine the joint distribution of several variables simultaneously 16

How to identify different tasks Consider how you will evaluate success Prediction vs. description vs. identification Prediction Classification or Regression Description Clustering or Dependency finding Identification Anomaly detection Discrete vs. continuous predictions Discrete Classification Continuous Regression Instances vs. sets Predictions about instances Classification, Regression, and Anomaly detection Descriptions of sets of instances Clustering and Dependency finding 17

The central role of output Classification Classes, rankings, or probability distributions Regression Continuous values or probability distributions for continuous values Dependency finding Dependencies Clustering Sets or descriptions of similar cases Anomaly detection Anomalous instances Don t confuse representation and task Equations can produce both probability estimates (classification) and continuous values (regression) Trees can produce both predicted classes and means & variances that are predictions of a continuous variable Rules can be used for classification but also represent the output of algorithms for dependency finding 18

The centrality of probability estimation Many tasks can reduce to probability estimation Examples Classification Select the most probable class Ranking Rank by probability of a specific class Dependency finding Learn a joint probability model and examine the model to identify associations Anomaly detection Use a joint model to identify data instances that occur, but the model says are unlikely But, you don t always need a full probability model to do each of these tasks Simpler models sometimes suffice Choosing a good task specification Try simple specifications first Iterate As a thought experiment, ask how the problem might be reduced to probability estimation Consider whether the problem encompasses several distinct tasks 19