Data Warehousing and Machine Learning

Size: px
Start display at page:

Download "Data Warehousing and Machine Learning"

Transcription

1 Data Warehousing and Machine Learning Introduction Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring / 47

2 What is Data Mining?? Introduction DWML Spring / 47

3 What is Data Mining?? Introduction DWML Spring / 47

4 What is Data Mining?! Introduction DWML Spring / 47

5 What is Data Mining? Definitions Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991]. Data Mining is a step in the KDD process consisting of applying computational techniques that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [Fayyad, Piatetsky-Shapiro, Smyth 1996]. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [Hand, Mannila, Smyth 2001]. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience [Mitchell, 1997] Introduction DWML Spring / 47

6 What is Data Mining? Definitions Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991]. Data Mining is a step in the KDD process consisting of applying computational techniques that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [Fayyad, Piatetsky-Shapiro, Smyth 1996]. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [Hand, Mannila, Smyth 2001]. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience [Mitchell, 1997] Data Mining vs. Machine Learning Different roots: information extraction vs. intelligent machines Today very large overlap of techniques and applications Some remaining differences: emphasis on large datasets (DM), theoretical analysis of learnability (ML),... For this course: Data Mining Machine Learning Introduction DWML Spring / 47

7 What is Data Mining? Data Mining in practice Introduction DWML Spring / 47

8 What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt Introduction DWML Spring / 47

9 What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate Introduction DWML Spring / 47

10 What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate general algorithmic methods data/domain specific operations Introduction DWML Spring / 47

11 What is Data Mining? Background Developed by a four member consortium in a EU project. Members of the consortium: Teradata (NCR) SPSS (statistical software) DaimlerChrysler OHRA (Insurance and Banking) Consortium supported by a special interest group composed of over 300 organizations involved in data mining projects. Aim From The CRISP-DM project has developed an industry- and tool-neutral Data Mining process model. [... ] this project defined and validated a data mining process that is applicable in diverse industry sectors. This will make large data mining projects faster, cheaper, more reliable and more manageable. Even small scale data mining investigations will benefit from using CRISP-DM. Introduction DWML Spring / 47

12 What is Data Mining? Phases of the CRISP DM Process Model (Illustration from Introduction DWML Spring / 47

13 What is Data Mining? Business/Data understanding Vision: Data Mining extracts whatever interesting hidden information there is in the data Reality: Data Mining techniques solve several types of well-defined tasks Reality: The data used must support the task at hand Reality: The data miner must understand the background of the data, in order to select an appropriate data mining technique Introduction DWML Spring / 47

14 What is Data Mining? Our Focus Introduction DWML Spring / 47

15 What is Data Mining? Selecting the Modeling Technique Universe of Techniques (Defined by Tool) Techniques Appropriate for Problem Political Requirements (Management,Understandability) Constraints (Time, Data Characteristics, Staff Training/Knowledge) Tool(s) Selected Introduction DWML Spring / 47

16 Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring / 47

17 Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring / 47

18 Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring / 47

19 Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring / 47

20 Example: Regression Nutritional rating of cereals Data: nutritional information and ratings for 77 cereals. Task: find best linear approximation of the dependency of rating on sugars. Types of tasks and models DWML Spring / 47

21 Example: Classification Text Categorization The Association for Computing Machinery (ACM) maintains a subject classification scheme for computer science research papers. Part of the subject hierarchy (1998 version): I. Computing Methodologies I.2 Artificial Intelligence I.2.6 Learning - Analogies - Concept learning - Connectionism and neural nets - Induction - Knowledge acquisition - Language acquisition - Parameter learning Papers are manually classified by authors or editors. Data: collection of classified papers (full text or abstracts) Task: build a classifier that automatically assigns a subject index to new, unclassified papers. Types of tasks and models DWML Spring / 47

22 Example: Classification Spam Filtering Spam filtering in Mozilla: user trains the mail reader to recognize spam by manually labeling incoming mails as spam/no spam. Data: collection of user-classified s (full text). Task: build a classifier that automatically categorizes an incoming as spam/no spam Types of tasks and models DWML Spring / 47

23 Example: Classification Character Recognition Example for a Pattern Recognition problem (pattern recognition is an older discipline than data mining, but now can also be seen as a sub-area of data mining): Data: collection of handwritten characters, correctly labeled. Task: build a classifier that identifies new handwritten characters. Types of tasks and models DWML Spring / 47

24 Example: Classification Credit Rating From existing customer data predict whether a person applying for a new loan will repay or default on the loan. Data: existing customer records with attributes like age, employment type, income,... and information on payback history. Task: build a classifier that predicts whether a new customer will repay the loan. Types of tasks and models DWML Spring / 47

25 Examples: Clustering Text Categorization Web mining: automatically detect similarity between web pages (e.g. to support search engines or automatic construction of internet directories). Data: the WWW. Task: Construct a (similarity) model for pages on the WWW. Types of tasks and models DWML Spring / 47

26 Examples: Clustering Bioinformatics: Phylogenetic Trees From biological data construct a model of evolution. Lactococcus Lactis Caulobacter Crescentus Bacillus Halodurans Bacillus Subtilis Rattus Norvegicus Pan Troglodytes Homo Sapiens Data: e.g. genome sequences of different animal species. Task: construct a hierarchical model of similarity between the species. Types of tasks and models DWML Spring / 47

27 Examples: Association Analysis Association Rules Data: transaction data Task: infer association rules Transaction Items bought 1 Beer,Soap,Milk,Butter 2 Beer,Chips,Butter 3 Milk,Spaghetti,Butter,Tomatos {Beer} {Chips} {Spaghetti,Tomatos} {Wine}... Types of tasks and models DWML Spring / 47

28 Tools WEKA Free open source Java toolbox ( Many methods, good interface Clementine Commercial system, Windows only Many methods, good interface, integrated use of MS SQL server For all toolboxes: easy use of methods can be dangerous correct interpretation of results requires understanding of methods. Documentation essential (and often a weak point...)! Types of tasks and models DWML Spring / 47

29 Data Warehousing and Machine Learning Decision trees Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Types of tasks and models DWML Spring / 47

30 Classification A high-level view Classifier Spam yes/no Classification DWML Spring / 47

31 Classification A high-level view SubAllCap yes/no TrustSend yes/no InvRet yes/no Body adult yes/no Classifier Spam yes/no Body zambia yes/no Classification DWML Spring / 47

32 Classification A high-level view Cell Cell Cell Classifier Symbol A..Z,0..9 Cell Classification DWML Spring / 47

33 Classification Labeled Data Instances (Cases, Examples) Attributes Class variable (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet... B zambia Spam y n n... n y n n n... n n n y n... n y n n n... n n Instances Attributes Class variable Cell-1 Cell-2 Cell-3... Cell-324 Symbol B Z (In principle, any attribute can become the designated class variable) Classification DWML Spring / 47

34 Classification Attribute Types Each attribute (including the class variable) has associated with it a set of possible values or states. E.g. States(A) = {yes, no} States(A) = {red, blue, green} States(A) = {010100, ,..., } States(A) = R States(A) finite: States(A) = R: States(A) = N: A is called discrete A is called continuous or numeric A can be interpreted as continuous (N R), or made discrete by replacing N e.g. with {1, 2,..., 100, > 100} (few data mining methods are specifically adapted to integer valued attributes). Classification DWML Spring / 47

35 Classification Complete/Incomplete Data Name Gender DoB Income Customer since Last Purchase Thomas Jensen m Jens Nielsen m Lene Hansen f Ulla Sørensen f Name Gender DoB Income Customer since Last Purchase Thomas Jensen m Jens Nielsen m?? Lene Hansen f ? Ulla Sørensen f?? Classification DWML Spring / 47

36 Classification Classification Classification data in general: Attributes: Variables A 1, A 2,..., A n (discrete or continuous). Class variable: Variable C. Always discrete: States(C) = {c 1,..., c l } (set of class labels) A (complete data) Classifier is a mapping C : States(A 1,..., A n) States(C). A classifier able to handle incomplete data provides mappings for subsets {A i1,..., A ik } of {A 1,..., A n}. C : States(A i1,..., A ik ) States(C) A classifier partitions Attribute-value space (also: instance space) into subsets labelled with class labels. Classification DWML Spring / 47

37 Classification Iris dataset SL PL PW SW Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species of Iris. first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). Attributes Class variable SL SW PL PW Species Setosa Setosa Virginica Versicolor Classification DWML Spring / 47

38 Classification Labeled data in instance space: Classification DWML Spring / 47

39 Classification Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier Classification DWML Spring / 47

40 Classification Decision Regions Deci- Axis-parallel linear: e.g. sion Trees Piecewise linear: e.g. Naive Bayes Nonlinear: e.g. Neural Network Classification DWML Spring / 47

41 Classification Classifiers differ in... Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,... ) Neural networks K-nearest neighbors Classification DWML Spring / 47

42 Decision Trees Example Attributes: height [0, 2.5], sex {m, f }. Class labels: {tall, short} tall tall m s f 1.0 short short < 1.8 h < h m f short tall short tall Partition of instance space Representation by decision tree Decision tree structure DWML Spring / 47

43 Decision Trees A decision tree is a tree - whose internal nodes are labeled with attributes - whose leaves are labeled with class labels - edges going out from node labeled with attribute A are labeled with subsets of States(A), such that all labels combined form a partition of States(A). Possible partitions e.g.: States(A) = R : [, 2.3[,[2.3, ] [, 1.9[,[1.9, 3.5[,[3.5, ] States(A) = {a, b, c} : {a}, {b}, {c} {a, b}, {c} Decision tree structure DWML Spring / 47

44 Decision Trees Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf. < 1.8 h m s < f h 1.7 short tall short tall [m,1.85] C([m, 1.85]) = tall Decision tree classification DWML Spring / 47

45 Decision Trees How to learn a decision tree? Given a dataset: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium Medium 50 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Low 25 Bad 8 Medium Medium 75 Good We want to build a decision tree that is small has high classification accuracy Decision tree learning DWML Spring / 47

46 Decision Trees Some simple candidate trees: Savings Assets L M H L M H 2,5,7 1,4,8 3,6 G:1, B:2 G:3, B:0 G:1, B:1 Income 2,7 3,4,5,8 1,6 G:0, B:2 G:3, B:1 G:2, B:0 Income 50 > > 25 2,3,4,6,7 G:2, B:3 1,5,8 G:3, B:0 3,6,7 G:1, B:2 1,2,4,5,8 G:4, B:1 Decision tree learning: selecting a root DWML Spring / 47

47 Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Decision tree learning: selecting a root DWML Spring / 47

48 Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Entropy A measure of impurity: for S = (x 1, x 2,..., x n) with x = P n i=1 x i : Entropy(S) = nx i=1 x i x log 2 ( x i x ) Entropy(2, 0) = Entropy(0, 2) = Entropy(3, 0) = (1 log 2 (1) + 0 log 2 (0)) = = 0 Entropy(3, 1) = (0.75 log 2 (0.75) log 2 (0.25)) = = Entropy(2, 3) = (0.4 log 2 (0.4) log 2 (0.6)) = = 0.97 Entropy(2, 2) = Entropy(1, 1) = (0.5 log 2 (0.5) log 2 (0.5)) = = 1.0 Decision tree learning: selecting a root DWML Spring / 47

49 Decision Trees Information Gain A B true false L M H Entropy: 8,2 1,1 2,0 5,1 2, Expected Entropy: A : B : = = Data Entropy: Entropy(9, 3) = Information Gain: A : = B : = Decision tree learning: selecting a root DWML Spring / 47

50 Decision Trees Expected entropies: Savings Assets L M H L M H 1,2 3,0 1,1 0,2 3,1 2, = = Income Income 50 > > 25 2,3 3,0 1,2 4, = = Information gains are Entropy(5, 3) = minus expected entropies. Decision tree learning: selecting a root DWML Spring / 47

51 Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H 2,7 Savings 1,6 G:0, B:2 G:2, B:0 L M H 5 4,8 3 G:0, B:1 G:2, B:0 G:0, B:1 Decision tree learning DWML Spring / 47

52 Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H bad Savings good L M H bad good bad Decision tree learning DWML Spring / 47

53 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Decision tree learning: continuous attributes DWML Spring / 47

54 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Gain(S, I = 12.5) = Entropy(S, I 12.5) + 8 Entropy(S, I > 12.5) = 0 8 Decision tree learning: continuous attributes DWML Spring / 47

55 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Split(S, I = 37.5) = Entropy(S, I 37.5) + 5 Entropy(S, I > 37.5) = Decision tree learning: continuous attributes DWML Spring / 47

56 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Split(S, I = 62.5) = Entropy(S, I 62.5) + 3 Entropy(S, I > 62.5) = Decision tree learning: continuous attributes DWML Spring / 47

57 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Split(S, I = 87.5) = Entropy(S, I 87.5) + 1 Entropy(S, I > 87.5) = Decision tree learning: continuous attributes DWML Spring / 47

58 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Split(S, I = 112.5) = Entropy(S, I 112.5) + 0 Entropy(S, I > 112.5) = 0 8 Decision tree learning: continuous attributes DWML Spring / 47

59 Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: Class: Bad Good Bad Good Bad good Good Good Split: Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = Thus, we get an attribute with states 62.5 and > Decision tree learning: continuous attributes DWML Spring / 47

60 Decision Trees ID3 algorithm for decision tree learning Determine attribute A with highest information gain (for continuous attributes: also determine split-value) Construct decision tree with root A, and one leaf for each value of A (two leaves if A is continuous) For a non-pure leaf L: determine attribute B with highest information gain for the data sorted into L. Replace L with a subtree consisting of root B and one leaf for each value of B (two leaves if B is continuous) Continue until all leaves are pure, or some other termination condition applies (e.g.: possible information gains below a given threshold) Label each leaf with the class label that is most frequent among the data sorted into the leaf Decision tree learning: continuous attributes DWML Spring / 47

61 Decision Trees Pros and Cons + Easy to interpret. + Efficient learning methods. - Difficulties with handling missing data. Decision tree learning: continuous attributes DWML Spring / 47

62 Overfitting The problem Assets bad L M Savings H good Predictions made by the learned model: Assets=M,Savings=M Risk=good Assets=M,Savings=H Risk=bad L M H bad good bad The training data contained a single case with Assets=M,Savings=H This case had the (uncharacteristic?) class label Risk=bad The model is overfitted to the training data With the prediction Assets=M,Savings=H Risk=good we will likely obtain a higher accuracy on future cases Overfitting DWML Spring / 47

63 Overfitting The general problem Complex models will represent properties of the training data very precisely The training data may contain some peculiar properties that are not representative for the domain The model will not perform optimally in classifying future instances Classification error Model size Future data Training data Overfitting DWML Spring / 47

64 Overfitting Decision Tree Pruning To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the tree construction: Data is split into training set and test set Decision tree is learned using training data only Pruning: for internal node A: replace subtree rooted at A with leaf if this reduces the classification error on the test set. Overfitting DWML Spring / 47

65 Overfitting Example Full bad bad L L Assets M Savings M good H H good bad bad Pruned Assets L M H good good Test data (show only cases with Assets=M): Id. Savings Assets Income Risk 9 High Medium 50 Good 10 Low Medium 50 Bad 11 High Medium 75 Good 12 Medium Medium 50 Good Accuracy of full tree on test data: 50% Accuracy of pruned tree on test data: 75% prune the Savings node. Overfitting DWML Spring / 47

66 Overfitting Model Tuning with Test Set Test Train learn Model apply Test Data split Train tuning parameter setting tune Data learn final Model Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters) Tuning can be an iterative process that requires repeated evaluations on the test set A final model is learned using all the data Problem: part of data wasted as test set Overfitting DWML Spring / 47

67 Overfitting Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter: for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Overfitting DWML Spring / 47

68 Overfitting Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter: for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Cross Validation is also used for final evaluation of a learned model. Overfitting DWML Spring / 47

Data Mining An Overview ITEV, F /18

Data Mining An Overview ITEV, F /18 Data Mining An Overview ITEV, F-2008 1/18 ITEV, F-2008 2/18 What is Data Mining?? ITEV, F-2008 2/18 What is Data Mining?? ITEV, F-2008 2/18 What is Data Mining?! ITEV, F-2008 3/18 What is Data Mining?

More information

Probabilistic Classifiers DWML, /27

Probabilistic Classifiers DWML, /27 Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Data Warehousing and Machine Learning

Data Warehousing and Machine Learning Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University 1 Decision Tree Example Will a pa)ent have high-risk based on the ini)al 24-hour observa)on?

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Data Mining: An experimental approach with WEKA on UCI Dataset

Data Mining: An experimental approach with WEKA on UCI Dataset Data Mining: An experimental approach with WEKA on UCI Dataset Ajay Kumar Dept. of computer science Shivaji College University of Delhi, India Indranath Chatterjee Dept. of computer science Faculty of

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

CS 584 Data Mining. Classification 1

CS 584 Data Mining. Classification 1 CS 584 Data Mining Classification 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for

More information

Data Mining - Motivation

Data Mining - Motivation Data Mining - Motivation "Computers have promised us a fountain of wisdom but delivered a flood of data." "It has been estimated that the amount of information in the world doubles every 20 months." (Frawley,

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012 劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012 Overview of Data Mining ( 資料採礦 ) What is Data Mining? Steps in Data Mining Overview of Data Mining techniques Points to Remember Data mining

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

Experimental Design + k- Nearest Neighbors

Experimental Design + k- Nearest Neighbors 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Experimental Design + k- Nearest Neighbors KNN Readings: Mitchell 8.2 HTF 13.3

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 06/0/ Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Feb, 2017 What is Data Mining? Introduction A possible

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

INTRODUCTION TO DATA MINING

INTRODUCTION TO DATA MINING INTRODUCTION TO DATA MINING 1 Chiara Renso KDDLab - ISTI CNR, Italy http://www-kdd.isti.cnr.it email: chiara.renso@isti.cnr.it Knowledge Discovery and Data Mining Laboratory, ISTI National Research Council,

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

CSE 626: Data mining. Instructor: Sargur N. Srihari. Phone: , ext. 113

CSE 626: Data mining. Instructor: Sargur N. Srihari.   Phone: , ext. 113 CSE 626: Data mining Instructor: Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113 1 What is Data Mining? Different perspectives: CSE, Business, IT As a field of research in

More information

Intro to Artificial Intelligence

Intro to Artificial Intelligence Intro to Artificial Intelligence Ahmed Sallam { Lecture 5: Machine Learning ://. } ://.. 2 Review Probabilistic inference Enumeration Approximate inference 3 Today What is machine learning? Supervised

More information

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset V.Veeralakshmi Department of Computer Science Bharathiar University, Coimbatore, Tamilnadu veeralakshmi13@gmail.com Dr.D.Ramyachitra Department

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 10 - Classification trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES

A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES A SURVEY ON DATA MINING TECHNIQUES FOR CLASSIFICATION OF IMAGES 1 Preeti lata sahu, 2 Ms.Aradhana Singh, 3 Mr.K.L.Sinha 1 M.Tech Scholar, 2 Assistant Professor, 3 Sr. Assistant Professor, Department of

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Part I. Instructor: Wei Ding

Part I. Instructor: Wei Ding Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

Decision tree learning

Decision tree learning Decision tree learning Andrea Passerini passerini@disi.unitn.it Machine Learning Learning the concept Go to lesson OUTLOOK Rain Overcast Sunny TRANSPORTATION LESSON NO Uncovered Covered Theoretical Practical

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

An Introduction to Data Mining in Institutional Research. Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa

An Introduction to Data Mining in Institutional Research. Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa An Introduction to Data Mining in Institutional Research Dr. Thulasi Kumar Director of Institutional Research University of Northern Iowa AIR/SPSS Professional Development Series Background Covering variety

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV Subject Name: Elective I Data Warehousing & Data Mining (DWDM) Subject Code: 2640005 Learning Objectives: To understand

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING CS 7265 BIG DATA ANALYTICS INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, PhD Computer Science,

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

AMOL MUKUND LONDHE, DR.CHELPA LINGAM International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 4, Dec 2015, 53-58 IIST COMPARATIVE ANALYSIS OF ANN WITH TRADITIONAL

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining José Hernández-Orallo Dpto. de Sistemas Informáticos y Computación Universidad Politécnica de Valencia, Spain jorallo@dsic.upv.es Roma, 14-15th May 2009 1 Outline Motivation.

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information