IV. MODELS FROM DATA: Data mining
|
|
- Julia Carter
- 5 years ago
- Views:
Transcription
1 IV. MODELS FROM DATA: Data mining 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations Decision trees Rules 2 1
2 Methods Classic predictive Methods - Data models Hypothesis Model Knowledge Model - Hypothesis discovery in data Data New Knowledge Marko Debeljak Knowledge discovery in data bases (KDD) KDD-data mining Frawley et al., 1991: KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data, How to find patters in data? Data mining (DM) central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (15-25% of the effort of the overall KDD process). - step 1: preparing data for DM (data preprocessing) - step 3: evaluating the discovered patterns (results of DM) 4 2
3 Knowledge discovery in data bases (KDD) When the patterns can be treated as knowledge? Frawley et al., (1991): A pattern that is interesting (according to a user- imposed interest measure) and certain enough (again according to the user s criteria) is called knowledge. Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user). Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria). 5 Knowledge discovery in data bases (KDD) What can KDD and data mining contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, )? Environmental sciences deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions. The amount of collecting environmental data is increasing exponentially. KDD was purposefully designed to cope with such complex questions about complex systems like: - Understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, ) - Predicting future values of system variables of interest (e.g., the rate of out-crossing with GM plants at location x at time y, seedbank dynamics, ) 6 3
4 What is data mining? Data mining focuses on the discovery of previously unknown knowledge and integrates machine learning. Machine learning focuses on descriptions and prediction, based on known properties learned from the training empirical data (examples) using computer algorithms. Learning from examples is called inductive learning. If the goal of inductive learning is to obtain a model that predicts the value of the target variable from learning examples, then it is called predictive or supervised learning. Data mining (DM) What is data mining from a technical perspective? Data Mining, is the process of automatically searching large volumes of data for patterns using algorithms. Data Mining Machine learning Data Mining is the application of Machine Learning techniques to data analysis problems. The most relevant notions of data mining: 1. Data 2. Patterns 3. Data mining algorithms 8 4
5 This image cannot currentl y be display ed. This image cannot currentl y be display ed. Data mining (DM) - data Data stored in one flat table. Each example represented by a fixed number of attributes. Data stored in original tables or relations. No loss of information, due to aggregation. RELATIONAL data mining PROPOSITIONAL data mining Loss of information due to aggregation Objects Properties of objects Distance (m) Wind direction ( 0 ) Wind speed (m/s) Out-crossing rate (%) Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, Data Data are not stored at all but they continuously flow through an algorithm. Each example can be propositional or relational. DATA STREAM mining Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,
6 Data mining (DM) - pattern 2. What is a pattern? A pattern is defined as: A statement (expression) in a given language, that describes relationships among the facts (attributes, elements) in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset (Frawley et al. 1991, Fayyad et al. 1996). =model Classes of patterns considered in data mining: A. equations, B. decision trees, relational decision trees C. association, classification, and regression rules. 11 Data mining (DM) - pattern A. Equations To predict the value of a target (dependent) variable as a linear or nonlinear combination of the input (independent) variables: - Algebraic equations To predict the behavior of dynamic systems, which change their rates over time: - Difference equations - Differential equations 12 6
7 Data mining (DM) - pattern B. Decision trees To predict the value of one or several target dependent variables from the values of other independent variables by a decision tree. Decision tree has a hierarchical structure, where: - each internal node contains a test on an independent variable(s), - each branch corresponds to an outcome of the test (critical values of independent variable(s)), - each leaf gives a prediction for the value of the dependent (predicted) variable(s). 13 Data mining (DM) - pattern The decision tree is called: A classification tree: class value in the leaf is discrete (a finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, ) A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, A model tree: leaf contains linear model predicting the class value (piecewise linear function): out-crossing rate= 12.3 distance wind speed wind direction 14 7
8 Data mining (DM) - pattern 3. Rules To perform association analysis between attributes discovered by association rules. The rule denotes patterns of the form: IF Conjunction of conditions THEN Conclusion. -For classification rules, the conclusion assigns one of the possible discrete values to the class (a finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D) -For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g., 120, 220, 312, 15 Data mining (DM) - algorithm 3. What is data mining algorithm? Algorithm in general: -A procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined endstate. Data mining algorithm: - A computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data 16 8
9 Data mining (DM) - algorithm What kind of possible algorithms do we use for discovering patterns? Selection of the algorithm depends on the problem at hand: 1. Equations = Linear and multiple regressions, equation discovery 2. Decision trees = Top/down induction of decision trees 3. Rules = Rule induction 17 Data mining (DM) - algorithm 1. Linear and multiple regression Bivariate linear regression: Predicted variable (C-class (ML) may be contusions or discontinues) can be expressed as a linear function of one attribute (A): C = α+ β A Multiple regression: Predicted variable (C-class (ML) may be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (AI): C = Σ n i=1 β i A i 18 9
10 Data mining (DM) - algorithm 2. Top/down induction of decision trees The decision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986) Tree construction proceeds recursively starting with the entire set of training examples (entire table). At each step, an attribute is selected as the root of the (sub) tree and the current training set is split into subsets according to the values of the selected attribute. 19 Data mining (DM) - algorithm 3. Rule induction A rule that correctly classifies some examples is constructed first. The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain
11 Data mining (DM) - Statistics Data mining vs. Statistics Common to both approaches: Reasoning FROM properties of a data sample TO properties of a population. 21 Data mining (DM) Machine learning - Statistics Statistics Hypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied. Main approach: best fitting all the available data. Data mining Automated construction of understandable patterns, and structured models. Main approach: structuring the data space, heuristic search for decision trees, rules, covering (parts of) the data space
12 DATA MINING CASE STUDIES 23 Applications ecological modeling Propositional and relational supervised data mining: - Simple data mining - Data mining of time series - Spatial data mining 1. Equations: Algebraic equations Differential equations 2. Single and multi target decision trees: Classification trees Regression trees Model trees (single target only) 12
13 Applications ecological modeling POPULATION DYNAMICS HABITAT MODELLING GENE FLOW MODELLING RISK MODELLING Algebraic equations: SOIL WATER FLOW Problem: Prediction of drainage water Type of pattern: Algebraic equation Algorithm: CIPER 13
14 PCQE Database Experimental site La Jaillière Western France Owned by ARVALIS Shallow silt clay soils 11 fields are observed Field size about ha Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work PCQE Database (continued) Agricultural practices Fertilization Irrigation Phytochemical protection Harvesting Tillage Slope Water flow Drainage Runoff 25 campaigns ( ) The campaign is defined as the period starting from 01 SEP and finishing on 31 AUG, following year Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work 14
15 DRAINAGE predictive model - CIPER Polynomial equations induced on data for a whole campaign CIPER algorithm Evaluation Leave one out approach Fields Test field Std. Dev. RMSE RRSE Corr. coeff. (r) All T % All T % All T % All T % All T % All T % All T % All T % All T % T3 T % T6 T % Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work Predictive models Model (All/T4) Drainage = * RainfallA1 * Temp * DrainageN * CDCoef 2 * RainfallA1 2 * Slope * Runoff * DrainageN1 * Temp 2 * RainfallA * Runoff 2 * DrainageN1 * Slope * Temp 2 * RainfallA * RainfallA * Runoff * Slope * Runoff * Slope 2 * RainfallA * RainfallA1 * Slope * Slope 3 * CDCoef 3 * Runoff * Slope Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work 15
16 Predictive models Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work Constraint algebraic equations: GENE FLOW 16
17 Constraint algebraic equations: GENE FLOW Problem: Prediction of gene flow Type of pattern: Constraint algebraic equation Algorithm: Lagramge Constraint algebraic equations: GENE FLOW Experimental design: Federal Biological Research Centre, BBA, D Field design 2000 transgenic field / donor non-transgenic field / recipient access paths 2 m 100 m 220 m 96 points 6 sampling point 3 system o f coordinates a for the sampling points 60 cobs 2500 kernels n o p q 3 a 1 m N b % of outcrossing l k Donors Direction of drilling- GMO corns c d 4.5 m 3 m 7.5 m 13.5 m 2 m j h g f e 25,5m 49,5 m Receptors - NT corns 17
18 Constraint algebraic equations: GENE FLOW Constraint algebraic equations: GENE FLOW 18
19 e cannot currently be displayed. Differential equations: COMMUNITY STRUCTURE Differential equations: COMMUNITY STRUCTURE Problem: Time dependent ecosystem processes Type of pattern: Differential equations Algorithm: LAgramge 19
20 Differential equations: COMMUNITY STRUCTURE Differential equations: COMMUNITY STRUCTURE Data: 1995 to
21 Differential equations: COMMUNITY STRUCTURE Phosphorus Water inflow out-flow Respiration Growth Differential equations: COMMUNITY STRUCTURE Phytoplankton Growth Respiration Sedimentation Grazing 42 21
22 Differential equations: COMMUNITY STRUCTURE Zooplankton Feeds on phytoplankton Respiration Mortality Differential equations: COMMUNITY STRUCTURE 22
23 EQUATIONS Algebraic - CIPER Constraint algebraic- Lagramge Differential- Lagramge SUITABLE for predictions, UNSUITABLE for interpretation Decision trees: HABITAT MODELS 23
24 24 Problem: Classification of habitats (suitable, unsuitable) Type of pattern: Classification decision trees Algorithm: J4.8 Decision trees: HABITAT MODELS Observed locations of BBs Decision trees: HABITAT MODELS
25 Decision trees: HABITAT MODELS The training dataset Positive examples: - Locations of bear sightings (Hunting Association; telemetry) - Females only - Using home-range (HR) areas instead of raw locations - Narrower HR for optimal habitat, wider for maximal Negative examples: - Sampled from the unsuitable part of the study area - Stratified random sampling - A different land cover types equally accounted for Decision trees: HABITAT MODELS Propositional dataset 1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0, ,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89, ,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20, ,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20, ,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20, ,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619, ,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619, ,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24, Present: 1 Absent: 0 25
26 The model for maximal habitat The model for optimal habitat 26
27 Decision trees: HABITAT MODELS Map of maximal habitat (39% SLO territory) Decision trees: HABITAT MODELS Map of optimal habitat (13% SLO territory) 54 27
28 Multi target predictions: COMMUNITY STRUCTURE Decision trees: COMMUNITY STRUCTURE Problem: Temporal prediction of height and cover of crop and weeds Type of pattern: a) Multi target regression trees / time series b) Constraint multi target regression clustering trees Algorithm: CLUS 28
29 Data 130 sites, monitoring every 7 to 14 days for 5 months (2665 samples: 1322 conventional, 1333, HT OSR observations) Each sample (observation) described by 65 attributes Original data collected by the Centre for Ecology and Hydrology, Rothamsted Research and SCRI within the Farm Scale Evaluation Program (2000, 2001, 2002) Results scenario A: Multiple target regression tree Target: Avg Crop Covers, Avg Weed Covers Excluded attributes: / Constraints: MinimalInstances = 64.0; MaxSize = 15 Predictive power: Corr.Coef.: , RMSE: , RRMSE: ,
30 Results scenario B: Constraint predictive clustering trees for time series Syntactic constraint Results scenario B: Constraint predictive clustering trees for time series Target: Avg Weed Covers (Time Series) Scenario 3.9 Constraints: Syntactic, MinInstances = 32 Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain:
31 Results scenario B: Constraint predictive clustering trees for time series Results scenario B: Constraint predictive clustering trees for time series 31
32 Relational data mining: GENE FLOW MODELLING Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, 2007 Relational data mining: GENE FLOW MODELLING Problem: Classification of fields to above or below 0.9% of outcrossing Type of pattern: Relational classification decision tree Algorithm: TILDE 32
33 Spatial temporal relations 2004: 40 GM fields 7 non-gm fields 181 sampling points 2005: 17 GM fields 4 non-gm fields 127 sampling points 2006: 43 GM fields 4 non-gm fields 112 sampling points Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, 2007 Relational data mining: GENE FLOW MODELLING Data scattered over several tables or relations: A table storing general information on each field (e.g., area) A table storing the cultivation techniques for each field and each year A table storing the relations (e.g., distance) between fields Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,
34 Relational data mining: GENE FLOW MODELLING Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, 2007 Relational data mining: GENE FLOW MODELLING Relation database system PostGIS Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,
35 Relational data mining building model Relation data analysis: Algorithm Tilde (Blockeel and De Raedt, 1998; De Raedt et al., 2001) => upgrade of algorithm C.4.5 (Quinlan, 1993) for classification decision trees The algorithm is included in the ACE-ilProlog data mining system Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, 2007 Relational data mining results Threshold 0.01% Threshold 0.45% Threshold 0.9% Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,
36 Multi-target regression model: RISK MODELLING Decision trees: RISK MODELLING Problem: Prediction of soil exposed to disturbances Type of pattern: Multi target regression trees (resistance, resilience) Algorithm: CLUS 36
37 Multi-target regression model: RISK MODELLING The dataset: soil samples taken at 26 locations throughout SCO The dataset: The flat table of data: 26 of 18 data entries The dataset: Multi-target regression model: RISK MODELLING Physical properties: soil texture: sand, silt, clay Chemical properties: ph, C, N, SOM (soil organic matter) FAO soil classification: Order and Suborder Physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: e.g., recovery from overburden stress after two day cycles: eg2dc Biological resilience: heat, copper 37
38 Multi-target regression model: RISK MODELLING Multi-target regression model: RISK MODELLING Macaulay Institute (Aberdeen): soils data attributes and maps: Approximately soil profiles held in database Descriptions of over soil horizons 38
39 Application 77 Application 78 39
40 Application 79 Decision trees Single or multiple decision trees Classification, regression, model trees Propositional and relational data Temporal and spatial data SUITABLE for predictions, SUITABLE for interpretation 40
41 Conclusions What can data mining do for you? Knowledge discovered by analyzing data with DM techniques can help: Understand the domain studied Make predictions/classifications Support decision processes in environmental management Conclusions What data mining cannot do for you? The law of information conservation (garbage-ingarbage-out) The knowledge we are seeking to discover has to come from the combination of data and background knowledge If we have very little data of very low quality and no background knowledge no form of data analysis will help 41
42 Conclusions Side-effects? Discovering problems with the data during analysis Missing values Erroneous values Inappropriately measured variables Identifying new opportunities New problems to be addressed Recommendations on what data to collect and how DATA MINING data preprocessing DATA FORMAT File extension.arff This a plain text format, files should be edited by editors such as Notepad, TextPad, WordPad (that do not add extra formatting information) The file consists of NameOfDataset List of AttName AttType AttType can be numeric or nominal list of categorical values, e.g., {red, green, blue} (in a separate line), followed by the actual data in comma separated value (.csv) format 84 42
43 DATA MINING data preprocessing DATA outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, sunny,85,85,false,no sunny,80,90,true,no overcast,83,86,false,yes 85 DATA MINING data preprocessing Excel Attributes (variables) in columns and cases in lines Use decimal POINT and not decimal COMMA for numbers Save excel sheet as CSV file 86 43
44 DATA MINING data preprocessing TextPad, Notepad Open CSV file Delete on the beginning of lines and save (just save, don t change the format) Change all ; to, The numbers must have decimal dot (.) and not a decimal comma (,) Save file as CSV file (don't change the format) 87 DATA MINING Hands-on exercises 2. Data mining 88 44
45 DATA MINING data preprocessing WEKA Open CVS file in WEKA Select algorithm and attributes Perform data mining 89 How to select the best classification tree? Performance of the classification tree: Classification accuracy: (Correctly classified examples) /(all examples) True positive rate False positive rate Confusion matrix is a matrix showing actual and predicted classifications 90 45
46 How to select the best classification tree? Classification trees: J48 Interpretable size: -Pruned or unpruned - Minimal number of objects per leaf The number of instances classified in this leaf The number of instances INCORRECTLY classified in this leaf It could appear: (13) no incorrectly classified instances Or (3.5/0.5) due to missing values (?) where instances are fractured Or (0,0) a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question 91 How to select the best regression / model tree? The performance of the regression / model tree: 92 46
47 How to select the best regression / model tree? The interpretable size: -Pruned or unpruned - Minimal number of objects per leaf The number of instances that REACH this leaf Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%. The smaller this value, the better. 93 Accuracy and error Avoid over fitting the data by tree pruning. Pruned trees are: - Less accurate (percentage of correct classifications) of training data - More accurate when classifying unseen data 94 47
48 How to prune optimally? Pre-pruning: stop growing the tree, e.g. when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf) Post-pruning: grow full tree, then post-prune (confidence factor-classification trees) 95 Optimal accuracy 10-fold cross-validation is a standard classifier evaluation method used in machine learning: - Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and takes a mean accuracy
IV. MODELS FROM DATA. MODELLING METHODS- Data mining. Data mining. Outline A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS
IV. MODELS FROM DATA Data mining 1 MODELLING METHODS- Data mining data Data mining 2 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining
More information4. MODELS FROM DATA. Outline. Knowledge discovery in data bases (KDD) A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS
4. MODELS FROM DATA 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations
More informationIV. MODELS FROM DATA: Data mining
IV. MODELS FROM DATA: Data mining 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications:
More informationConstraint Based Induction of Multi-Objective Regression Trees
Constraint Based Induction of Multi-Objective Regression Trees Jan Struyf 1 and Sašo Džeroski 2 1 Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Leuven, Belgium
More informationCOMP33111: Tutorial and lab exercise 7
COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised
More informationUnsupervised: no target value to predict
Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationThe Explorer. chapter Getting started
chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different
More informationJue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline
Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow
More informationINTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá
INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More informationData Mining and Analytics
Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/
More informationPractical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer
Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before
More informationDecision Trees In Weka,Data Formats
CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned
More informationClassification with Decision Tree Induction
Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree
More informationChapter 4: Algorithms CS 795
Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that
More informationData Mining Practical Machine Learning Tools and Techniques
Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,
More informationData Mining Practical Machine Learning Tools and Techniques
Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward
More informationWhat Is Data Mining? CMPT 354: Database I -- Data Mining 2
Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT
More informationDecision Tree CE-717 : Machine Learning Sharif University of Technology
Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete
More informationBasic Concepts Weka Workbench and its terminology
Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know
More informationCONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM
1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu
More informationChapter 4: Algorithms CS 795
Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization
More informationRepresenting structural patterns: Reading Material: Chapter 3 of the textbook by Witten
Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules
More informationWhat is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.
What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate
More informationDecision Tree Learning
Decision Tree Learning 1 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle
More informationAn Information-Theoretic Approach to the Prepruning of Classification Rules
An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization
More informationData Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules
Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 06/0/ Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms
More informationIntroduction to Machine Learning
Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a
More information9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)
Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o
More informationThe digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).
http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis
More informationData Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3
Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?
More informationData Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree
More informationWEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov
WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationClassification. Instructor: Wei Ding
Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute
More informationData Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules
Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms
More informationData Mining Algorithms: Basic Methods
Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association
More informationIntegrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.
Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationA Systematic Overview of Data Mining Algorithms
A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a
More informationInducer: a Rule Induction Workbench for Data Mining
Inducer: a Rule Induction Workbench for Data Mining Max Bramer Faculty of Technology University of Portsmouth Portsmouth, UK Email: Max.Bramer@port.ac.uk Fax: +44-2392-843030 Abstract One of the key technologies
More informationData Mining Practical Machine Learning Tools and Techniques
Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More informationEcological Modelling
Ecological Modelling 220 (2009) 1063 1072 Contents lists available at ScienceDirect Ecological Modelling journal homepage: www.elsevier.com/locate/ecolmodel Modelling the outcrossing between genetically
More informationData Analytics and Boolean Algebras
Data Analytics and Boolean Algebras Hans van Thiel November 28, 2012 c Muitovar 2012 KvK Amsterdam 34350608 Passeerdersstraat 76 1016 XZ Amsterdam The Netherlands T: + 31 20 6247137 E: hthiel@muitovar.com
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationMachine Learning in Real World: C4.5
Machine Learning in Real World: C4.5 Industrial-strength algorithms For an algorithm to be useful in a wide range of realworld applications it must: Permit numeric attributes with adaptive discretization
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationNominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN
NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical
More informationAdvanced learning algorithms
Advanced learning algorithms Extending decision trees; Extraction of good classification rules; Support vector machines; Weighted instance-based learning; Design of Model Tree Clustering Association Mining
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationCS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor
CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationData Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.
Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that
More informationPart I. Instructor: Wei Ding
Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationData Mining Course Overview
Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical
More informationLecture outline. Decision-tree classification
Lecture outline Decision-tree classification Decision Trees Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes
More informationWhat is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry
Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335 5934
More informationInternational Journal of Software and Web Sciences (IJSWS)
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationData Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification
More informationBITS F464: MACHINE LEARNING
BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031
More informationSummary. Machine Learning: Introduction. Marcin Sydow
Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:
More informationClassification and Regression
Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan
More information(Refer Slide Time: 0:51)
Introduction to Remote Sensing Dr. Arun K Saraf Department of Earth Sciences Indian Institute of Technology Roorkee Lecture 16 Image Classification Techniques Hello everyone welcome to 16th lecture in
More informationA Two Stage Zone Regression Method for Global Characterization of a Project Database
A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,
More informationAggregation and Selection in Relational Data Mining
in Relational Data Mining Celine Vens Anneleen Van Assche Hendrik Blockeel Sašo Džeroski Department of Computer Science - K.U.Leuven Department of Knowledge Technologies - Jozef Stefan Institute, Slovenia
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationInduction of Decision Trees
Induction of Decision Trees Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny
More informationCredit card Fraud Detection using Predictive Modeling: a Review
February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,
More informationTour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers
Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background
More informationData Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input
Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationBasis Functions. Volker Tresp Summer 2017
Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)
More informationA System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment
A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma
More informationFuzzy Partitioning with FID3.1
Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing
More informationNominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML
Decision Trees Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical
More informationA Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York
A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine
More informationWhat is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry
Data Mining Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University it of Iowa Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335
More informationCSE4334/5334 DATA MINING
CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy
More informationData Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification
Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms
More informationDecision trees. Decision trees are useful to a large degree because of their simplicity and interpretability
Decision trees A decision tree is a method for classification/regression that aims to ask a few relatively simple questions about an input and then predicts the associated output Decision trees are useful
More informationClassification and Regression Trees
Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationChapter 3: Data Mining:
Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More information9. Conclusions. 9.1 Definition KDD
9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]
More information