IV. MODELS FROM DATA. MODELLING METHODS- Data mining. Data mining. Outline A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS

Size: px
Start display at page:

Download "IV. MODELS FROM DATA. MODELLING METHODS- Data mining. Data mining. Outline A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS"

Transcription

1 IV. MODELS FROM DATA Data mining 1 MODELLING METHODS- Data mining data Data mining 2 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations Decision trees Rules 3 1

2 Knowledge discovery in data bases (KDD) What is KDD? Frawley et al., 1991: KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data, How to find patters in data? Data mining (DM) central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (15-25% of the effort of the overall KDD process). - step 1: preparing data for DM (data preprocessing) - step 3: evaluating the discovered patterns (results of DM) 4 Knowledge discovery in data bases (KDD) When the patterns can be treated as knowledge? Frawley et al., (1991): A pattern that is interesting (according to a user- imposed interest measure) and certain enough (again according to the user s criteria) is called knowledge. Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user). Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria). 5 Knowledge discovery in data bases (KDD) What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, )? ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions. The amount of collected environmental data is increasing exponentially. KDD was purposively designed to cope with such complex questions about complex systems like: - understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, ) - predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbank dynamics, ) 6 2

3 Data mining (DM) What is data mining? Data Mining, is the process of automatically searching large volumes of data for patterns using algorithms. Data Mining Machine learning Data Mining is the application of Machine Learning techniques to data analysis problems. The most relevant notions of data mining: 1. Data 2. Patterns 3. Data mining algorithms 7 Data mining (DM) - data 1. What is data? According to Fayyad et al. (1996): Data is a set of facts, e.g., cases in a database. Data in DM is given in a single flat table: - rows: objects or records (examples in ML) - columns: properties of objects (attributes, features in ML) which is then used as input to a data mining algorithm. Objects Properties of objects Distance (m) Wind direction ( 0 ) Wind speed (m/s) Out-crossing rate (%) Data mining (DM) - pattern 2. What is a pattern? A pattern is defined as: A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset (Frawley et al. 1991, Fayyad et al. 1996). Classes of patterns considered in DM (depend on the data mining task at hand): 1. equations, 2. decision trees 3. association, classification, and regression rules 9 3

4 Data mining (DM) - pattern 1.Equations To predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables. Linear equations involving: - two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces Nonlinear equations involving: - two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces 10 Data mining (DM) - pattern 2. Decision trees To predict the value of one or several target dependent variables (class) from the values of other independent variables (attributes) by decision tree. Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable. 11 Data mining (DM) - pattern Decision tree is called: A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, ) A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, A model tree: leaf contains linear model predicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance wind speed wind direction 12 4

5 Data mining (DM) - pattern 3. Rules To perform association analysis between attributes discovered by association rules. The rule denotes patterns of the form: IF Conjunction of conditions THEN Conclusion. For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D) For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, 13 Data mining (DM) - algorithm 3. What is data mining algorithm? Algorithm in general: -a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined endstat. Data mining algorithm: - a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data 14 Data mining (DM) - algorithm What kind of possible algorithms do we use for discovering patterns? It depends on the goals: 1. Equations = Linear and multiple regressions 2. Decision trees = Top/down induction of decision trees 3. Rules = Rule induction 15 5

6 Data mining (DM) - algorithm 1. Linear and multiple regression Bivariate linear regression: predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A): C = α+ β A Multiple regression: predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (A i ): C = Σ n i =1 β i A i 16 Data mining (DM) - algorithm 2. Top/down induction of decision trees Decision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986) Tree construction proceeds recursively starting with the entire set of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute. 17 Data mining (DM) - algorithm 3. Rule induction A rule that correctly classifies some examples is constructed first. The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain. 18 6

7 Data mining (DM) - Statistics Data mining vs. statistics Common to both approaches: Reasoning FROM properties of a data sample TO properties of a population. 19 Data mining (DM) Machine learning - Statistics Statistics Hypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied. Main approach: best fitting all the available data. Data mining Automated construction of understandable patterns, and structured models. Main approach: structuring the data space, heuristic search for decision trees, rules, covering (parts of) the data space. 20 DATA MINING CASE STUDIES 21 7

8 Practical implementations Each class of described patterns is illustrated with examples of applications: 1. Equations: Algebraic equations Differential equations 2. Decision trees: Classification trees Regression trees Model trees 3. Predictive rules 22 Applications Difference equations Algebraic equations: CIPER 23 Applications Algebraic equations Materials and methods Measured radial increments: - 8 trees - 69 years old Hydrological conditions (HMS Lendava; monthly data on minimal, average and maximum values) - Ledava River levels - groundwater levels Management data (thinning; m 3 /y removed from the stand; Forestry Unit Lendava) Dataset Meteorological conditions (monthly data, HMS Lendava): -Time of solar radiation (h), - precipitation (mm), - ET (mm) - Number of days with white frost - Number of days with snow - T: max, aver, min - Cumulative T>0ºC, >5ºC, and >10ºC - Number of days with: -mint>0ºc - mint<-10ºc -mint<-4ºc - mint>25ºc - maxt>10ºc - maxt>25ºc Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.) Σ: 333 attributes; 35 years 24 8

9 Applications Algebraic equations 52 different combinations of attributes were tested. Σ: 124 models Experiment RRSE eq. elements jnj3_2m 0, jnj3_3s 0, jnj3_1s 0, jnj3_4m 0, jnj2_2 0, jly_4xl 0, Applications Algebraic equations Model jnj3_2m: RadialGrowthIncrement = minl8-10^ maxl8-10^ t-sun4-7^ t-sun8-10^ e-05 t-sun8-10^ d-wf-4-7^ e-05 minl4-7^1 t-sun4-7^ minl4-7^1 t-sun8-10^ e-05 minl8-10^1 t-sun8-10^ e-05 maxl8-10^1 t-sun4-7^ t-sun4-7^1 d-wf-4-7^ t-sun8-10^1 d-wf-4-7^ Relative Root Squared Error = Correlation between average measured (r-aver8) and modeled increments: linear regression: R 2 = Applications Algebraic equations Model jnj3_2m 27 9

10 Applications Algebraic equations Algebraic equations: Lagramge 28 Applications Algebraic equations Data source: Federal Biological Research Centre (BBA), Braunschweig, Germany (2000, 2001) Slovenian Agricultural Institute (KIS), Slovenia (2006) Plants involved: BBA: - transgenic maize (var. Acrobat, glufosinate tolerant line) donor - non-transgenic maize field (var. Anjou ) - receptor KIS: - yellowkernel variety of maize (hybrid Bc462, simulating a transgenic maize variety) donor - white kernel variety of maize (variety Bc38W, simulating a non-gm variety) - receptor 29 Applications Algebraic equations Experiment design: 96 points Field design 2000 transgenic field / donor 2 m 100 m 220 m 6 non-transgenic field / recipient access paths sampling point 3 system of coordinates a for the sampling points 60 cobs 2500 kernels n o p q a 1 m N b % of outcrossing l k Donors Direction of drilling- GMO corns c d 4.5 m 3 m 7.5 m 13.5 m 2 m j h g f e 25,5m 49,5 m Receptors - NT corns 30 10

11 Selected attributes: % of outcrossing Cardinal direction of the sampling point from the center of donor field Visual angle between the sampling point and the donor field Distance between the point to the center of donor filed The shortest distance between the sampling point and the donor field % of appropriate wind direction (exposure time) Length of the wind ventilation route Wind velocity Applications Algebraic equations: outcrossing rate 31 Applications Algebraic equations 32 Applications Algebraic equations 33 11

12 Applications Algebraic equations 34 Applications Algebraic equations 35 Applications Algebraic equations 36 12

13 Applications Differential equations Differential equations: Lagramge 37 Applications Differential equations 38 Applications Differential equations 39 13

14 Applications Differential equations Phosphorus water in-flow out-flow respiration growth 40 Applications Differential equations Phytoplankton growth respiration sedimentation grazing 41 Applications Differential equations Zooplankton Feeds on phytoplankton respiration mortality 42 14

15 Applications Differential equations Applications Classification trees: habitat models Classification trees: J Applications Classification trees: habitat models Observed locations of BBs

16 Applications Classification trees: habitat models The training dataset Positive examples: - Locations of bear sightings (Hunting association; telemetry) - Females only - Using home-range (HR) areas instead of raw locations - Narrower HR for optimal habitat, wider for maximal Negative examples: - Sampled from the unsuitable part of the study area - Stratified random sampling - Different land cover types equally accounted for 46 Applications Classification trees: habitat models Dataset 1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0, ,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89, ,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20, ,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20, ,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20, ,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619, ,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619, ,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24, Present: 1 Absent: 0 47 Applications Classification trees: habitat models The model for optimal habitat 48 16

17 Applications Classification trees: habitat models The model for maximal habitat 49 Applications Classification trees: habitat models Map of optimal habitat (13% SLO territory) 50 Applications Classification trees: habitat models Map of maximal habitat (39% SLO territory) 51 17

18 Applications Multi-target classification : outcrossing rate Multi-target classification model (Clus): Modelling pollen dispersal of genetically modified oilseed rape within the field Marko Debeljak, Claire Lavigne, Sašo Džeroski, Damjan Demšar In: 90th ESA Annual Meeting [jointly with the]ix International Congress of Ecology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005, p Applications Multi-target classification : outcrossing rate Experiment design: 90m Filed for receptors (90 90m) 3 3m grid = 841 nodes Donors: MF transgenic oilseed rape B004.oxy (10 10m) Field planted with MF oilseed rape B seeds of MS oilseed rape FU58B004 planted per node 90m % MS outcrossing % MF outcrossing 53 Applications Multi-target classification : outcrossing rate Selected attributes for modelling: Rate of outcrossing of MS and MF receptor plants [rate per 1000] Cardinal direction of the sampling point from the center of donor field [rad] Visual angle between the sampling point and the donor field [rad] Distance between the point to the center of donor filed [m] The shortest distance between the sampling point and the donor field [m] 54 18

19 Applications Multi-target classification : outcrossing rate Number of examples: 817 Correlation coefficient: MF: MS: MF MS 55 Applications Multi-target regression model: soil resilience 56 Applications Multi-target regression trees: soil resilience The dataset: soil samples taken on 26 location throughout SCO The dataset: The flat table of data:26 by 18 data entries 57 19

20 Applications Multi-target regression trees: soil resilience The dataset: physical properties: soil texture: sand, silt, clay chemical properties: ph, C, N, SOM (soil organic matter) FAO soil classification: Order and Suborder physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc biological resilience: heat, copper 58 Applications Multi-target regression trees: soil resilience Different scenarios and multi-target regression models have been constructed: A model predicting the resistance and resilience of soils to copper perturbation. 59 Applications Multi-target regression trees: soil resilience The increasing importance of mapping soil functions to advice on land use and environmental management -to make a map of soil resilience for Scotland. The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils

21 Applications Multi-target regression trees: soil resilience Macaulay Institute (Aberdeen): soils data attributes and maps: Approximately soil profiles held in database Descriptions of over soil horizons 61 Application 62 Application 63 21

22 Application 64 USING RELATIONAL DECISION TREES TO MODEL FLEXIBLE CO-EXISTENCE MEASURES IN A MULTI-FIELD SETTING Marko Debeljak 1, Aneta Trajanov 1, Daniela Stojanova 1, Florence Leprince 2,Sašo Džeroski 1 1: Jožef Stefan Institute, Ljubljana, Slovenia 2:ARVALIS-Institut du végétal, Montardon, France Introduction Initial questions: To what extent will GM maze grown on Geens genetically interfere with the maize on Yelows? Will this interference be small enough to allow co-existence? N DKC6041 YG Semis du 26/04 8 ha Pr34N44 YG Semis du 20/04 24 rgs Pr33A ha Pr33A46 Semis du 21/ m Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,

23 Relational data preprocessing GIS Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April, 2007 Relational data mining results Out-crossing rate: Threshold 0.9 % Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, April,

24 Data 130 sites, monitoring every 7 to 14 days for 5 month (2665 samples: 1322 conventional, 1333, HT OSR observations) Each sample (observation) described with 65 attributes Original data collected by Centre for Ecology and Hydrology, Rothamsted Research and SCRI within Farm Scale Evaluation Program (2000, 2001, 2002) Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009 Results scenario B: Multiple target regression tree Target: Avg Crop Covers, Avg Weed Covers Excluded attributes: / Constraints: MinimalInstances = 64.0; MaxSize = 15 Predictive power: Corr.Coef.: , RMSE: , RRMSE: , Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009 Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS) syntactic constraint Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October

25 Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS) Target: Avg Weed Covers (Time Series) Scenario 3.9 Constraints: Syntactic, MinInstances = 32 Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain: Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009 Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS) Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009 Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS) Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October

26 Applications Rules 76 Applications Rules 77 Applications Rules The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005) Only one field plan was used: - maximising pollen and seed dispersal 78 26

27 Applications Rules Large-risk field pattern 79 Applications Rules Variables describing simulations - simulation number - genetic variables - for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0 - for each field (1 to 35) the number of years since the last GM oilseed rape crop - the number of years since the last non-gm oilseed rape crop - proportion of GM seeds in non-gm oilseed rape of field 14 at year 0 -TOTAL NUMBER OF VARIABLES: Applications Rules Run of experiment simulation started with an empty seed bank lasted 25 years, but only the last 4 years were kept in the files for data mining TOTAL NUMBER of simulations on the field pattern without the borders:

28 Applications Rules Non aggregated data: CUBIST Use 60% of data for training Each rule must cover >=1% of cases Maximum of 10 rules Rule 1: [29499 cases, mean , range 0 to , est err ] if SowingDateY0F14 > 258 then PropGMinField14 = Rule 2: [12726 cases, mean , range e-07 to , est err ] if SowingDateY0F14 > 258 SowingDateY0F14 <= 277 then PropGMinField14 = YearsSinceLastGMcrop_F14 82 Applications Rules Rule 4: [22830 cases, mean , range 0 to , est err ] if SowingDateY0F14 <= 258 YearsSinceLastGMcrop_F14 > 2 then PropGMinField14 = SowingDateY0F SowingDensityY0F14 83 Applications Rules Rule 10: [1911 cases, mean , est err ] if TillageSoilBedPrepY0F14 in {0, 2} SowingDateY0F14 <= 258 SowingDensityY0F14 <= 55 YearsSinceLastGMcrop_F14 <= 2 then PropGMinField14 = YearsSinceLastGMcrop_F SowingDateY0F SowingDensityY0F SowingDateY-1F HarvestLossY-1F HarvestLossY-2F SowingDensityY-2F HarvestLossY-1F EfficHerb2Y-3F27-7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F SowingDateY-1F HarvestLossY0F9 + 5e-05 1cuttingY-3F SowingDensityY-2F18-6e-05 2cuttingY-2F24-6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32-6e-05 2cuttingY-2F SowingDateY0F11-4e-05 1cuttingY0F SowingDensityY-1F EfficHerb2Y-3F HarvestLossY-1F22-5e-05 2cuttingY-2F EfficHerb2GMvolunY0F EfficHerb1GMvolunY-2F fficherb2gmvoluny0f

29 Applications Rules Non aggregated data: CUBIST Options: Use 60% of data for training Each rule must cover >=1% of cases Maximum of 10 rules Target attribute `PropGMinField14' Evaluation on training data (60000 cases): Average error Relative error 0.47 Correlation coefficient 0.77 Evaluation on test data (40000 cases): Average error Relative error 0.49 Correlation coefficient Conclusions What can data mining do for you? Knowledge discovered by analyzing data with DM techniques can help: Understand the domain studied Make predictions/classifications Support decision processes in environmental management 86 Conclusions What data mining cannot do for you? The law of information conservation (garbage-in-garbage-out) The knowledge we are seeking to discover has to come from the combination of data and background knowledge If we have very little data of very low quality and no background knowledge no form of data analysis will help 87 29

30 Conclusions Side-effects? Discovering problems with the data during analysis missing values erroneous values inappropriately measured variables Identifying new opportunities new problems to be addressed recommendations on what data to collect and how 88 DATA MINING Hands-on exercises 1. Data preprocessing 89 DATA MINING data preprocessing DATA FORMAT File extension.arff This a plain text format, files should be edited by editors such as Notepad, TextPad, WordPad (that do not add extra formatting information) File consists of NameOfDataset List of AttName AttType AttType can be numeric or nominal list of categorical values, e.g., {red, green, blue} (in a separate line), followed by the actual data in comma separated value (.csv) format 90 30

31 DATA MINING data preprocessing DATA outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, sunny,85,85,false,no sunny,80,90,true,no overcast,83,86,false,yes 91 DATA MINING data preprocessing Excel Attributes (variables) in columns and cases in lines Use decimal POINT and not decimal COMMA for numbers Save excel sheet as CSV file 92 DATA MINING data preprocessing TextPad, Notepad Open CSV file Delete on the beginning of lines and save (just save, don t change the format) Change all ; to, Numbers must have decimal dot (.) and not decimal comma (,) Save file as CSV file (don't change format) 93 31

32 DATA MINING Hands-on exercises 2. Data minig 94 DATA MINING data preprocessing WEKA Open CVS file in WEKA Select algorithm and attributes Perform data mining 95 How to select the best classification tree? Performance of the classification tree: classification accuracy: (correctly classified examples)/(all examples) True positive rate False positive rate Confusion matrix is a matrix showing actual and predicted classifications 96 32

33 How to select the best classification tree? Classification trees: J48 Interpretable size: -pruned or unpruned - minimal number of objects per leaf the number of instances CORRECTLY classified into this leaf the number of instances INCORRECTLY classified into this leaf It could appear: (13) no incorrectly classified instances or (3.5/0.5) due to missing values (?) where instances are fractured or (0,0) a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question 97 How to select the best regression / model tree? Performance of the regression / model tree: 98 How to select the best regression / model tree? The interpretable size: -pruned or unpruned - minimal number of objects pre leaf The number of instances that REACH this leaf Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%. The smaller this value, the better

34 Accuracy and error Avoid overfitting the data by tree pruning. Pruned trees are: - less accurate (percentage of correct classifications) on training data - more accurate when classifying unseen data 100 How to prune optimally? Pre-pruning: stop growing the tree e.g., when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf) Post-pruning: grow full tree, then post-prune (confidence factor-classification trees) 101 Optimal accuracy 10-fold cross-validation is a standard classifier evaluation method used in machine learning: - Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and take a mean accuracy

4. MODELS FROM DATA. Outline. Knowledge discovery in data bases (KDD) A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS

4. MODELS FROM DATA. Outline. Knowledge discovery in data bases (KDD) A) THEORETICAL BACKGROUND B) PRACTICAL IMPLEMENTATIONS 4. MODELS FROM DATA 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: Equations

More information

IV. MODELS FROM DATA: Data mining

IV. MODELS FROM DATA: Data mining IV. MODELS FROM DATA: Data mining 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications:

More information

IV. MODELS FROM DATA: Data mining

IV. MODELS FROM DATA: Data mining IV. MODELS FROM DATA: Data mining 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining Data Patterns Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications:

More information

Constraint Based Induction of Multi-Objective Regression Trees

Constraint Based Induction of Multi-Objective Regression Trees Constraint Based Induction of Multi-Objective Regression Trees Jan Struyf 1 and Sašo Džeroski 2 1 Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Leuven, Belgium

More information

Ecological Modelling

Ecological Modelling Ecological Modelling 220 (2009) 1063 1072 Contents lists available at ScienceDirect Ecological Modelling journal homepage: www.elsevier.com/locate/ecolmodel Modelling the outcrossing between genetically

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Decision Trees In Weka,Data Formats

Decision Trees In Weka,Data Formats CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning 1 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 06/0/ Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Advanced learning algorithms

Advanced learning algorithms Advanced learning algorithms Extending decision trees; Extraction of good classification rules; Support vector machines; Weighted instance-based learning; Design of Model Tree Clustering Association Mining

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Machine Learning in Real World: C4.5

Machine Learning in Real World: C4.5 Machine Learning in Real World: C4.5 Industrial-strength algorithms For an algorithm to be useful in a wide range of realworld applications it must: Permit numeric attributes with adaptive discretization

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Final Report: Kaggle Soil Property Prediction Challenge

Final Report: Kaggle Soil Property Prediction Challenge Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form) Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o

More information

Building Intelligent Learning Database Systems

Building Intelligent Learning Database Systems Building Intelligent Learning Database Systems 1. Intelligent Learning Database Systems: A Definition (Wu 1995, Wu 2000) 2. Induction: Mining Knowledge from Data Decision tree construction (ID3 and C4.5)

More information

Lecture outline. Decision-tree classification

Lecture outline. Decision-tree classification Lecture outline Decision-tree classification Decision Trees Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes

More information

Basis Functions. Volker Tresp Summer 2017

Basis Functions. Volker Tresp Summer 2017 Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information

Notes based on: Data Mining for Business Intelligence

Notes based on: Data Mining for Business Intelligence Chapter 9 Classification and Regression Trees Roger Bohn April 2017 Notes based on: Data Mining for Business Intelligence 1 Shmueli, Patel & Bruce 2 3 II. Results and Interpretation There are 1183 auction

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Decision tree learning

Decision tree learning Decision tree learning Andrea Passerini passerini@disi.unitn.it Machine Learning Learning the concept Go to lesson OUTLOOK Rain Overcast Sunny TRANSPORTATION LESSON NO Uncovered Covered Theoretical Practical

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information

Data Analytics and Boolean Algebras

Data Analytics and Boolean Algebras Data Analytics and Boolean Algebras Hans van Thiel November 28, 2012 c Muitovar 2012 KvK Amsterdam 34350608 Passeerdersstraat 76 1016 XZ Amsterdam The Netherlands T: + 31 20 6247137 E: hthiel@muitovar.com

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Introduction to Machine Learning CANB 7640

Introduction to Machine Learning CANB 7640 Introduction to Machine Learning CANB 7640 Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/5/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/canb7640/

More information

AN OVERVIEW AND EXPLORATION OF JMP A DATA DISCOVERY SYSTEM IN DAIRY SCIENCE

AN OVERVIEW AND EXPLORATION OF JMP A DATA DISCOVERY SYSTEM IN DAIRY SCIENCE AN OVERVIEW AND EXPLORATION OF JMP A DATA DISCOVERY SYSTEM IN DAIRY SCIENCE A.P. Ruhil and Tara Chand National Dairy Research Institute, Karnal-132001 JMP commonly pronounced as Jump is a statistical software

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

CS Machine Learning

CS Machine Learning CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K

More information

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Non-trivial extraction of implicit, previously unknown and potentially useful information from data CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Inducer: a Rule Induction Workbench for Data Mining

Inducer: a Rule Induction Workbench for Data Mining Inducer: a Rule Induction Workbench for Data Mining Max Bramer Faculty of Technology University of Portsmouth Portsmouth, UK Email: Max.Bramer@port.ac.uk Fax: +44-2392-843030 Abstract One of the key technologies

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr. Data Mining Lesson 9 Support Vector Machines MSc in Computer Science University of New York Tirana Assoc. Prof. Dr. Marenglen Biba Data Mining: Content Introduction to data mining and machine learning

More information

Modeling Plant Succession with Markov Matrices

Modeling Plant Succession with Markov Matrices Modeling Plant Succession with Markov Matrices 1 Modeling Plant Succession with Markov Matrices Concluding Paper Undergraduate Biology and Math Training Program New Jersey Institute of Technology Catherine

More information

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4. Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that

More information