Homework: Data Mining

Size: px
Start display at page:

Download "Homework: Data Mining"

Transcription

1 : Data Mining This homework sheet will test your knowledge of data mining using R. 3 a) Load the files Titanic.csv into R as follows. This dataset provides information on the survival of the passengers of the Titanic. titanic <- as.data.frame(read.csv("titanic.csv", header = TRUE, sep = ",")) titanic$survived <- as.factor(titanic$survived) # Remove NA observations titanic <- na.omit(titanic[, c("survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked")]) # Number of observations nrow(titanic) ## [] 45 # List of column names colnames(titanic) ## [] "survived" "pclass" "sex" "age" "sibsp" "parch" ## [7] "fare" "embarked" # View first rows of the dataset head(titanic) ## survived pclass sex age sibsp parch fare embarked ## female S ## 2 male S ## 3 female S ## 4 male S ## 5 female S ## 6 male S Column Name survival pclass sex age sibsp parch fare embarked Interpretation Survival ( = No; = Yes) Passenger Class ( = st; 2 = 2nd; 3 = 3rd) Sex Age Number of Siblings/Spouses Aboard Number of Parents/Children Aboard Passenger Fare Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) b) Use the k-nearest neighbor method with k = 3 to predict if a 35-years-old person from st class survived? Page: of 5

2 library(class) train <- as.data.frame(cbind(titanic$age, titanic$pclass)) knn(train, c(35, ), titanic$survived, k = 3) ## [] ## Levels: c) Build a decision tree with the train dataset, that shows which people on the Titanic survived. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Plot and interpret your result. library(rpart) library(party) library(partykit) dt <- rpart(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, plot(as.party(dt)) method="class", data=titanic) Page: 2 of 5

3 sex male female 2 age 7 pclass 9.5 < < sibsp 8 fare 23.9 < 23.9 embarked 2.5 < 2.5 Q, S C age Node 3 (n = 64) Node 5 (n = 6) Node 6 (n = 27) Node 9 (n = 2) Node 2 (n = 34) 27.5 < Node 3 (n = 75) Node 4 (n = 22) Node 5 (n = 236) Generally speaking, a strong distinguishing power comes from the root note. Thus, Sex as the root node determines survival to a large extent. The next nodes are represented by Age and Pclass respectively. Finally, one can see that females traveling on first/second class have a very high survival rate. d) Prune the decision tree. Plot the pruned tree and interpret your result. Use R to calculate the number of nodes in the pruned tree. # Optimal size of nodes according to cross-validation error which.min(dt$cptable[, "xerror"]) ## 5 ## 5 Page: 3 of 5

4 pruned <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]) plot(as.party(pruned)) sex male female 2 age 7 pclass 9.5 < < sibsp 8 fare 23.9 < 23.9 embarked 2.5 < 2.5 Q, S C age Node 3 (n = 64) Node 5 (n = 6) Node 6 (n = 27) Node 9 (n = 2) Node 2 (n = 34) 27.5 < Node 3 (n = 75) Node 4 (n = 22) Node 5 (n = 236) Sex remains the most distinguishing feature. Elder males (with age above 9.5 years) have a relatively low chance of surviving, in contrast to women from st/2nd class e) Does the decision tree support the hypothesis that women and children are saved first? Why is it difficult to analyze this hypothesis with OLS? When looking at the position of Age and Sex, we can see that other males have, overall, a very low possibility of survival. As we clearly have a non-linear effect, this hypothesis is difficult to verify by a linear model. Page: 4 of 5

5 As a remedy, one has to use e. g. interaction terms. f) Split the data into two subsets for training (2 %) and testing (8 %). Recalculate the above pruned decision tree and measure the prediction performance using the confusion matrix, as well as accuracy, precision, sensitivity and F score. intrain <- runif(nrow(titanic)) <.2 dt <- rpart(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, method = "class", data = titanic[intrain, ]) pruned <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]) pred <- predict(pruned, titanic[-intrain, ], type = "class") # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].793 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7624 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].945 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].8439 g) Split the data into two subsets for training (2 %) and testing (8 %). Train a neural network with nodes in the hidden layer. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Calculate the confusion matrix, as well as accuracy, precision, sensitivity and F score! Page: 5 of 5

6 library(nnet) intrain <- runif(nrow(train)) <.2 ann <- nnet(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data = titanic[intrain, ], size =, maxit =, rang =., decay = 5e-4) ## # weights: ## initial value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value ## iter 8 value ## iter 9 value ## iter value ## final value ## stopped after iterations pred <- predict(ann, titanic[-intrain, ], type = "class") cm <- table(pred = pred, true = titanic$survived[-intrain]) # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## 7 26 # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7739 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7683 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].885 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].8226 Page: 6 of 5

7 h) Now train a neural network with 5 nodes in the hidden layer and a maximum of iterations (maxit=). Compare the performance to the previous neural network. What is the cause of the drop in prediction performance, even though we increased both the number of neurons and the training effort? ann2 <- nnet(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data = titanic[intrain, ], size = 5, maxit =, rang =., decay = 5e-4) ## # weights: 55 ## initial value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value ## iter 8 value ## iter 9 value ## iter value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value.532 ## iter 8 value ## iter 9 value ## iter 2 value ## iter 2 value ## iter 22 value ## iter 23 value ## iter 24 value ## iter 25 value ## iter 26 value ## iter 27 value ## iter 28 value ## iter 29 value ## iter 3 value ## iter 3 value ## iter 32 value ## iter 33 value ## iter 34 value ## iter 35 value ## iter 36 value ## iter 37 value Page: 7 of 5

8 ## iter 38 value ## iter 39 value ## iter 4 value ## iter 4 value ## iter 42 value ## iter 43 value ## iter 44 value ## iter 45 value ## iter 46 value ## iter 47 value ## iter 48 value ## iter 49 value ## iter 5 value ## iter 5 value ## iter 52 value ## iter 53 value ## iter 54 value ## iter 55 value ## iter 56 value ## iter 57 value ## iter 58 value ## iter 59 value ## iter 6 value ## iter 6 value ## iter 62 value ## iter 63 value ## iter 64 value ## iter 65 value ## iter 66 value ## iter 67 value ## iter 68 value ## iter 69 value ## iter 7 value ## iter 7 value ## iter 72 value ## iter 73 value ## iter 74 value ## iter 75 value ## iter 76 value ## iter 77 value ## iter 78 value ## iter 79 value ## iter 8 value ## iter 8 value ## iter 82 value ## iter 83 value ## iter 84 value ## iter 85 value ## iter 86 value ## iter 87 value ## iter 88 value ## iter 89 value ## iter 9 value ## iter 9 value Page: 8 of 5

9 ## iter 92 value ## iter 93 value ## iter 94 value ## iter 95 value ## iter 96 value ## iter 97 value ## iter 98 value ## iter 99 value ## iter value ## final value ## stopped after iterations pred <- predict(ann2, titanic[-intrain, ], type = "class") cm <- table(pred = pred, true = titanic$survived[-intrain]) # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## 273 # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7577 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].772 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].8382 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].837 Both accuracy and precision decrease. A possible cause of this behavior might be overfitting. With only 7 input values, a total of 5 neurons in the hidden layer might be too much. i) Split the data into two subsets for training (2 %) and testing (8 %). Train a support vector machine and predict the survival. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Calculate the confusion matrix, as well as accuracy, precision, sensitivity and F score! Page: 9 of 5

10 library(e7) intrain <- runif(nrow(titanic)) <.2 model.svm <- svm(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data=titanic[intrain,], type="c-classification") pred <- predict(model.svm, titanic[-intrain,]) # Confusion matrix cm <- table(pred=pred, true=titanic$survived[-intrain]) cm ## true ## pred ## ## # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7893 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7884 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].883 # F score 2*cm[, ]/(2*cm[,]+cm[, 2]+cm[2, ]) ## [].838 j) Which of the above machine learning approaches has the best performance? Model Accuracy Precision Sensitivity F Decision Tree (Pruned) Neural Network ( Hidden Neurons) Support Vector Machine In the given setting, one achieves the highest accuracy with decision trees, while support vector machines Page: of 5

11 have the highest precision rate. The best trade-off between precision and sensitivity originates from using the SVM. k) Plot the receiver operating curve (ROC) for the trained SVM. library(proc) pred <- predict(model.svm, titanic[-intrain, ], decision.values = TRUE) dv <- attributes(pred)$decision.values plot.roc(as.numeric(titanic$survived[-intrain]), dv, xlab = " - Specificity") Sensitivity Specificity ## ## Call: ## plot.roc.default(x = as.numeric(titanic$survived[-intrain]), predictor = dv, xlab = " - Specificity ## ## Data: dv in 68 controls (as.numeric(titanic$survived[-intrain]) ) > 426 cases (as.numeric(titanic$surv ## Area under the curve:.84 l) Perform a k-means clustering to determine the origin of wines. Use k = 3 means with n = random initializations. What is the within-cluster sum of squares (WCSS) value? As the input data we use the dataset wines that is included in the kohonen package. library(kohonen) data(wines) The dataset contains 77 rows and thirteen columns; object vintages contains the class labels. For compatibility with older versions of the package, variable wine.classes is retained too. This data is the Page: of 5

12 result of chemical analyses of wines grown in the same region of Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contains the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables. head(wines) ## alcohol malic acid ash ash alkalinity magnesium tot. phenols ## [,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## flavonoids non-flav. phenols proanth col. int. col. hue OD ratio ## [,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## proline ## [,] 5 ## [2,] 85 ## [3,] 48 ## [4,] 735 ## [5,] 45 ## [6,] 29 km <- kmeans(wines, 3, nstart = ) km ## K-means clustering with 3 clusters of sizes 62, 46, 69 ## ## Cluster means: ## alcohol malic acid ash ash alkalinity magnesium tot. phenols ## ## ## ## flavonoids non-flav. phenols proanth col. int. col. hue OD ratio proline ## ## ## ## ## Clustering vector: ## [] ## [36] ## [7] ## [6] ## [4] ## [76] 3 ## Page: 2 of 5

13 ## Within cluster sum of squares by cluster: ## [] ## (between_ss / total_ss = 86.5 %) ## ## Available components: ## ## [] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" km$tot.withinss ## [] m) Use the above result from the clustering procedure to plot data points and clusters in a 2-dimensional plot showing only the dimensions alcohol and ash. plot(wines[, "alcohol"], wines[, "ash"], col = km$cluster) points(km$centers[, "alcohol"], km$centers[, "ash"], col = :nrow(km$centers), pch = 8) Page: 3 of 5

14 wines[, "alcohol"] wines[, "ash"] n) What is the recommended number of clusters according to the Elbow plot? ev <- c() for (i in :5) { km <- kmeans(wines, i, nstart = ) ev[i] <- sum(km$betweenss)/km$totss } plot(:5, ev, type = "l", xlab = "Number of Clusters", ylab = "Explained Variance") Page: 4 of 5

15 Explained Variance Number of Clusters The recommended number seems to be 3 clusters, matching the data description naming 3 origins. Page: 5 of 5

Data Mining. Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel

Data Mining. Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel Data Mining Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel Today s Lecture Objectives 1 Recapitulating common concepts of machine learning 2 Understanding the k-nearest neighbor

More information

COMP 364: Computer Tools for Life Sciences

COMP 364: Computer Tools for Life Sciences COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn Christopher J.F. Cameron and Carlos G. Oliver 1 / 1 Key course information Assignment #4 available now due Monday,

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Figure 3.20: Visualize the Titanic Dataset

Figure 3.20: Visualize the Titanic Dataset 80 Chapter 3. Data Mining with Azure Machine Learning Studio Figure 3.20: Visualize the Titanic Dataset 3. After verifying the output, we will cast categorical values to the corresponding columns. To begin,

More information

TITANIC. Predicting Survival Using Classification Algorithms

TITANIC. Predicting Survival Using Classification Algorithms TITANIC Predicting Survival Using Classification Algorithms 1 Nicholas King IE 5300-001 May 2016 PROJECT OVERVIEW > Historical Background ### > Project Intent > Data: Target and Feature Variables > Initial

More information

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Data Mining: Unsupervised Learning Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Learning how k-means clustering works 2 Understanding dimensionality reduction

More information

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC)

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC) Intro to R Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC) fuz@mrl.ucsb.edu MRL 2066B Sharon Solis Paul Weakliem Research Computing

More information

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli By definition, machine learning models are based on learning and self-adaptive techniques. A priori, real world data are intrinsically carriers

More information

A toolkit for stability assessment of tree-based learners

A toolkit for stability assessment of tree-based learners A toolkit for stability assessment of tree-based learners Michel Philipp, University of Zurich, Michel.Philipp@psychologie.uzh.ch Achim Zeileis, Universität Innsbruck, Achim.Zeileis@R-project.org Carolin

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1) Orange Juice data Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l10-oj-data.html#(1) 1/31 Orange Juice Data The data contain weekly sales of refrigerated

More information

STAT Statistical Learning. Clustering. Unsupervised. Learning. Clustering. April 3, 2018

STAT Statistical Learning. Clustering. Unsupervised. Learning. Clustering. April 3, 2018 STAT 8 - STAT 8 - April, 8 STAT 8 - Supervised vs. STAT 8 - STAT 8 - Supervised - How many clusters? STAT 8 - STAT 8 - STAT 8 - k-means clustering k-means clustering STAT 8 - ## K-means clustering with

More information

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records, 10/5/2017 MIST.6060 Business Intelligence and Data Mining 1 Distance Measures Nearest Neighbors In a p-dimensional space, the Euclidean distance between two records, a = a, a,..., a ) and b = b, b,...,

More information

In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train. Data frames in R In this tutorial we will see some of the basic operations on data frames in R Understand the structure Indexing Column names Add a column/row Delete a column/row Subset Summarize We will

More information

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models Kunal Sharma CS 4641 Machine Learning Abstract Supervised learning classification algorithms

More information

Predictive modelling / Machine Learning Course on Big Data Analytics

Predictive modelling / Machine Learning Course on Big Data Analytics Predictive modelling / Machine Learning Course on Big Data Analytics Roberta Turra, Cineca 19 September 2016 Going back to the definition of data analytics process of extracting valuable information from

More information

Lab and Assignment Activity

Lab and Assignment Activity Lab and Assignment Activity 1 Introduction Sometime ago, a Titanic dataset was released to the general public. This file is given to you as titanic_data.csv. This data is in text format and contains 12

More information

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017 Exercise 3 AMTH/CPSC 445a/545a - Fall Semester 2016 October 7, 2017 Problem 1 Compress your solutions into a single zip file titled assignment3.zip, e.g. for a student named Tom

More information

Trees, SVM, and Random Forest Discriminants

Trees, SVM, and Random Forest Discriminants 57 Part IX Trees, SVM, and Random Forest Discriminants 1 rpart Analyses the Pima Dataset Note the rpart terminology: size Number of leaves Used in plots from plotcp() nsplit Number of splits = size - 1

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Data Mining Algorithms In R/Clustering/K-Means

Data Mining Algorithms In R/Clustering/K-Means 1 / 7 Data Mining Algorithms In R/Clustering/K-Means Contents 1 Introduction 2 Technique to be discussed 2.1 Algorithm 2.2 Implementation 2.3 View 2.4 Case Study 2.4.1 Scenario 2.4.2 Input data 2.4.3 Execution

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

K-means Clustering. customers.data <- read.csv(file = "wholesale_customers_data1.csv") str(customers.data)

K-means Clustering. customers.data <- read.csv(file = wholesale_customers_data1.csv) str(customers.data) K-means Clustering Dataset Wholesale Customer dataset contains data about clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.the

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus Classification and Regression Trees Machine learning methods used to construct predictive models from

More information

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances International Journal of Statistics and Systems ISSN 0973-2675 Volume 12, Number 3 (2017), pp. 421-430 Research India Publications http://www.ripublication.com On Sample Weighted Clustering Algorithm using

More information

Data Mining: STATISTICA

Data Mining: STATISTICA Outline Data Mining: STATISTICA Prepare the data Classification and regression (C & R, ANN) Clustering Association rules Graphic user interface Prepare the Data Statistica can read from Excel,.txt and

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

EPL451: Data Mining on the Web Lab 5

EPL451: Data Mining on the Web Lab 5 EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Hands-on Machine Learning for Cybersecurity

Hands-on Machine Learning for Cybersecurity Hands-on Machine Learning for Cybersecurity James Walden 1 1 Center for Information Security Northern Kentucky University 11th Annual NKU Cybersecurity Symposium Highland Heights, KY October 11, 2018 Topics

More information

OPTIMUM COMPLEXITY NEURAL NETWORKS FOR ANOMALY DETECTION TASK

OPTIMUM COMPLEXITY NEURAL NETWORKS FOR ANOMALY DETECTION TASK OPTIMUM COMPLEXITY NEURAL NETWORKS FOR ANOMALY DETECTION TASK Robert Kozma, Nivedita Sumi Majumdar and Dipankar Dasgupta Division of Computer Science, Institute for Intelligent Systems 373 Dunn Hall, Department

More information

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny 171:161: Introduction to Biostatistics Breheny Lab #3 The focus of this lab will be on using SAS and R to provide you with summary statistics of different variables with a data set. We will look at both

More information

An Efficient Clustering for Crime Analysis

An Efficient Clustering for Crime Analysis An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

The University of Birmingham School of Computer Science MSc in Advanced Computer Science. Imaging and Visualisation Systems. Visualisation Assignment

The University of Birmingham School of Computer Science MSc in Advanced Computer Science. Imaging and Visualisation Systems. Visualisation Assignment The University of Birmingham School of Computer Science MSc in Advanced Computer Science Imaging and Visualisation Systems Visualisation Assignment John S. Montgomery msc37jxm@cs.bham.ac.uk May 2, 2004

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar MACHINE LEARNING TOOLBOX Logistic regression on Sonar Classification models Categorical (i.e. qualitative) target variable Example: will a loan default? Still a form of supervised learning Use a train/test

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Midterm Examination CS 540-2: Introduction to Artificial Intelligence Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Data Science Course Content

Data Science Course Content CHAPTER 1: INTRODUCTION TO DATA SCIENCE Data Science Course Content What is the need for Data Scientists Data Science Foundation Business Intelligence Data Analysis Data Mining Machine Learning Difference

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Importing data sets in R

Importing data sets in R Importing data sets in R R can import and export different types of data sets including csv files text files excel files access database STATA data SPSS data shape files audio files image files and many

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Cluster Analysis using Spherical SOM

Cluster Analysis using Spherical SOM Cluster Analysis using Spherical SOM H. Tokutaka 1, P.K. Kihato 2, K. Fujimura 2 and M. Ohkita 2 1) SOM Japan Co-LTD, 2) Electrical and Electronic Department, Tottori University Email: {tokutaka@somj.com,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Assignment 1: CS Machine Learning

Assignment 1: CS Machine Learning Assignment 1: CS7641 - Machine Learning Saad Khan September 18, 2015 1 Introduction I intend to apply supervised learning algorithms to classify the quality of wine samples as being of high or low quality

More information

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface Data Mining: i STATISTICA Outline Prepare the data Classification and regression Clustering Association rules Graphic user interface 1 Prepare the Data Statistica can read from Excel,.txt and many other

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

Lecture 19: Decision trees

Lecture 19: Decision trees Lecture 19: Decision trees Reading: Section 8.1 STATS 202: Data mining and analysis November 10, 2017 1 / 17 Decision trees, 10,000 foot view R2 R5 t4 1. Find a partition of the space of predictors. X2

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Statistical foundations of Machine Learning INFO-F-422 TP: Prediction

Statistical foundations of Machine Learning INFO-F-422 TP: Prediction Statistical foundations of Machine Learning INFO-F-422 TP: Prediction Catharina Olsen and Gianluca Bontempi March 25, 2013 1 1 Introduction: supervised learning A supervised learning problem lets us study

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Didacticiel - Études de cas

Didacticiel - Études de cas Subject In this tutorial, we use the stepwise discriminant analysis (STEPDISC) in order to determine useful variables for a classification task. Feature selection for supervised learning Feature selection.

More information

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN MOTIVATION FOR RANDOM FOREST Random forest is a great statistical learning model. It works well with small to medium data. Unlike Neural Network which requires

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018 Machine Learning in Python Rohith Mohan GradQuant Spring 2018 What is Machine Learning? https://twitter.com/myusuf3/status/995425049170489344 Traditional Programming Data Computer Program Output Getting

More information

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 6 Dezember 2016 1 Multitasking

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

RENR690-SDM. Zihaohan Sang. March 26, 2018

RENR690-SDM. Zihaohan Sang. March 26, 2018 RENR690-SDM Zihaohan Sang March 26, 2018 Intro This lab will roughly go through the key processes of species distribution modeling (SDM), from data preparation, model fitting, to final evaluation. In this

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000. Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Data Mining. Kohonen Networks. Data Mining Course: Sharif University of Technology 1

Data Mining. Kohonen Networks. Data Mining Course: Sharif University of Technology 1 Data Mining Kohonen Networks Data Mining Course: Sharif University of Technology 1 Self-Organizing Maps Kohonen Networks developed in 198 by Tuevo Kohonen Initially applied to image and sound analysis

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Pandas III: Grouping and Presenting Data

Pandas III: Grouping and Presenting Data Lab 8 Pandas III: Grouping and Presenting Data Lab Objective: Learn about Pivot tables, groupby, etc. Introduction Pandas originated as a wrapper for numpy that was developed for purposes of data analysis.

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 9 Recursive Partitioning: Predicting Body Fat and Glaucoma Diagnosis 9. Introduction 9.2 Recursive Partitioning

More information

Tanagra Tutorial. Determining the right number of neurons and layers in a multilayer perceptron.

Tanagra Tutorial. Determining the right number of neurons and layers in a multilayer perceptron. 1 Introduction Determining the right number of neurons and layers in a multilayer perceptron. At first glance, artificial neural networks seem mysterious. The references I read often spoke about biological

More information

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Topics Validation of biomedical models Data-splitting Resampling Cross-validation

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Predicting Diabetes using Neural Networks and Randomized Optimization

Predicting Diabetes using Neural Networks and Randomized Optimization Predicting Diabetes using Neural Networks and Randomized Optimization Kunal Sharma GTID: ksharma74 CS 4641 Machine Learning Abstract This paper analysis the following randomized optimization techniques

More information

Data Science Algorithms In Plain English. Mark Whitehorn. Mark Whitehorn

Data Science Algorithms In Plain English. Mark Whitehorn. Mark Whitehorn Data Science Algorithms In Plain English Mark Whitehorn It s all about me Prof. Mark Whitehorn Emeritus Professor of Analytics University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk

More information

Data mining: concepts and algorithms

Data mining: concepts and algorithms Data mining: concepts and algorithms Practice Data mining Objective Exploit data mining algorithms to analyze a real dataset using the RapidMiner machine learning tool. The practice session is organized

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Statistical Methods in AI

Statistical Methods in AI Statistical Methods in AI Distance Based and Linear Classifiers Shrenik Lad, 200901097 INTRODUCTION : The aim of the project was to understand different types of classification algorithms by implementing

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Tutorial for the R Statistical Package

Tutorial for the R Statistical Package Tutorial for the R Statistical Package University of Colorado Denver Stephanie Santorico Mark Shin Contents 1 Basics 2 2 Importing Data 10 3 Basic Analysis 14 4 Plotting 22 5 Installing Packages 29 This

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

I211: Information infrastructure II

I211: Information infrastructure II Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information