RENR690-SDM. Zihaohan Sang. March 26, 2018

Size: px

Start display at page:

Download "RENR690-SDM. Zihaohan Sang. March 26, 2018"

Gervais Banks
5 years ago
Views:

1 RENR690-SDM Zihaohan Sang March 26, 2018 Intro This lab will roughly go through the key processes of species distribution modeling (SDM), from data preparation, model fitting, to final evaluation. In this tutorial, I will introduce two machine learning methods(knn and SVM) and mention model evaluation criteria(accuracy et al). Species distribution modeling(sdm) is known under other names like climate envelopemodeling, habitat modeling, and ecological niche-modeling. The aim of SDM is estimating the similarity of the conditions at any site to the conditions of known occurrence (and perhaps of non-occurrence) places. The most common application of SDM is to predict species presence probability with climate data as predictors. Step1. Load Data We are using an example file that is installed with the dismo package. system.file() function identifies the file path to where dismo package is installed in your computer then create a rasterstack from these layers. The example uses data representing bioclimatic variables from the WorldClim database ( Hijmans et al., 2004) In this lab we applied the following bioclimatic variables as predictors: bio1(annual mean temperature( C)); bio12(annual precipitation(mm)); bio16( precipitation of wettest three months (mm)); bio17(precipitation of the driest three months (mm)); bio5 (max temperature of warmest month ( C)); bio6 (min temperature of coldest month( C));bio7(Temperature Annual Range (bio5-bio6)( C)); bio8 (Mean Temperature of Wettest three months( C)). knitr::opts_chunk$set(warning = FALSE, message = FALSE) ##install the following packages for this tutorial ##install.packages(c('raster', 'rgdal', 'dismo')) require(dismo) require(raster) require(rgdal) ###load environmental values, and combine all variables into one raster file files <- list.files(path = paste(system.file(package = "dismo"), '/ex', sep = ""), pattern = "grd", full.names = T)[1:8] predictors <- stack(files) plot(predictors) #plot each bioclimatic variable

2 Step2. Data Preparation It is critical to collect a sufficient number of occurrence records that document presence (and perhaps absence or abundance) of the species ofinterest.the biggest assumption of SDM is that datasets can provide useful information of the ecological requirement of the speices. The quality of data can directly influence the model performance (garbage in, garbage out). Now it is time to take a look at the data. ##load occurance records of the species dat <- read.csv("dat.csv")#records of survey; pd represent occurance (1: pres ence; 0:absence) dat2 <- dat[, -3] ##just keep coordinates of points clim <- extract(predictors, dat2) #extract climatic values from the raster file we just build sdmdat <- data.frame(cbind(pd = dat[,3], clim)) #column combine pd and climat e variables # and transform the list to dataframe for later analysis sdmdat$pd <- as.factor(sdmdat$pd) str(sdmdat) summary(sdmdat) #visually check the data points require(maptools)

3 data("wrld_simpl") #load the global map plot(predictors, 1) plot(wrld_simpl, add = T) points(lat ~ lon, data = dat, col = c("gray65", "blue")[sdmdat$pd], pch = 20, cex = 0.9) legend(-58, 28, legend = c("absence", "presence"), col = c("gray65", "blue"), pch = 20, bty = "n") #pairs plot of the cliamte variables pairs(sdmdat[, 2:6], cex = 0.8, col = c("gray","blue")[sdmdat$pd])

4 Step3. Machine Learning Methods To fairly assess models performance and avoid overfitting, the common method is to split dataset into 2 subsets: one for training model(60%-70% of overall data), the rest data is for testing. # create a list of 70% of the rows in the original dataset we can use for tra ining # install.package("caret", "rpart") require(caret) validation_index <- caret::createdatapartition(sdmdat$pd, p=0.70, list=false) # select 30% of the data for validation/testing validation <- sdmdat[-validation_index,] # use the remaining 70% of data to training and testing the models dataset <- sdmdat[validation_index,] dim(dataset) ## [1] Great! It is time to create some models of the data and estimate their accuracy on unseen data. SVM (Support Vector Machine) SVM is a classification method. In this algorithm, we plot each data item as a point within n- dimensional space (where n is number of predict variables you have) with the value of each feature being the value of a particular coordinate. Like CART, we will find some lines that split the data between the two differently classified groups of data. The classifer will be the line that has the farthest distances from the closest point in each of the two groups. KNN (K- Nearest Neighbors) It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors(knn) is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its the nearest k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and last one is for categorical variables. If K = 1, then the case is simply classified based on its nearest neighbor. Also, data preprocesses is necessary before using knn; data should be normalized, and knn is relative sensitive to outlier and noise. # SVM set.seed(7) fit.svm <- train(pd~., data=dataset, method="svmradial", preprocess=c('scale','center '))

5 # knn set.seed(7) fit.knn <- train(pd~., data=dataset, method="knn", preprocess=c('scale','center')) 4. Model evaluation But how to evaluate a model? These are the default metrics used to evaluate algorithms on binary and multi-class classification datasets in caret package. Accuracy is the percentage of correctly classifies out of all samples. Kappa or Cohen s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your dataset. It is a more useful measure to use on problems that have an imbalance in the classes (e.g split for classes 0 and 1 and you can achieve 70% accuracy by predicting all instances are for class 0). In this example, KNN and SVM seem hard to accurately identify suitable (high probability) sites for the species, as you can see from their low Kappa values. The Area Under ROC Curve (AUC), sensitivity and specificity are also suitable for binary classification problems. The AUC shows a models ability to classify positive and negative groups. AUC equals 1 represents a model that made 100% correct predicts. An area of 0.5 represents a model as good as random chance. ROC can be broken down into sensitivity and specificity. Sensitivity is the number of the positive class that actually predicted correctly. Specificity is also called the true negative rate, which is the number of cases from the negative class that were actually predicted correctly. ##1.fit.svm pred.svm <- predict(fit.svm, validation) caret::confusionmatrix(validation$pd, pred.svm) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## ## ## Accuracy : ## 95% CI : (0.7816, ) ## No Information Rate : ## P-Value [Acc > NIR] : 1 ## Kappa : ## Mcnemar's Test P-Value : 4.402e-05 ## Sensitivity : ## Specificity : ## Pos Pred Value : ## Neg Pred Value : ## Prevalence : ## Detection Rate : ## Detection Prevalence : ## Balanced Accuracy : require(rocr)

6 pred.pd <- ifelse(pred.svm == "1", 1, 0) test.pd <- ifelse(validation$pd == "1", 1, 0) predic <- ROCR::prediction(pred.pd, test.pd) perf <- performance(predic, "tpr", "fpr") ac <- performance(predic, 'auc') ac ## Slot "y.values": ## [[1]] ## [1] plot(perf) abline(0, 1, col = "gray50", lty=2) ##2. fit.knn pred.knn <- predict(fit.knn, validation) caret::confusionmatrix(validation$pd, pred.knn) pred.pd <- ifelse(pred.knn == "1", 1, 0) test.pd <- ifelse(validation$pd == "1", 1, 0) predic <- ROCR::prediction(pred.pd, test.pd) perf <- performance(predic, "tpr", "fpr") ac <- performance(predic, 'auc') ac ## Slot "y.values": ## [[1]] ## [1] plot(perf) abline(0, 1, col = "gray50", lty=2) It s clear these two methods don t have good performances to capture the ecological niche of the species. Tuning the model may helpful to improve the performance, or we should try other methods and adding relevant predictors.

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning