Package ESKNN. September 13, 2015

Similar documents
Retrieving and Working with Datasets Prof. Pietro Ducange

Application of Machine Learning Classification Algorithms on Hepatitis Dataset

Takagi-Sugeno-Kang(zero-order) model for. diagnosis hepatitis disease

Analysis of feature Selection and Classification algorithms on Hepatitis Data

PCA-NB Algorithm to Enhance the Predictive Accuracy

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

QUEST User Manual. Yu-Shan Shih Department of Mathematics National Chung Cheng University, Taiwan Revised April 27, 2004

An Ensemble approach on Missing Value Handling in Hepatitis Disease Dataset

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Performance Evaluation of Various Classification Algorithms

Implementation of Modified K-Nearest Neighbor for Diagnosis of Liver Patients

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Analysis of Classification Algorithms Applied to Hepatitis Patients

k-nn classification with R QMMA

Package nodeharvest. June 12, 2015

k-nn behavior with set-valued attributes

Package RPEnsemble. October 7, 2017

Package FastKNN. February 19, 2015

Package risksetroc. February 20, 2015

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Classification using Weka (Brain, Computation, and Neural Learning)

Package obliquerf. February 20, Index 8. Extract variable importance measure

Package ms.sev. R topics documented: December 21, Type Package

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Package randomglm. R topics documented: February 20, 2015

Illustration of Random Forest and Naïve Bayes Algorithms on Indian Liver Patient Data Set

Package DSBayes. February 19, 2015

Package JOUSBoost. July 12, 2017

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Package hkclustering

HW7 Solutions. Gabe Hope. May The accuracy of the nearest neighbor classifier was: 0.9, and the classifier took 2.55 seconds to run.

Package impute. April 12, Index 5. A function to impute missing expression data

Classification of Hand-Written Numeric Digits

Package kernelfactory

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Machine Learning - Clustering. CS102 Fall 2017

Package rpst. June 6, 2017

A Modified K-Nearest Neighbor Algorithm Using Feature Optimization

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Package PDN. November 4, 2017

Problem Set 5. MAS 622J/1.126J: Pattern Recognition and Analysis. Due Monday, 8 November Resubmission due, 12 November 2010

Package snn. August 23, 2015

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Package kerasformula

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

k-nn Disgnosing Breast Cancer

Random Forest A. Fornaser

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Package ordinalforest

Package biglars. February 19, 2015

Package PTE. October 10, 2017

Journal of Statistical Software

Applying Supervised Learning

2 cubist.default. Fit a Cubist model

Package nricens. R topics documented: May 30, Type Package

Package JGEE. November 18, 2015

BRACE: A Paradigm For the Discretization of Continuously Valued Data

Package assortnet. January 18, 2016

Package cwm. R topics documented: February 19, 2015

Package bisect. April 16, 2018

Contents. Preface to the Second Edition

A proposal of hierarchization method based on data distribution for multi class classification

Package EFS. R topics documented:

GUIDE Classification and Regression Trees User Manual for Version 16

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Package strat. November 23, 2016

Package MatchIt. April 18, 2017

Nonparametric Classification Methods

Package datasets.load

Package Cubist. December 2, 2017

Package subsemble. February 20, 2015

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Package randomforest.ddr

Aravind Baskar A ASSIGNMENT - 3

Package anesrake. April 28, 2018

Package SAMUR. March 27, 2015

A Framework for Outlier Detection Using Improved

Package C50. December 1, 2017

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

K- Nearest Neighbors(KNN) And Predictive Accuracy

Package Modeler. May 18, 2018

Business Cases for Machine Learning

Anomaly Detection. You Chen

Package sensory. R topics documented: February 23, Type Package

Package SAENET. June 4, 2015

Package Daim. February 15, 2013

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Disease diagnosis using rough set based feature selection and K-nearest neighbor classifier

1 Training/Validation/Testing

Didacticiel - Études de cas

Package rereg. May 30, 2018

Package samplesizecmh

Package maboost. R topics documented: February 20, 2015

Trade-offs in Explanatory

Nearest Neighbor 3D Segmentation with Context Features

Package rknn. June 9, 2015

Analysis of Modified Rule Extraction Algorithm and Internal Representation of Neural Network

Chapter 5: Outlier Detection

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

Transcription:

Type Package Package ESKNN September 13, 2015 Title Ensemble of Subset of K-Nearest Neighbours Classifiers for Classification and Class Membership Probability Estimation Version 1.0 Date 2015-09-13 Author Asma Gul, Aris Perperoglou, Zardad Khan, Osama Mahmoud, Werner Adler, Miftahuddin Miftahuddin, and Berthold Lausen Maintainer Asma Gul <agul@essex.ac.uk> Functions for classification and group membership probability estimation are given. The issue of non-informative features in the data is addressed by utilizing the ensemble method. A few optimal models are selected in the ensemble from an initially large set of base k- nearest neighbours (KNN) models, generated on subset of features from the training data. A two stage assessment is applied in selection of optimal models for the ensemble in the training function. The prediction functions for classification and class membership probability estimation returns class outcomes and class membership probability estimates for the test data. The package includes measure of classification error and brier score, for classification and probability estimation tasks respectively. Imports caret,stats LazyLoad yes License GPL (>= 2) NeedsCompilation no Repository CRAN Date/Publication 2015-09-13 09:22:47 R topics documented: ESkNN-package...................................... 2 esknnclass......................................... 2 esknnprob.......................................... 4 hepatitis........................................... 6 Predict.esknnClass..................................... 7 1

2 esknnclass Predict.esknnProb...................................... 8 sonar............................................. 10 Index 12 ESkNN-package Ensemble of Subset of K-Nearest Neighbours Classifiers for Classification and Class Membership Probability Estimation Details Functions for building an ensemble of optimal k-nearest neighbours (knn) models for classification and class membership probability estimation are provided. To address the issue of non-informative features in the data. A set of base knn models is generated and a subset of models is selected for the ensemble based on the individual and combined performance of these models. Out-of-bag data and an independent training data set is used for the performance assessment of models individually and collectively. Class labels and class membership probability estimates are returned by the prediction functions. Other measures such as confusion matrix, classification error rate, and brier scores etc, are also returned by the functions. Package: ESKNN Type: Package Version: 1.0 Date: 2015-09-13 License: GPL (>= 2) Author(s) Asma Gul, Aris Perperoglou, Zardad Khan, Osama Mahmoud, Miftahuddin, Werner Adler, and Berthold Lausen Maintainer: Asma Gul <agul@essex.ac.uk> References Gul, A., Perperoglou, A., Khan, Z., Mahmoud, O., Miftahuddin, M., Adler, W. and Lausen, B.(2014),Ensemble of subset of k-nearest neighbours classifiers, Journal name to appear. esknnclass Train ensemble of subset of k-nearest neighbours classifiers for classification

esknnclass 3 Usage Constructing m, models and search for the optimal models for classification. esknnclass(xtrain, ytrain, k = NULL, q = NULL, m = NULL, ss = NULL) Arguments xtrain ytrain k A matrix or data frame of size n x d dimension where n is the number of traing observation and d is the number of features. A vector of class labels of the training data. Class labels should be factor of two levels (0,1) represented by variable Class in the data.. Number of nearest neighbours to be considered, when NULL then the default is set tok=3. q Percent of models to be selected from the initial set m. m ss Value Number of models to be generated in the first stage, when NULL the default is m=501. Feature subset size to be selected from d features for each bootstrap sample, when NULL the default is (number of features)/3. trainfinal fsfinal List of the extracted opimal models. List of the features used in each selected models. Author(s) Asma Gul <agul@essex.ac.uk> References Gul, A., Perperoglou, A., Khan, Z., Mahmoud, O.,Miftahuddin, M., Adler, W. and Lausen, B.(2014), Ensemble of Subset of knn Classifiers, Journal name to appear. See Also Predict.esknnClass Examples # Load the data data(hepatitis) data <- hepatitis # Divide the data into testing and training parts

4 esknnprob Class <- data[,names(data)=="class"] data$class<-as.factor(as.numeric(class)-1) train <- data[sample(1:nrow(data),0.7*nrow(data)),] test <- data[-(sample(1:nrow(data),0.7*nrow(data))),] ytrain<-train[,names(train)=="class"] xtrain<-train[,names(train)!="class"] xtest<-test[,names(test)!="class"] ytest <- test[,names(test)=="class"] # Trian esknnclass model<-esknnclass(xtrain, ytrain,k=null) # Predict on test data resclass<-predict.esknnclass(model,xtest,ytest,k=null) # Returning Objects are predicted class labels, confusion matrix and classification error resclass$predclass resclass$confmatrix resclass$classerror esknnprob Train the ensemble of subset of k-nearest neighbours classifiers for estimation of class membership probabilty. Usage This function selects a subset of optimal models from a set of m models, initially generated on bootstrap sample with a random feature subset from the training data, for class membership probability estimation. The values for the hyper parameters, for example subset size of the best models from the total initial m models, can be specified by the user otherwise the default values are considered. esknnprob(xtrain, ytrain, k = NULL, q = NULL, m = NULL, ss = NULL) Arguments xtrain ytrain k A matrix or data frame of size n x d dimension where n is the number of traing observation and d is the number of features. A vector of class labels for the training data. Class labels should be factor of two levels (0,1) represented by variable Class in the data. Number of nearest neighbours to be considered, when NULL then the default is set tok=3. q Percent of models to be selected from the initial set m.

esknnprob 5 m ss Number of models to be generated in the first stage, when NULL the default is m=501. Feature subset size to be selected from d features for each bootstrap sample, when NULL the default is (number of features)/3. Value trainfinal fsfinal List of the extracted opimal models. List of the features used in each selected models. Author(s) Asma Gul <agul@essex.ac.uk> References Gul, A., Perperoglou, A., Khan, Z., Mahmoud, O.,Miftahuddin, M., Adler, W. and Lausen, B.(2014),Ensemble of Subset of knn Classifiers, Journal name to appear. See Also Predict.esknnProb Examples # Load the data data(sonar) data <- sonar # Divide the data into testing and training Class <- data[,names(data)=="class"] data$class<-as.factor(as.numeric(class)-1) train <- data[sample(1:nrow(data),0.7*nrow(data)),] test <- data[-(sample(1:nrow(data),0.7*nrow(data))),] ytrain<-train[,names(train)=="class"] xtrain<-train[,names(train)!="class"] xtest<-test[,names(test)!="class"] ytest <- test[,names(test)=="class"] # Trian esknnprob on training data model<-esknnprob(xtrain, ytrain,k=null) # Predict on test data resprob<-predict.esknnprob(model,xtest,ytest,k=null) ## Returning Objects

6 hepatitis resprob$predprob resprob$brierscore hepatitis Hepatitis data set Usage Format This data set is about hepatitus disease. The data set is obtained from UCI machine learning repository. There are 155 observations in total, however this data set consists of 80 observations after removing the observations with missing values. There are 19 features/ attributes where 13 attributes are binary while 6 attributes are discrete valued. The observations are catogarized in two classes classes die and live. There are 13 observations in class "die" and "67" in class live. data(hepatitis) A data frame with 80 observations on the following 20 variables. Age age of the patients in years, from 20 to 80 years. Sex Gender of patient, a factor at two levels coded by 1 (male) and 2(female) Steroid Steroid treatment, a factor at two levels coded by 1(yes) and 2(no). Antivirals Antivirals medication, a factor at two levels 1 (yes) and 2 (no). Fatigue Fatigue is a frequent and disabling symptom reported by patients with chronic hepatitis, a factor at two levels 1 (yes) and 2 (no). Malaise Malaise one of the symptoms of hepatitis, a factor at two levels 1 (yes) and 2 (no). Anorexia Anorexia, loss of appetite, a factor at two levels 1 (yes) and 2 (no). LiverBig The size of liver increased or fatty, a factor at two levels 1 (yes) and 2 (no). LiverFirm A factor at two levels 1 (yes) and 2 (no). SpleenPalpable Splenomegaly is an enlargement of the spleen, a factor at two levels 1 (yes) and 2 (no). Spiders Enlarged blood vessels that resemble little spiders,a factor at two levels 1 (yes) and 2 (no). Ascites Ascites is the presence of excess fluid in the peritoneal cavity, a factor at two levels 1(yes) and 2(no)). Varices a factor at two levels 1(yes) and 2(no)). Bilirubin Bilirubin is a substance made when the body breaks down old red blood cells, continuous feature AlkPhosphate Alkaline phosphatase is an enzyme made in liver cells and bile ducts, a discrete valued feature reveals level Alkaline phosphatase.

Predict.esknnClass 7 Sgot A discrete valued feature. AlbuMin A continous feature. ProTime A discrete valued feature. Histology a factor at two levels 1 (yes) and 2 (no). Class a factor at two levels 1(Die) or 2(Live). Source This data set is available on: https://archive.ics.uci.edu/ml/datasets/hepatitis Examples data(hepatitis) str(hepatitis) Predict.esknnClass Class predictions from ensemble of subset of k-nearest neighbours Classification prediction for test data on the trained esknnclass object for. Usage Predict.esknnClass(optModels, xtest, ytest=null, k = NULL) Arguments optmodels xtest ytest k An object of esknnclass A matrix or data frame test set features/attributes. Optional: A vector of lenth m consisting of class labels for the test data. Should be binary (0,1), reprenting by a variable Class in the data. If provided then confusion matrix and classification error rate is returned. Number of nearest neighbors considered. The same value is considered as for training in esknnclass. Value predclass ConfMatrix ClassError A vector of predicted classes of test set observations. Confusion matrix return a matrix of cross classification counts based on the estimated class labels and the true class labels of test observations. This matrix is returned if ytest is given. Classification error rate of the clssifier for test set observations. This is returned if ytest is provided.

8 Predict.esknnProb Author(s) Asma Gul <agul@essex.ac.uk> References Gul, A., Perperoglou, A., Khan, Z., Mahmoud, O.,Miftahuddin, M., Adler, W. and Lausen, B.(2014),Ensemble of Subset of knn Classifiers, Journal name to appear. See Also esknnclass Examples # Load the data data(hepatitis) data <- hepatitis # Spliting the data into testing and training parts. Class <- data[,names(data)=="class"] data$class<-as.factor(as.numeric(class)-1) train <- data[sample(1:nrow(data),0.7*nrow(data)),] test <- data[-(sample(1:nrow(data),0.7*nrow(data))),] ytrain<-train[,names(train)=="class"] xtrain<-train[,names(train)!="class"] xtest<-test[,names(test)!="class"] ytest <- test[,names(test)=="class"] # Trian esknnclass using training data model<-esknnclass(xtrain, ytrain,k=null) # Predict on test data resclass<-predict.esknnclass(model,xtest,ytest,k=null) # Returning Objects are predicted class labels, confusion matrix and classification error resclass$predclass resclass$confmatrix resclass$classerror Predict.esknnProb Prediction function returning class membership probability estimates

Predict.esknnProb 9 This function provides class membership probability estimates for the test set observations. Usage Predict.esknnProb(optModels, xtest, ytest, k = NULL) Arguments optmodels xtest ytest k An object of class esknnprob. A matrix or data frame test set features/attributes. Optional: A vector of class labels for the test data. Class labels should be factor of two levels (0,1) represented by variable Class in the data. The Brier score is returned if this vector is given. Number of nearest neighbors considered. The same value should be considered as for training in esknnprob Value PredProb BrierScore A vector of estimated class membership probabilities of test set observations. A vector of Brier Score based on the estimated probabilities and true class label of test set observations. This vector is returned if ytest is given. Author(s) Asma Gul <agul@essex.ac.uk> References ul, A., Perperoglou, A., Khan, Z., Mahmoud, O.,Miftahuddin, M., Adler, W. and Lausen, B.(2014),Ensemble of Subset of knn Classifiers, Journal name to appear. See Also esknnprob Examples # Load the data data(sonar) data <- sonar # Divide the data into testing and training parts Class <- data[,names(data)=="class"] # Class Varible must be a factor in (0,1)

10 sonar data$class<-as.factor(as.numeric(class)-1) train <- data[sample(1:nrow(data),0.7*nrow(data)),] test <- data[-(sample(1:nrow(data),0.7*nrow(data))),] ytrain<-train[,names(train)=="class"] xtrain<-train[,names(train)!="class"] xtest<-test[,names(test)!="class"] ytest <- test[,names(test)=="class"] # Trian esknnprob model<-esknnprob(xtrain, ytrain,k=null) # Predict on test data resprob<-predict.esknnprob(model,xtest,ytest,k=null) ## Returning Objects resprob$predprob resprob$brierscore sonar Sonar, Mines vs. Rocks. Usage Format Source This data set is a collection of sonar signals, coded as 60 continuous attributes on 208 observations. The sonar signals are obtained from a variety of different aspect angles, spanning 90 degrees for mines and 180 degrees for rocks. The task is classification of sonar signals in two catagories, signals bounced off a "rock" or a "metal cylinder". Each pattern in the data is a set of 60 numbers (continous) in the range 0.0 to 1.0, where each number represents the energy within a particular frequency band, integrated over a certain period of time. From total 208 observations, 111 obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions, is labled with "M" and 97 patterns obtained from rocks under similar conditions is labled with "R". data(sonar) A data frame with 208 observations on 60 features/attributes in two classes. All the features are numerical and the class is nominal. This data set is available on: ftp://ftp.ics.uci.edu/pub/machine-learning-databases http: //sci2s.ugr.es/keel/dataset.php?cod=85

sonar 11 References Gorman, R. P., and Sejnowski, T. J. (1988). "Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets" in Neural Networks, Vol. 1, pp. 75-89. Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine Learning Benchmark Problems. R package version 2.1-1. Examples data(sonar) str(sonar)

Index Topic Predict.esknnProb Predict.esknnProb, 8 Topic datasets hepatitis, 6 sonar, 10 Topic esknnclass esknnclass, 2 esknnprob, 4 Topic esknn esknnclass, 2 esknnprob, 4 Predict.esknnProb, 8 Topic packages Predict.esknnClass, 7 Topic package ESkNN-package, 2 ESkNN (ESkNN-package), 2 ESkNN-package, 2 esknnclass, 2, 8 esknnprob, 4, 9 hepatitis, 6 Predict.esknnClass, 3, 7 Predict.esknnProb, 5, 8 sonar, 10 12