Package RPEnsemble. October 7, 2017

Similar documents
Package snn. August 23, 2015

Package RLT. August 20, 2018

Package RandPro. January 10, 2018

Package crossval. R topics documented: March 31, Version Date Title Generic Functions for Cross Validation

Package obliquerf. February 20, Index 8. Extract variable importance measure

Package clusterrepro

Package FastKNN. February 19, 2015

Package RaPKod. February 5, 2018

Package fusionclust. September 19, 2017

Package kirby21.base

Package flam. April 6, 2018

The supclust Package

Comparison of Linear Regression with K-Nearest Neighbors

Package randomglm. R topics documented: February 20, 2015

Package bisect. April 16, 2018

k-nn classification with R QMMA

Package hsiccca. February 20, 2015

Package DTRlearn. April 6, 2018

Network Traffic Measurements and Analysis

Package ssa. July 24, 2016

Package TANDEM. R topics documented: June 15, Type Package

Stat 4510/7510 Homework 4

Package hmeasure. February 20, 2015

Package RAMP. May 25, 2017

Package MixSim. April 29, 2017

Package lolr. April 13, Type Package Title Linear Optimal Low-Rank Projection Version 2.0 Date

Package bayescl. April 14, 2017

Package ESKNN. September 13, 2015

Package rngsetseed. February 20, 2015

Package mgc. April 13, 2018

Package densityclust

Package nmslibr. April 14, 2018

Stat 602X Exam 2 Spring 2011

Package missforest. February 15, 2013

Package nodeharvest. June 12, 2015

Package CepLDA. January 18, 2016

Package Rtsne. April 14, 2017

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package RCA. R topics documented: February 29, 2016

Package FPDclustering

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Package kernelfactory

Package rferns. November 15, 2017

Package anidom. July 25, 2017

Package LVGP. November 14, 2018

Package subsemble. February 20, 2015

Package BiocNeighbors

Package BASiNET. October 2, 2018

Package dkdna. June 1, Description Compute diffusion kernels on DNA polymorphisms, including SNP and bi-allelic genotypes.

Package TVsMiss. April 5, 2018

Homework 3: Solutions

Package JOUSBoost. July 12, 2017

Lecture 25: Review I

Package penalizedlda

Package cwm. R topics documented: February 19, 2015

Package ECctmc. May 1, 2018

Package randomforest.ddr

Package rocc. February 20, 2015

Package areaplot. October 18, 2017

Package Combine. R topics documented: September 4, Type Package Title Game-Theoretic Probability Combination Version 1.

Package spark. July 21, 2017

Package CONOR. August 29, 2013

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Package smoother. April 16, 2015

Package subtype. February 20, 2015

Package Modeler. May 18, 2018

Package nonlinearicp

Package SCORPIUS. June 29, 2018

HW7 Solutions. Gabe Hope. May The accuracy of the nearest neighbor classifier was: 0.9, and the classifier took 2.55 seconds to run.

Package CuCubes. December 9, 2016

Package optimus. March 24, 2017

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Package clusternomics

Package longclust. March 18, 2018

Package ipft. January 4, 2018

Package rmi. R topics documented: August 2, Title Mutual Information Estimators Version Author Isaac Michaud [cre, aut]

Package esabcv. May 29, 2015

Package datasets.load

EPL451: Data Mining on the Web Lab 5

Package sfc. August 29, 2016

Package rstream. R topics documented: June 21, Version Date Title Streams of Random Numbers

Package PUlasso. April 7, 2018

Package texteffect. November 27, 2017

How do we obtain reliable estimates of performance measures?

Package KRMM. R topics documented: June 3, Type Package Title Kernel Ridge Mixed Model Version 1.0 Author Laval Jacquin [aut, cre]

Package ArCo. November 5, 2017

Package ordinalclust

Package glassomix. May 30, 2013

Package PrivateLR. March 20, 2018

Package glinternet. June 15, 2018

The Curse of Dimensionality

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Nonparametric Classification Methods

Package StVAR. February 11, 2017

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package strat. November 23, 2016

Package GLDreg. February 28, 2017

Package MonteCarlo. April 23, 2017

Package acebayes. R topics documented: November 21, Type Package

Transcription:

Version 0.4 Date 2017-10-06 Title Random Projection Ensemble Classification Package RPEnsemble October 7, 2017 Author Maintainer Timothy I. Cannings <cannings@marshall.usc.edu> Implements the methodology of ``Cannings, T. I. and Samworth, R. J. (2017) Randomprojection ensemble classification, J. Roy. Stat. Soc., Ser. B. (with discussion), 79, 959-- 1035''. The random projection ensemble classifier is a general method for classification of highdimensional data, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. The random projections are divided into non-overlapping blocks, and within each block the projection yielding the smallest estimate of the test error is selected. The random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Depends R (>= 3.4.2), MASS, parallel Imports class, stats License GPL-3 URL http:, http://www-bcf.usc.edu/~cannings/ NeedsCompilation no Repository CRAN Date/Publication 2017-10-07 14:28:08 UTC R topics documented: RPEnsemble-package.................................... 2 Other.classifier....................................... 3 R............................................... 4 RPalpha........................................... 5 RPChoose.......................................... 6 RPChooseSS........................................ 7 RPEnsembleClass...................................... 9 1

2 RPEnsemble-package RPGenerate......................................... 10 RPModel.......................................... 11 RPParallel.......................................... 12 Index 14 RPEnsemble-package Random Projection Ensemble Classification Implements the methodology of Cannings and Samworth (2015). The random projection ensemble classifier is a very general method for classification of high-dimensional data, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. The random projections are divided into non-overlapping blocks, and within each block the projection yielding the smallest estimate of the test error is selected. The random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Details RPChoose chooses the projection from a block of size B2 that minimises an estimate of the test error (see Cannings and Samworth, 2015, Section 3), and classifies the training and test sets using the base classifier on the projected data. RPParallel makes many calls to RPChoose in parallel. RPalpha chooses the best empirical value of alpha (see Cannings and Samworth, 2015, Section 5.1). RPEnsembleClass combines the results of many base classifications to classify the test set. The method can be used with any base classifier, any test error estimate and any distribution of the random projections. This package provides code for the following options: Classifiers linear discriminant analysis, quadratic discriminant analysis and the k-nearest neighbour classifier. Error estimates resubstitution and leave-one-out, we also provide code for the samplesplitting method described in Cannings and Samworth (2015, Section 7) (this can be done by setting estmethod = samplesplit). Projection distribution Haar, Gaussian or axis-aligned projections. The package provides the option to add your own base classifier and estimation method, this can be done by editing the code in the function Other.classifier. Moreover, one could edit the RPGenerate function to generate projections from different distributions. Maintainer: Timothy I. Cannings <t.cannings@statslab.cam.ac.uk>

Other.classifier 3 #generate data from Model 1 set.seed(101) Train <- RPModel(2, 50, 100, 0.5) Test <- RPModel(2, 100, 100, 0.5) #Classify the training and test set for B1 = 10 independent projections, each #one carefully chosen from a block of size B2 = 10, using the "knn" base #classifier and the leave-one-out test error estimate Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "loo", splitsample = FALSE, k = seq(1, 25, by = 3), clustertype = "Default") #estimate the class 1 prior probability phat <- sum(train$y == 1)/50 #choose the best empirical value of the voting threshold alpha alphahat <- RPalpha(RP.out = Out, Y = Train$y, p1 = phat) #combine the base classifications Class <- RPEnsembleClass(RP.out = Out, n = 50, n.test = 100, p1 = phat, alpha = alphahat) #calculate the error mean(class!= Test$y) #Code for sample splitting version of the above #n.val <- 25 #s <- sample(1:50,25) #OutSS <- RPParallel(XTrain = Train$x[-s,], YTrain = Train$y[-s], #XVal = Train$x[s,], YVal = Train$y[s], XTest = Test$x, d = 2, #B1 = 50, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "samplesplit", #k = seq(1,13, by = 2), clustertype = "Fork", cores = 1) #alphahatss <- RPalpha(RP.out = OutSS, Y = Train$y[s], p1 = phat) #ClassSS <- RPEnsembleClass(RP.out = OutSS, n.val = 25, n.test = 100, #p1 = phat, samplesplit = TRUE, alpha = alphahatss) #mean(classss!= Test$y) Other.classifier The users favourite classifier User defined code to convert existing R code for classification to the correct format Other.classifier(x, grouping, xtest, CV,...)

4 R x An n by p matrix containing the training dataset grouping A vector of length n containing the training data classes xtest An n.test by p test dataset CV If TRUE perform cross-validation (or otherwise) to classify training set. If FALSE, classify test set.... Optional arguments e.g. tuning parameters Details User editable code for your choice of base classifier. class a vector of classes of the training or test set R A rotation matrix The 100 by 100 rotation matrix used in Model 2 in Cannings and Samworth (2016). data(r) Format A 100 by 100 rotation matrix Cannings, T. I. and Samworth, R. J. (2016) Random projection ensemble classification. http: data(r) head(r%*%t(r))

RPalpha 5 RPalpha Choose alpha Chooses the best empirical value of the cutoff alpha, based on the leave-one-out, resubstitution or sample-split estimates of the class labels. RPalpha(RP.out, Y, p1) RP.out Y p1 The result of a call to RPParallel Vector of length n or n.val containing the training or validation dataset classes (Empirical) prior probability Details See precise details in Cannings and Samworth (2015, Section 5.1). alpha The value of alpha that minimises the empirical error See Also RPParallel Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", cores = 1) alpha <- RPalpha(RP.out = Out, Y = Train$y, p1 = sum(train$y == 1)/length(Train$y)) alpha

6 RPChoose RPChoose Chooses projection Chooses a the best projection from a set of size B2 based on a test error estimate, then classifies the training and test sets using the chosen projection. RPChoose(XTrain, YTrain, XTest, d, B2 = 10, base = "LDA", k = c(3,5), projmethod = "Haar", estmethod = "training",...) XTrain YTrain XTest d B2 base k Details Note projmethod estmethod An n by p matrix containing the training data feature vectors A vector of length n of the classes (either 1 or 2) of the training data An n.test by p matrix of the test data The lower dimension of the image space of the projections The block size The base classifier one of "knn","lda","qda" or "other" The options for k if base is "knn" Either "Haar", "Gaussian" or "axis" Method for estimating the test errors to choose the projection: either training error "training" or leave-one-out "loo"... Optional further arguments if base = "other" Randomly projects the the data B2 times. Chooses the projection yielding the smallest estimate of the test error. Classifies the training set (via the same method as estmethod) and test set using the chosen projection. Returns a vector of length n + n.test: the first n entries are the estimated classes of the training set, the last n.test are the estimated classes of the test set. Resubstitution method unsuitable for the k-nearest neighbour classifier.

RPChooseSS 7 See Also RPParallel, RPChooseSS, lda, qda, knn set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar", estmethod = "loo") Choose.out10 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar", estmethod = "loo") sum(choose.out5[1:50]!= Train$y) sum(choose.out10[1:50]!= Train$y) sum(choose.out5[51:150]!= Test$y) sum(choose.out10[51:150]!= Test$y) RPChooseSS A sample splitting version of RPChoose Chooses the best projection based on an estimate of the test error of the classifier with training data (XTrain, YTrain), the estimation method counts the number of errors made on the validation set (XVal, YVal). RPChooseSS(XTrain, YTrain, XVal, YVal, XTest, d, B2 = 100, base = "LDA", k = c(3, 5), projmethod = "Haar",...) XTrain YTrain XVal YVal XTest d B2 base An n by p matrix containing the training data feature vectors A vector of length n of the classes (either 1 or 2) of the training data An n.val by p matrix containing the validation data feature vectors A vector of length n.val of the classes (either 1 or 2) of the validation data An n.test by p matrix of the test data feature vectors The lower dimension of the image space of the projections The block size The base classifier one of "knn","lda","qda" or "other"

8 RPChooseSS k The options for k if base = "knn" projmethod Either "Haar", "Gaussian" or "axis"... Optional further arguments if base = "other" Details Maps the the data using B2 random projections. For each projection the validation set is classified using the the training set and the projection yielding the smallest number of errors over the validation set is retained. The validation set and test set are then classified using the chosen projection. Returns a vector of length n.val + n.test: the first n.val entries are the estimated classes of the validation set, the last n.test are the estimated classes of the test set. See Also RPParallel, RPChoose, lda, qda, knn set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Validate <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar") Choose.out10 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar") sum(choose.out5[1:50]!= Validate$y) sum(choose.out10[1:50]!= Validate$y) sum(choose.out5[51:150]!= Test$y) sum(choose.out10[51:150]!= Test$y)

RPEnsembleClass 9 RPEnsembleClass Classifies the test set using the random projection ensemble classifier Performs a biased majority vote over B1 base classifications to assign the test set. RPEnsembleClass(RP.out, n, n.val, n.test, p1, samplesplit, alpha,...) RP.out The result of a call to RPParallel n Training set sample size n.test Test set sample size n.val Validation set sample size p1 Prior probability estimate samplesplit TRUE if using sample-splitting method alpha The voting threshold... Optional further arguments if base = "other" Details An observation in the test set is assigned to class 1 if B1*alpha or more of the base classifications are class 1 (otherwise class 2). A vector of length n.test containing the class predictions of the test set (either 1 or 2). See Also RPParallel, RPalpha, RPChoose

10 RPGenerate Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 50, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", clustertype = "Default") Class <- RPEnsembleClass(RP.out = Out, n = length(train$y), n.test = nrow(test$x), p1 = sum(train$y == 1)/length(Train$y), splitsample = FALSE, alpha = RPalpha(Out, Y = Train$y, p1 = sum(train$y == 1)/length(Train$y))) mean(class!= Test$y) RPGenerate Generates random matrices Generates B2 random p by d matrices according to Haar measure, Gaussian or axis-aligned projections RPGenerate(p = 100, d = 10, method = "Haar", B2 = 10) p d method B2 The original data dimension The lower dimension Projection distribution, either "Haar" for Haar distributed projections, "Gaussian" for Gaussian distributed projections with i.i.d. N(0,1/p) entries, "axis" for uniformly distributed axis aligned projections, or "other" for user defined method the number of projections returns B2 p by d random matrices as a single p by d*b2 matrix

RPModel 11 R1 <- RPGenerate(p = 20, d = 2, "Haar", B2 = 3) t(r1)%*%r1 R2 <- RPGenerate(p = 20, d = 2, "Gaussian", B2 = 3) t(r2)%*%r2 R3 <- RPGenerate(p = 20, d = 2, "axis", B2 = 3) colsums(r3) rowsums(r3) RPModel Generate pairs (x,y) from joint distribution Generates data from the models described in Cannings and Samworth (2016) RPModel(Model.No, n, p, Pi = 1/2) Model.No n p Pi Model Number Sample size Data dimension Class one prior probability x An n by p data matrix n observations of the p-dimensional features y A vector of length n containing the classes (either 1 or 2) Note Models 1 and 2 require p = 100 or 1000. Cannings, T. I. and Samworth, R. J. (2016) Random projection ensemble classification. http:

12 RPParallel Data <- RPModel(Model.No = 1, 100, 100, Pi = 1/2) table(data$y) colmeans(data$x[data$y==1,]) colmeans(data$x[data$y==2,]) RPParallel Chooses a projection from each block in parallel Makes B1 calls to RPChoose or RPChooseSS in parallel and returns the results as a matrix. RPParallel(XTrain, YTrain, XVal, YVal, XTest, d, B1 = 500, B2 = 50, base = "LDA",projmethod = "Gaussian", estmethod = "training", k = c(3,5,9), clustertype = "Default", cores = 1, machines = NULL, seed = 1,... ) XTrain YTrain XVal YVal XTest d B1 B2 base k projmethod estmethod clustertype cores machines seed An n by p matrix containing the training data feature vectors A vector of length n containing the classes (either 1 or 2) of the training data An n.val by p matrix containing the validation data feature vectors A vector of length n.val of the classes (either 1 or 2) of the validation data An n.test by p matrix containing the test data feature vectors The lower dimension of the image space of the projections The number of blocks The size of each block The base classifier one of "knn","lda","qda" or "other" The options for k if base is "knn" "Haar", "Gaussian" or "axis" Method for estimating the test errors to choose the projection: either training error "training", leave-one-out "loo", or sample split "samplesplit" The type of cluster: "Default" uses just one core, "Fork" uses a single machine, "Socket" uses many machines. Note "Fork" and "Socket" are not supported on windows. Required only if clustertype==fork: the number of computer cores to use (note: cores > 1 not supported on Windows) Required only if clustertype==socket: the names of the machines to use e.g. c("computer1", "Computer2") (not supported on Windows) If not NULL, sets random seed for reproducible results... Optional further arguments if base = "other"

RPParallel 13 Details Makes B1 calls to RPChoose or RPChooseSS in parallel. If estmethod == "training" or "loo", then returns an n+n.test by B1 matrix, each row containing the result of a call to RPChoose. If estmethod == "samplesplit", then returns an n.val+n.test by B1 matrix, each row containing the result of a call to RPChooseSS. See Also RPChoose, RPChooseSS Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training") colmeans(out)

Index Other.classifier, 2, 3 R, 4 RPalpha, 2, 5, 9 RPChoose, 2, 6, 7 9, 12, 13 RPChooseSS, 7, 7, 12, 13 RPEnsemble (RPEnsemble-package), 2 RPEnsemble-package, 2 RPEnsembleClass, 2, 9 RPGenerate, 2, 10 RPModel, 11 RPParallel, 2, 5, 7 9, 12 14