Package KernelKnn. January 16, 2018

Similar documents
Package nodeharvest. June 12, 2015

Tutorial 1. Linear Regression

Package nmslibr. April 14, 2018

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Package elmnnrcpp. July 21, 2018

Statistical Machine Learning Hilary Term 2018

Package ECctmc. May 1, 2018

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

The Review of Attributes Influencing Housing Prices using Data Mining Methods

Package fastdummies. January 8, 2018

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package Cubist. December 2, 2017

Package DCEM. September 30, 2018

Package milr. June 8, 2017

Package nngeo. September 29, 2018

VISUALIZATION TECHNIQUES UTILIZING THE SENSITIVITY ANALYSIS OF MODELS

Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM

Package fastrtext. December 10, 2017

Package validara. October 19, 2017

Package kdtools. April 26, 2018

Package reconstructr

2 cubist.default. Fit a Cubist model

Package widyr. August 14, 2017

Package MTLR. March 9, 2019

Package semver. January 6, 2017

Package rucrdtw. October 13, 2017

Package BiDAG. September 8, 2017

Package auctestr. November 13, 2017

Package radiomics. March 30, 2018

Package spark. July 21, 2017

Package robotstxt. November 12, 2017

Package gbts. February 27, 2017

Package robustreg. R topics documented: April 27, Version Date Title Robust Regression Functions

Package docxtools. July 6, 2018

Package vinereg. August 10, 2018

Package strat. November 23, 2016

Package skm. January 23, 2017

Package readxl. April 18, 2017

Package epitab. July 4, 2018

Package rnn. R topics documented: June 21, Title Recurrent Neural Network Version 0.8.1

Package ramcmc. May 29, 2018

Package packcircles. April 28, 2018

Package colf. October 9, 2017

Package preprocomb. June 26, 2016

Package geogrid. August 19, 2018

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package dissutils. August 29, 2016

Package kerasformula

Package marinespeed. February 17, 2017

Package MatchIt. April 18, 2017

Package mixsqp. November 14, 2018

Package wrswor. R topics documented: February 2, Type Package

Package embed. November 19, 2018

Package optimus. March 24, 2017

Package postal. July 27, 2018

Package RandPro. January 10, 2018

Package canvasxpress

Package TANDEM. R topics documented: June 15, Type Package

Package queuecomputer

Package lexrankr. December 13, 2017

Package triebeard. August 29, 2016

Package FSelectorRcpp

Package farver. November 20, 2018

Package clusternomics

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package gifti. February 1, 2018

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Package rsppfp. November 20, 2018

Package ompr. November 18, 2017

Package rgho. R topics documented: January 18, 2017

Package paralleldist

Package TVsMiss. April 5, 2018

Package scmap. March 29, 2019

Package textrecipes. December 18, 2018

Package balance. October 12, 2018

Package assertr. R topics documented: February 23, Type Package

Package rmi. R topics documented: August 2, Title Mutual Information Estimators Version Author Isaac Michaud [cre, aut]

Package raker. October 10, 2017

Package redlistr. May 11, 2018

Package GauPro. September 11, 2017

Package cancensus. February 4, 2018

Package PUlasso. April 7, 2018

Package ezsummary. August 29, 2016

Package pdfsearch. July 10, 2018

Package censusapi. August 19, 2018

Package geoops. March 19, 2018

Package rtext. January 23, 2019

Package coda.base. August 6, 2018

Package matchingr. January 26, 2018

Package SIMLR. December 1, 2017

Package bisect. April 16, 2018

Package rwars. January 14, 2017

Package crossword.r. January 19, 2018

Package OLScurve. August 29, 2016

Package TrafficBDE. March 1, 2018

Package ClustGeo. R topics documented: July 14, Type Package

Package rcmdcheck. R topics documented: November 10, 2018

Package modmarg. R topics documented:

Package SC3. November 27, 2017

Package automl. September 13, 2018

Transcription:

Type Package Title Kernel k Nearest Neighbors Version 1.0.8 Date 2018-01-16 Package KernelKnn January 16, 2018 Author Lampros Mouselimis <mouselimislampros@gmail.com> Maintainer Lampros Mouselimis <mouselimislampros@gmail.com> BugReports https://github.com/mlampros/kernelknn/issues URL https://github.com/mlampros/kernelknn Extends the simple k-nearest neighbors algorithm by incorporating numerous kernel functions and a variety of distance metrics. The package takes advantage of 'RcppArmadillo' to speed up the calculation of distances between observations. License MIT + file LICENSE LazyData TRUE Depends R(>= 2.10.0) Imports Rcpp (>= 0.12.5) LinkingTo Rcpp, RcppArmadillo Suggests testthat, covr, knitr, rmarkdown VignetteBuilder knitr RoxygenNote 6.0.1 NeedsCompilation yes Repository CRAN Date/Publication 2018-01-16 18:24:51 UTC R topics documented: Boston............................................ 2 distmat.kernelknn..................................... 3 distmat.kernelknncv................................... 4 distmat.knn.index.dist................................... 6 1

2 Boston ionosphere.......................................... 7 KernelKnn.......................................... 9 KernelKnnCV........................................ 10 knn.index.dist........................................ 12 Index 14 Boston Boston Housing Data (Regression) housing values in suburbs of Boston data(boston) Format Source A data frame with 506 Instances and 14 attributes (including the class attribute, "medv") crim per capita crime rate by town zn proportion of residential land zoned for lots over 25,000 sq.ft. indus proportion of non-retail business acres per town chas Charles River dummy variable (= 1 if tract bounds) nox nitric oxides concentration (parts per 10 million) rm average number of rooms per dwelling age proportion of owner-occupied units built prior to 1940 dis weighted distances to five Boston employment centres rad index of accessibility to radial highways tax full-value property-tax rate per $10,000 ptratio pupil-teacher ratio by town black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat percentage of lower status of the population medv Median value of owner-occupied homes in $1000 s This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. Creator: Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.

distmat.kernelknn 3 References https://archive.ics.uci.edu/ml/datasets/housing data(boston) X = Boston[, -ncol(boston)] y = Boston[, ncol(boston)] distmat.kernelknn kernel k-nearest-neighbors using a distance matrix kernel k-nearest-neighbors using a distance matrix distmat.kernelknn(dist_mat, TEST_indices = NULL, y, k = 5, h = 1, weights_function = NULL, regression = F, threads = 1, extrema = F, Levels = NULL, minimize = T) Arguments DIST_mat a distance matrix (square matrix) having a diagonal filled with either zero s (0) or NA s (missing values) TEST_indices y k a numeric vector specifying the indices of the test data in the distance matrix (row-wise or column-wise). If the parameter equals NULL then no test data is included in the distance matrix a numeric vector (in classification the labels must be numeric from 1:Inf). It is assumed that if the TEST_indices is not NULL then the length of y equals to the rows of the train data ( nrow(dist_mat) - length(test_indices) ), otherwise length(y) == nrow(dist_mat). an integer specifying the k-nearest-neighbors h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0) weights_function there are various ways of specifying the kernel function. See the details section. regression threads extrema a boolean (TRUE,FALSE) specifying if regression or classification should be performed the number of cores to be used in parallel (openmp will be employed) if TRUE then the minimum and maximum values from the k-nearest-neighbors will be removed (can be thought as outlier removal)

4 distmat.kernelknncv Levels minimize a numeric vector. In case of classification the unique levels of the response variable are necessary either TRUE or FALSE. If TRUE then lower values will be considered as relevant for the k-nearest search, otherwise higher values. Details Value This function takes a distance matrix (square matrix where the diagonal is filled with 0 or NA) as input. If the TEST_indices parameter is NULL then the predictions for the train data will be returned, whereas if the TEST_indices parameter is not NULL then the predictions for the test data will be returned. There are three possible ways to specify the weights function, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of uniform, triangular, epanechnikov, biweight, triweight, tricube, gaussian, cosine, logistic, gaussiansimple, silverman, inverse, exponential. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving tricube_gaussian_mult or I can add the previously mentioned kernels by giving tricube_gaussian_add. 3rd option : a user defined kernel function a vector (if regression is TRUE), or a data frame with class probabilities (if regression is FALSE) Author(s) Lampros Mouselimis data(boston) X = Boston[, -ncol(boston)] y = Boston[, ncol(boston)] dist_obj = dist(x) dist_mat = as.matrix(dist_obj) out = distmat.kernelknn(dist_mat, TEST_indices = NULL, y, k = 5, regression = TRUE) distmat.kernelknncv cross validated kernel-k-nearest-neighbors using a distance matrix cross validated kernel-k-nearest-neighbors using a distance matrix

distmat.kernelknncv 5 distmat.kernelknncv(dist_mat, y, k = 5, folds = 5, h = 1, weights_function = NULL, regression = F, threads = 1, extrema = F, Levels = NULL, minimize = T, seed_num = 1) Arguments DIST_mat a distance matrix (square matrix) having a diagonal filled with either zero s (0) or NA s (missing values) y k Details Value a numeric vector (in classification the labels must be numeric from 1:Inf) an integer specifying the k-nearest-neighbors folds the number of cross validation folds (must be greater than 1) h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0) weights_function there are various ways of specifying the kernel function. See the details section. regression threads extrema Levels minimize seed_num a boolean (TRUE,FALSE) specifying if regression or classification should be performed the number of cores to be used in parallel (openmp will be employed) if TRUE then the minimum and maximum values from the k-nearest-neighbors will be removed (can be thought as outlier removal) a numeric vector. In case of classification the unique levels of the response variable are necessary either TRUE or FALSE. If TRUE then lower values will be considered as relevant for the k-nearest search, otherwise higher values. a numeric value specifying the seed of the random number generator This function takes a number of arguments (including the number of cross-validation-folds) and it returns predicted values and indices for each fold. There are three possible ways to specify the weights function, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of uniform, triangular, epanechnikov, biweight, triweight, tricube, gaussian, cosine, logistic, gaussiansimple, silverman, inverse, exponential. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving tricube_gaussian_mult or I can add the previously mentioned kernels by giving tricube_gaussian_add. 3rd option : a user defined kernel function a list of length 2. The first sublist is a list of predictions (the length of the list equals the number of the folds). The second sublist is a list with the indices for each fold. Author(s) Lampros Mouselimis

6 distmat.knn.index.dist ## Not run: data(ionosphere) X = ionosphere[, -c(2, ncol(ionosphere))] y = as.numeric(ionosphere[, ncol(ionosphere)]) dist_obj = dist(x) dist_mat = as.matrix(dist_obj) out = distmat.kernelknncv(dist_mat, y, k = 5, folds = 3, Levels = unique(y)) ## End(Not run) distmat.knn.index.dist indices and distances of k-nearest-neighbors using a distance matrix indices and distances of k-nearest-neighbors using a distance matrix distmat.knn.index.dist(dist_mat, TEST_indices = NULL, k = 5, threads = 1, minimize = T) Arguments DIST_mat a distance matrix (square matrix) having a diagonal filled with either zero s (0) or NA s (missing values) TEST_indices k Details threads minimize a numeric vector specifying the indices of the test data in the distance matrix (row-wise or column-wise). If the parameter equals NULL then no test data is included in the distance matrix an integer specifying the k-nearest-neighbors the number of cores to be used in parallel (openmp will be employed) either TRUE or FALSE. If TRUE then lower values will be considered as relevant for the k-nearest search, otherwise higher values. This function takes a number of arguments and it returns the indices and distances of the k-nearestneighbors for each observation. If TEST_indices is NULL then the indices-distances for the DIST_mat be returned, whereas if TEST_indices is not NULL then the indices-distances for the test data only will be returned.

ionosphere 7 Value a list of length 2. The first sublist returns the indices and the second the distances of the k nearest neighbors for each observation. If TEST_indices is NULL the number of rows of each sublist equals the number of rows in the DIST_mat data. If TEST_indices is not NULL the number of rows of each sublist equals the length of the input TEST_indices. Author(s) Lampros Mouselimis data(boston) X = Boston[, -ncol(boston)] dist_obj = dist(x) dist_mat = as.matrix(dist_obj) out = distmat.knn.index.dist(dist_mat, TEST_indices = NULL, k = 5) ionosphere Johns Hopkins University Ionosphere database (binary classification) This radar data was collected by a system in Goose Bay, Labrador. This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. See the paper for more details. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal. data(ionosphere) Format A data frame with 351 Instances and 35 attributes (including the class attribute, "class")

8 ionosphere Details Sigillito, V. G., Wing, S. P., Hutton, L. V., \& Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262-266. They investigated using backprop and the perceptron training algorithm on this database. Using the first 200 instances for training, which were carefully split almost 50 percent positive and 50 percent negative, they found that a "linear" perceptron attained 90.7 percent, a "non-linear" perceptron attained 92 percent, and backprop an average of over 96 percent accuracy on the remaining 150 test instances, consisting of 123 "good" and only 24 "bad" instances. (There was a counting error or some mistake somewhere; there are a total of 351 rather than 350 instances in this domain.) Accuracy on "good" instances was much higher than for "bad" instances. Backprop was tested with several different numbers of hidden units (in [0,15]) and incremental results were also reported (corresponding to how well the different variants of backprop did after a periodic number of epochs). David Aha (aha@ics.uci.edu) briefly investigated this database. He found that nearest neighbor attains an accuracy of 92.1 percent, that Ross Quinlan s C4 algorithm attains 94.0 percent (no windowing), and that IB3 (Aha \& Kibler, IJCAI-1989) attained 96.7 percent (parameter settings: 70 percent and 80 percent for acceptance and dropping respectively). Source Donor: Vince Sigillito (vgs@aplcen.apl.jhu.edu) Date: 1989 Source: Space Physics Group Applied Physics Laboratory Johns Hopkins University Johns Hopkins Road Laurel, MD 20723 References https://archive.ics.uci.edu/ml/datasets/ionosphere data(ionosphere) X = ionosphere[, -ncol(ionosphere)] y = ionosphere[, ncol(ionosphere)]

KernelKnn 9 KernelKnn kernel k-nearest-neighbors This function utilizes kernel k nearest neighbors to predict new observations KernelKnn(data, TEST_data = NULL, y, k = 5, h = 1, method = "euclidean", weights_function = NULL, regression = F, transf_categ_cols = F, threads = 1, extrema = F, Levels = NULL) Arguments data TEST_data y k Details a data frame or matrix a data frame or matrix (it can be also NULL) a numeric vector (in classification the labels must be numeric from 1:Inf) an integer specifying the k-nearest-neighbors h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0) method a string specifying the method. Valid methods are euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, simple_matching_coefficient, minkowski (by default the order p of the minkowski parameter equals k), hamming, mahalanobis, jaccard_coefficient, Rao_coefficient weights_function there are various ways of specifying the kernel function. See the details section. regression a boolean (TRUE,FALSE) specifying if regression or classification should be performed transf_categ_cols a boolean (TRUE, FALSE) specifying if the categorical columns should be converted to numeric or to dummy variables threads extrema Levels the number of cores to be used in parallel (openmp will be employed) if TRUE then the minimum and maximum values from the k-nearest-neighbors will be removed (can be thought as outlier removal) a numeric vector. In case of classification the unique levels of the response variable are necessary This function takes a number of arguments and it returns the predicted values. If TEST_data is NULL then the predictions for the train data will be returned, whereas if TEST_data is not NULL then the predictions for the TEST_data will be returned. There are three possible ways to specify the weights function, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of uniform, triangular, epanechnikov,

10 KernelKnnCV biweight, triweight, tricube, gaussian, cosine, logistic, gaussiansimple, silverman, inverse, exponential. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving tricube_gaussian_mult or I can add the previously mentioned kernels by giving tricube_gaussian_add. 3rd option : a user defined kernel function Value a vector (if regression is TRUE), or a data frame with class probabilities (if regression is FALSE) Author(s) Lampros Mouselimis data(boston) X = Boston[, -ncol(boston)] y = Boston[, ncol(boston)] out = KernelKnn(X, TEST_data = NULL, y, k = 5, method = 'euclidean', regression = TRUE) KernelKnnCV kernel-k-nearest-neighbors using cross-validation This function performs kernel k nearest neighbors regression and classification using cross validation KernelKnnCV(data, y, k = 5, folds = 5, h = 1, method = "euclidean", weights_function = NULL, regression = F, transf_categ_cols = F, threads = 1, extrema = F, Levels = NULL, seed_num = 1) Arguments data a data frame or matrix y a numeric vector (in classification the labels must be numeric from 1:Inf) k an integer specifying the k-nearest-neighbors folds the number of cross validation folds (must be greater than 1) h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)

KernelKnnCV 11 method a string specifying the method. Valid methods are euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, simple_matching_coefficient, minkowski (by default the order p of the minkowski parameter equals k), hamming, mahalanobis, jaccard_coefficient, Rao_coefficient weights_function there are various ways of specifying the kernel function. See the details section. regression a boolean (TRUE,FALSE) specifying if regression or classification should be performed transf_categ_cols a boolean (TRUE, FALSE) specifying if the categorical columns should be converted to numeric or to dummy variables threads extrema Levels seed_num the number of cores to be used in parallel (openmp will be employed) if TRUE then the minimum and maximum values from the k-nearest-neighbors will be removed (can be thought as outlier removal) a numeric vector. In case of classification the unique levels of the response variable are necessary a numeric value specifying the seed of the random number generator Details This function takes a number of arguments (including the number of cross-validation-folds) and it returns predicted values and indices for each fold. There are three possible ways to specify the weights function, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of uniform, triangular, epanechnikov, biweight, triweight, tricube, gaussian, cosine, logistic, gaussiansimple, silverman, inverse, exponential. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving tricube_gaussian_mult or I can add the previously mentioned kernels by giving tricube_gaussian_add. 3rd option : a user defined kernel function Value a list of length 2. The first sublist is a list of predictions (the length of the list equals the number of the folds). The second sublist is a list with the indices for each fold. Author(s) Lampros Mouselimis ## Not run: data(ionosphere) X = ionosphere[, -c(2, ncol(ionosphere))] y = as.numeric(ionosphere[, ncol(ionosphere)])

12 knn.index.dist out = KernelKnnCV(X, y, k = 5, folds = 3, regression = FALSE, Levels = unique(y)) ## End(Not run) knn.index.dist indices and distances of k-nearest-neighbors This function returns the k nearest indices and distances of each observation knn.index.dist(data, TEST_data = NULL, k = 5, method = "euclidean", transf_categ_cols = F, threads = 1) Arguments data TEST_data k Details Value a data.frame or matrix a data.frame or matrix (it can be also NULL) an integer specifying the k-nearest-neighbors method a string specifying the method. Valid methods are euclidean, manhattan, chebyshev, canberra, braycurtis, pearson_correlation, simple_matching_coefficient, minkowski (by default the order p of the minkowski parameter equals k), hamming, mahalanobis, jaccard_coefficient, Rao_coefficient transf_categ_cols a boolean (TRUE, FALSE) specifying if the categorical columns should be converted to numeric or to dummy variables threads the number of cores to be used in parallel (openmp will be employed) This function takes a number of arguments and it returns the indices and distances of the k-nearestneighbors for each observation. If TEST_data is NULL then the indices-distances for the train data will be returned, whereas if TEST_data is not NULL then the indices-distances for the TEST_data will be returned. a list of length 2. The first sublist returns the indices and the second the distances of the k nearest neighbors for each observation. If TEST_data is NULL the number of rows of each sublist equals the number of rows in the train data. If TEST_data is not NULL the number of rows of each sublist equals the number of rows in the TEST data. Author(s) Lampros Mouselimis

knn.index.dist 13 data(boston) X = Boston[, -ncol(boston)] out = knn.index.dist(x, TEST_data = NULL, k = 4, method = 'euclidean', threads = 1)

Index Topic datasets Boston, 2 ionosphere, 7 Boston, 2 distmat.kernelknn, 3 distmat.kernelknncv, 4 distmat.knn.index.dist, 6 ionosphere, 7 KernelKnn, 9 KernelKnnCV, 10 knn.index.dist, 12 14