arxiv: v2 [cs.lg] 13 Jun 2016

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 13 Jun 2016"

Meredith Powell
5 years ago
Views:

1 Eicient Estimation o or the Nearest Neighbors Class o Methods - unpublished wor rom 20 - Alesander Lodwich, Faisal Shaait and Thomas Breuel arxiv: v2 [cs.lg] 3 Jun 206 contact ino: alesander[at]lodwich.net Abstract. The Nearest Neighbors (NN) method has received much attention in the past decades, where some theoretical bounds on its perormance were identiied and where practical optimizations were proposed or maing it wor airly well in high dimensional spaces and on large datasets. From countless experiments o the past it became widely accepted that the value o has a signiicant impact on the perormance o this method. However, the eicient optimization o this parameter has not received so much attention in literature. Today, the most common approach is to cross-validate or bootstrap this value or all values in question. This approach orces distances to be recomputed many times, even i eicient methods are used. Hence, estimating the optimal can become expensive even on modern systems. Frequently, this circumstance leads to a sparse manual search o. In this paper we want to point out that a systematic and thorough estimation o the parameter can be perormed eiciently. The discussed approach relies on large matrices, but we want to argue, that in practice a higher space complexity is oten much less o a problem than repetitive distance computations. Keywords: nearest neighbor, optimization, benchmaring Introduction Since the introduction o the Nearest Neighbor (NN) method by Fix and Hodges in 95 [] a lot o dierent variants o it have appeared in order to mae it suitable to dierent scenarios. The most notable improvements were done in terms o adaptive distance metrics [2][3][4], ast access via space partitioning (paced R* trees [5], d-trees[6], X-trees[7], SPY-TEC [8]), nowledge base (prototype) pruning ([9],[0]) or classiication based on sensitive distributed data ([3],[4],[5]). An overview over the state-o-the-art in nearest neighbor techniques is given in []. The continuing richness o investigation wor into nearest neighbor can be explained with the omnipresence o CBR (Case Based Reasoning) type o problems [2] or just rom the practical point o view o its massive parallelizability or simply populartiy. Ater all nearest neighbor is conceptually easy to understand - many students get to learn the nearest neighbor as the irst classiier.

2 All o the beorehand mentioned advances evolve around the question o how to optimize the NN s distance measure or retrieving. The method s apparent laziness might be the reason why a ast preoptimization o has not been paid a lot attention to. The proper choice o is an important actor or achieving maximum perormance o a NN. However, as we will show, conventional optimization via cross-validation or bootstrapping is slow and promises potential or being sped up. Thereore, we will devote this paper to the concept o ast optimization. There is wor addressing this issue in an alternative way by introducing incremental NN classiiers based on dierent types o trees. This class o nearest neighbors attempts to eliminate the inluence o by choosing the right ad hoc. The rationale behind this method is that or classiication tass the exact majority o a speciic class label is not necessarily interesting. It is only interesting when nearest neighbor is used as a density estimator. In other cases it is generally enough to eel sae about which class label rules the nearest set. This class o nearest neighbor starts by polling a minimal amount o nearest neighbors. Then it analyzes the labels and i the retrieved collection o labels is indecisive it will poll more nearest neighbors until it considers the collection decisive enough. Naturally, the method does not scale to small values o, because e.g. a = will never be indecisive. This is a problem because we now rom experiments that small are oten optimal. Incremental nearest neighbors have their strength in very large databases where typical queries do not need to compute all relative distances. During a lietime o a large database some distances might not be computed at all. The lazy distance computation mae incremental nearest neighbor methods ideal candidates or real time tass operating on large volatile data. However, in a cross validation setup which needs to compute all distances incremental NNs cannot play out their strengths. Because o the restrictions and intended use we exempted incremental methods rom investigation in this paper. We organize this paper in three parts. In the irst part we will study the options to mae the estimation o as ast as possible, in the second part we will experimentally compare the result with the conventional approach and in the last part we will draw a conclusion. 2 Stating the Problem The NN is commonly considered a classiier rom the area o supervised learning theory. In this theory there exists a training unction T that delivers a model m based on a set o options o, a matrix o example values V and a vector o labels l o respective size (m = T (o, V, l)). In turn, the model m is used in a classiication unction C that is presented with a matrix o new examples V and its tas is to deliver a vector o new labels l (l = C(m, V )). T and C must be related but there is no restriction on what the model can be. It can be a set o complex items, a matrix, a vector or just a single value, indeed.

3 From the perspective o a human the purpose o a model is to mae predictions about the uture. In order to be able to do this the human brain requires a simpliication o the world. Only with the simpliication o the world to a reduced number o variables and rules it can compute uture states aster than they occur. In case o NN the model m consists o the data and o the smoothing parameter. This s, that the unction T is an identity unction between model and parameters. This conlicts with the common notion o a model because the production o models commonly implies reduction. However, the collected data already are a reduction o the world! From a practical point o view they can be considered as representative model states o the world and the data in the model is used to predict the class variable o a new vector beore it is actually recorded. This its perectly well with the original notion o models. However, the model is only ixed, when is ixed. According to the ramewor, ixing the model (getting the right value or ) is the job o the unction T. Many training techniques in machine learning use optimization strategies developed or real numbers and open parameter spaces. This is not suitable or as it is an integer value and has nown let and right limits: an ideal candidate or ull search. Most ramewors or pattern recognition oer macro optimization or the remaing model parameters that express themsel in the initial training options o. The structural compatibility o the NN with the pattern recognition ramewors macro optimization unctions seduces the users very oten to macro optimize. The consequence o this is that NN must compute distances repetitively as it cannot assume that speciic vectors will simply exchange their role between training and testing in the uture. In case o NN this is exceptionally regrettable. What does macro optimization or the computational complexity? Here we assume - but without loose o generality - that the dataset V is o size n (nrows in matrix V ) and can be exactly divided into equally sized partitions ready to be rearranged into dierent train and test setups. Since everything is being recomputed the computational complexity or this ind o cross-validation o or a brute NN is O (ˆ n ). 2 ˆ is the size o the tested range o and since we are considering ull search we accept that ˆ depends on the size n o the dataset and the number o olds. This s that the ˆ = n. The scan o nearest sets yields a partial complexity o O( n 2 ). Hence, or ull ( ( ) 2 search the total complexity is O n 3 + n ). 2 In order to reduce this high complexity it is necessary to optimize NN within the train unction T as it has all necessary inormation about the relationships among the examples and the labels. This s that T (o = {}, V, l) should become T (o = {i}, V, l) where i is a vector o partition indices o the ind (0, 0, 0,...,,,,...,,,...) T. Now, the train unction T can utilize the act that no new data will arrive during the training and all possible distance requests can be computed in advance. Distances within the same partition need not be computed as they will

4 be never requested. These ields can be set to ininity (alternatively they can be illed up with the largest value ound in the matrix + ). The lower triangle is symmetrical to the upper triangle o the matrix because vectors between two points have the same norm. The distance matrix D has a structure as shown in igure. The size o D is n n 2. By D(column, row) we the component distance and by D(column, row) 2 we the associated label. Alone this redesign causes ) the complexity o the distance computations to get reduced to O. However this beneit is achieved at the expense o ( n2 2 higher memory use. The brute NN has a space complexity o O(n), now the space complexity has risen to O(n 2 ). The next step is to sort the vectors horizontally according to their distance. Although collecting the best solutions would be aster or a single run it s or a range o that you eectively obtain the insertion sort. Since there exist aster sorting algorithms we choose to sort but by using a dierent algorithm. The astest algorithm or doing so is the quic sort. Its average complexity is O (n log(n)). In the worst case scenario the sorting complexity o this method is O ( n 2). In that case n rows will be sorted with O ( n 2). This s that the worst time complexity so ar is O ( ) O n2 2 + n2 log(n). ( n2 2 + n3 ) and average case is Distance Matrix n test train set set Fold ininite values necessary computations Fold 2 Fold 3 Fold 4 n mirrored values Fold 5 { {{ { { Data Segments (a) Distance matrix and its computation requiring segments or the cross validation o (b) Sorted distance matrix. Values are being sorted big values to the right Fig. : The distance matrix consists o tuples o data and label ((d, l)). It can be sorted horizontally without loosing the relationship between vector pair and ground truth.

5 In order to obtain nearest neighbors or each vector indexed by the row a counting matrix M := N n s is initialized with zeros. s is the number o symbols or classes. For each row in D and M and or the columns =.. n in D the counters or the speciic class label is increased. More precisely, or every row r =..n and or every tested the counters ( M (D(, r) 2, r) are updated by. The complexity o this operation is O n ). 2 For the overall method ( 3 this adds up to O 2 n ). 2 In parallel the level o correct classiication must be computed because ater every modiication o M the state or the smaller is lost. Thereore a matrix A := N n or recording the number o correct classications is required. How is this number computed? At every round o the nearest neighbor candidate computations M contains in each o its rows a vector that tells how many labels o speciic ind are in the nearest neighbors set. The classiication label is l r = argmax M r. The complexity or this operation is O(n 2 s). s This simple method is ambiguous by nature, as there can be many labels that are represented by the same amount o vectors in a nearby set. Computer implementations preer to return the symbol with the smalest coding. However, it is possible to have a shadow matrix S := R n s that is the sum o the distances observed or each class label in the set so ar. The rule or computing S is the same as or M with the dierence that instead o adding ones to the matrix you add distances. When symbol requency is ambiguous (argmax returns more than one value) it is possible to use S to ind which samples are closer overall. Because o the speciic interest into ast optimization the simple argmax processing is used. Now, every l r is compared or equality with l r (ground truth) and the binary result is added to A(, i r ). The argmax A will return best. is obtained by averaging. ( Considering all parts o the ) algorithm together the overall complexity is O n 2 + n 2 log(n) + n 2 s O ( n 2 log(n) ) s Time required or validating the optimal time 000s 00s New KD Brute BD 0s number o cross validation olds Fig. 2: Experimentally established relationship between the number o olds and the algorithm speed.

6 3 Experiments The method or ast computation (AutoNN) was tested against three other algorithms rom the ANN library.[6]: brute, d-tree and bd-tree NN with deault settings. The AutoNN and its competitors perormed a complete cross validation run on the ad, diabetes, gene, glass, heart, heartc, horse, ionosphere, iris, mushrooms, soybean, STATLOG australian, STATLOG german, STATLOG heart, STATLOG SAT, STATLOG segment STATLOG shuttle, STATLOG vehicle, thyroid, waveorm and wine datasets with 3, 5, 0 and 20 olds. The goal o the experiment was the measurement o the time required to complete the ull course o testing dierent. The data was separated into stratiied partitions which were used in dierent conigurations in order to obtain a training and a testing set. The AutoKNN computes the classiication results or all while the other algorithms are bound to use a logarithmic search. By logarithmic search the ollowing schema is t: {, 2, 3, 4, 5, 6, 7, 8, 0, 00, 200,..., 000,..., n}. This schema is practically motivated and rational under the assumption that the inluence o additional labels on the result diminishes with higher values o. Practical consideration is primarily test time. Example: while AutoNN required 5,35s or a complete scan based on the ad dataset, on same data exact brute NN needed 353,7s in logarithmic mode and 9595,7s in ull mode. The use o the logarithmic mode maes results with exactly the same values impossible. However, the dierences in resulting and thus in accuracy were absolutely negligible so that the results are directly comparable nonetheless. We added the experiment times or all databases up to a total or each cross validation size. The results are shown in the igure and the table under 2. Time measurements were perormed on a AMD Phenom II 965 with 8GB o RAM with a Linux ernel. The algorithms are implemented in C/C++ and were compiled with gcc with O3 option. Only core algorithm operation was measured and all time or additional I/O was ignored. For best comparability, ANN library sources were statically included. 4 Discussion and Conclusion The Nearest Neighbor approach is considered user riendly and is requently used or data mining, classiication and regression tass. It is embedded into many automatic environments that mae use o NN s lexibility. Although NN has been used, analyzed and advanced or almost six decades a repeating question can not be answered by current literature: What is the astest way to estimate the right value or and what are the expenses or doing so. The approach chosen here is to move the esimation away rom the meta ramewor right into the training unction T. The advantage o this is that additional inormation about the data can be made. This additional inormation allows to precompute the distances among all vectors without waste and to reuse them numerous times. From this design change which is nown to practitioners

7 but not( discussed in literature a reduction in time complexity can be observed ( ) ) 2 ( ) rom O n 3 + n 2 3 to O 2 n 2 + n 2 log(n) + n 2 s in average case. The experiments show, that this has signiicant impact on the speed o the estimation tas. The comparison between d-tree NN and the proposed approach proves moreover that having a better time complexity saves practically more time than an eicient distance measure or this tas. The cost o this improvement is a higher space complexity (now O(n 2 )). In order to esimtate the practical impact o this complexity exchange we studied the contents o the UCI repository [7]. The UCI should be a reasonable crossover o the problems people ace in real lie. Out o 62 datasets we ound that 90% o them have less than 50K examples, 80% o them have less than 0K examples and hal o the UCI s datasets has less than 000 examples (or exact distribution see Fig. 3). These sizes can be easily handled on higher class commodity computers. This leads to the conclusion that turning in space complexity or time complexity is a good choice most o the time. Future implementations should oer an integrated searching. The results also show that the so ound values or can be transered not only to other exact NN but also to approximate NN woring on d-tree and bd-tree models. Sample Sizes o Datasets in the UCI Repository No. o Examples,0 0 8,0 0 7,0 0 6,0 0 5,0 0 4,0 0 3,0 0 2,0 0 0,0% 20,0% 40,0% 60,0% 80,0% 00,0% (62 on Oct., 5th [85 Entries]) Fraction o Datasets in the UCI Repository Fig. 3: Dataset sizes in the UCI repository

8 5 Acnowledgments This wor has been made possible through the unding o the PaREn (Pattern Recognition and Engineering [8]) project by the BMBF (Federal Ministry o Education and Research, Germany). Reerences. Fix, E., Hodges, J.L. Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School o Aviation Medicine, Randolph Field, Texas, Carlotta Domeniconi, Dimitrios Gunopulos, Jing Peng. Adaptive Metric nearest Neighbor Classiication. CVPR, Vol., pp.57 (2000) 3. Jing Peng, Douglas R. Heisteramp, H. K. Dai. Adaptive Kernel Metric Nearest Neighbor Classiication. ICPR, Vol. 3, pp (2002) 4. T. Deselaers, R. Paredes, E. Vidal, H. Ney. Learning Weighted Distances or Relevance Feedbac in Image Retrieval. ICPR 2008, Tampa, Florida, USA (08/2/2008) 5. I. Kamel, C. Faloutsos. On Pacing R-trees. Proceedings o the 2nd Conerence on Inormation and Knowledge Management (CIKM), Washington DC (993) 6. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest. Introduction to Algorithms. ch. 0, MIT Press and McGraw-Hill (2009) 7. S. Berchtold, D. Keim, H.-P. Kriegel. The X-tree: An index structure or highdimensional data. In Proceedings o 22th International Conerence on Very Large Databases (VLDB96), pp 2839, Morgan Kaumann (996) 8. Dong-Ho Lee, Hyoung-Joo Kim. An Eicient Technique or Nearest-Neighbor Query Processing on the SPY-TEC. IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, pp , IEEE Educational Activities Department,Piscataway, NJ, USA (2003) 9. Jose Salvador-Sanchez, Filiberto Pla, Francesco J. Ferri. Prototype Selection or the Nearest-Neighbor Rule Through Proximity Graphs. PRL, Vol. 8, No. 6, pp (997) 0. U. Lipowezy. Selection o the optimal prototype subset or -NN classiication PRL, Vol. 9, No. 0, pp (998). Keith Price Bibliography - Nearest Neighbor Literature Overview 2. David W. Aha. The Omnipresence o Case-Based Reasoning in Science and Application. Journal on Knowledge-Based Systems, Vol, pp (998) 3. Young, B., Bhatnagar, R.: Secure K-NN Algorithm or Distributed Databases. University o Cincinnati (2006) 4. Shanec, M., Yongdae, K., Kumar, V.: Privacy Preserving Nearest Neighbor Search. Dept. o Computer Science Minneapolis (2006) 5. Kantarcoglu, M., Cliton, C.: Privately Computing a Distributed -nn Classiier. LNCS, Vol. 3202, (2004) 6. Mount, D. M.: Approximate Nearest Neighbor Library., Department o Computer Science and Institute or Advanced Computer Studies, University o Maryland (200) 7. UCI Machine Learning Repository: 8. Pattern Recognition and Engineering (PaREn), DFKI, BMBF Project: ( )

9 9. Janic V. Frasch and Alesander Lodwich and Faisal Shaait and Thomas M. Breuel A Bayes-true data generator or evaluation o supervised and unsupervised learning methods. Pattern Recognition Letters, volume 32:, p , 20, doi: 6/j.patrec APPENDIX A Inluence o on the NN Perormance The ollowing diagrams are results rom NN based on natural and synthetic datasets. Synthetic datasets were obtained using WGKS [9] The standard deviation was estimated based on a 0x cross-validation. The diagrams are non linear. Sections o little change are compressed, hence x-axis are discontinuous.

10 NN Perormance Heart Database (0x-validation) NNPerormance Glass Database (0x-validation) (acc) (acc) NN Perormance Wine Dataset (0x-validation) NN Perormance Waveorm Dataset (0x-validation) (acc)

11 .00 NN Perormance Thyroid Database (0x-validation) NN Perormance Statlog's Vehicle Database (0x-validation) NN Perormance Statlog's Shuttle Database (0x-validation).20 NN Perormance Statlog's Segment Database (0x-validation)

12 .00 NN Perormance Statlog's SAT Database (0x-validation) NN Perormance Statlog's German Finance Database (0x-validation) NN Perormance Statlog's Australian Financial Dataset (0x-validation) NN Perormance Soybean Dataset (0x-validation)

13 NN Perormance NN Perormance Mushrooms Database (0x-validation) Random 6.67% o MNIST60 (0x-validation) NN Perormance Iris Dataset (0x-validation) NN Perormance MNIST0K (0x-validation)

14 .00 NN Perormance Ionosphere Dataset (0x-validation) NN Perormance Horse Dataset (0x-validation) NN Perormance NN Perormance Heart C Dataset (0x-validation) Gene Dataset (0x-validation)

15 NN Perormance Diabetes Dataset (0x-validation) NN Perormance Ad Dataset (0x-validation) NN Perormance WGKS 2D 20% (0x-validation) NN Perormance WGKS 2D 0% (0x-validation)

16 NN Perormance NN Perormance WGKS 4D 30% (0x-validation) WGKS 4D 20% (0x-validation) NN Perormance WGKS 4D 0% (0x-validation) NN Perormance WGKS 2D 30% (0x-validation)

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models