ONLINE SEMI-SUPERVISED GROWING NEURAL GAS

Size: px

Start display at page:

Download "ONLINE SEMI-SUPERVISED GROWING NEURAL GAS"

Caroline Phillips
6 years ago
Views:

1 International Journal of Neural Systems, Vol. 0, No. 0 (April, 2000) c World Scientific Publishing Company ONLINE SEMI-SUPERVISED GROWING NEURAL GAS OLIVER BEYER Semantic Computing Group, CITEC, Bielefeld University Bielefeld, Germany obeyer@cit-ec.uni-bielefeld.de PHILIPP CIMIANO Semantic Computing Group, CITEC, Bielefeld University Bielefeld, Germany cimiano@cit-ec.uni-bielefeld.de Received (to be inserted Revised by Publisher) Abstract In this paper we introduce Online Semi-supervised Growing Neural Gas (OSSGNG), a novel online semisupervised classification approach based on Growing Neural Gas (GNG). Existing semi-supervised classification approaches based on GNG require that the training data is explicitly stored as the labeling is performed a posteriori after the training phase. As main contribution, we present an approach that relies on online labeling and prediction functions to process labeled and unlabeled data uniformly and in an online fashion, without the need to store any of the training examples explicitly. We show, on the one hand, that using on-the-fly labeling strategies does not significantly deteriorate the performance of classifiers based on GNG, while circumventing the need to explicitly store training examples. Armed with this result, we then present a semi-supervised extension of GNG that relies on the above mentioned online labeling functions to label unlabeled examples and incorporate them into the model on-the-fly. As an important result, we show that OSSGNG performs as good as previous semi-supervised extensions of GNG which rely on offline labeling strategies (SSGNG). We also show that OSSGNG compares favorably to other state-of-the-art semi-supervised learning approaches on standard benchmarking datasets. 1. Introduction Traditionally, in machine learning one distinguishes between supervised and unsupervised learning approaches. Supervised approaches assume that the task is to learn a function f : X Y by assigning data points from some space X into a finite set of given categories Y. Data labeled with these categories is used to learn a model for this function that minimizes the empirical risk of making an erroneous assignment. Unsupervised machine learning approaches, in particular clustering, rely on unlabeled data as well as on a given similarity metric to find natural groups in data. Techniques from semi-supervised learning (SSL) (see Chapelle et al. 6 ) have blurred the distinction between these two learning paradigms and have become especially interesting as more and more data becomes available of which, however, only a small fraction can be manually labeled due to the high cost incurred. On the one hand, semi-supervised learning has been shown to improve the performance of supervised classification approaches by factoring in unlabeled data (see Nigam et al. 27 ). On the other hand, semi-supervised learning has been also shown to improve clustering by factoring in labeled data that

2 can be used as constraints to guide the search for an optimal clustering of the data (see Wagstaff et al. 35 ). Approaches based on topological maps, e.g. Self-Organizing Maps (SOMs) 18 or Growing Neural Gas (GNG) 12, have been successfully applied to clustering problems by representing a high dimensional input space in a low dimensional and interpretable feature map. Growing Neural Gas, for example, when used with unlabeled data, will learn natural categories and thus the inherent topology of the data in an incremental fashion. GNG features the advantages of unsupervised approaches that can learn categories for which no labeled data is given. Approaches such as GNG are ideal in life-long learning settings where neither the categories can be assumed to be fixed a priori nor labels can be assumed to be available for all categories. With appropriate extensions, topological maps such as SOMs or GNG can, however, also be trained with labeled data and thus be used in classification tasks 20,21,37. This requires appropriate labeling functions that assign labels to neurons of the network as well as prediction functions that assign labels to unseen examples. The labels themselves can be used to merely label neurons thus not influencing the clustering itself or be exploited by some discriminative learning process in order to minimize the risk of assigning a wrong label, as for example in Learning Vector Quantization (LVQ) 19. In the context of GNG, mainly offline labeling techniques have been proposed so far, i.e. the labeling is performed in batch mode after all the training data has been processed. This requires the explicit storage of training data and thus runs counter to the online nature of Growing Neural Gas. Training in batch mode has also been argued to be disadvantageous in some scenarios. First, in many applications we find massive streams of data that can not be stored on standard hardware anymore (see Gaber et al. 13 ). Online clustering algorithms (see Barbakh et al. 2 ), thus, become especially relevant in the context of stream data mining as data can not be processed in batch mode or in several passes and the model needs to be updated on-the-fly instead. Second, it even has been shown that online learning can render the training more efficient by using the model to generate new training examples which are closer to the desired solution, thus allowing a more efficient exploration of the parameter search space (see Rolf et al. 29 ). Further, there are scenarios (e.g. Interactive Learning) and tracking applications where batch learning is simply not suitable (see Steil et al. 30 and Mandic et al. 24 ). In this paper, we investigate Growing Neural Gas as a classification algorithm. In particular, we propose an extension of the standard Growing Neural Gas algorithm to an online semi-supervised classifier. We propose on the one hand an extension of Growing Neural Gas into an online classifier by introducing a step that updates the labels of a (winner) neuron after each data point has been processed. This circumvents the need to store all the labeled training data explicitly. Second, we propose a further extension of the Online Growing Neural Gas classifier into a semi-supervised classifier which leverages unlabeled data by labeling unlabeled data points on-the-fly. Our contributions are in particular the following: 1. We propose an extension of the original Growing Neural Gas algorithm to an online classifier by an additional step that uses an appropriate labeling function to assign or recompute the label of a winner neuron after each seen data point. We also extend the approach with an appropriate prediction function that allows to assign labels to unseen data. 2. We show that the proposed extension does not yield significantly worse results compared to an offline version of GNG in which the labeling is computed in batch mode once the training phase has ended. 3. We investigate several candidate functions that can be used to realize the labeling and prediction functions, showing that a memoryfree version performs as good as candidates that partially store examples or frequency counts. 4. We extend the Online Growing Neural Gas classifier proposed to a semi-supervised algorithm that predicts labels for unlabeled ex-

3 amples and incorporates these labeled examples into the model on-the-fly. 5. We show that this online and semi-supervised variant of Growing Neural Gas (OSSGNG) outperforms the offline variant presented by Zaki et al. 37 (SSGNG) on a number of datasets. We also compare the performance of our approach to state-of-the-art semisupervised classification algorithms, showing that our approach can compete with such approaches, outperforming them on a number of datasets. The plan for the paper is as follows: in Section 2 we briefly introduce the standard Growing Neural Gas algorithm. In Section 3, we present our extension of Growing Neural Gas into an online classification approach. In particular, we discuss several alternatives of how the prediction and labeling functions can be implemented and present empirical results comparing the performance of these functions. In Section 4, we show how the extension proposed in Section 3 can be used in a semi-supervised setting and evaluate the approach on a number of benchmarking datasets for semisupervised learning, showing that our approach can compete with state-of-the-art semi-supervised learning approaches. Before concluding, we discuss related work in Section Growing Neural Gas Growing neural gas (GNG) 12 is an incremental self-organizing approach which is capable of representing a high dimensional input space in a low dimensional feature map. It belongs to the family of topological maps such as Self-Organizing Maps (SOM) 18 or Neural Gas (NG) 25. Typically, SOM and GNG are used for visualization tasks in a number of domains 15,20,37,10 as the neurons, which represent prototypes, are easy to understand and interpret. Like SOM and NG, GNG is a Competitive Learning approach based on the winner-takes-it-all (WTA) principle. This means that in every iteration step, the algorithm determines the neuron which is closest to the presented input stimulus. Although the main idea behind SOM, NG and GNG is similar, there are some important differences which set GNG apart. First of all, Growing Neural Gas combines the ideas of Growing Cell Structures (GCS) 11 and Competitive Hebbian Learning (CHL) 26. It shares the growing character of GCS in the sense that, starting from a small network, neurons are successively inserted into the network and can also be removed if they are identified as being superfluous. This is an advantage compared to SOM and NG, as there is no need of fixing the network size in advance. Inspired by CHL, GNG also integrates temporal synaptic links between neurons, which are introduced between a winner neuron and a second winner neuron. These links are temporal in the sense that they are subject to aging during the iteration steps of the algorithm and are removed when they get too old. The main difference compared to SOM and NG is the fact that the adaptation strength of the network is constant over time and fixed by the two parameters e b and e n, i.e. the adaptation strength for the winner neuron and its neighbors, respectively. Furthermore, only the best-matching neuron and its topological neighbors are adapted, such that there is no global optimization of the network. In the following we will briefly describe the single steps of the GNG algorithm as proposed by Fritzke 12. The algorithm is depicted in Figure 1 (modulo step 4 which is part of our GNG extension presented in Section 3.2). In the first step (1), the algorithm starts with two neurons, randomly placed in the feature space. (2) The first stimulus x R n of the input space (first training example) is presented to the network. (3) The two neurons n 1 and n 2 which minimize the euclidean distance to x are identified as first and second winner. (5) The age of all edges that connect n 1 to other neurons is increased by 1. In step (6), the local error variable error(n 1 ) of n 1 is updated. This error variable will be used later in order to set the location for a newly inserted node. In step (7), n 1 and its topological neighbors are adapted towards x by fractions e b and e n, respectively. (8) A new connection between n 1 and n 2 is created and the age of the edge is set to 0. (9) All edges with an age greater than a max as well as all neurons without any connecting edge are removed. (10) Depending on the iteration and the parameter λ, a new node r is inserted into the network. It will be inserted half-way between the neuron q with the highest local error and its topological neighbor f having the

4 largest error among all neighbors of q. In addition, the connection between q and f is removed and both neurons are connected to r. In step (11), the error variables of all nodes are decreased by a factor β. (12) The algorithm stops if the stop criterion is met, i.e., the maximal network size or some other performance measure has been reached. 3. Classification with GNG As already mentioned, one typical approach to turn GNG into a classifier is by extending the algorithm with appropriate labeling and prediction functions that assign labels to neurons as well as labels to unseen examples. In this section, we first present a set of offline labeling functions that have been proposed in the context of other topological map approaches, but have not been systematically investigated before. Further, we introduce a set of online labeling approaches and prediction approaches that are based on linkage strategies used in cluster analysis. We present experimental results on three datasets comparing the performance of offline and online labeling strategies, coming to the conclusion that online labeling strategies do not yield significantly worse results compared to the offline labeling strategies. As using offline strategies for labeling neurons requires the storage of training examples, the application of online strategies is preferable, particularly as they produce comparable results as conveyed by our experimental results Offline labeling methods In order to apply GNG to a classification task, we require two functions: i) a neuron labeling function l : N C where C is the set of class labels, and ii) a prediction function pred : X C where X is the input space. We analyze the following offline neuron labeling functions as proposed by Lau et al. 22. They are offline in the sense that they assume that the pairs (x, l x) with x X train X and l x C seen in the training phase are explicitly stored: Minimal-distance method (min-dist): According to this strategy, neuron n i adopts the label l x of the closest data point x X train : l min dist (n i ) = l x = l(arg min n i x 2 ) x X train Average-distance method (avg-dist): According to this strategy, we assign to neuron n i the label of the category c that minimizes the average distance to all data points labeled with category c: X(c) n i x k 2 l avg dist (n i ) = arg min c X(c) k=1 where X(c) = {x X train l x = c} is the set of all examples labeled with c. Majority method (majority): According to this strategy, we label neuron n i with that category c having the highest overlap (in terms of data points belonging to category c) with the data points in the voronoi cell for n i. We denote the set of data points in the voronoi cell for n i as v(n i ) = {x X train n j, j i : n j x 2 n i x 2 } within the topological map. The majority strategy can be formalized as follows: l majority (n i ) = arg max X(c) v(n i ) c In addition to the neuron labeling strategy, we need to define prediction functions that assign labels to unseen examples. These prediction functions are inspired by linkage strategies typically used in cluster analysis 17,1,33 : Single-linkage: According to this prediction strategy, a new data point x new is labeled with category c of the winner neuron n that minimizes the distance to this new example: pred single (x new ) = arg min( min n x new 2 ) c n N(c) where N(c) = {n N l(n) = c} is the set of all neurons labeled with category c according to one of the above mentioned neuron labeling functions. According to this strategy, a data point, thus, adopts the label of the winner neuron. In combination with the majority-strategy, we have an often used a posteriori labeling strategy.

5 Average-linkage: Following this strategy, example x new adopts the label of category c having the minimal average distance to the example: N(c) pred avg(x new) = arg min( c k=1 n k x new 2 ) N(c) Complete-linkage: According to this prediction strategy a new data point x new is labeled with category c of the neuron n that minimizes the maximal distance to this new example: pred compl (x new ) = arg min( max n x new 2 ) c n N(c) 3.2. Online labeling strategies for GNG In order to extend GNG into an online classification algorithm, we extend the basic GNG by a step in which the label of the presented stimulus is assigned on-the-fly, without the requirement of an additional labeling phase. We denote the winner neuron for data point x by w(x). All prediction strategies are local in the sense that they do not consider any neighboring neurons besides the winner neuron w(x). As the labeling is performed onthe-fly, the label assigned to a neuron can change over time, so that the labeling function is dependent on the number of examples the network has seen and has the following form: l : N T C. We will simply write l t (n i ) to denote the label assigned to neuron n i after having seen t data points. Relabeling method (relabel): According to this very simple strategy, the winner neuron w(x) adopts the label of x: l t relabel(n i ) = l x, where n i = w(x) Frequency-based method (freq): In this labeling method we realize a memory for each neuron. We assume that each neuron stores information about how often a data point of a certain category has been assigned to n i after t examples have been presented to the network. This frequency freq t(c, n i) is updated on-the-fly and does not require the storage of training examples and thus represents a very restricted form of memory. According to this strategy, a neuron is labeled by the category which maximizes this frequency, i.e. lfreq(n t i ) = arg max freq t (c, n i ) c Limited-distance method (limit): According to this strategy, we also implement a simple memory that stores the distance of the data point that was closest to the neuron in question. We denote this data point as min t(n i) and the corresponding distance as θ t(n i) = min t(n i) n i 2. The winner neuron w(x) adopts the category label l x of the data point x if the distance between them is lower than θ t (w(x)). Only in case of a smaller distance, θ t (n i ) will be updated with the new distance. l t limit(n i ) = { l x, if n i x 2 θ t(n i) l t 1 limit (ni), else Online labeling Growing Neural Gas (OGNG) 1. Start with two units i and j at random positions w i, w j in the input space. 2. Present an input vector x R n from the input data. 3. Find the nearest unit n 1 (winner) and the second nearest unit n 2 (second winner) 4. Assign the label of x to n 1 according to the selected labeling strategy. 5. Increment the age of all edges emanating from n Update the local error variable by adding the squared distance between w n1 and x: error(n 1) = w n1 x 2 7. Move n 1 and all its topological neighbors (i.e. all the nodes connected to n 1 by an edge) towards x by fractions of e b and e n of the distance: w n1 = e b (x w n1 ) w n = e n (x w n ) for all direct neighbors n of n If n 1 and n 2 are connected by an edge, set the age of the edge to 0 (refresh). If there is no such edge, create one. 9. Remove edges having an age greater than a max. If this results in nodes having no emanating edges, remove them as well. 10. If the number of input vectors presented or generated so far is an integer or multiple of a parameter λ, insert a new node r as follows: Determine the unit q with the largest error. Among the neighbors of q, find node f with the largest error. Insert a new node r halfway between q and f as follows: w r = wq + w f 2

6 Create edges between r and q, and r and f. Remove the edge between q and f. Decrease the local error variable of q and f by multiplying them with a constant α. Set the error r with the new error variable of q. 11. Decrease all local error variables of all nodes i by a factor β. 12. If the stopping criterion is not met, go back to step (2). (For our experiments, the stopping criterion has been set to be the maximum network size.) Fig. 1. GNG algorithm with extension for online labeling Experiments and results We compare and evaluate the above mentioned labeling strategies (online vs. offline in particular) on three classification data sets: i) an artificial data set generated following a gaussian distribution, ii) the ORL face database 31 and iii) the image segmentation data set of the UCI machine learning database 4. We briefly describe these datasets in what follows: Artificial data set (ART): The first data set is a two dimensional Gaussian mixture distribution with 6 classes located at [0,6], [-2,2], [2,2], [0,-6], [-2,-2], [2,-2]. The data points of each class are distributed according to a Gaussian distribution with the standard deviation of 1. ORL face database (ORL): The second data set is the ORL face database containing 400 frontal images of humans performing different gestures. The data set consists of 40 individuals showing 10 gestures each. We downscaled each image from to and applied a principal component analysis (PCA) to reduce the number of dimensions from 2576 to 60, capturing 86.65% of the original variance. Image Segmentation data set (SEG): The image segmentation data set consists of 2310 instances from 7 randomly selected outdoor images (brick-face, sky, foliage, cement, window, path, grass). Each instance includes 19 attributes that describe a 3 3 region within one of the images. In order to compare the different labeling strategies to each other, we set the parameters for GNG as follows: insertion parameter λ = 300; maximum age a max = 120; adaptation parameter for winner e b = 0.2; adaptation parameter for neighborhood e n = 0.006; error variable decrease α = 0.5; error variable decrease β = These parameters have been empirically determined on a trial and error basis and a different choice of parameter might lead to very different results. In our case the algorithm stops when a network size of 100 neurons is reached. For our experiments we randomly sampled 10 training/test sets consisting of 4 labeled examples per category. The accuracy is averaged for all 10 test folds. The reason for using only four examples per category is that the classification problems under consideration are so simple that by using more examples any strategy yields nearly perfect results, thus rendering a comparison meaningless. Our results are shown in Table 1. The Table shows the classification accuracy for various configurations of labeling methods (min-dist, avg-dist, majority, relabel, freq, limit) and prediction strategies (single-linkage, average-linkage, complete-linkage), averaged over the three different data sets. We evaluated the accuracy of each labeling method combined with three prediction strategies (rows of the tables). Therefore, we consider the results of 54 experiments overall. The results license the following conclusions: Comparison of offline labeling strategies: According to Table 1, there is no offline labeling method which significantly outperforms the others. Comparing the accuracy results averaged over all prediction strategies, the majority method is the most effective labeling method as it provides the highest accuracy with 77.59%, followed by the min-dist method with 76.27% and the avgdist method with 74.28%. Concerning the prediction strategies, the single-linkage prediction strategy shows best results averaged over all methods with 81.41%, followed by the average-linkage prediction strategy with The results of all 54 experiments can be found at

7 offline labeling Min-dist Avg-dist Majority Average method method method Single-linkage Average-linkage Complete-linkage Average online labeling Relabel Freq Limit Average method method method Single-Linkage Average-Linkage Complete-Linkage Average Table 1: Classification accuracy for the offline (upper part)/online (lower part) labeling strategies combined with prediction strategies averaged over the three data sets (ART, ORL, SEG) trained with 4 labeled data points of each category (best averaged results are marked). an accuracy of 77.65%. The complete-linkage yielded the worst results with an averaged accuracy of 69.07%. Comparison of online labeling strategies: According to Table 1, all three online labeling strategies are almost equal in their classification performance. The limit method performs slightly better compared to the other two methods and achieves an accuracy of 78.15%, followed by the freq method with an accuracy of 78.05% and the relabel method with an accuracy of 77.88%. As with the offline labeling strategies, it is also the case here that the single-linkage prediction is the best choice with an accuracy of 83.30%, followed by the average-linkage prediction with an accuracy of 80.90% and the complete-linkage prediction with an accuracy of 69.88%. Online vs. offline labeling strategies: Comparing the averaged accuracy of all labeling methods of Table 1, the results show that there is no significant difference between them in terms of performance. The online labeling methods even provide a slightly higher accuracy. Impact of memory: Strategies relying on some sort of memory (e.g. storing the frequency of seen labels as in the freq method), do not perform significantly better than a simple memory-free method (relabel method) performing decisions on the basis of new data points only. This shows that the implementation of a label memory does not enhance the classifiers performance Discussion The results of our experiments show that using online labeling strategies does not significantly deteriorate the performance of a classifier based on GNG in comparison to using offline labeling strategies. An open question is in fact in how far the labels of the neurons actually differ from each other when using online vs. offline labeling strategies. If the labels overlap to a high degree, this would explain why the accuracy of both approaches is comparable. In order to shed light on this issue, we compared the labels assigned to neurons using the online labeling strategies with those assigned by offline labeling strategies at the end of the training phase, quantifying the percentage of neurons for which both methods agree on the label. We carried out this analysis using the single-linkage prediction strategy (averaged over all three datasets) as it was the best performing strategy in our experiments described above. The results are summarized in Table 2. We can see that in general there is a very high agreement in the label assigned between the different labeling strategies, i.e. the labels are the same for over 85% of the neurons independent of the methods compared. This shows that the online and the offline labeling strategies ultimately assign almost the same labels to the neurons and thus explains the closeness of the results in terms of classification performance.

8 Percentage Relabel Freq Limit of label agreement method method method Min-dist method Avg-dist method Majority method Table 2. Comparison of the percentage of agreement on labels for different online and offline labeling strategies. 4. Semi-supervised learning with Growing Neural Gas The results in the previous section have shown that we can extend GNG by a dedicated online labeling step while yielding a satisfactory performance. Armed with these results, we have extended the approach presented in the previous section by a semi-supervised learning component that assigns labels to unseen examples on-the-fly and incorporates these labeled examples into the model. In contrast to previous semi-supervised extensions of GNG (i.e. the approach by Zaki et al. 37 ), no separate steps are thus required. In the approach of Zaki et al., two steps are iterated: in a first step, the network is trained using labeled examples only. Then, labels are assigned to neurons using an offline labeling approach. Finally, in a second step, unlabeled examples are classified into the network and labeled appropriately. These two steps are then iterated until the labeling converges, similar to the Expectation-Maximization (EM) approach 14. Compared to the approach of Zaki et al. as a baseline, we show that our online approach performs at least as well. In the following, we first discuss in more detail our baseline, i.e. the SSGNG approach by Zaki et al. Next, we present our own approach which we call Online- Semi-supervised Growing Neural Gas (OSSGNG). In Section 4.3, we present experimental results showing that our approach outperforms the SS- GNG approach by Zaki et al. Further, we also show that our approach compares favorably to state-ofthe-art semi-supervised learning approaches on a set of standard benchmarking datasets Semi-supervised Growing Neural Gas The Semi-supervised Growing Neural Gas (SS- GNG) algorithm, proposed by Zaki et al. 37, extends GNG to a classifier that can be trained with both labeled and unlabeled examples. For our purposes, SSGNG will be used as baseline in our experiments. It is inspired in the EM algorithm and therefore the learning process is separated into different phases, as shown in Figure 2. Semi-supervised Growing Neural Gas 1. Given L and U, let L = { } represent an initial empty set of newly labeled data. 2. Present L to the GNG algorithm and train the network only with L. 3. Label all the nodes of the GNG network according to L. 4. Present an input x j from U iteratively and compute the Euclidean distance between x j and every node of the GNG network: Distance = w n x j 2 5. Label x j according to the class label of the winner node. Remove x j from the current unlabeled dataset, U, and add x j into the newly labeled dataset L. 6. If all unlabeled data has been labeled, go to 7, otherwise go back to Present L and L together to the GNG classifier and retrain the classifier with L + L and evaluate the new classification performance. 8. Check the labels of L ; if they become stable during successive iterations, stop. Otherwise go back to step 4. Fig. 2. Semi-supervised Growing Neural Gas (SSGNG) algorithm. In the following, we will describe each step of the SSGNG algorithm in detail. (1) The algorithm starts with a given set of labeled and unlabeled training examples L, U X, as well as an initial empty set of newly labeled data L. The set L holds all examples from U that have been labeled during the last iteration step of the training process. In the next step (2), the GNG network trains only with L. In this step, the clustering is only performed on the basis of the feature vectors of L, without taking the label information into account. (3) Labels are assigned to each neuron of the network after the GNG training has finished. As Zaki et al. do not explicitly describe their labeling strategy, in our experiments we use the minimum distance method from Section 3.1. In steps (4-6), the label for each unlabeled example of the training set is predicted by the network. Thus, the label of that neuron minimizing the Euclidean distance to the unlabeled datapoint is adopted by

9 the latter. This step is similar to the expectationstep (E-step) of the EM algorithm. Additionally, all newly labeled examples of U are added to L. (7) In this step, the GNG network retrains with L + L. (8) In the last step, the algorithm stops if the labels of U stabilize during the iterations in the sense that the labels do not change anymore. The main disadvantage of the SSGNG is the fact that labels are assigned to each neuron a posteriori after the end of the training phase. Thus, the approach is not able to process a continuous stream of labeled and unlabeled training examples. Furthermore, labeled and unlabeled examples are processed in different phases and therefore need to be stored until the SSGNG training ends. Another disadvantage of SSGNG is that a minimal set of labeled examples for each class is crucial for the training. Our online version of Semi-supervised Growing Neural Gas, presented in the next section, circumvents these problems Online Semi-supervised Growing Neural Gas In order to extend Growing Neural Gas into a semi-supervised classifier, we add two steps (step 4 and 5) to the original GNG algorithm, as shown in Figure 3. In step (4), in case x is an unlabeled example, a label for x is predicted according to the chosen prediction strategy. The prediction strategy we use is the single-linkage prediction from Section 3.1. In step (5), the label of the presented stimulus is assigned to the winner neuron in each iteration of GNG. The label assignment is performed by an online labeling function, which in our case is the limit method from Section 3.2. In contrast to SSGNG, OSSGNG processes labeled and unlabeled training examples uniformly in every iteration step, with the exception that a label is only predicted for an unlabeled example. This means that OSSGNG is able to solve a classification task after each iteration step. The main advantage of the OSSGNG lies in its ability to train in an online fashion without the need of storing training examples explicitly. Further, the OSS- GNG algorithm still provides the ability of GNG to perform a clustering on an unlabeled training set. This means that clusters can be formed without the knowledge about categories. Online Semi-supervised Growing Neural Gas 1. Start with two units i and j at random positions w i, w j in the input space. 2. Present an input vector x R n from the input data. 3. Find the nearest unit n 1 (winner) and the second nearest unit n 2 (second winner). 4. If the label of x is missing, assign a label to x according to the selected prediction strategy. 5. Assign the label of x to n 1 according to the present labeling strategy. 6. Increment the age of all edges emanating from n Update the local error variable by adding the squared distance between w n1 and x: error(n 1) = w n1 x 2 8. Move n 1 and all its topological neighbors (i.e. all the nodes connected to n 1 by an edge) towards x by fractions of e b and e n of the distance: w n1 = e b (x w n1 ) w n = e n (x w n ) for all direct neighbors n of n If n 1 and n 2 are connected by an edge, set the age of the edge to 0 (refresh). If there is no such edge, create one. 10. Remove edges having an age greater than a max. If this results in nodes having no emanating edges, remove them as well. 11. If the number of input vectors presented or generated so far is an integer or multiple of a parameter λ, insert a new node r as follows: Determine the unit q with the largest error. Among the neighbors of q, find node f with the largest error. Insert a new node r halfway between q and f as follows: w r = w q + w f 2 Create edges between r and q, and r and f. Remove the edge between q and f. Decrease the local error variable of q and f by multiplying them with a constant α. Set the error r with the new error variable of q. 12. Decrease all local error variables of all nodes i by a factor β. 13. If the stopping criterion is not met, go back to step (2). (For our experiments, the stopping criterion has been set to be the maximum network size.) Fig. 3. GNG algorithm with extension for online semi-supervised learning Experiments and results We evaluate the OSSGNG algorithm on 6 datasets (3 artificial datasets, 3 real datasets) that have been proposed for as benchmarking for semisupervised classification (Chapelle et al. 6 ). We use SSGNG as baseline for our approach and evaluate

10 the classification accuracy on test for the 6 datasets described below in more detail. Except for the BCI dataset, all SSL benchmarking datasets include 1500 data points of 241 dimensional feature vectors. In Order to visualize the data, Figure 4 (in the appendix) shows a two-dimensional representation of each dataset plotting the first two principal components after applying a Principal Component Analysis (PCA). We describe these datasets briefly in what follows: g241c: This artificial dataset was generated by two unit-variance isotropic Gaussians with their centers having a distance of 2.5 from each other in a random direction. Additionally, all dimensions are standardized in the sense that they are shifted and rescaled to zero-mean and unit variance. g241d: The second artificial dataset is similar to the first one. However, the two classes A, B were split into A 1,A 2 and B 1,B 2. The distance between the subclasses (A 1, B 1 ) and (A 2, B 2 ) was set to 2.5 in random direction, while the interclass distance between (A 1, A 2 ) and (B 1, B 2 ) is 6. The dataset was designed in such a way that these subclasses are not convex, thus making it impossible to separate those classes based on Euclidean distance. Digit1: In the last artificial dataset, images of the digit 1 were generated. These images are the result of transformations of the digit along five degrees of freedom: two for translations ([-0.13,0.13] each), one for rotation ([ 90,90 ]), one for line thickness ([0.02,0.05]), and one for the small line at the bottom ([0,0.1]). The class labels were set according to the tilt angle, with the boundary corresponding to an upright digit. Additional noise was added in order to make the task a slightly more difficult. USPS: This dataset includes images of handwritten digits. The digits 2 and 5 were assigned to one class, while the rest of the ten digits form the second class. Thus, the dataset is imbalanced with the ratio of 1:4. COIL: In this dataset the images of 24 objects were partitioned into 6 classes. Each image shows one of the objects from a different angle (in steps of 5 degrees). BCI: In the last dataset, EEG (electroencephalography) measurements were recorded from 39 electrodes. The 400 data points are collected from 400 trials with subjects moving either the left hand (class -1) or the right hand (class +1). An autoregression model of the order 3 was applied to the resulting 39 time series in order to construct a 117 (39 3)-dimensional feature vector. As we want both GNG variations, SSGNG and OSSGNG, to be comparable, we chose a fixed set of parameters that was proposed by Zaki et al. 37 for SSGNG and used these throughout our experiments. The parameters are thus set as follows: insertion parameter λ = 300; maximum age a max = 100; adaptation parameter for winner e b = 0.2; adaptation parameter for neighborhood e n = 0.006; error variable decrease α = 0.5; error variable decrease β = The algorithm stops when a network size of 200 neurons is reached. Our experiments are carried out using a 12-fold cross validation with 100 labeled examples per fold, respectively. This setup corresponds to the setup used in Chapelle et al. 7 where a number of stateof-the-art SSL algorithms were benchmarked. We evaluate the accuracy of all compared algorithms on test, as shown in Table 3. Each row in the table represents the accuracy on test averaged over the 12 folds. The best accuracy is marked in each row. We also compare to a 1-Nearest Neighbor (1-NN) classifier and a Support Vector Machine (SVM) classifier, using a linear kernel, in order to provide an overall baseline for all SSL approaches. Both classifiers were only trained with the labeled data points of our data sets. The results license the following observations: Comparison of OSSGNG and SSGNG: According to Table 3, OSSGNG clearly outperforms SSGNG on 5 out of 6 datasets (g241c, g241d and COIL), while having a comparable performance on the datasets Digit1 and USPS. On average, OSSGNG has also a higher accuracy (77.73%) compared to SSGNG (73.08%). While we can not claim that the differences are significant, it is valid to claim that our online version performs as

11 good as our baseline (SSGNG), while circumventing the need to explicitly store training examples and perform several passes over the data until reaching convergence. Comparison of OSSGNG and OGNG: The results in Table 3 show that extending OGNG with a semi-supervised component (OSSGNG) can improve its classification performance by up to 7.68% (dataset g241d). There are only two datasets (Digit1 and USPS) for which OSSGNG yields worse results compared to OGNG, albeit these differences are clearly minor. Interestingly, these are also the datasets for which a semisupervised SVM classifier (TVSM) performs worse than a standard SVM (see Table 4). Comparison of OSSGNG and standard semi-supervised learning algorithms: We additionally compared our results to the results of standard semi-supervised classification algorithms published by Chapelle et al. 7, namely Transductive SVM (TSVM) 16 (using a linear kernel), Cluster-Kernel 36, Datadependancy regularization 8 and Low-Density Separation (LDS) 7. We did not reimplement these algorithms, but compared our results to the published results using the same data under same conditions. The results are summarized in Table 4. On two datasets (g241c, g241d), OSSGNG performs definitely worse compared to the other semisupervised learning approaches. On the other four datasets Digit1, USPS and COIL, BCI, the performance of OSSGNG is better than the one of a standard SVM and comparable to those of other state-of-the-art semisupervised learning approaches. On one dataset, i.e. BCI, it is even the case that OS- SGNG outperforms all other approaches by far. These results clearly license the conclusion that OSSGNG can compete with other semi-supervised learning approaches. OGNG SSGNG OSSGNG (labeled data) g241c g241d Digit USPS COIL BCI Average Table 3. Classification results (accuracy) of a 12-fold cross-validation for OGNG, SSGNG and OSSGNG performed on the 6 datasets (g241c, g24d, Digit1, USPS, COIL, BCI) Discussion Our experiments show the benefit of the semisupervised OGNG (OSSGNG). It clearly outperforms SSGNG and improves the classification performance of OGNG in 4 out of 6 datasets. The two datasets (Digit1, USPS) in which the semisupervised approaches (OSSGNG, TSVM) yield worse results compared to their original algorithms (OGNG, SVM) seem to be very easy to classify as every compared algorithm achieves an accuracy over 90%. The results of the 1-NN approach also license this observation as it performs much better on those datasets than on the others. It seems that in these cases semi-supervised learning can not improve the classification performance further. For 4 out of 6 datasets, OSSGNG achieves better results compared to a standard SVM (with a linear kernel) while also being comparable to other standard SSL algorithms. It is striking that OSS- GNG outperforms all other approaches by far on the BCI dataset. This dataset is characterized by the availability of only few data points (400 in total) as well as by low-dimensional feature vectors (117 dimensions). OSSGNG thus seems to generalize better on low numbers of examples. OGNG and OSSGNG show worst results on the datasets g241c and g241d. In order to shed light on this observation, we performed a PCA to reduce the dimensionality of the data to the number of principal components that capture 90% of the variance. The results are shown in Table 5. The analysis shows that those two datasets have a much higher complexity (with 192 and 193 components) compared to the rest, which could be the reason for the weak results of both OGNG and OSSGNG. Their performance is even worse than the one of

12 1-NN SVM OGNG TSVM Cluster-Kernel Data-Dep. Reg. LDS OSSGNG (labeled data only) g241c g241d Digit USPS COIL BCI Average Table 4: Classification results (accuracy) of a 12-fold cross-validation for our baseline (1-NN, SVM), OGNG, different standard SSL approaches and OSSGNG performed on the 6 datasets (g241c, g24d, Digit1, USPS, COIL, BCI). 1-NN, which hints at the fact that some parts of the data are underrepresented in the OGNG / OS- SGNG. This could be due to too few neurons or due to a very sparsely labeled network. In fact, the OSSGNG algorithm does not guarantee that all neurons are actually labeled. It may be possible to achieve better results with OGNG/OSSGNG on these two datasets with a different set of parameters. g241c g241d Digit1 USPS COIL BCI Table 5. Number of principal components that capture 90% of the data variance. 5. Related Work To our knowledge, there has been no systematic investigation and comparison of different labeling strategies (offline vs. online in particular) for Growing Neural Gas. This is a gap that we have intended to fill. The question of how GNG can be extended to an online classification algorithm has also not been addressed previously. In most cases, offline strategies have been considered that perform the labeling after the training phase has ended and the network has stabilized to some extent as in the WEBSOM 20,21 and LabelSOM 28 approaches. In both of these approaches, the label assignment is essentially determined by the distance of the labeled training data point to the neurons of the already trained network. Such offline labeling strategies run counter to the online nature of GNG, whose interesting properties are that the network grows over time and only neurons, but no explicit examples, need to be stored in the network. Our results indeed license the claim that extending a clustering algorithm (based on GNG) with online labeling strategies do not yield a worse classification performance compared to using offline labeling functions. In recent years, there has been substantial work in the area of semi-supervised learning, both in the context of classification and clustering tasks. Those approaches have been successfully applied to a number of applications such as text classification, pattern recognition and medical diagnosis 16,37,9. One can distinguish between three main classes of semi-supervised learning approaches: Generative Models, Low-Density Separation and Graph-based Methods. Generative models involve the estimation of the conditional density p(x y), with p(x) being the density of the input space and p(y) being the density of the label/category space. Typically, the EM algorithm is applied to estimate the parameters of the Gaussian distribution for each class. Existing semi-supervised extensions of GNG 37 in fact build on the EM principle and thus process labeled and unlabeled data in two separate steps. Approaches such as SSGNG and co-training proposed by Blum et al. 5 belong to this class of semi-supervised learning approaches. In contrast, our approach relies on a single (online) step that predicts labels for new examples and incorporates them into the existing model on-the-fly. Approaches based on a Low-Density Separation such as Transductive SVM (TSVM) 16 make use of the unlabeled data to iteratively maximize the margin using labeled and unlabeled data points. Therefore, the TSVM is initially trained with only labeled examples and increases the amount of unlabeled data points iteratively. The third class of semi-supervised learning algorithms are

13 Graph-based Methods (see Belkin et al. 3 ). These approaches organize labeled and unlabeled data points as nodes in a graph, where edges represent the similarity between the single nodes and are thus labeled with the distance between them. Missing distances are typically approximated by the minimal aggregated path over all paths connecting those two nodes. Another closely related approach was proposed by Shen et al. 32. They presented a Self-Organizing Incremental Neural Network (SOINN) which provides a growing structure and is capable of processing labeled and unlabeled data. During the learning process, nodes are successively inserted into the network, separated into sub- clusters and merged into bigger clusters. In contrast, our approach relies on simpler yet effective machinery that does not require any additional heuristics. Our focus in this paper has been on extending GNG into a semi-supervised online classifier, such that a comparison with other methods such as SOINN is out of the scope of this paper. In our work, we have built on a formulation of GNG as introduced by Fritzke 12, providing our extensions on top of the basic algorithm. Our extension to semi-supervised online GNG has been inspired and based on the SSGNG approach of Zaki et al. 37. Our goal has been to extend their approach into a uniform approach that can work with labeled and unlabeled data and does not require separate training and labeling phases nor iterative processing. The labeling functions we have empirically examined are based on Lau et al. 22 and the prediction functions we have used are based on standard inter-cluster similarity measures used in clustering approaches. 6. Conclusion We have presented an extension of GNG to an online semi-supervised classifier which relies on online labeling strategies to assign labels to neurons and label unlabeled data points on the-fly. We have shown in particular that using online labeling strategies yields comparable results to using offline labeling strategies. As using offline strategies for labeling neurons requires the storage of training examples, the application of online strategies is preferable, particularly as they produce comparable results as conveyed by our experimental results. Further, we have shown that the semisupervised extension to GNG compares favorably to other state-of-the-art semi-supervised learning approaches. We see two important limitations of our work. First, our results have been conducted on small datasets and relatively simple classification problems. In our experiments, comparing online and offline labeling strategies, we have presented results on these datasets using 4 labeled examples only. The reason for this is that the classification problems under consideration are so simple that by using more examples any strategy yields nearly perfect results, thus rendering a comparison meaningless. An obvious avenue for future research is thus to confirm our results on larger datasets for more complex classification problems. We assume that on larger datasets, the simplicity of our approach will have clear advantages as training and prediction can be carried out efficiently without several passes over the data and without explicitly storing training data. Our approach might be thus especially relevant in the context of stream classification tasks. Second, our approach does not directly exploit the topology of the network for the labeling. It would also be interesting to investigate the performance of labeling functions that are non-local in the sense that they take into account the network topology in predicting a label. Further, the labeling does not influence the topology of the network. It would thus be interesting to investigate in how far the categorization can be improved by letting the labels influence the synaptic links and thus the clustering process, e.g. by using discriminative learning techniques aimed at reducing the empirical risk of misclassifying an example. Verifying whether the online labeling strategies can be used in the context of other approaches based on topological maps is a further issue to investigate. Acknowledgments This project has been funded by the Excellence Initiative of the Deutsche Forschungsgemeinschaft (DFG). Thanks to Barbara Hammer for comments and feedback on a first draft of this paper.

Online labelling strategies for growing neural gas

Online labelling strategies for growing neural gas Oliver Beyer and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University, obeyer@cit-ec.uni-bielefeld.de http://www.sc.cit-ec.uni-bielefeld.de