Using PageRank in Feature Selection

Size: px

Start display at page:

Download "Using PageRank in Feature Selection"

Luke Ryan
6 years ago
Views:

1 Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy Abstract. Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained. 1 Introduction Everyday we work with a large amount of data, the majority of which is unlabelled. Almost all the information on Internet is not labelled. Therefore being able to treat it with unsupervised tasks has become very important. For instance, we would like to automatically categorize documents and we know that we can consider some of the words as noisy variables. The problem is to select the right subset of words that better represent the document set without using information about the class of documents. This is a typical problem of feature selection in which documents take as features the set of terms contained in all the dataset. Feature selection is a widely recognised important task in machine learning and data mining [2]. In high-dimensional data-sets feature selection improves algorithms performance and classification accuracy since the chance of overfitting increases with the number of features. Furthermore when the curse of dimensionality problem emerges - especially when the objects representation in the feature space is very sparse - feature selection reduces the degradation of the results of clustering and distance-based k-nn algorithms. In the supervised approach to feature selection we can classify the existing methods into two families: wrapper methods and filter methods. The wrapper techniques evaluate the features using the learning algorithm that will ultimately be employed. The filter based approaches most commonly explore correlations between features and the class label, assign to each feature a score and then rank the features with respect to the score. Feature selection picks the best k features according to their score and these ones will be used to represent the data-set. Most of the existing filter methods are supervised.

2 Data variance might be the simplest unsupervised evaluation of the features. The variance of a feature in the dataset reflects its greater ability to separate into disjoint regions the objects of different classes. In this way there are some works [5], [7] that adopt a Laplacian matrix, which transforms by projection the original dataset into a different space with some desired properties. Then they search the features in the transformed space that best represents a natural partition of the data. The difference between supervised and unsupervised feature selection is in the use of information on the class to guide the search of the best subset of features. Both methods can be viewed as a selection of features that are consistent with the concepts represented in the data. In supervised learning the concept is related to the class affiliation, while in unsupervised learning it is usually related to the similarity between data instances in relevant portions of the dataset. We believe that these intrinsic structures in data can be captured in a similar way in which PageRank ranks Web pages: by selection of the features that are mostly correlated with the majority of the features in the dataset. These features should represent the relevant portions of the dataset - the dataset representatives - and still allow to discard the data marginal characteristics. In this work we propose to use PageRank formula for the selection of the best features in a dataset in an unsupervised way. With the proposed method we are able to select a subset of the original features such that: allows to represent the relevant characteristics of the data has the highest probability of co-occurrence with the highest number of other features helps to speed-up the processing of the data eliminates the noisy variables 2 Methods In this section we describe the base technique of our method and the specific approach that we adopt in the case of unsupervised feature selection. The resulting algorithm is a Feature Selection/Ranking algorithm that we call FRRW (Feature Ranking by Random Walking). Indeed it is based on Random Walks on a graph where the vertices of the graph are the features and the graph vertices are connected by weighted edges dependent on how much both the features are recurrent in the dataset. The basic idea that supports the adoption of a graph-based ranking algorithm is that of voting or recommendation: when a first vertex is connected to a second vertex by a weighted edge the first vertex basically votes for the second one proportionally to the edge weight connecting them. The higher is the sum of the weights obtained by the second vertex by the other vertices the higher is the importance of that vertex in the graph. Furthermore, the importance of a vertex determines the importance of its votes. Random Walks on graphs are a special case of Markov Chains in which the Markov Chain itself describes the probability of moving between the graph vertices. In our case, it describes the probability of finding instances in the dataset in which the instances are characterized by the features. Random Walks search the stationary state in the Markov Chain and this situation assigns to each state in the Markov Chain a probability that

3 is the probability of being in that state after an infinite walk on the graph guided by the transition probabilities. Through the Random Walk on the graph PageRank determines the stationary state vector essentially by an iterative algorithm, i.e. collectively by aggregation of the transition probabilities between all the graph vertices. PageRank produces for each vector component a score (according to a formula that we will report soon) and orders the components by the score value. As a conclusion it finds a ranking between the states. Intuitively, this score is proportional to the overall probability of moving into a state from any other state. In our case, the graph states are the features and the score vector represents the stationary distribution over the feature probabilities. In other terms, it is the overall probability of finding each feature in the dataset together with other features. The framework is general: it is possible to adapt it to different domains by simple modifications of the vertices proximity measure that evaluates the transition probability between two graph vertices. 2.1 PageRank Our approach is based on the PageRank algorithm [4], which is a graph-based ranking algorithm already used in the Google search engine and in a great number of unsupervised applications. A good definition of PageRank and of one of its applications is given in [3]. PageRank assigns a score to any vertex of the graph: the score at vertex V a is as greater as is the importance of the vertex. The importance is determined by the vertices to which V a is connected. Fig. 1. A simple example of a term graph.

4 In our problem, we have an undirected and weighted graph G = (V; E), where V is the set of vertices, E V V is the set of edges. For an edge connecting vertices V a and V b 2 V there is a weight denoted by w ab. A simple example of an undirected and weighted graph is reported in Figure 1. PageRank determines iteratively the score for each vertex V a in the graph as a weighted contribution of all the scores assigned to the vertices V b connected to V a, as follows: WP(V a ) = (1? d) + d [ P V b;v b6=v a w ba PVc ;Vc6=V b w bc WP(V b )] where d is a parameter that is set between 0 and 1 (setted to 0:85, the usual value). WP is the resulting score vector, whose i-th component is the score associated to vertex V i. The greater is the score, the greater is the importance of the vertex according to its similarity with the other vertices to which it is connected. This algorithm is used in many applications, particularly in NLP tasks such as Word Sense Disambiguation [3]. 2.2 Application of FRRW in Document Categorization We denote by D a dataset with n document instances D = fd1; d2; ; d n g. D is obtained after a pre-processing step, where stop-words are eliminated and the stem of words is obtained by application of the Porter Stemming Algorithm. We denote by T = ft1; t2; ; t k g the set of the k terms that are present in the documents of D after pre-processing. Each document d i has a bag-of-word representation and contains a subset T di 2 T of terms. Vice versa, D ti 2 D represents the set of documents with term t i. We construct a graph G in which each vertex corresponds to a term t i and on each edge between two vertices t i and t j we associate a weight w ti;tj that is a similarity measure between the terms. The terms similarity could be computed in many ways. We compute it as the fraction of the documents that contain the two terms: w ij = jd t i \ D tj j jdj From graph G, a weighted matrix W is computed where each element w ij in W corresponds to the weight associated to the edge between terms t i and t j. This matrix is given in input to PageRank algorithm. The matrix corresponding to the graph of Figure 1 is reported in Figure 2 as exemplification. Notice that the cells in the position at the diagonal of the matrix are empty since the PageRank algorithm will not use them but it will consider for each graph vertex only the contribution of other vertices. In this manner we obtain a score vector whose i-th component is the score of the term t i. We order the vector components by their value and extract the correspondent ranking. The underlined idea is that PageRank algorithm allows to select the features that are the best representatives because are the most recommended by other ones since they co-occur in the same documents. In the next section we present an empirical evaluation of FRRW.

5 S 12 S 13 S 14 S 15 S 16 2 S 21 S 23 S 24 S 25 S 26 3 S 31 S 32 S 34 S 35 S 36 4 S 41 S 42 S 43 S 45 S 46 5 S 51 S 52 S 53 S 54 S 56 6 S 61 S 62 S 63 S 64 S 65 Fig. 2. Term Similarity Matrix for the graph of Figure 1 3 Empirical Evaluation Our experiments verify the validity of a feature selection as in [7], by a-posteriori verification that even a simple classifier, such as 1-NN, is able to classify correctly instances in data when they are represented by the set of features selected in an unsupervised manner. Thus we use the achieved classification accuracy to estimate the quality of the selected feature set. In fact, if the selected feature set is more relevant with the target concept a classifier should achieve a better accuracy. In our experiments we use the 7sector data [1]. From these data we construct two datasets for double check. Each dataset contains two, well separated conceptual categories: respectively, network and metals, such as gold and silver for the former dataset and software and gold and silver for the latter one. The characteristics of each dataset are shown in Figure 3 together with the accuracy obtained by a 1-NN classifier on all the features. Class One Class Two N. of features Accuracy dataset1 network gold and silver % dataset2 software gold and silver % Fig. 3. Characteristics of the data sets We compare algorithm FRRW with other two algorithms for feature selection. The former is IGR: it is based on Information Gain which in turn outperforms other measures for term selection, such as Mutual Information, 2, Term Strength as shown in [6]. The second one is a baseline selector, denoted by RanS, a simple random selection of the features. Our experiments are conceived as follows. We run the three feature selection algorithms on each of the two datasets separately and obtain a ranking of the features by each method. From each of these rankings, we select the set of the top-ranked i features, with i that varies from 1 to 600. Thus the ability of the three algorithms to place at the top of the ranking the best features is checked. For each feature set of i features, chosen by the three methods, we project the dataset and let the same 1-NN classifier to predict the class in a set of test instances (with 10 fold cross-validation) and store its accuracy. For RanS we perform 50 times the random selection of i features, for each value of i, and average the resulting classifier accuracy. Algorithm 1 sketches the algorithm we used for this evaluation.

6 Algorithm 1 Feature Evaluation Framework 1: This procedure is for evaluation of unsupervised feature selection algorithms 2: FRRW(i), IGR(i): output vectors storing the accuracy of 1-NN on dataset D projected on top i features selected respectively by FRRW and IGR 3: RanS(i): output vector storing the average accuracy of 1-NN on a random selection of i features 4: for each data set D do 5: for i = 1 to 600 do 6: select top i features by FRRW, IGR 7: /* project D on the selected features */ 8: D F RRW (i) = F RRW (i) (D) 9: D IGR(i) = IGR(i) (D) 10: /* Determine 1-NN accuracy and store it in accuracy vector */ 11: FRRW[i] = 10-fold CV on D F RRW (i) using 1-NN 12: IGR[i] = 10-fold CV on D IGR(i) using 1-NN 13: /* determine average accuracy of 1-NN with RanS */ 14: tempaccuracy = 0; 15: for j = 1 to 50 step 1 do 16: D RanS(i) = RanS(i) (D); 17: tempaccuracy=tempaccuracy + 10-fold CV on D RanS(i) using 1-NN 18: end for 19: RanS[i]= tempaccuracy/50 20: end for 21: end for The accuracy of 1-NN resulting by the feature selection of the three methods is reported in Figure 5. It is clear that with FRRW the reachable accuracy is considerably higher (even higher than the accuracy obtained with the entire set of features). In particular, it is higher with FRRW even considering only a low number of features. This behaviour perfectly fits our expectations: it means that the top-ranked features are also the most characteristic of the target concept, namely those ones that allow a good classification. On the other side, as we increase the number i of the features, the accuracy decreases (even if with FRRW it is still higher than with the other methods). This means that the ranking of the features induced by FRRW correctly places at the top of the list the features that contribute most to a correct classification. Instead, at the bottom of the list there are just those remaining features that are not useful for the characterization of the target concept, because they are the noisy ones. 4 Conclusion In this work we investigate the use of PageRank for feature selection. PageRank is a well-known graph-based ranking algorithm that performs a Random-Walk through the feature space and ranks the features according to their similarity with the greatest number of other features. We empirically demonstrate that this technique works well also in an unsupervised task (namely when no class is provided to the feature selection algorithm). We have shown that our method can rank features according to their

7 classification utility, without considering the class information itself. This interesting result means our method can infer the intrinsic characteristics of the dataset, which is particularly useful in all those cases in which no class information is provided. References 1. 7Sector data Guyon I and Elisseeff A, An introduction to variable and feature selection, JMLR, 3, (2003). 3. Mihalcea R, Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling, HLT/EMNLP05, (2005). 4. Brin S and Page L, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems, 30, (1998). 5. Deng Cai Xiaofei He and C.F. Partha Niyogi, Laplacian score for feature selection, NIPS05, (2005). 6. Yang Y and Pedersen J O, A comparative study on feature selection in text categorization, ICML97, (1997). 7. Zheng Zhao and Huan Liu, Semi-supervised feature selection via spectral analysis, SDM07, (2007).

8 75 FRRW IGR RanS Accuracy Num of features Fig. 4. Accuracy on dataset1. 75 FRRW IGR RanS Accuracy Num of features Fig. 5. Accuracy on dataset2.

Using PageRank in Feature Selection

Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important