Using PageRank in Feature Selection

Size: px
Start display at page:

Download "Using PageRank in Feature Selection"

Transcription

1 Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy Abstract. Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained. 1 Introduction Everyday we work with a large amount of data, the majority of which is unlabelled. Almost all the information on Internet is not labelled. Therefore being able to treat it with unsupervised tasks has become very important. For instance, we would like to automatically categorize documents and we know that we can consider some of the words as noisy variables. The problem is to select the right subset of words that better represent the document set without using information about the class of documents. This is a typical problem of feature selection in which documents take as features the set of terms contained in all the dataset. Feature selection is a widely recognised important task in machine learning and data mining [3]. In high-dimensional data-sets feature selection improves algorithms performance and classification accuracy since the chance of overfitting increases with the number of features. Furthermore when the curse of dimensionality problem emerges - especially when the objects representation in the feature space is very sparse - feature selection reduces the degradation of the results of clustering and distance-based k-nn algorithms. In the supervised approach to feature selection we can classify the existing methods into two families: wrapper methods and filter methods. The wrapper techniques evaluate the features using the learning algorithm that will ultimately be employed. The filter based approaches most commonly explore correlations between features and the class label, assign to each feature a score and then rank the features with respect to the score. Feature selection picks the best k features according to their score and these ones will be used to represent the data-set. Most of the existing filter methods are supervised.

2 Data variance might be the simplest unsupervised evaluation of the features. The variance of a feature in the dataset reflects its greater ability to separate into disjoint regions the objects of different classes. In this way there are some works [8], [10] that adopt a Laplacian matrix, which transforms by projection the original dataset into a different space with some desired properties. Then they search the features in the transformed space that best represents a natural partition of the data. The difference between supervised and unsupervised feature selection is in the use of information on the class to guide the search of the best subset of features. Both methods can be viewed as a selection of features that are consistent with the concepts represented in the data. In supervised learning the concept is related to the class affiliation, while in unsupervised learning it is usually related to the similarity between data instances in relevant portions of the dataset. We believe that these intrinsic structures in data can be captured in a similar way in which PageRank ranks Web pages: by selection of the features that are mostly correlated with the majority of the features in the dataset. These features should represent the relevant portions of the dataset - the dataset representatives - and still allow to discard the data marginal characteristics. In this work we propose to use PageRank formula for the selection of the best features in a dataset in an unsupervised way. With the proposed method we are able to select a subset of the original features such that: allows to represent the relevant characteristics of the data has the highest probability of co-occurrence with the highest number of other features helps to speed-up the processing of the data eliminates the noisy variables 2 Methods In this section we describe the base technique of our method and the specific approach that we adopt in the case of unsupervised feature selection. The resulting algorithm is a Feature Selection/Ranking algorithm that we call FRRW (Feature Ranking by Random Walking). Indeed it is based on Random Walks [5] on a graph where the vertices of the graph are the features and the graph vertices are connected by weighted edges dependent on how much both the features are recurrent in the dataset. The basic idea that supports the adoption of a graph-based ranking algorithm is that of voting or recommendation: when a first vertex is connected to a second vertex by a weighted edge the first vertex basically votes for the second one proportionally to the edge weight connecting them. The higher is the sum of the weights obtained by the second vertex by the other vertices the higher is the importance of that vertex in the graph. Furthermore, the importance of a vertex determines the importance of its votes. Random Walks on graphs are a special case of Markov Chains in which the Markov Chain itself describes the probability of moving between the graph vertices. In our case, it describes the probability of finding instances in the dataset in which the instances are characterized by both the features. Random Walks search the stationary state in the Markov Chain and this situation assigns to each state in the Markov Chain a probability

3 that is the probability of being in that state after an infinite walk on the graph guided by the transition probabilities. Through the Random Walk on the graph PageRank determines the stationary state vector essentially by an iterative algorithm, i.e. collectively by aggregation of the transition probabilities between all graph vertices. PageRank produces a score for each vector component (according to formula 1 that we will discuss in section 2.1) and orders the components by the score value. As a conclusion it finds a ranking between the states. Intuitively, this score is proportional to the overall probability of moving into a state from any other state. In our case, the graph states are the features and the score vector represents the stationary distribution over the feature probabilities. In other terms, it is the overall probability of finding in an instance each feature together with other features. The framework is general: it is possible to adapt it to different domains. Indeed, in case we would like to use the proposed method in a different domain, we would need to complete the graph assigning any single feature of the domain to a graph vertex and determining the score at the edges by application of a suitable proximity measure between the features. 2.1 PageRank Our approach is based on the PageRank algorithm [7], which is a graph-based ranking algorithm already used in the Google search engine and in a great number of unsupervised applications. A good definition of PageRank and of one of its applications is given in [6]. PageRank assigns a score to any vertex of the graph: the score at vertex V a is as greater as is the importance of the vertex. The importance is determined by the vertices to which V a is connected. In our problem, we have an undirected and weighted graph G = (V, E), where V is the set of vertices, E V V is the set of edges. For an edge connecting vertices V a and V b V there is a weight denoted by w ab. A simple example of an undirected and weighted graph is reported in Figure 1. PageRank determines iteratively the score for each vertex V a in the graph as a weighted contribution of all the scores assigned to the vertices V b connected to V a, as follows: WP(V a ) = (1 d) + d [ V b,v b V a w ba V c,v c V b w bc WP(V b )] (1) where d is a parameter that is set between 0 and 1 (setted to 0.85, the usual value). WP is the resulting score vector, whose i-th component is the score associated to vertex V i. The greater is the score, the greater is the importance of the vertex according to its similarity with the other vertices to which it is connected. This algorithm is used in many applications, particularly in NLP tasks such as Word Sense Disambiguation [6]. 2.2 Application of FRRW in Document Categorization We denote by D a dataset with n document instances D = {d 1, d 2,, d n }. D is obtained after a pre-processing step, where stop-words are eliminated and the stem of words is obtained by application of the Porter Stemming Algorithm [4]. We denote by

4 Fig. 1. A simple example of a term graph. T = {t 1, t 2,, t k } the set of the k terms that are present in the documents of D after pre-processing. Each document d i has a bag-of-word representation and contains a subset T di T of terms. Vice versa, D ti D represents the set of documents with term t i. We construct a graph G in which each vertex corresponds to a term t i and on each edge between two vertices t i and t j we associate a weight w ti,t j that is a similarity measure between the terms. The terms similarity could be computed in many ways. We compute it as the fraction of the documents that contain the two terms: w ij = D t i D tj D From graph G, a matrix W of the weights at the graph edges is computed where each element w ij, at row i and column j in W corresponds to the weight associated to the edge between terms t i and t j. This matrix is given in input to PageRank algorithm. The cells of the matrix into the diagonal is empty since the PageRank algorithm will not use them but it will consider for each graph vertex only the contribution of other vertices. In this manner we obtain a score vector whose i-th component is the score of the term t i. We order the vector components by their value and extract the correspondent ranking. The underlined idea is that PageRank algorithm allows to select the features that are the best representatives because are the most recommended by other ones since they co-occur in the same documents. In the next section we present an empirical evaluation of FRRW.

5 3 Empirical Evaluation Our experiments verify the validity of a feature selection as in [10], by a-posteriori verification that even a simple classifier, such as 1-NN, is able to classify correctly instances in data when they are represented by the set of features selected in an unsupervised manner. Thus we use the achieved classification accuracy to estimate the quality of the selected feature set. In fact, if the selected feature set is more relevant with the target concept a classifier should achieve a better accuracy. In our experiments we use the 7sector data [2]. This dataset is a subset of the company data described in [1]. It contains web pages collected by a crawler in Pages concerning the following seven top-level economic sectors from a hierarchy published by Marketguide The seven sectors are: basic materials sector energy sector nancial sector healthcare sector technology sector transportation sector utilities sector From these data we construct two datasets for double check. Each dataset contains two, well separated conceptual categories: respectively, network and metals, such as gold and silver for the former dataset and software and gold and silver for the latter one. The characteristics of each dataset are shown in Figure 2 together with the accuracy obtained by a 1-NN classifier on all the features. Class One Class Two N. of features Accuracy dataset1 network gold and silver % dataset2 software gold and silver % Fig. 2. Characteristics of the data sets We compare algorithm FRRW with other two algorithms for feature selection. The former is IGR [9]: it is based on Information Gain which in turn outperforms other measures for term selection, such as Mutual Information,χ 2, Term Strength as shown in [9]. The second one is a baseline selector, denoted by RanS, a simple random selection of the features that we introduce specifically for this work. Our experiments are conceived as follows. We run the three feature selection algorithms on each of the two datasets separately and obtain a ranking of the features by each method. From each of these rankings, we select the set of the top-ranked i features, with i that varies from 1 to 600. Thus the ability of the three algorithms to place at the top of the ranking the best features is checked. For each feature set of i features, chosen by the three methods, we project the dataset and let the same 1-NN classifier to predict

6 Algorithm 1 Feature Evaluation Framework 1: This procedure is for evaluation of unsupervised feature selection algorithms 2: FRRW(i), IGR(i): output vectors storing the accuracy of 1-NN on dataset D projected on top i features selected respectively by FRRW and IGR 3: RanS(i): output vector storing the average accuracy of 1-NN on a random selection of i features 4: for each data set D do 5: for i = 1 to 600 do 6: select top i features by FRRW, IGR 7: /* project D on the selected features */ 8: D F RRW(i) = Π F RRW(i) (D) 9: D IGR(i) = Π IGR(i) (D) 10: /* Determine 1-NN accuracy and store it in accuracy vector */ 11: FRRW[i] = 10-fold CV on D F RRW(i) using 1-NN 12: IGR[i] = 10-fold CV on D IGR(i) using 1-NN 13: /* determine average accuracy of 1-NN with RanS */ 14: tempaccuracy = 0; 15: for j = 1 to 50 step 1 do 16: D RanS(i) = Π RanS(i) (D); 17: tempaccuracy=tempaccuracy + 10-fold CV on D RanS(i) using 1-NN 18: end for 19: RanS[i]= tempaccuracy/50 20: end for 21: end for the class in a set of test instances (with 10 fold cross-validation) and store its accuracy. For RanS we perform 50 times the random selection of i features, for each value of i, and average the resulting classifier accuracy. Algorithm 1 sketches the algorithm we used for this evaluation. The accuracy of 1-NN resulting by the feature selection of the three methods is reported in Figure 3 and in Figure 4. In these figures we plot the accuracy achieved by the classifier on the instances respectively of dataset 1 and dataset 2 where the instances are represented by a subset of the features selected by each of the three methods. On the x axis we plot the number of selected features and on the y axis we plot the accuracy obtained by the classifier on the features selected by each of the three methods. In both of the two datasets our feature selection method induces the same behaviour of the classifier. In the first part of the graph the accuracy grows up to a certain level. After this maximum the accuracy decreases. This means that the most useful features are those ones at the top of the ranking. When we finish to consider in the set of selected features only these good features and start to include the remaining ones, these latter ones introduce noise in the classification task. This is explained by the observed degradation of the classification accuracy. When we consider the other two methods, we can see that the accuracy not only is generally lower but it is also less stable when the number of features increases. It is clear that with FRRW the reachable accuracy is considerably higher (even higher than the accuracy obtained with the entire set of features). In particular, it is higher with FRRW even considering only a low

7 number of features. This behaviour perfectly fits our expectations: it means that the top-ranked features are also the most characteristic of the target concept, namely those ones that allow a good classification. On the other side, as we increase the number i of the features, the accuracy decreases (even if with FRRW it is still higher than with the other methods). This means that the ranking of the features induced by FRRW correctly places at the top of the list the features that contribute most to a correct classification. Instead, at the bottom of the list there are just those remaining features that are not useful for the characterization of the target concept, because they are the noisy ones. 4 Conclusion In this work we investigate the use of PageRank for feature selection. PageRank is a well-known graph-based ranking algorithm that performs a Random-Walk through the feature space and ranks the features according to their similarity with the greatest number of other features. We empirically demonstrate that this technique works well also in an unsupervised task (namely when no class is provided to the feature selection algorithm). We have shown that our method can rank features according to their classification utility, without considering the class information itself. This interesting result means our method can infer the intrinsic characteristics of the dataset, which is particularly useful in all those cases in which no class information is provided. As a future work we will want to investigate this method, of feature selection, for unsupervised learning such as the clustering task. References 1. Baker L. D. and McCallum A. K., Distributional clustering of words for text classication, SIGIR98, (1998). 2. 7Sector data Guyon I and Elisseeff A, An introduction to variable and feature selection, JMLR, 3, (2003). 4. Porter M.F., An algorithm for suffix stripping, Program, (1980). 5. James Norris, Markov Chains, Cambridge University Press, Mihalcea R, Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling, HLT/EMNLP05, (2005). 7. Brin S and Page L, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems, 30, (1998). 8. Deng Cai Xiaofei He and C.F. Partha Niyogi, Laplacian score for feature selection, NIPS05, (2005). 9. Yang Y and Pedersen J O, A comparative study on feature selection in text categorization, ICML97, (1997). 10. Zheng Zhao and Huan Liu, Semi-supervised feature selection via spectral analysis, SDM07, (2007).

8 75 FRRW IGR RanS Accuracy Num of features Fig. 3. Accuracy on dataset1. 75 FRRW IGR RanS Accuracy Num of features Fig. 4. Accuracy on dataset2.

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

Information theory methods for feature selection

Information theory methods for feature selection Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Supervised Random Walks

Supervised Random Walks Supervised Random Walks Pawan Goyal CSE, IITKGP September 8, 2014 Pawan Goyal (IIT Kharagpur) Supervised Random Walks September 8, 2014 1 / 17 Correlation Discovery by random walk Problem definition Estimate

More information

TRANSDUCTIVE LINK SPAM DETECTION

TRANSDUCTIVE LINK SPAM DETECTION TRANSDUCTIVE LINK SPAM DETECTION Denny Zhou Microsoft Research http://research.microsoft.com/~denzho Joint work with Chris Burges and Tao Tao Presenter: Krysta Svore Link spam detection problem Classification

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Trace Ratio Criterion for Feature Selection

Trace Ratio Criterion for Feature Selection Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy CENTRALITIES Carlo PICCARDI DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy email carlo.piccardi@polimi.it http://home.deib.polimi.it/piccardi Carlo Piccardi

More information

Unsupervised Feature Selection for Sparse Data

Unsupervised Feature Selection for Sparse Data Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing

More information

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Motivation. Motivation

Motivation. Motivation COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems A probabilistic model to resolve diversity-accuracy challenge of recommendation systems AMIN JAVARI MAHDI JALILI 1 Received: 17 Mar 2013 / Revised: 19 May 2014 / Accepted: 30 Jun 2014 Recommendation systems

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Using Spam Farm to Boost PageRank p. 1/2

Using Spam Farm to Boost PageRank p. 1/2 Using Spam Farm to Boost PageRank Ye Du Joint Work with: Yaoyun Shi and Xin Zhao University of Michigan, Ann Arbor Using Spam Farm to Boost PageRank p. 1/2 Roadmap Introduction: Link Spam and PageRank

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Reflexive Regular Equivalence for Bipartite Data

Reflexive Regular Equivalence for Bipartite Data Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

CS 6604: Data Mining Large Networks and Time-Series

CS 6604: Data Mining Large Networks and Time-Series CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Object Classification Problem

Object Classification Problem HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category

More information

Network community detection with edge classifiers trained on LFR graphs

Network community detection with edge classifiers trained on LFR graphs Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

April 3, 2012 T.C. Havens

April 3, 2012 T.C. Havens April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate

More information

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

A Unified Framework to Integrate Supervision and Metric Learning into Clustering A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December

More information

Predicting Gene Function and Localization

Predicting Gene Function and Localization Predicting Gene Function and Localization By Ankit Kumar and Raissa Largman CS 229 Fall 2013 I. INTRODUCTION Our data comes from the 2001 KDD Cup Data Mining Competition. The competition had two tasks,

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

K-means clustering based filter feature selection on high dimensional data

K-means clustering based filter feature selection on high dimensional data International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

PROBLEM 4

PROBLEM 4 PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information