Using PageRank in Feature Selection

Size: px
Start display at page:

Download "Using PageRank in Feature Selection"

Transcription

1 Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy Abstract. Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained. 1 Introduction Everyday we work with a large amount of data, the majority of which is unlabelled. Almost all the information on Internet is not labelled. Therefore being able to treat it with unsupervised tasks has become very important. For instance, we would like to automatically categorize documents and we know that we can consider some of the words as noisy variables. The problem is to select the right subset of words that better represent the document set without using information about the class of documents. This is a typical problem of feature selection in which documents take as features the set of terms contained in all the dataset. Feature selection is a widely recognised important task in machine learning and data mining [2]. In high-dimensional data-sets feature selection improves algorithms performance and classification accuracy since the chance of overfitting increases with the number of features. Furthermore when the curse of dimensionality problem emerges - especially when the objects representation in the feature space is very sparse - feature selection reduces the degradation of the results of clustering and distance-based k-nn algorithms. In the supervised approach to feature selection we can classify the existing methods into two families: wrapper methods and filter methods. The wrapper techniques evaluate the features using the learning algorithm that will ultimately be employed. The filter based approaches most commonly explore correlations between features and the class label, assign to each feature a score and then rank the features with respect to the score. Feature selection picks the best k features according to their score and these ones will be used to represent the data-set. Most of the existing filter methods are supervised.

2 Data variance might be the simplest unsupervised evaluation of the features. The variance of a feature in the dataset reflects its greater ability to separate into disjoint regions the objects of different classes. In this way there are some works [5], [7] that adopt a Laplacian matrix, which transforms by projection the original dataset into a different space with some desired properties. Then they search the features in the transformed space that best represents a natural partition of the data. The difference between supervised and unsupervised feature selection is in the use of information on the class to guide the search of the best subset of features. Both methods can be viewed as a selection of features that are consistent with the concepts represented in the data. In supervised learning the concept is related to the class affiliation, while in unsupervised learning it is usually related to the similarity between data instances in relevant portions of the dataset. We believe that these intrinsic structures in data can be captured in a similar way in which PageRank ranks Web pages: by selection of the features that are mostly correlated with the majority of the features in the dataset. These features should represent the relevant portions of the dataset - the dataset representatives - and still allow to discard the data marginal characteristics. In this work we propose to use PageRank formula for the selection of the best features in a dataset in an unsupervised way. With the proposed method we are able to select a subset of the original features such that: allows to represent the relevant characteristics of the data has the highest probability of co-occurrence with the highest number of other features helps to speed-up the processing of the data eliminates the noisy variables 2 Methods In this section we describe the base technique of our method and the specific approach that we adopt in the case of unsupervised feature selection. The resulting algorithm is a Feature Selection/Ranking algorithm that we call FRRW (Feature Ranking by Random Walking). Indeed it is based on Random Walks on a graph where the vertices of the graph are the features and the graph vertices are connected by weighted edges dependent on how much both the features are recurrent in the dataset. The basic idea that supports the adoption of a graph-based ranking algorithm is that of voting or recommendation: when a first vertex is connected to a second vertex by a weighted edge the first vertex basically votes for the second one proportionally to the edge weight connecting them. The higher is the sum of the weights obtained by the second vertex by the other vertices the higher is the importance of that vertex in the graph. Furthermore, the importance of a vertex determines the importance of its votes. Random Walks on graphs are a special case of Markov Chains in which the Markov Chain itself describes the probability of moving between the graph vertices. In our case, it describes the probability of finding instances in the dataset in which the instances are characterized by the features. Random Walks search the stationary state in the Markov Chain and this situation assigns to each state in the Markov Chain a probability that

3 is the probability of being in that state after an infinite walk on the graph guided by the transition probabilities. Through the Random Walk on the graph PageRank determines the stationary state vector essentially by an iterative algorithm, i.e. collectively by aggregation of the transition probabilities between all the graph vertices. PageRank produces for each vector component a score (according to a formula that we will report soon) and orders the components by the score value. As a conclusion it finds a ranking between the states. Intuitively, this score is proportional to the overall probability of moving into a state from any other state. In our case, the graph states are the features and the score vector represents the stationary distribution over the feature probabilities. In other terms, it is the overall probability of finding each feature in the dataset together with other features. The framework is general: it is possible to adapt it to different domains by simple modifications of the vertices proximity measure that evaluates the transition probability between two graph vertices. 2.1 PageRank Our approach is based on the PageRank algorithm [4], which is a graph-based ranking algorithm already used in the Google search engine and in a great number of unsupervised applications. A good definition of PageRank and of one of its applications is given in [3]. PageRank assigns a score to any vertex of the graph: the score at vertex V a is as greater as is the importance of the vertex. The importance is determined by the vertices to which V a is connected. Fig. 1. A simple example of a term graph.

4 In our problem, we have an undirected and weighted graph G = (V; E), where V is the set of vertices, E V V is the set of edges. For an edge connecting vertices V a and V b 2 V there is a weight denoted by w ab. A simple example of an undirected and weighted graph is reported in Figure 1. PageRank determines iteratively the score for each vertex V a in the graph as a weighted contribution of all the scores assigned to the vertices V b connected to V a, as follows: WP(V a ) = (1? d) + d [ P V b;v b6=v a w ba PVc ;Vc6=V b w bc WP(V b )] where d is a parameter that is set between 0 and 1 (setted to 0:85, the usual value). WP is the resulting score vector, whose i-th component is the score associated to vertex V i. The greater is the score, the greater is the importance of the vertex according to its similarity with the other vertices to which it is connected. This algorithm is used in many applications, particularly in NLP tasks such as Word Sense Disambiguation [3]. 2.2 Application of FRRW in Document Categorization We denote by D a dataset with n document instances D = fd1; d2; ; d n g. D is obtained after a pre-processing step, where stop-words are eliminated and the stem of words is obtained by application of the Porter Stemming Algorithm. We denote by T = ft1; t2; ; t k g the set of the k terms that are present in the documents of D after pre-processing. Each document d i has a bag-of-word representation and contains a subset T di 2 T of terms. Vice versa, D ti 2 D represents the set of documents with term t i. We construct a graph G in which each vertex corresponds to a term t i and on each edge between two vertices t i and t j we associate a weight w ti;tj that is a similarity measure between the terms. The terms similarity could be computed in many ways. We compute it as the fraction of the documents that contain the two terms: w ij = jd t i \ D tj j jdj From graph G, a weighted matrix W is computed where each element w ij in W corresponds to the weight associated to the edge between terms t i and t j. This matrix is given in input to PageRank algorithm. The matrix corresponding to the graph of Figure 1 is reported in Figure 2 as exemplification. Notice that the cells in the position at the diagonal of the matrix are empty since the PageRank algorithm will not use them but it will consider for each graph vertex only the contribution of other vertices. In this manner we obtain a score vector whose i-th component is the score of the term t i. We order the vector components by their value and extract the correspondent ranking. The underlined idea is that PageRank algorithm allows to select the features that are the best representatives because are the most recommended by other ones since they co-occur in the same documents. In the next section we present an empirical evaluation of FRRW.

5 S 12 S 13 S 14 S 15 S 16 2 S 21 S 23 S 24 S 25 S 26 3 S 31 S 32 S 34 S 35 S 36 4 S 41 S 42 S 43 S 45 S 46 5 S 51 S 52 S 53 S 54 S 56 6 S 61 S 62 S 63 S 64 S 65 Fig. 2. Term Similarity Matrix for the graph of Figure 1 3 Empirical Evaluation Our experiments verify the validity of a feature selection as in [7], by a-posteriori verification that even a simple classifier, such as 1-NN, is able to classify correctly instances in data when they are represented by the set of features selected in an unsupervised manner. Thus we use the achieved classification accuracy to estimate the quality of the selected feature set. In fact, if the selected feature set is more relevant with the target concept a classifier should achieve a better accuracy. In our experiments we use the 7sector data [1]. From these data we construct two datasets for double check. Each dataset contains two, well separated conceptual categories: respectively, network and metals, such as gold and silver for the former dataset and software and gold and silver for the latter one. The characteristics of each dataset are shown in Figure 3 together with the accuracy obtained by a 1-NN classifier on all the features. Class One Class Two N. of features Accuracy dataset1 network gold and silver % dataset2 software gold and silver % Fig. 3. Characteristics of the data sets We compare algorithm FRRW with other two algorithms for feature selection. The former is IGR: it is based on Information Gain which in turn outperforms other measures for term selection, such as Mutual Information, 2, Term Strength as shown in [6]. The second one is a baseline selector, denoted by RanS, a simple random selection of the features. Our experiments are conceived as follows. We run the three feature selection algorithms on each of the two datasets separately and obtain a ranking of the features by each method. From each of these rankings, we select the set of the top-ranked i features, with i that varies from 1 to 600. Thus the ability of the three algorithms to place at the top of the ranking the best features is checked. For each feature set of i features, chosen by the three methods, we project the dataset and let the same 1-NN classifier to predict the class in a set of test instances (with 10 fold cross-validation) and store its accuracy. For RanS we perform 50 times the random selection of i features, for each value of i, and average the resulting classifier accuracy. Algorithm 1 sketches the algorithm we used for this evaluation.

6 Algorithm 1 Feature Evaluation Framework 1: This procedure is for evaluation of unsupervised feature selection algorithms 2: FRRW(i), IGR(i): output vectors storing the accuracy of 1-NN on dataset D projected on top i features selected respectively by FRRW and IGR 3: RanS(i): output vector storing the average accuracy of 1-NN on a random selection of i features 4: for each data set D do 5: for i = 1 to 600 do 6: select top i features by FRRW, IGR 7: /* project D on the selected features */ 8: D F RRW (i) = F RRW (i) (D) 9: D IGR(i) = IGR(i) (D) 10: /* Determine 1-NN accuracy and store it in accuracy vector */ 11: FRRW[i] = 10-fold CV on D F RRW (i) using 1-NN 12: IGR[i] = 10-fold CV on D IGR(i) using 1-NN 13: /* determine average accuracy of 1-NN with RanS */ 14: tempaccuracy = 0; 15: for j = 1 to 50 step 1 do 16: D RanS(i) = RanS(i) (D); 17: tempaccuracy=tempaccuracy + 10-fold CV on D RanS(i) using 1-NN 18: end for 19: RanS[i]= tempaccuracy/50 20: end for 21: end for The accuracy of 1-NN resulting by the feature selection of the three methods is reported in Figure 5. It is clear that with FRRW the reachable accuracy is considerably higher (even higher than the accuracy obtained with the entire set of features). In particular, it is higher with FRRW even considering only a low number of features. This behaviour perfectly fits our expectations: it means that the top-ranked features are also the most characteristic of the target concept, namely those ones that allow a good classification. On the other side, as we increase the number i of the features, the accuracy decreases (even if with FRRW it is still higher than with the other methods). This means that the ranking of the features induced by FRRW correctly places at the top of the list the features that contribute most to a correct classification. Instead, at the bottom of the list there are just those remaining features that are not useful for the characterization of the target concept, because they are the noisy ones. 4 Conclusion In this work we investigate the use of PageRank for feature selection. PageRank is a well-known graph-based ranking algorithm that performs a Random-Walk through the feature space and ranks the features according to their similarity with the greatest number of other features. We empirically demonstrate that this technique works well also in an unsupervised task (namely when no class is provided to the feature selection algorithm). We have shown that our method can rank features according to their

7 classification utility, without considering the class information itself. This interesting result means our method can infer the intrinsic characteristics of the dataset, which is particularly useful in all those cases in which no class information is provided. References 1. 7Sector data Guyon I and Elisseeff A, An introduction to variable and feature selection, JMLR, 3, (2003). 3. Mihalcea R, Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling, HLT/EMNLP05, (2005). 4. Brin S and Page L, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems, 30, (1998). 5. Deng Cai Xiaofei He and C.F. Partha Niyogi, Laplacian score for feature selection, NIPS05, (2005). 6. Yang Y and Pedersen J O, A comparative study on feature selection in text categorization, ICML97, (1997). 7. Zheng Zhao and Huan Liu, Semi-supervised feature selection via spectral analysis, SDM07, (2007).

8 75 FRRW IGR RanS Accuracy Num of features Fig. 4. Accuracy on dataset1. 75 FRRW IGR RanS Accuracy Num of features Fig. 5. Accuracy on dataset2.

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Information theory methods for feature selection

Information theory methods for feature selection Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Trace Ratio Criterion for Feature Selection

Trace Ratio Criterion for Feature Selection Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

Feature Selection for fmri Classification

Feature Selection for fmri Classification Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Supervised Random Walks

Supervised Random Walks Supervised Random Walks Pawan Goyal CSE, IITKGP September 8, 2014 Pawan Goyal (IIT Kharagpur) Supervised Random Walks September 8, 2014 1 / 17 Correlation Discovery by random walk Problem definition Estimate

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Algorithms, Games, and Networks February 21, Lecture 12

Algorithms, Games, and Networks February 21, Lecture 12 Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Unsupervised Feature Selection for Sparse Data

Unsupervised Feature Selection for Sparse Data Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-

More information

TRANSDUCTIVE LINK SPAM DETECTION

TRANSDUCTIVE LINK SPAM DETECTION TRANSDUCTIVE LINK SPAM DETECTION Denny Zhou Microsoft Research http://research.microsoft.com/~denzho Joint work with Chris Burges and Tao Tao Presenter: Krysta Svore Link spam detection problem Classification

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing

More information

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy CENTRALITIES Carlo PICCARDI DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy email carlo.piccardi@polimi.it http://home.deib.polimi.it/piccardi Carlo Piccardi

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems A probabilistic model to resolve diversity-accuracy challenge of recommendation systems AMIN JAVARI MAHDI JALILI 1 Received: 17 Mar 2013 / Revised: 19 May 2014 / Accepted: 30 Jun 2014 Recommendation systems

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Relational Classification for Personalized Tag Recommendation

Relational Classification for Personalized Tag Recommendation Relational Classification for Personalized Tag Recommendation Leandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Samelsonplatz 1, University

More information

Handling Ties. Analysis of Ties in Input and Output Data of Rankings

Handling Ties. Analysis of Ties in Input and Output Data of Rankings Analysis of Ties in Input and Output Data of Rankings 16.7.2014 Knowledge Engineering - Seminar Sports Data Mining 1 Tied results in the input data Frequency depends on data source tie resolution policy

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp

Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Collaborative Filtering using Weighted BiPartite Graph Projection A Recommendation System for Yelp Sumedh Sawant sumedh@stanford.edu Team 38 December 10, 2013 Abstract We implement a personal recommendation

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data

Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data Vito Claudio Ostuni, Tommaso Di Noia, Roberto Mirizzi, Eugenio Di Sciascio Polytechnic University of Bari, Italy {ostuni,mirizzi}@deemail.poliba.it,

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

Motivation. Motivation

Motivation. Motivation COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of

More information

Chapter 12 Feature Selection

Chapter 12 Feature Selection Chapter 12 Feature Selection Xiaogang Su Department of Statistics University of Central Florida - 1 - Outline Why Feature Selection? Categorization of Feature Selection Methods Filter Methods Wrapper Methods

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

CS 6604: Data Mining Large Networks and Time-Series

CS 6604: Data Mining Large Networks and Time-Series CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

CPSC 532L Project Development and Axiomatization of a Ranking System

CPSC 532L Project Development and Axiomatization of a Ranking System CPSC 532L Project Development and Axiomatization of a Ranking System Catherine Gamroth cgamroth@cs.ubc.ca Hammad Ali hammada@cs.ubc.ca April 22, 2009 Abstract Ranking systems are central to many internet

More information

Reflexive Regular Equivalence for Bipartite Data

Reflexive Regular Equivalence for Bipartite Data Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Predicting Gene Function and Localization

Predicting Gene Function and Localization Predicting Gene Function and Localization By Ankit Kumar and Raissa Largman CS 229 Fall 2013 I. INTRODUCTION Our data comes from the 2001 KDD Cup Data Mining Competition. The competition had two tasks,

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Mixtures of Gaussians and Advanced Feature Encoding

Mixtures of Gaussians and Advanced Feature Encoding Mixtures of Gaussians and Advanced Feature Encoding Computer Vision Ali Borji UWM Many slides from James Hayes, Derek Hoiem, Florent Perronnin, and Hervé Why do good recognition systems go bad? E.g. Why

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

Class dependent feature weighting and K-nearest neighbor classification

Class dependent feature weighting and K-nearest neighbor classification Class dependent feature weighting and K-nearest neighbor classification Elena Marchiori Institute for Computing and Information Sciences, Radboud University Nijmegen, The Netherlands elenam@cs.ru.nl Abstract.

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Centrality Book. cohesion.

Centrality Book. cohesion. Cohesion The graph-theoretic terms discussed in the previous chapter have very specific and concrete meanings which are highly shared across the field of graph theory and other fields like social network

More information

DeepWalk: Online Learning of Social Representations

DeepWalk: Online Learning of Social Representations DeepWalk: Online Learning of Social Representations ACM SIG-KDD August 26, 2014, Rami Al-Rfou, Steven Skiena Stony Brook University Outline Introduction: Graphs as Features Language Modeling DeepWalk Evaluation:

More information

Online Social Networks and Media

Online Social Networks and Media Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ

More information

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Bipartite Edge Prediction via Transductive Learning over Product Graphs Bipartite Edge Prediction via Transductive Learning over Product Graphs Hanxiao Liu, Yiming Yang School of Computer Science, Carnegie Mellon University July 8, 2015 ICML 2015 Bipartite Edge Prediction

More information