Using PageRank in Feature Selection
|
|
- David Bridges
- 6 years ago
- Views:
Transcription
1 Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy Abstract. Feature selection is an important task in data mining because it allows to reduce the data dimensionality and eliminates the noisy variables. Traditionally, feature selection has been applied in supervised scenarios rather than in unsupervised ones. Nowadays, the amount of unsupervised data available on the web is huge, thus motivating an increasing interest in feature selection for unsupervised data. In this paper we present some results in the domain of document categorization. We use the well-known PageRank algorithm to perform a random-walk through the feature space of the documents. This allows to rank and subsequently choose those features that better represent the data set. When compared with previous work based on information gain, our method allows classifiers to obtain good accuracy especially when few features are retained. 1 Introduction Everyday we work with a large amount of data, the majority of which is unlabelled. Almost all the information on Internet is not labelled. Therefore being able to treat it with unsupervised tasks has become very important. For instance, we would like to automatically categorize documents and we know that we can consider some of the words as noisy variables. The problem is to select the right subset of words that better represent the document set without using information about the class of documents. This is a typical problem of feature selection in which documents take as features the set of terms contained in all the dataset. Feature selection is a widely recognised important task in machine learning and data mining [3]. In high-dimensional data-sets feature selection improves algorithms performance and classification accuracy since the chance of overfitting increases with the number of features. Furthermore when the curse of dimensionality problem emerges - especially when the objects representation in the feature space is very sparse - feature selection reduces the degradation of the results of clustering and distance-based k-nn algorithms. In the supervised approach to feature selection we can classify the existing methods into two families: wrapper methods and filter methods. The wrapper techniques evaluate the features using the learning algorithm that will ultimately be employed. The filter based approaches most commonly explore correlations between features and the class label, assign to each feature a score and then rank the features with respect to the score. Feature selection picks the best k features according to their score and these ones will be used to represent the data-set. Most of the existing filter methods are supervised.
2 Data variance might be the simplest unsupervised evaluation of the features. The variance of a feature in the dataset reflects its greater ability to separate into disjoint regions the objects of different classes. In this way there are some works [8], [10] that adopt a Laplacian matrix, which transforms by projection the original dataset into a different space with some desired properties. Then they search the features in the transformed space that best represents a natural partition of the data. The difference between supervised and unsupervised feature selection is in the use of information on the class to guide the search of the best subset of features. Both methods can be viewed as a selection of features that are consistent with the concepts represented in the data. In supervised learning the concept is related to the class affiliation, while in unsupervised learning it is usually related to the similarity between data instances in relevant portions of the dataset. We believe that these intrinsic structures in data can be captured in a similar way in which PageRank ranks Web pages: by selection of the features that are mostly correlated with the majority of the features in the dataset. These features should represent the relevant portions of the dataset - the dataset representatives - and still allow to discard the data marginal characteristics. In this work we propose to use PageRank formula for the selection of the best features in a dataset in an unsupervised way. With the proposed method we are able to select a subset of the original features such that: allows to represent the relevant characteristics of the data has the highest probability of co-occurrence with the highest number of other features helps to speed-up the processing of the data eliminates the noisy variables 2 Methods In this section we describe the base technique of our method and the specific approach that we adopt in the case of unsupervised feature selection. The resulting algorithm is a Feature Selection/Ranking algorithm that we call FRRW (Feature Ranking by Random Walking). Indeed it is based on Random Walks [5] on a graph where the vertices of the graph are the features and the graph vertices are connected by weighted edges dependent on how much both the features are recurrent in the dataset. The basic idea that supports the adoption of a graph-based ranking algorithm is that of voting or recommendation: when a first vertex is connected to a second vertex by a weighted edge the first vertex basically votes for the second one proportionally to the edge weight connecting them. The higher is the sum of the weights obtained by the second vertex by the other vertices the higher is the importance of that vertex in the graph. Furthermore, the importance of a vertex determines the importance of its votes. Random Walks on graphs are a special case of Markov Chains in which the Markov Chain itself describes the probability of moving between the graph vertices. In our case, it describes the probability of finding instances in the dataset in which the instances are characterized by both the features. Random Walks search the stationary state in the Markov Chain and this situation assigns to each state in the Markov Chain a probability
3 that is the probability of being in that state after an infinite walk on the graph guided by the transition probabilities. Through the Random Walk on the graph PageRank determines the stationary state vector essentially by an iterative algorithm, i.e. collectively by aggregation of the transition probabilities between all graph vertices. PageRank produces a score for each vector component (according to formula 1 that we will discuss in section 2.1) and orders the components by the score value. As a conclusion it finds a ranking between the states. Intuitively, this score is proportional to the overall probability of moving into a state from any other state. In our case, the graph states are the features and the score vector represents the stationary distribution over the feature probabilities. In other terms, it is the overall probability of finding in an instance each feature together with other features. The framework is general: it is possible to adapt it to different domains. Indeed, in case we would like to use the proposed method in a different domain, we would need to complete the graph assigning any single feature of the domain to a graph vertex and determining the score at the edges by application of a suitable proximity measure between the features. 2.1 PageRank Our approach is based on the PageRank algorithm [7], which is a graph-based ranking algorithm already used in the Google search engine and in a great number of unsupervised applications. A good definition of PageRank and of one of its applications is given in [6]. PageRank assigns a score to any vertex of the graph: the score at vertex V a is as greater as is the importance of the vertex. The importance is determined by the vertices to which V a is connected. In our problem, we have an undirected and weighted graph G = (V, E), where V is the set of vertices, E V V is the set of edges. For an edge connecting vertices V a and V b V there is a weight denoted by w ab. A simple example of an undirected and weighted graph is reported in Figure 1. PageRank determines iteratively the score for each vertex V a in the graph as a weighted contribution of all the scores assigned to the vertices V b connected to V a, as follows: WP(V a ) = (1 d) + d [ V b,v b V a w ba V c,v c V b w bc WP(V b )] (1) where d is a parameter that is set between 0 and 1 (setted to 0.85, the usual value). WP is the resulting score vector, whose i-th component is the score associated to vertex V i. The greater is the score, the greater is the importance of the vertex according to its similarity with the other vertices to which it is connected. This algorithm is used in many applications, particularly in NLP tasks such as Word Sense Disambiguation [6]. 2.2 Application of FRRW in Document Categorization We denote by D a dataset with n document instances D = {d 1, d 2,, d n }. D is obtained after a pre-processing step, where stop-words are eliminated and the stem of words is obtained by application of the Porter Stemming Algorithm [4]. We denote by
4 Fig. 1. A simple example of a term graph. T = {t 1, t 2,, t k } the set of the k terms that are present in the documents of D after pre-processing. Each document d i has a bag-of-word representation and contains a subset T di T of terms. Vice versa, D ti D represents the set of documents with term t i. We construct a graph G in which each vertex corresponds to a term t i and on each edge between two vertices t i and t j we associate a weight w ti,t j that is a similarity measure between the terms. The terms similarity could be computed in many ways. We compute it as the fraction of the documents that contain the two terms: w ij = D t i D tj D From graph G, a matrix W of the weights at the graph edges is computed where each element w ij, at row i and column j in W corresponds to the weight associated to the edge between terms t i and t j. This matrix is given in input to PageRank algorithm. The cells of the matrix into the diagonal is empty since the PageRank algorithm will not use them but it will consider for each graph vertex only the contribution of other vertices. In this manner we obtain a score vector whose i-th component is the score of the term t i. We order the vector components by their value and extract the correspondent ranking. The underlined idea is that PageRank algorithm allows to select the features that are the best representatives because are the most recommended by other ones since they co-occur in the same documents. In the next section we present an empirical evaluation of FRRW.
5 3 Empirical Evaluation Our experiments verify the validity of a feature selection as in [10], by a-posteriori verification that even a simple classifier, such as 1-NN, is able to classify correctly instances in data when they are represented by the set of features selected in an unsupervised manner. Thus we use the achieved classification accuracy to estimate the quality of the selected feature set. In fact, if the selected feature set is more relevant with the target concept a classifier should achieve a better accuracy. In our experiments we use the 7sector data [2]. This dataset is a subset of the company data described in [1]. It contains web pages collected by a crawler in Pages concerning the following seven top-level economic sectors from a hierarchy published by Marketguide The seven sectors are: basic materials sector energy sector nancial sector healthcare sector technology sector transportation sector utilities sector From these data we construct two datasets for double check. Each dataset contains two, well separated conceptual categories: respectively, network and metals, such as gold and silver for the former dataset and software and gold and silver for the latter one. The characteristics of each dataset are shown in Figure 2 together with the accuracy obtained by a 1-NN classifier on all the features. Class One Class Two N. of features Accuracy dataset1 network gold and silver % dataset2 software gold and silver % Fig. 2. Characteristics of the data sets We compare algorithm FRRW with other two algorithms for feature selection. The former is IGR [9]: it is based on Information Gain which in turn outperforms other measures for term selection, such as Mutual Information,χ 2, Term Strength as shown in [9]. The second one is a baseline selector, denoted by RanS, a simple random selection of the features that we introduce specifically for this work. Our experiments are conceived as follows. We run the three feature selection algorithms on each of the two datasets separately and obtain a ranking of the features by each method. From each of these rankings, we select the set of the top-ranked i features, with i that varies from 1 to 600. Thus the ability of the three algorithms to place at the top of the ranking the best features is checked. For each feature set of i features, chosen by the three methods, we project the dataset and let the same 1-NN classifier to predict
6 Algorithm 1 Feature Evaluation Framework 1: This procedure is for evaluation of unsupervised feature selection algorithms 2: FRRW(i), IGR(i): output vectors storing the accuracy of 1-NN on dataset D projected on top i features selected respectively by FRRW and IGR 3: RanS(i): output vector storing the average accuracy of 1-NN on a random selection of i features 4: for each data set D do 5: for i = 1 to 600 do 6: select top i features by FRRW, IGR 7: /* project D on the selected features */ 8: D F RRW(i) = Π F RRW(i) (D) 9: D IGR(i) = Π IGR(i) (D) 10: /* Determine 1-NN accuracy and store it in accuracy vector */ 11: FRRW[i] = 10-fold CV on D F RRW(i) using 1-NN 12: IGR[i] = 10-fold CV on D IGR(i) using 1-NN 13: /* determine average accuracy of 1-NN with RanS */ 14: tempaccuracy = 0; 15: for j = 1 to 50 step 1 do 16: D RanS(i) = Π RanS(i) (D); 17: tempaccuracy=tempaccuracy + 10-fold CV on D RanS(i) using 1-NN 18: end for 19: RanS[i]= tempaccuracy/50 20: end for 21: end for the class in a set of test instances (with 10 fold cross-validation) and store its accuracy. For RanS we perform 50 times the random selection of i features, for each value of i, and average the resulting classifier accuracy. Algorithm 1 sketches the algorithm we used for this evaluation. The accuracy of 1-NN resulting by the feature selection of the three methods is reported in Figure 3 and in Figure 4. In these figures we plot the accuracy achieved by the classifier on the instances respectively of dataset 1 and dataset 2 where the instances are represented by a subset of the features selected by each of the three methods. On the x axis we plot the number of selected features and on the y axis we plot the accuracy obtained by the classifier on the features selected by each of the three methods. In both of the two datasets our feature selection method induces the same behaviour of the classifier. In the first part of the graph the accuracy grows up to a certain level. After this maximum the accuracy decreases. This means that the most useful features are those ones at the top of the ranking. When we finish to consider in the set of selected features only these good features and start to include the remaining ones, these latter ones introduce noise in the classification task. This is explained by the observed degradation of the classification accuracy. When we consider the other two methods, we can see that the accuracy not only is generally lower but it is also less stable when the number of features increases. It is clear that with FRRW the reachable accuracy is considerably higher (even higher than the accuracy obtained with the entire set of features). In particular, it is higher with FRRW even considering only a low
7 number of features. This behaviour perfectly fits our expectations: it means that the top-ranked features are also the most characteristic of the target concept, namely those ones that allow a good classification. On the other side, as we increase the number i of the features, the accuracy decreases (even if with FRRW it is still higher than with the other methods). This means that the ranking of the features induced by FRRW correctly places at the top of the list the features that contribute most to a correct classification. Instead, at the bottom of the list there are just those remaining features that are not useful for the characterization of the target concept, because they are the noisy ones. 4 Conclusion In this work we investigate the use of PageRank for feature selection. PageRank is a well-known graph-based ranking algorithm that performs a Random-Walk through the feature space and ranks the features according to their similarity with the greatest number of other features. We empirically demonstrate that this technique works well also in an unsupervised task (namely when no class is provided to the feature selection algorithm). We have shown that our method can rank features according to their classification utility, without considering the class information itself. This interesting result means our method can infer the intrinsic characteristics of the dataset, which is particularly useful in all those cases in which no class information is provided. As a future work we will want to investigate this method, of feature selection, for unsupervised learning such as the clustering task. References 1. Baker L. D. and McCallum A. K., Distributional clustering of words for text classication, SIGIR98, (1998). 2. 7Sector data Guyon I and Elisseeff A, An introduction to variable and feature selection, JMLR, 3, (2003). 4. Porter M.F., An algorithm for suffix stripping, Program, (1980). 5. James Norris, Markov Chains, Cambridge University Press, Mihalcea R, Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling, HLT/EMNLP05, (2005). 7. Brin S and Page L, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems, 30, (1998). 8. Deng Cai Xiaofei He and C.F. Partha Niyogi, Laplacian score for feature selection, NIPS05, (2005). 9. Yang Y and Pedersen J O, A comparative study on feature selection in text categorization, ICML97, (1997). 10. Zheng Zhao and Huan Liu, Semi-supervised feature selection via spectral analysis, SDM07, (2007).
8 75 FRRW IGR RanS Accuracy Num of features Fig. 3. Accuracy on dataset1. 75 FRRW IGR RanS Accuracy Num of features Fig. 4. Accuracy on dataset2.
Using PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationDistance based Clustering for Categorical Data
Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationFeatures: representation, normalization, selection. Chapter e-9
Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationRanking on Data Manifolds
Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationDistribution-free Predictive Approaches
Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for
More informationA Co-Clustering approach for Sum-Product Network Structure Learning
Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014
More informationReduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
More informationCollaborative filtering based on a random walk model on a graph
Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationCollaborative Filtering using Euclidean Distance in Recommendation Engine
Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance
More informationOptimizing Search Engines using Click-through Data
Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches
More informationAlgorithms, Games, and Networks February 21, Lecture 12
Algorithms, Games, and Networks February, 03 Lecturer: Ariel Procaccia Lecture Scribe: Sercan Yıldız Overview In this lecture, we introduce the axiomatic approach to social choice theory. In particular,
More informationInformation theory methods for feature selection
Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský
More informationCS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks
CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,
More informationFeature Selection for fmri Classification
Feature Selection for fmri Classification Chuang Wu Program of Computational Biology Carnegie Mellon University Pittsburgh, PA 15213 chuangw@andrew.cmu.edu Abstract The functional Magnetic Resonance Imaging
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationSupervised Random Walks
Supervised Random Walks Pawan Goyal CSE, IITKGP September 8, 2014 Pawan Goyal (IIT Kharagpur) Supervised Random Walks September 8, 2014 1 / 17 Correlation Discovery by random walk Problem definition Estimate
More informationTRANSDUCTIVE LINK SPAM DETECTION
TRANSDUCTIVE LINK SPAM DETECTION Denny Zhou Microsoft Research http://research.microsoft.com/~denzho Joint work with Chris Burges and Tao Tao Presenter: Krysta Svore Link spam detection problem Classification
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationRegularization and model selection
CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationFeature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262
Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationTrace Ratio Criterion for Feature Selection
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Trace Ratio Criterion for Feature Selection Feiping Nie 1, Shiming Xiang 1, Yangqing Jia 1, Changshui Zhang 1 and Shuicheng
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationMathematical Methods and Computational Algorithms for Complex Networks. Benard Abola
Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationCENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy
CENTRALITIES Carlo PICCARDI DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy email carlo.piccardi@polimi.it http://home.deib.polimi.it/piccardi Carlo Piccardi
More informationUnsupervised Feature Selection for Sparse Data
Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-
More information1) Give decision trees to represent the following Boolean functions:
1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationBENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA
BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,
More informationLecture 27: Learning from relational data
Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationAn Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures
An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationClustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic
Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the
More informationA FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM
A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,
More informationTransductive Phoneme Classification Using Local Scaling And Confidence
202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion
More informationChapter 3: Supervised Learning
Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationTWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University
TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing
More informationTime Series Clustering Ensemble Algorithm Based on Locality Preserving Projection
Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationMotivation. Motivation
COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationLINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION
LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationA probabilistic model to resolve diversity-accuracy challenge of recommendation systems
A probabilistic model to resolve diversity-accuracy challenge of recommendation systems AMIN JAVARI MAHDI JALILI 1 Received: 17 Mar 2013 / Revised: 19 May 2014 / Accepted: 30 Jun 2014 Recommendation systems
More informationAutomatic Summarization
Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization
More informationUsing Spam Farm to Boost PageRank p. 1/2
Using Spam Farm to Boost PageRank Ye Du Joint Work with: Yaoyun Shi and Xin Zhao University of Michigan, Ann Arbor Using Spam Farm to Boost PageRank p. 1/2 Roadmap Introduction: Link Spam and PageRank
More informationEvaluation Methods for Focused Crawling
Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth
More informationInformation Retrieval. Lecture 11 - Link analysis
Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationIncorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance
More informationReflexive Regular Equivalence for Bipartite Data
Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer
More informationIMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL
IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More information2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationCS 6604: Data Mining Large Networks and Time-Series
CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationMining di Dati Web. Lezione 3 - Clustering and Classification
Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationObject Classification Problem
HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category
More informationNetwork community detection with edge classifiers trained on LFR graphs
Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationPagerank Scoring. Imagine a browser doing a random walk on web pages:
Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
More informationApril 3, 2012 T.C. Havens
April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate
More informationA Unified Framework to Integrate Supervision and Metric Learning into Clustering
A Unified Framework to Integrate Supervision and Metric Learning into Clustering Xin Li and Dan Roth Department of Computer Science University of Illinois, Urbana, IL 61801 (xli1,danr)@uiuc.edu December
More informationPredicting Gene Function and Localization
Predicting Gene Function and Localization By Ankit Kumar and Raissa Largman CS 229 Fall 2013 I. INTRODUCTION Our data comes from the 2001 KDD Cup Data Mining Competition. The competition had two tasks,
More informationCover Page. The handle holds various files of this Leiden University dissertation.
Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:
More informationK-means clustering based filter feature selection on high dimensional data
International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationPROBLEM 4
PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationAn Improved Computation of the PageRank Algorithm 1
An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.
More information