Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Similar documents
Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

TUBE: Command Line Program Calls

Web Structure Mining using Link Analysis Algorithms

Feature Subset Selection Problem using Wrapper Approach in Supervised Learning

The Role of Biomedical Dataset in Classification

Statistical dependence measure for feature selection in microarray datasets

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

COMP5331: Knowledge Discovery and Data Mining

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Machine Learning Techniques for Data Mining

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

IJMIE Volume 2, Issue 9 ISSN:

An Empirical Study on feature selection for Data Classification

Slides for Data Mining by I. H. Witten and E. Frank

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CloNI: clustering of JN -interval discretization

Feature Ranking in Intrusion Detection Dataset using Combination of Filtering Methods

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Genetic Programming for Data Classification: Partitioning the Search Space

Machine Learning Techniques for Data Mining

Supervised Clustering of Yeast Gene Expression Data

Using Decision Boundary to Analyze Classifiers

SNS College of Technology, Coimbatore, India

Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

3 Virtual attribute subsetting

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Big Data Analytics for Host Misbehavior Detection

Gene Clustering & Classification

The Data Mining Application Based on WEKA: Geographical Original of Music

ISSN ICIRET-2014

Weka ( )

A Survey on Postive and Unlabelled Learning

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

WEKA Explorer User Guide for Version 3-4

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Efficiently Handling Feature Redundancy in High-Dimensional Data

Clustering analysis of gene expression data

Redundancy Based Feature Selection for Microarray Data

Classification Using Unstructured Rules and Ant Colony Optimization

The application of Randomized HITS algorithm in the fund trading network

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Creating Time-Varying Fuzzy Control Rules Based on Data Mining

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

A Comparative Study of Selected Classification Algorithms of Data Mining

A New Technique for Ranking Web Pages and Adwords

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Data Preprocessing. Data Preprocessing

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

PAGE RANK ON MAP- REDUCE PARADIGM

Lecture #3: PageRank Algorithm The Mathematics of Google Search

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Ranking web pages using machine learning approaches

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Link Prediction for Social Network

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

AN APPROACH TO EVALUATE QUALITY OF WEBSITE STRUCTURE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A Novel Algorithm for Associative Classification

A brief history of Google

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Homework2 Chapter4 exersices Hongying Du

CBFAST- Efficient Clustering Based Extended Fast Feature Subset Selection Algorithm for High Dimensional Data

A Study of Random Forest Algorithm with implemetation using Weka

Enhancing Cluster Quality by Using User Browsing Time

A Genetic Algorithm Applied to Graph Problems Involving Subsets of Vertices

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Bayesian Learning Networks Approach to Cybercrime Detection

CSI5387: Data Mining Project

CPSC 532L Project Development and Axiomatization of a Ranking System

Transcription:

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105 Email: Morshed.Osmani@ndsu.edu Syed (Shawon) M. Rahman, Ph.D. Assistant Professor Computer Science & Software Engineering Dept. University of Wisconsin - Platteville 1 University Plaza, Platteville, WI 53818 Email: RahmanS@UWPlatt.edu Abstract In our research, we have applied PageRank technology to identify important attributes of Genes. Google search engine uses this PageRank algorithm to assign a numerical weight of each webpage and cluster search results. We have found PageRank algorithm is a very effective method for selecting significant attributes for high-dimensional data; especially, gene expression data. Important genes can be used in clustering of gene expression data to get a better clustering using minimum time and space. Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Generally, some attributes are more important than others. We use yeast s gene expression data from Stanford MicroArray Database. Four datasets are used each with approximate 4905 genes and 5 to 14 expression levels. We calculate the correlation matrix between these genes using their expression levels in various experiments. We have used Weka (data mining software in java) to generate and validate a naïve bayes classifier. We found that the data ranked with PageRank algorithm produces better classification than the raw data. The raw data used to build classifier can classify attributes at 48-50% accuracy while PageRanked data classify attributes at 62-64% accuracy.

1. Introduction Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Sometimes it is unrealistic or unfeasible to cluster a dataset using the whole attributes set, particularly when the attribute set is very large. Our attribute selection reduces both memory and computation time requirements and make clustering feasible. At the same time it tries to improve the result of clustering as we are selecting important or influential attributes. Gene expression data has lots of attributes (genes) associated with it. To get the effective clustering within limited amount of time and memory space constraint, attribute selection is to be used. We apply our technique to select important genes in a gene expression dataset. The key idea behind our model is Google s PageRank [1]. We use the concept that if a page is linked to many other highly ranked pages, it has a high probability of getting high rank also. Link between genes are denoted by the high correlation values between these genes. In our experiment we use yeast dataset found from Stanford MicroArray Database [2]. This database contains raw gene expression data. We select four categories and extract raw datasets. Later we filter the datasets and run our algorithm on them. We compare results from our algorithm with the results without using our process. We find significant improvement in selecting important genes. 2. Background 2.1. Attribute Selection Attribute selection is an important problem space on which many researchers are working. Its main aim is to select a subset of important attribute from a dataset and use this subset for further exploration. This is particularly useful when the dataset has a huge number of attributes and we need to classify or cluster the dataset using those attributes. It is observed that not all the attributes have the same importance or influence on the function or the dataset clustering. We can extract some subset of cluster based on its influence on the dataset. The only criteria we need to ensure is that this attribute selection may not throw away attributes that are important or significant. Another related feature of this selection is such that this selected subset of attributes should produce better results or at least similar result compared to using the whole attributes set. Several approaches of attributes selection methods and their comparison can be found here [3]. Some approaches focus on supervised classification [4][5]. Other approaches [6][7] include unsupervised classification where we do not have prior information to 1

evaluate potential solution. Boudjeloud and Poulet [8] focuses on Genetic Algorithm based search technique for attribute selection. 2.2. PageRank PageRank algorithm is developed by Google founders Mr. Larry Page and Mr. Sergey Brin. The basic idea behind PageRank algorithm is page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks [1]. Highly ranked Page Page to be ranked Edge in the network Figure 1: Page connected with many highly ranked pages is to be highly ranked. (PageRank principle) Morrison and others [9] uses GeneRank, a modified version of PageRank, to evaluate microarray experiment results. 2.3. Naïve Bayes Classifier A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions [10]. Naïve Bayes classifier can be trained very efficiently in a supervised learning scenario. Though Naïve Bayes use over-simplified assumptions and simple design, it often works better in a complex realworld problem domain. One of advantages of using Naïve Bayes classifier is it requires a comparatively small amount of data to estimates its parameters necessary for classification. It is found that under very specific condition Naïve Bayes classifier performs better than most other classifier. Zhang [11] used Naïve Bayes classifier to find ranking of different objects (customers etc.). 3. Algorithm Here we used several steps to achieve our results. The steps are the following: 3.1 Dataset collection 2

3.2 For each dataset 3.2.1 Build the correlation network 3.2.2 Generate bit matrix from correlation network 3.2.3 Populate a table denoting the link between each gene identified by the bit at that position 3.2.4 Apply PageRank algorithm to that table 3.2.5 Normalize the ranks 3.3 Merge these dataset into a single dataset with the function column 3.4 Run the Naïve Bayes Classifier on this dataset 3.5 Build a single dataset from all those initial dataset with the function column 3.6 Run the Naïve Bayes Classifier on this combined dataset 3.7 Compare results from these two classifications 3.1. Dataset Collection We collected Yeast dataset from Stanford MicroArray Database [2] using public login. Four datasets are collected for category Sporulation (7 points sample), Cell-cycle (14 points sample), Transcription (6 points sample) and Stress (5 point sample). Data from Stanford MicroArray Database (SMD) are in raw format and zipped. For each time point we get a MS Excel file and from those files the gene expression values are copied to a single Excel file. Thus we get an excel file with gene id and the gene expression values for each experiment in the corresponding column. Later we copy all those data to a MS Access database for filtering. Sporulation dataset contain 6384 records whereas Cell-cycle, Transcription, Stress, Function contain 7680, 8832, 8448, 7990 records respectively. First we remove duplicate entries. Then we filter the datasets and keep only those genes present in all four dataset and also in the function. After filtering we get 4905 records of common genes in each dataset. These are the dataset we work on. Dataset collection process requires some manual labor, going through the data to find any anomaly, remove duplicity, transferring one format to another for further processing. In our case we used MS Excel and MS Access to collect the base dataset. 3.2. Work on Each Dataset 3.2.1. Build the correlation network We find out the correlation between any two genes in the current dataset. We build a matrix that denotes the correlation between all the possible pair of genes in the current dataset. We get a 4905X4905 matrix of correlation coefficient between those genes. This is a symmetric matrix. So both upper half and lower half contain the same information. 3

This step is done on MATLAB. We copy the dataset from MS Access table to a single tab delimited text file and run MATLAB script to calculate correlation matrix. 3.2.2. Generate the bit matrix from correlation network We create bit matrix from the correlation matrix by taking some cutoff value. Here we take 0.5 as cutoff point. However other values may do also. MATLAB script performs this part. 3.2.3. Populate MySQL table We consider bit position to be the link between the two genes. We populate a MySQL table with this link information using a tuple (from, to). MATLAB script generates a file containing SQL insert statements to populate the table. Later we use MySQL command prompt to insert data using that file. 3.2.4. Run the PageRank algorithm We run the PageRank algorithm implementation in PHP found at [12] on the table and get the rank of the each gene. Later we use a PHP script to read rank data from rank table and copy this rank data to a MS Access table. 3.2.5. Normalize the ranks We normalize the ranks to be in [0, 1] interval. 3.3. Merge These Dataset into a Single Dataset with the Function Column We merge ranks of all four dataset into a single dataset and add function column to this newly merged dataset. This part is done in MS Access. 3.4. Run the Naïve Bayes Classifier on This Dataset We use Weka [13] to build a classifier and validate. For this reason we divided our data into train set and test set. Train dataset are created using the records with comparatively high rank value in all those four tables. The only assumption we take that train dataset may have almost equal number of 0, 1 in the function column to make the classification less biased. Weka has a special file format Attribute-Relation File Format(ARFF). We 4

converted our data into ARFF format and run Weka to train a Naïve Bayes classifier using train dataset and validate it with test dataset. 3.5 Build a Single Dataset from all Those Initial Dataset with the Function Column We build a combined raw dataset from all the 4 raw dataset before taking correlation and ranks. The function column is also added to this dataset. 3.6 Run the Naïve Bayes Classifier on This Combined Dataset We randomly divide this combined raw dataset into train and test dataset. Then we run a Naïve Bayes classifier using the raw train dataset and validate this classifier with raw test dataset. There are some assumptions we make here. We take almost equal number of record in train dataset as in the train dataset of ranks. We also use almost equal number of 1, 0 in the train dataset to minimize the biasness to a particular value (in our case it is 0). 3.7 Compare Results From These Two Classifications We compare the result we found from Weka and compare the graph provided by the Weka. We found significant improvement in classification accuracy when we use ranking. 4. Data Statistics Initial number of records in each raw dataset. Number of genes common to all these dataset: 4905 Table 1: Number of record in each raw dataset collected from SMD [2]. Recordset # of records #of experiment points Sporulation 6384 7 Cell-cycle 7680 14 Transcription 8832 6 Stress 8448 5 Function 7990-4.1. Compared Data Statistics In Table 2, we have compared the statistics between raw dataset and using our algorithm. 5

Table 2. Comparison of the statistics between raw dataset and using our algorithm. Raw Dataset Statistics Using Our Algorithm Train dataset: Train dataset: Total # of records = 121 # of records containing 1 as function value = 59 # of records containing 0 as function value = 62 Test dataset: Total # of records = 4784 # of records containing 1 as function value = 1490 # of records containing 0 as function value = 3294 Total # of records = 108 # of records containing 1 as function value = 52 # of records containing 0 as function value = 56 Test dataset: Total # of records = 4797 # of records containing 1 as function value = 1497 # of records containing 0 as function value = 3299 4.2 Time statistics: All statiscs are taken on a Pentium Centrino 1.6GHz with 768 MB RAM. Time to generate correlation matrix and create a MySQL bulk insert file = 2min 25 sec Time to insert into MySQL table = 32 min 10 sec Time to run PageRank algorithm = 2 hour 10 min The above statistics is for Sporulation dataset only. Other datasets also take somewhat similar amount of time except Cell-cycle dataset which takes little bit less amount of time in second and third cases. 5. Results and discussion We get promising result after using our algorithm. We run Naïve Bayes classifier in Weka using the raw dataset (without applying our algorithm). The result is not good. Only 48.43% of data is correctly classified. Result of classification using raw data Correctly Classified Instances 2317 48.4323 % Incorrectly Classified Instances 2467 51.5677 % TP Rate FP Rate Precision Recall F-Measure Class 6

0.42 0.372 0.713 0.42 0.528 0 0.628 0.58 0.328 0.628 0.431 1 When we apply our algorithm and run Weka again we get 61.28 % of correctly classified instance a major improvement (13%). Thus we see our method produces better results. Gene can be classified according to the function whether it has a significant impact on the function or not, thereby selected for further clustering. Result of classification after applying our algorithm Correctly Classified Instances 2939 61.2802 % Incorrectly Classified Instances 1857 38.7198 % TP Rate FP Rate Precision Recall F-Measure Class 0.803 0.807 0.687 0.803 0.741 0 0.193 0.197 0.308 0.193 0.237 1 We make some assumptions which need to be verified. However, our results show that those assumptions do not cause any bad impact. Extensive study and experiments by taking different assumptions may reveal the best assumptions that to be made. For example, we took 0.5 as cutoff point when producing bit matrix from correlation network. We may try with our values also. We also take almost equal values of 0 and 1 to build the train dataset. Our observations show that it reduces the chance of biasness to a particular value (in our case 0). 6. Conclusion In this paper, we have presented a method for selecting important attributes (genes) from a gene expression dataset. Our method used Google s PageRank algorithm to find ranking of different genes based on their correlation network. Then high-ranking genes were used to build a Naïve Bayes classifier. We compared our approach with raw data and find significant improvement. Thus we have shown that our method produced better result compared to the raw data method. We need to do further research for finding optimal or suitable value for cutoff would be a place for improvement. Before computing correlation matrix, we may also normalize the values. Taking top 100/200 selection irrespective of number of 0/1 and build a classifier based on that would be another area. Currently, the classifier fails to train itself if 0 and 1 are not in a balanced number (their proportion doesn t vary much from 1:1). 7

References 1. Larry Page, Sergey Brin, R. Motwani, T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford (Santa Barbara, CA 93106, January 1998). 2. Stanford MicroArray Database (http://genome-www5.stanford.edu/), web retrieve on December 10, 2006. 3. Borges, H.B. Nievola, J.C.(2005); Attribute Selection Methods Comparison for Classification of Diffuse Large B-Cell Lymphoma, In Proceedings of the Fourth International Conference on Machine Learning and Applications, December 2005. 4. G. John, R. Kohavi, and K. Pfleger; Irrelevant features and subset selection problem, In Morgan Kaufmann New Brunswick, NJ, editor, the eleventh International Conference on Machine Learning, pages 121 129, 1994. 5. H. Liu and H. Motod; Feature selection for knowledge discovery and data mining. In Kluwer International Series in Engineering and Computer Science, Secs, 1998. 6. M. Dash and H. Liu; Feature selection for clustering. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 110 121, 2000. 7. Y. Kim, W. N. Street, and F. Menczer. Evolutionary model selection in unsupervised learning. volume 6, pages 531 556. IOS Press, 2002. 8. Lydia Boudjeloud and Fran cois Poulet; Attribute Selection for High Dimensional Data Clustering, In Proceedings of the Applied Stochastic Models and Data Analysis 2005. 9. Julie L. Morrison, Rainer Breitling, Desmond J. Higham, David R. Gilbert: GeneRank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6: 233 (2005). 10. Naïve Bayes classifier (http://en.wikipedia.org/wiki/naive_bayes_classifier/), web retrieve on December 10, 2006. 11. H. Zhang and J. S; Naive Bayesian classifiers for ranking, Proceedings of the 15th European Conference on Machine Learning (ECML2004), Springer(2004). 12. PageRank implementation in PHP (http://gxrank.googlecode.com/svn/trunk/), web retrieve on December 10, 2006. 13. Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005. 8