Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105 Email: Morshed.Osmani@ndsu.edu Syed (Shawon) M. Rahman, Ph.D. Assistant Professor Computer Science & Software Engineering Dept. University of Wisconsin - Platteville 1 University Plaza, Platteville, WI 53818 Email: RahmanS@UWPlatt.edu Abstract In our research, we have applied PageRank technology to identify important attributes of Genes. Google search engine uses this PageRank algorithm to assign a numerical weight of each webpage and cluster search results. We have found PageRank algorithm is a very effective method for selecting significant attributes for high-dimensional data; especially, gene expression data. Important genes can be used in clustering of gene expression data to get a better clustering using minimum time and space. Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Generally, some attributes are more important than others. We use yeast s gene expression data from Stanford MicroArray Database. Four datasets are used each with approximate 4905 genes and 5 to 14 expression levels. We calculate the correlation matrix between these genes using their expression levels in various experiments. We have used Weka (data mining software in java) to generate and validate a naïve bayes classifier. We found that the data ranked with PageRank algorithm produces better classification than the raw data. The raw data used to build classifier can classify attributes at 48-50% accuracy while PageRanked data classify attributes at 62-64% accuracy.

1. Introduction Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Sometimes it is unrealistic or unfeasible to cluster a dataset using the whole attributes set, particularly when the attribute set is very large. Our attribute selection reduces both memory and computation time requirements and make clustering feasible. At the same time it tries to improve the result of clustering as we are selecting important or influential attributes. Gene expression data has lots of attributes (genes) associated with it. To get the effective clustering within limited amount of time and memory space constraint, attribute selection is to be used. We apply our technique to select important genes in a gene expression dataset. The key idea behind our model is Google s PageRank [1]. We use the concept that if a page is linked to many other highly ranked pages, it has a high probability of getting high rank also. Link between genes are denoted by the high correlation values between these genes. In our experiment we use yeast dataset found from Stanford MicroArray Database [2]. This database contains raw gene expression data. We select four categories and extract raw datasets. Later we filter the datasets and run our algorithm on them. We compare results from our algorithm with the results without using our process. We find significant improvement in selecting important genes. 2. Background 2.1. Attribute Selection Attribute selection is an important problem space on which many researchers are working. Its main aim is to select a subset of important attribute from a dataset and use this subset for further exploration. This is particularly useful when the dataset has a huge number of attributes and we need to classify or cluster the dataset using those attributes. It is observed that not all the attributes have the same importance or influence on the function or the dataset clustering. We can extract some subset of cluster based on its influence on the dataset. The only criteria we need to ensure is that this attribute selection may not throw away attributes that are important or significant. Another related feature of this selection is such that this selected subset of attributes should produce better results or at least similar result compared to using the whole attributes set. Several approaches of attributes selection methods and their comparison can be found here [3]. Some approaches focus on supervised classification [4][5]. Other approaches [6][7] include unsupervised classification where we do not have prior information to 1

evaluate potential solution. Boudjeloud and Poulet [8] focuses on Genetic Algorithm based search technique for attribute selection. 2.2. PageRank PageRank algorithm is developed by Google founders Mr. Larry Page and Mr. Sergey Brin. The basic idea behind PageRank algorithm is page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks [1]. Highly ranked Page Page to be ranked Edge in the network Figure 1: Page connected with many highly ranked pages is to be highly ranked. (PageRank principle) Morrison and others [9] uses GeneRank, a modified version of PageRank, to evaluate microarray experiment results. 2.3. Naïve Bayes Classifier A Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions [10]. Naïve Bayes classifier can be trained very efficiently in a supervised learning scenario. Though Naïve Bayes use over-simplified assumptions and simple design, it often works better in a complex realworld problem domain. One of advantages of using Naïve Bayes classifier is it requires a comparatively small amount of data to estimates its parameters necessary for classification. It is found that under very specific condition Naïve Bayes classifier performs better than most other classifier. Zhang [11] used Naïve Bayes classifier to find ranking of different objects (customers etc.). 3. Algorithm Here we used several steps to achieve our results. The steps are the following: 3.1 Dataset collection 2

3.2 For each dataset 3.2.1 Build the correlation network 3.2.2 Generate bit matrix from correlation network 3.2.3 Populate a table denoting the link between each gene identified by the bit at that position 3.2.4 Apply PageRank algorithm to that table 3.2.5 Normalize the ranks 3.3 Merge these dataset into a single dataset with the function column 3.4 Run the Naïve Bayes Classifier on this dataset 3.5 Build a single dataset from all those initial dataset with the function column 3.6 Run the Naïve Bayes Classifier on this combined dataset 3.7 Compare results from these two classifications 3.1. Dataset Collection We collected Yeast dataset from Stanford MicroArray Database [2] using public login. Four datasets are collected for category Sporulation (7 points sample), Cell-cycle (14 points sample), Transcription (6 points sample) and Stress (5 point sample). Data from Stanford MicroArray Database (SMD) are in raw format and zipped. For each time point we get a MS Excel file and from those files the gene expression values are copied to a single Excel file. Thus we get an excel file with gene id and the gene expression values for each experiment in the corresponding column. Later we copy all those data to a MS Access database for filtering. Sporulation dataset contain 6384 records whereas Cell-cycle, Transcription, Stress, Function contain 7680, 8832, 8448, 7990 records respectively. First we remove duplicate entries. Then we filter the datasets and keep only those genes present in all four dataset and also in the function. After filtering we get 4905 records of common genes in each dataset. These are the dataset we work on. Dataset collection process requires some manual labor, going through the data to find any anomaly, remove duplicity, transferring one format to another for further processing. In our case we used MS Excel and MS Access to collect the base dataset. 3.2. Work on Each Dataset 3.2.1. Build the correlation network We find out the correlation between any two genes in the current dataset. We build a matrix that denotes the correlation between all the possible pair of genes in the current dataset. We get a 4905X4905 matrix of correlation coefficient between those genes. This is a symmetric matrix. So both upper half and lower half contain the same information. 3

This step is done on MATLAB. We copy the dataset from MS Access table to a single tab delimited text file and run MATLAB script to calculate correlation matrix. 3.2.2. Generate the bit matrix from correlation network We create bit matrix from the correlation matrix by taking some cutoff value. Here we take 0.5 as cutoff point. However other values may do also. MATLAB script performs this part. 3.2.3. Populate MySQL table We consider bit position to be the link between the two genes. We populate a MySQL table with this link information using a tuple (from, to). MATLAB script generates a file containing SQL insert statements to populate the table. Later we use MySQL command prompt to insert data using that file. 3.2.4. Run the PageRank algorithm We run the PageRank algorithm implementation in PHP found at [12] on the table and get the rank of the each gene. Later we use a PHP script to read rank data from rank table and copy this rank data to a MS Access table. 3.2.5. Normalize the ranks We normalize the ranks to be in [0, 1] interval. 3.3. Merge These Dataset into a Single Dataset with the Function Column We merge ranks of all four dataset into a single dataset and add function column to this newly merged dataset. This part is done in MS Access. 3.4. Run the Naïve Bayes Classifier on This Dataset We use Weka [13] to build a classifier and validate. For this reason we divided our data into train set and test set. Train dataset are created using the records with comparatively high rank value in all those four tables. The only assumption we take that train dataset may have almost equal number of 0, 1 in the function column to make the classification less biased. Weka has a special file format Attribute-Relation File Format(ARFF). We 4

converted our data into ARFF format and run Weka to train a Naïve Bayes classifier using train dataset and validate it with test dataset. 3.5 Build a Single Dataset from all Those Initial Dataset with the Function Column We build a combined raw dataset from all the 4 raw dataset before taking correlation and ranks. The function column is also added to this dataset. 3.6 Run the Naïve Bayes Classifier on This Combined Dataset We randomly divide this combined raw dataset into train and test dataset. Then we run a Naïve Bayes classifier using the raw train dataset and validate this classifier with raw test dataset. There are some assumptions we make here. We take almost equal number of record in train dataset as in the train dataset of ranks. We also use almost equal number of 1, 0 in the train dataset to minimize the biasness to a particular value (in our case it is 0). 3.7 Compare Results From These Two Classifications We compare the result we found from Weka and compare the graph provided by the Weka. We found significant improvement in classification accuracy when we use ranking. 4. Data Statistics Initial number of records in each raw dataset. Number of genes common to all these dataset: 4905 Table 1: Number of record in each raw dataset collected from SMD [2]. Recordset # of records #of experiment points Sporulation 6384 7 Cell-cycle 7680 14 Transcription 8832 6 Stress 8448 5 Function 7990-4.1. Compared Data Statistics In Table 2, we have compared the statistics between raw dataset and using our algorithm. 5

Table 2. Comparison of the statistics between raw dataset and using our algorithm. Raw Dataset Statistics Using Our Algorithm Train dataset: Train dataset: Total # of records = 121 # of records containing 1 as function value = 59 # of records containing 0 as function value = 62 Test dataset: Total # of records = 4784 # of records containing 1 as function value = 1490 # of records containing 0 as function value = 3294 Total # of records = 108 # of records containing 1 as function value = 52 # of records containing 0 as function value = 56 Test dataset: Total # of records = 4797 # of records containing 1 as function value = 1497 # of records containing 0 as function value = 3299 4.2 Time statistics: All statiscs are taken on a Pentium Centrino 1.6GHz with 768 MB RAM. Time to generate correlation matrix and create a MySQL bulk insert file = 2min 25 sec Time to insert into MySQL table = 32 min 10 sec Time to run PageRank algorithm = 2 hour 10 min The above statistics is for Sporulation dataset only. Other datasets also take somewhat similar amount of time except Cell-cycle dataset which takes little bit less amount of time in second and third cases. 5. Results and discussion We get promising result after using our algorithm. We run Naïve Bayes classifier in Weka using the raw dataset (without applying our algorithm). The result is not good. Only 48.43% of data is correctly classified. Result of classification using raw data Correctly Classified Instances 2317 48.4323 % Incorrectly Classified Instances 2467 51.5677 % TP Rate FP Rate Precision Recall F-Measure Class 6

0.42 0.372 0.713 0.42 0.528 0 0.628 0.58 0.328 0.628 0.431 1 When we apply our algorithm and run Weka again we get 61.28 % of correctly classified instance a major improvement (13%). Thus we see our method produces better results. Gene can be classified according to the function whether it has a significant impact on the function or not, thereby selected for further clustering. Result of classification after applying our algorithm Correctly Classified Instances 2939 61.2802 % Incorrectly Classified Instances 1857 38.7198 % TP Rate FP Rate Precision Recall F-Measure Class 0.803 0.807 0.687 0.803 0.741 0 0.193 0.197 0.308 0.193 0.237 1 We make some assumptions which need to be verified. However, our results show that those assumptions do not cause any bad impact. Extensive study and experiments by taking different assumptions may reveal the best assumptions that to be made. For example, we took 0.5 as cutoff point when producing bit matrix from correlation network. We may try with our values also. We also take almost equal values of 0 and 1 to build the train dataset. Our observations show that it reduces the chance of biasness to a particular value (in our case 0). 6. Conclusion In this paper, we have presented a method for selecting important attributes (genes) from a gene expression dataset. Our method used Google s PageRank algorithm to find ranking of different genes based on their correlation network. Then high-ranking genes were used to build a Naïve Bayes classifier. We compared our approach with raw data and find significant improvement. Thus we have shown that our method produced better result compared to the raw data method. We need to do further research for finding optimal or suitable value for cutoff would be a place for improvement. Before computing correlation matrix, we may also normalize the values. Taking top 100/200 selection irrespective of number of 0/1 and build a classifier based on that would be another area. Currently, the classifier fails to train itself if 0 and 1 are not in a balanced number (their proportion doesn t vary much from 1:1). 7

References 1. Larry Page, Sergey Brin, R. Motwani, T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford (Santa Barbara, CA 93106, January 1998). 2. Stanford MicroArray Database (http://genome-www5.stanford.edu/), web retrieve on December 10, 2006. 3. Borges, H.B. Nievola, J.C.(2005); Attribute Selection Methods Comparison for Classification of Diffuse Large B-Cell Lymphoma, In Proceedings of the Fourth International Conference on Machine Learning and Applications, December 2005. 4. G. John, R. Kohavi, and K. Pfleger; Irrelevant features and subset selection problem, In Morgan Kaufmann New Brunswick, NJ, editor, the eleventh International Conference on Machine Learning, pages 121 129, 1994. 5. H. Liu and H. Motod; Feature selection for knowledge discovery and data mining. In Kluwer International Series in Engineering and Computer Science, Secs, 1998. 6. M. Dash and H. Liu; Feature selection for clustering. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 110 121, 2000. 7. Y. Kim, W. N. Street, and F. Menczer. Evolutionary model selection in unsupervised learning. volume 6, pages 531 556. IOS Press, 2002. 8. Lydia Boudjeloud and Fran cois Poulet; Attribute Selection for High Dimensional Data Clustering, In Proceedings of the Applied Stochastic Models and Data Analysis 2005. 9. Julie L. Morrison, Rainer Breitling, Desmond J. Higham, David R. Gilbert: GeneRank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics 6: 233 (2005). 10. Naïve Bayes classifier (http://en.wikipedia.org/wiki/naive_bayes_classifier/), web retrieve on December 10, 2006. 11. H. Zhang and J. S; Naive Bayesian classifiers for ranking, Proceedings of the 15th European Conference on Machine Learning (ECML2004), Springer(2004). 12. PageRank implementation in PHP (http://gxrank.googlecode.com/svn/trunk/), web retrieve on December 10, 2006. 13. Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005. 8