SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA. A Thesis. Presented to. The Graduate Faculty of The University of Akron

Size: px

Start display at page:

Download "SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA. A Thesis. Presented to. The Graduate Faculty of The University of Akron"

Isaac Warner
6 years ago
Views:

1 SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Ameer Basha Shaik Abdul May, 2011

2 SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA Ameer Basha Shaik Abdul Thesis Approved: Accepted: Advisor Dr. Zhong-Hui Duan Department Chair Dr. Chien-Chung Chan Committee Member Dr. Chien-Chung Chan Dean of the College Dr. Chand K. Midha Committee Member Dr. Yingcai Xiao Dean of the Graduate School Dr. George R. Newkome Date ii

3 ABSTRACT Support vector machine is statistical classification algorithm that classifies data by separating two classes with the help of a functional hyper plane. SVM is known for good performance on noisy and high dimensional data such as microarray. A marginal region of functional hyper plane named danger zone is defined to be the region between two parallel hyper planes that are determined by the average distances of the support vectors from the two classes to functional hyper plane. The main aim of this study was to determine the effect of margin distance, the width of the danger zone, on the accuracy of the classifier and to analyze the role of margin distance in feature selection. The study was carried out using three microarray datasets. For each dataset, equation of functional hyper plane separating the two classes of data was derived. The corresponding support vectors were obtained. The average distances between support vectors from the two classes to functional hyper plane were calculated. The relations between the width of the danger zone and the classification accuracy were investigated. The rate of change of the margin distance with respect to the number of features used for constructing the support vector machine was also examined. The results indicate that although correlation between margin and accuracy is not very strong, but the rate of change of classification accuracy with respect to margin distance can be employed to determine the optimal number of features for constructing high performance support vector machine for classifying microarray samples. iii

4 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Dr. Zhong Hui Duan for her continuous guidance throughout the research. With her untiring advice and invaluable help, it has been possible to proceed in the correct direction and successfully complete the study. I would also like to thank my thesis committee members Dr. Chein- Chung Chan and Dr. Yingcai Xiao for their expert suggestions. I am extremely grateful to my parents for having faith in me and for extending their support at all times. In addition, I would also like to thank my friends, Aparna Sriram, Teja Polapragada and Sri Harsha Muppaneni for their timely help. iv

5 TABLE OF CONTENTS Page LIST OF TABLES..ix LIST OF FIGURES......xi CHAPTER I. INTRODUCTION Gene expression and Microarray technology Support Vector Machine..5 II. LITERATURE REVIEW..12 III. MATERIALS AND METHODS Methodology Dataset selection Training and test dataset generation Data preprocessing and feature selection Training the SVM Determining the equation of hyper-plane..27 v

6 3.1.6 Distance of a sample to hyper-plane Classifying test data Obtaining and analyzing the results...30 IV. RESULTS AND DISCUSSIONS Distances of training samples to the decision boundary Distances of test samples Classification accuracy Relation between classification accuracy and margin value Variation in accuracy and margin with increase in the number of genes Effects of change in number of genes on margin and accuracy Correlation between margin and accuracy.57 V. CONCLUSIONS...59 REFERENCES..61 APPENDICES...63 APPENDIX A. GENERATING TRAINING AND TEST RANDOMLY.64 vi

7 APPENDIX B. SCALING TRAINING AND TEST DATASET.67 APPENDIX C. PERFORMING T-TEST ON THE SCALED TRAINING DATA..68 APPENDIX D. CALCULATING THE MARGIN DISTANCE OF SVM CLASSIFIER...70 vii

8 LIST OF TABLES Table Page 3.1 Format of Leukemia dataset Format of Heart disease dataset Format of Breast cancer dataset Sample s distribution of training and test data for 3 datasets Number of genes in each dataset Distances of training samples to the hyper-plane for linear SVM kernel Distance of test samples to the hyper-plane using linear SVM kernel for Leukemia dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Leukemia dataset Distance of test samples to the hyper-plane using linear SVM kernel for Heart disease dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Heart disease dataset..38 viii

9 4.6 Distance of test samples to the hyper-plane using linear SVM kernel for Breast disease dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Breast disease dataset Percentage of misclassified test samples in the danger zone for linear kernel SVM Percentage of misclassified test samples in the danger zone for RBF kernel SVM Average classification accuracy for linear kernel SVM Average classification accuracy for RBF kernel SVM Correlation between margin value and classification accuracy.58 ix

10 LIST OF FIGURES Figure Page 1.1 Microarray experiment Decision boundary and margin of SVM classifier Flow chart representation of the process SVM feature space showing the support vectors (s i ), margin distance (m d ) and the danger zone Classification accuracy of linear kernel SVM for Leukemia dataset Margin value of linear kernel SVM for Leukemia dataset Classification accuracy of RBF kernel SVM for Leukemia dataset Margin value of RBF kernel SVM for Leukemia dataset Classification accuracy of linear kernel SVM for Heart disease dataset Margin value of linear kernel SVM for Heart disease dataset Classification accuracy of RBF kernel SVM for Heart disease dataset Margin value of RBF kernel SVM for Heart disease dataset 48 x

11 4.9 Classification accuracy of linear kernel SVM for Breast cancer dataset Margin value of linear kernel SVM for Breast cancer dataset Classification accuracy of RBF kernel SVM for Breast cancer dataset Margin value of RBF kernel SVM for Leukemia Breast cancer dataset Rate of change of margin with respect to the change in number of genes for Leukemia dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Leukemia dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Leukemia dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Leukemia dataset using RBF SVM kernel Rate of change of margin with respect to the change in number of genes for Heart disease dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Heart disease dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Heart disease dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Heart disease dataset using RBF SVM kernel.54 xi

12 4.21 Rate of change of margin with respect to the change in number of genes for Breast cancer dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Breast cancer dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Breast cancer dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Breast cancer dataset using RBF SVM kernel.56 xii

13 CHAPTER I INTRODUCTION Bioinformatics is an emerging field that has its roots in molecular biology, mathematics and computer science. It deals with generation, management and analysis of the biological data, which is obtained from various experiments and techniques, often resulting in large data. The analysis of such enormous biological data requires use of sophisticated algorithms, which can process the data and help in visualizing the data and extract information from it [7]. This led to the evolution of Bioinformatics, an interdisciplinary field involving both biologist and computer scientists. Advancement in the field of bioinformatics has facilitated many researchers in analyzing the data and understanding the structural, comparative and functional properties. Some of the enhancements being analysis of genomes and proteins, identifying metabolic and signaling pathways which define the gene to gene relationships, development of microarray chip and conducting microarray experiments to measure the gene expression levels. The availability of the data on public websites and repositories made it easier to carry out the research. NCBI is one such database that includes DNA and protein sequence data and also facilitates researchers to contribute their sequences to the database. KEGG and EcoCyc are the databases that consists the 1

14 information about the pathways [7].To process the data, finely tuned algorithms were developed over the years and have been made publicly available. Some of them are BLAST, CLUSTALW algorithms that perform sequence comparison. Algorithms to perform phylogenetic analysis were also made available on the public websites [8]. One of the major advancement made in the field of bioinformatics is the emergence of microarray technology. Microarray technology facilitates in determining the expression values of several genes simultaneously. The gene expression data is used for various analyses to understand the biological significance of the species or the tissue from which the genes were extracted for the experiment. One such analysis is classification of the sample based on the gene expression values that are obtained from the microarray experiment. This study focuses on analysis and calculation of distance measure and margin of a support vector machine classifier for microarray dataset. It also deals with studying the effect of margin value on the classification accuracy and relation between them. Before we proceed further, a brief introduction to gene expression and microarray technology is provided followed by a discussion of the support vector machine classifier. 1.1 Gene expression and Microarray technology The characteristic features and behavior of a biological species largely depends on the genes and the proteins present in it. Proteins obtained from the genes vary depending upon the gene expression levels. Hence analyzing the expression levels of genes under various conditions will help us in identifying the reason behind abnormalities in diseased 2

species in addition to identifying the genes responsible for the abnormality. Microarray technology is used to study and record the gene expressions of thousands of genes simultaneously.

15 species in addition to identifying the genes responsible for the abnormality. Microarray technology is used to study and record the gene expressions of thousands of genes simultaneously. Microarray is a chip on which biological substrates are bound to the probes present on the silicon chip or a glass slide. The biological substrates can be DNAs, proteins molecules or carbohydrates that decide the type of microarray chip. There are different types of microarrays such as DNA microarrays, protein microarrays, tissue microarrays and carbohydrate microarrays [9]. DNA microarrays are the commonly used ones to record the expression levels of genes. Experiment tissue samples mrna cdna Labeled cdna Microarray Scanned Image mrnas extracted Reverse transcription Fluorescent dyes Hybridization Laser scanning Figure 1.1 Microarray experiment 3

16 Figure 1.1 shows a typical microarray experiment. The target mrnas (messenger RNAs) of the species whose gene expressions are to be measured are reverse transcribed to cdnas (complimentary DNAs). The cdnas are labeled with fluorescent dyes or radioactive lasers and are hybridized on the microarray chip. The chip is left overnight to let it hybridize. During the process of hybridization cdnas bind to their complementary strands present on the microarray chip using base pair bonding. Then the chip is washed to remove any non-specific DNA bindings present on it. It is then scanned to obtain a digital image. The image obtained is analyzed and processed using image processing and data normalization techniques to record the expression levels of thousands of genes. For a dual channel microarray chip, both control cell tissue samples and experiment cell tissue samples are extracted and are colored with different fluorescent dyes. Then they are reverse-transcribed to cdnas and are hybridized on the dual channel microarray chip. After hybridization the chip is scanned to obtain an image which is further processed to obtain the gene expression levels of the experiment tissue samples. Microarrays have many applications in medical and biological fields. Various kinds of microarrays are used to obtain the expression levels of the biological entities. For example, the protein microarrays are used to understand the protein-protein, proteindrug and protein-dna interactions. In medicine, DNA microarrays are used to identify the differentially expressed genes. In addition microarrays are used for drug discovery and to study the changes in the gene expression levels in response to the drugs [9].In cancer research, microarrays are used for determining mutation detection, gene copy number analysis, cancer therapeutics and drug sensitivity [10]. 4

17 Classification is one of the prominent analyses performed on the microarray gene expression data. The analysis helps in distinguishing the diseased samples and identifying the unknown samples based on the gene expression data. In microarray data, features correspond to the number of genes in the experiments and samples correspond to the number of microarray experiments. As the data has large number of features as compared to the number of samples, classifying such a data is quite a tedious task. Many classifiers tend to underperform and can lead to false discoveries due to the high dimensional nature of microarray data [11]. Hence there is a need to optimize the classifying techniques and fine-tune them in order to fit for the microarray data. The next section discusses about SVM classifier and the basic concept used for the classification of data in the SVM classifier. 1.2 Support Vector Machine Support vector machine (SVM) is gaining popularity for its ability to classify noisy and high dimensional data. SVM is a statistical learning algorithm that classifies the samples using a subset of training samples called support vectors. The idea behind SVM classifier is that it creates a feature space using the attributes in the training data. It then tries to identify a decision boundary or a hyper-plane that separates the feature space into two halves where each half contains only the training data points belonging to a category. This is shown in Figure 1.2. In Figure 1.2 the circular data points belong to one class and square points belong to another class. SVM tries to find a hyper-plane (H 1 or H 2 ) that separates the two 5

18 categories. As shown in figure there may be many hyper-planes that can separate the data. Based on maximum margin hyper-plane concept SVM chooses the best decision boundary that separates the data. Each hyper-plane (H i ) is associated with a pair of supporting hyper-planes (h i1 and h i2 ) that are parallel to the decision boundary (H i ) and pass through the nearest data point. The distance between these supporting planes is called as margin. In the figure, even though both the hyper-planes (H 1 and H 2 ) divide the data points, H 1 has a bigger margin and tends to perform better for the classification of unknown samples than H 2. Hence, bigger the margin is, the less the generalization error for the classification of unknown samples is. Hence, H 1 is preferred over H 2. w h 22 h 21 h 12 H 2 H 1 Figure 1.2: Decision boundary and margin of SVM classifier [6] h 11 There are two types of SVMs, (1) Linear SVM, which separates the data points using a linear decision boundary and (2) Non-linear SVM, which separates the data 6

19 points using a non-linear decision boundary. For a linear SVM the equation for the decision boundary is w x+ b = 0 (1.1) where, w and x are vectors and the direction of w is perpendicular to the linear decision boundary. Vector w is determined using the training dataset. For any set of data points (x i ) that lie above the decision boundary the equation is w x i + b = k, where k > 0, (1.2) and for the data points (x j ) which lie below the decision boundary the equation is w x j + b = k`, where k`< 0. (1.3) By rescaling the values of w and b the equations of the two supporting hyper planes (h 11 and h 12 ) can be defined as h 11 : w x + b = 1 (1.4) h 12 : w x + b = -1 (1.5) The distance between the two hyper planes (margin d ) is obtained by w (x 1 x 2 ) = 2 (1.6) d = 2/ w (1.7) The objective of SVM classifier is to maximize the value of d. This objective is equivalent to minimizing the value of w 2 /2. The values of w and b are obtained by solving this quadratic optimization problem under the constraints 7

20 w x i + b > 1 if y i = 1 (1.8) w x i + b < -1 if y i = -1 (1.9) where y i is the class variable for x i. Imposing these restrictions will make SVM to place the training instances with y i = 1 above the hyper plane h 11 and the training instances with y i = -1 below the hyper plane h 12. The optimization problem can be solved using Lagrange multiplier method. The objective function to be minimized in the Lagrangian form can be written as: L P = 1 2 w 2 N i=1 α i (y i w x i + b 1) (1.10) α i are Lagrange multipliers and N are the number of samples [6]. The Lagrange multipliers should be non-negative (α i > 0). In order to minimize the Lagrangian form, its partial derivatives are obtained with respect to w and b and are equated to zero. L P w = 0 => w = α N i i=1 y i x i (1.11) L P b = 0 => N i=1 α i y i = 0 (1.12) 8

21 The equation is transformed to its dual form by substituting the values from Equation 1.11 and 1.12 in the Lagrangian form Equation 1.10.The dual form is given by L D = N α i i=1 1 2 N α i i=1 α j y i y j x i x j (1.13) The training instances for which the value of α i > 0 lie on the hyper plane h 11 or h 12 are called support vectors. Only these training instances are used to obtain the decision boundary parameters w and b. Hence the classification of unknown samples is based on the support vectors. In some cases it is preferable to misclassify some of training samples (training errors) in order to obtain decision boundary plane with maximum margin. A decision boundary with no training errors but smaller margin may lead to over-fitting and cannot classify unknown samples correctly. On the other hand, a decision boundary with few training errors and a larger margin can classify the unknown samples more accurately. Hence there must be a tradeoff between the margin and the number of training errors. The decision boundary thus obtained is called as soft margin. The constraints for the optimization problem still hold good but need the addition of slack variables ( ), which account for the soft margin. These slack variables correspond to the error in decision boundary. Also a penalty for the training error should be introduced in the objective function in order to balance the margin value and the number of training errors. The objective function for the optimization problem will be minimization of w 2 /2 + C ( i ) k (1.14) 9

22 where C and k are specified by the user and can be varied depending on the dataset. The constraints for the optimization problem will be w x i + b > 1 - i, if y i = 1, (1.15) w x i + b < -1 + i, if y i = -1. (1.16) The Lagrange multiplier for soft margin differs from the Lagrange multipliers of linear decision boundary. α i values should be non-negative and also should be less than or equal to C. Hence the parameter C acts as the upper limit for error in the decision boundary [6]. Linear SVM performs well on datasets that can be easily separated by a hyperplane into two parts. But sometimes datasets are complex and are difficult to classify using a linear kernel. Non-linear SVM classifiers can be used for such complex datasets. The concept behind non-linear SVM classifier is to transform the dataset into a high dimensional space where the data can be separated using a linear decision boundary. In the original feature space the decision boundary is not linear. The main problem with transforming the dataset to higher dimension is the increase in complexity of the classifier. Also the exact mapping function that can separate data linearly in higher dimensional space is not known. In order to overcome this, a concept called kernel trick is used to transform the data to higher dimensional space. If is the mapping function, in order to find the linear decision boundary in the transformed higher dimensional space, attribute x in the Equation 1.13 is replaced with (x). The transformed Lagrangian dual form is given by L D = N i=1 α i 1 2 N i=1 α i α j y i y j (x i ) (x j ) (1.17) 10

23 The dot product is a measure of similarity between two vectors. The key idea behind the kernel trick is that it considers the dot product analogous in the original and the transformed space. Consider two input instance vectors x i and x j in the original space. When changed to a higher dimension, they are transformed to (x i ) and (x j ) respectively. Likewise, the similarity measure in original space is transformed from x i x j to (x i ) (x j ) in higher dimension space. The dot product of (x i ) and (x j ) is called the kernel function and is represented by K(x i, x j ). As the kernel trick assumes that the dot products are similar in both the spaces, it aids in computing the kernel function in the transformed space using the original attribute set. Hence the original nonlinear decision boundary equation in lower dimension space is transformed to an equation of linear decision boundary in higher dimensional space given by: w (x) + b = 0 (1.18) A brief overview of the report is presented below: 1) Chapter 2 presents the literature review and some of the previous work done on the microarray classification using SVM. 2) Chapter 3 focuses on the datasets used in thesis, the process followed and application of the process model to the various datasets selected. 3) Chapter 4 presents the results obtained from the analyses performed and discusses the results obtained. 4) Chapter 5 concludes the report by presenting the observations derived from the current study. 11

24 CHAPTER II LITERATURE REVIEW This section discusses previous work done on microarray data using SVM classifiers. It also presents the methods implemented to fine tune SVM classifier in order to adapt to the large dimensional microarray data. The section starts with a review of the original study performed on Leukemia cancer dataset by Golub et al. The Leukemia cancer dataset was obtained from Broad Institute of MIT and Harvard [1]. The dataset is classified into training dataset with 38 samples and test data with 34 samples. Further, the training dataset consists of 27 acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML) samples whereas the test dataset consisting of 24 ALL samples and 10 AML samples. Original study on leukemia dataset was performed by T. R Golub et al. They proposed a classification technique to classify the tumor samples using the gene expression data of microarray. In the study, they ranked the genes based on how well they are correlated in distinguishing the class of a sample. In order to evaluate a gene ranking based on the correlation, the neighborhood analysis was used. The genes which showed a strong correlation in distinguishing the class were considered as informative genes for classification. Using these genes a class predictor was developed to classify the samples. Each sample provides information and favors a particular class. 12

25 The genes opt for one of the classes, which can be measured as a weighted vote. The weight can be calculated based on the expression level of new sample and how well the gene relates to the particular class. The sample is classified to a class by totaling the votes. It also helps in determining the prediction strength (PS) of the class, which varies from 0 to 1. If the PS value crosses a predefined threshold, only then the sample is assigned to the class. The proposed model was tested using cross validation technique for training data and it correctly classified 36 samples out of 38 samples. Then the class predictor was used to classify a set of independent test data samples. The model correctly classified 29 out of 34 samples. The median value of prediction strengths obtained were pretty high (PS = 0.77). The study also focused on identifying cancer classes using clustering. For this, a technique called self-organizing maps was used. It was basically identifying a centroid among the data points grouped together. The training dataset of 38 samples was used to identify two clusters and labeled them as A1 containing 25 samples and A2 containing 13 samples. In order to test these clusters, they were used as training data for the class predictor to build a learning model. Then the predictor model was evaluated using cross validation method. The accuracy obtained was pretty high with only one error and three uncertain samples. The study proposed an iterative development of a model for building a class predictor where data is initially used to build clusters and then the clusters are used to train a class predictor model. Then the misclassified samples in cross validation technique are removed and the new data is used to train the class predictor model. This model was used to test independent test data and the value of prediction strength was 13

26 quite high. The study further focused on identifying subclasses within ALL (T-lineage ALL, B-lineage ALL, B-lin-eage ALL). The prediction strength obtained for the subclasses of B-ALL were low indicating that they belong to the same class. They concluded that microarray gene expression data can be used for identifying different classes in cancer and a standardized method of experiment for obtaining microarray gene expression will result in improved accuracy [2]. S. Mukherjee et al in [12] performed classification on leukemia cancer data [1] using SVMs. The study analyzed classification ability of SVM on high dimensional microarray data. They used the feature selection method proposed by Golub et al in [2]. They ranked the features and picked top 49, 99 and 999 genes for classification. Classification using all the 7129 genes in dataset was also performed. The study proposed two methods; 1) SVM classification without rejections; 2) SVM classification with rejections. The former method was classifying the dataset using linear SVM classifier with top 49, 99 and 999 genes and also using the complete set of 7129 genes. The SVM classifier achieved better accuracy compared to the method proposed by Golub et al. The non-linear polynomial kernel SVM classifier did not improve accuracy for the dataset. The second method used a confidence threshold value to reject test samples if they lie closer to the boundary plane. Confidence threshold was calculated using Bayesian formulation. Distance of the training samples to the decision boundary was calculated based on leave-one-out strategy. The distribution function estimate of the distances was obtained using the non-parametric density estimation algorithm. Then the confidence level for the classifier was obtained by subtracting the estimate from unity. Distance of the test sample to the decision boundary was calculated and if it was less than confidence 14

27 level of the classifier then the decision is rejected and class of the test sample cannot be determined. The overall accuracy obtained was 100% with a few samples rejected in each category of the filtered genes. Of the top 49 genes, 4 were rejected. Likewise, 2 genes were rejected from the top 99 genes, none rejected from the top 999 genes and 3 genes were rejected from the total 7129 genes. They concluded that linear SVM classifier with rejections based on confidence values performed well on the leukemia cancer dataset [12]. The study conducted by Terrence S. Furey et al in [5] focused on SVM classification technique to classify the microarray data and also perform validation of cancer tissue sample. The dataset used for the experiment consisted of 31 samples. These include cancerous ovarian tissues, normal ovarian tissues and normal non-ovarian tissues as well. They tried to classify cancerous tissues from the normal tissues (which included both normal ovarian and normal non-ovarian). Most of the machine learning algorithms tends to underperform for large number of features, whereas SVM is known for handling large dimensional data. Hence in the study full dataset was initially used for classification. The process included classification of the entire dataset using hold-one-out techniques. Later the features were ranked and top features were used for classification. The features were ranked based on the scores, which can be calculated as a ratio. The numerator of the ratio is the difference between the mean expression value of genes in normal tissues and tumor tissues. The denominator is computed as the sum of the standard deviation of normal tissues and tumor tissues. Linear kernel was used for classification. Then top genes based on their scores are taken to train the SVM and unknown samples are classified. For the ovarian dataset two samples (N039 and HWB3) 15

28 were repeatedly misclassified. They performed analysis of these samples by calculating the margin value, which they defined as the distance of the sample from the decision boundary. The margin value for misclassified sample N039 was relatively large signifying that it may be mislabeled. A brief study by the biologist indicated that the sample was indeed mislabeled. Another misclassified sample HWBC3 was determined to be an outlier. Also top genes extracted from feature selection, three out of five genes are related to cancer. They proposed that feature selection can be used to extract genes that are related to cancer but suggested that it is not 100 percent efficient as some of the genes that are not related to cancer were also ranked highly. To generalize the method they performed the analysis on leukemia dataset (Golub et al, 1999) and colon cancer dataset (Alon et al, 1999). The results obtained were comparable to the previously obtained results. They concluded that SVM could be used for classification of microarray data and for analyzing the misclassified samples. It has been suggested that with the use of nonlinear kernel, classification accuracy might be improved for complex dataset, which are otherwise difficult to classify using a simple linear SVM kernel [5]. The study conducted by Sung-Huai Hsieh et al [3] focused on classification of leukemia dataset using support vector machines. They proposed a SVM classifier using Information Gain (IG) as feature selection method. Information gain is the reduction in the entropy when the data is partitioned based on that gene. Entropy can be used for determining if a feature is useful in performing the classification, in addition to determining the correlation among the training dataset. Microarray dataset was divided into two independent sets training and test data. Then feature selection is performed on the training dataset by calculating the IG values for the genes. Genes with top IG values 16

29 were selected for classification. Then the training data is scaled to handle any outliers and minimize the learning model bias. The scaled training data is used to train SVM classifier. Radial-based function (RBF) kernel with grid searching was used in SVM model and optimum values for the parameters of RBF (penalty C and gamma ) were determined. This model was tested using cross validation method and also using an independent test dataset. The paper proposed that SVM classifier model with IG feature selection achieved good accuracy (98.10%) [3]. John Phan et al. discussed about various parameter optimization techniques for SVM classifier in order to improve classification [4]. The study focused on identifying genetic markers that can differentiate renal cell carcinoma (RCC) subtypes. The dataset consisted of categories including 13 clear-cell RCC, 4 chromophobe and 3 oncocytoma. The study aimed at distinguishing the clear cell RCC from the other two categories. Initially SVM was used to rank the genes in the order of their ability to distinguish classes. Then the prediction accuracy or prediction rate was calculated using leave one out strategy. Sequential minimal optimization technique was used for optimizing SVM. It has been suggested that accuracy obtained by the SVM depends largely on the kernel selected and the parameters. The study focused on linear and radial basis kernel (RBF). The linear and RBF kernels have parameter C that corresponds to the penalty for misclassification. The higher the value of C is, the more the penalty, leading the classification model to be over-fitting. Conversely, smaller value of C leads to a more generalized model that may not be able to classify the unknown data accurately. In the paper, C was varied from 0.01 to 10 for linear and 0.01 to 100 for the RBF kernel and thus the optimal value of C was determined along with calculating the prediction rate. 17

30 Based on the average prediction rate obtained by varying the mentioned parameters they proposed optimal values for C as 0.1 for linear kernel and 1 for RBF. Also the results suggested that the value of sigma calculated, yielded in highest average prediction rate. Similarly, the optimal value of sigma for the RBF kernel was proposed as the mean of closest 2m neighbors to the data point where m corresponds to the dimensionality of the data point. Also, the results suggested that the value of sigma calculated yielded in highest average prediction rate. It has been stated that the genes selected using the mentioned process were found to have biological significance based on gene ontology and literature [4]. 18

31 CHAPTER III MATERIALS AND METHODS The goal of this study is to determine the confidence level in classifying an unknown gene sample based on microarray data using SVM classifier. It also focuses on analyzing different kernels of SVM and determining the kernel and the parameters best suited for classification of a given microarray data. The analysis is performed using distance measure, obtained by calculating the distance of a sample to the separating hyper-plane of an SVM classifier. 3.1 Methodology A flow chart representation of the process followed is shown in Figure 3.1. Rest of the section gives a detailed description of each step of the flow chart Dataset selection Three different microarray datasets are used in the study: i) Leukemia cancer dataset by Golub et al, obtained from Broad Institute [1]; ii) Heart Disease dataset of Animals Models of Cardiomyopathies project from Cardio Genomics [13]; iii) Breast cancer dataset obtained from NCBI [14]. 19

32 Repeat the procedure Dataset Selection Randomly select a subset of the dataset as a training dataset and the remaining as the testing dataset Data preprocessing Scaling data Feature selection using t-test Training SVM classifier Determining hyper-plane equation Calculating margin distance Classifying test data Calculating distances of test samples Obtaining results Analyzing the results Figure 3.1: Flow chart representation of the process Leukemia dataset was generated using high-density oligonucleotide microarrays, produced by Affymetrix and consist of gene expression profiles belonging to two tumor 20

33 samples (ALL and AML). The dataset obtained was already divided into training and test data. This entire dataset of training and test data was clubbed together in order to generate 10 different training and test data randomly. The complete dataset consists of 48 ALL samples and 25 AML samples. There are a total of 7129 genes for each sample. Each experiment sample consisted of two columns associated with it in the microarray dataset. The first column gives the expression level of the gene in the microarray experiment. The second column, represented as CALL helps to determine whether the expression value is due to the gene or due to noise. It takes three values P, M and A that correspond to presence, marginal and absence of a signal [1]. The format of the dataset is shown in Table 3.1 Table 3.1: Format of Leukemia cancer dataset Accession Number Sample 1 CALL Sample 2 CALL Sample 3 CALL A28102_at 151 A 484 A 118 P AB000114_at 72 A 61 A 16 A AB000115_at 281 A 118 A 197 M AB000220_at 36 A 39 A 39 A AB000381_s_at 29 A 38 A 50 A AB000409_at -299 A -11 A 237 P AB000410_s_at -336 A -116 A -129 A AB000449_at 57 A 274 P 311 P AB000450_at 186 P 245 P 186 P AB000460_at 1647 P 2128 P 1608 P AB000462_at 137 A -82 A 204 P AB000464_at 803 P 1489 P 322 P AB000466_at -894 A -969 A -444 A AB000467_at -632 A -909 A -254 P 21

34 Heart disease dataset was generated using Affymetrix HgU133 Plus 2.0 microarray chip. The dataset comprises of 32 files corresponding to diseased samples and 14 files corresponding to normal samples. The files are combined to create a dataset containing 46 samples in total. In this dataset, each sample consists of genes that are given as rows. Each sample has a column specifying the gene expression value and a second column specifying P, A, or M value [13]. Table 3.2 shows the format of the heart disease dataset. Table 3.2: Format of heart disease dataset Probe_Set_Name Sample 1 CALL Sample 2 CALL Sample 3 CALL 1007_s_at P P P 1053_at P P P 117_at P P P 121_at P 1283 P P 1255_g_at 25.9 A 46.8 A 89.8 A 1294_at 441 P P P 1316_at P 157 P P 1320_at 96.9 A M 72 A 1405_i_at 92.9 P 62.3 A 74.2 A Breast cancer dataset was generated using Affymetrix Human Genome U 133 Plus 2.0 Genechip. The dataset has 27 normal samples and 31 breast cancer tumor samples. The data consists of genes for each sample. A normalized breast cancer dataset was obtained where the normalization is performed using quantile normalization technique [14]. Table 3.3 shows the format of the dataset. 22

35 Table 3.3: Format of breast cancer dataset Probe Set ID Sample 1 Sample 2 Sample 3 Sample 4 Sample _s_at _at _at _at _g_at _at _at _at _i_at Training and test dataset generation For each of the 3 datasets, the complete dataset was used to generate training and test data randomly. For generating test data, few samples from each category were picked from the entire dataset. The rest of the samples in the dataset formed the training dataset. The samples were randomly chosen for the test data and the process was repeated 10 times in order obtain 10 different training and test datasets. Such a process ensured that the datasets generated were different from each other and also guaranteed mutual exclusivity of training and test data. The mutual exclusiveness is to eliminate any prior information of the test data sample for the trained SVM classifier. The process of generating the dataset was automated process was performed in MATLAB and the MATLAB code is presented in Appendix A. The training and test data sample distribution for each of the 3 datasets is shown in Table

36 Table 3.4: Sample s distribution of training and test data for 3 datasets Leukemia Dataset Heart Disease Dataset Breast Cancer Dataset Dataset ALL AML Total Diseased Samples Normal Samples Total Tumor Samples Normal Samples Total Training Data Testing Data Data preprocessing and feature selection The training and test datasets generated were subjected to data preprocessing. As the first step in data preprocessing, the Housekeeping genes present in the microarray dataset were removed. The total number of genes and the number of housekeeping genes present in the 3 datasets is shown in Table 3.5. Table 3.5: Number of genes in each dataset Dataset Total number of Genes Number of housekeeping genes Leukemia Cancer Dataset Heart Disease Dataset Breast Cancer Dataset After removing the housekeeping genes the number of genes in leukemia cancer, heart disease and breast cancer datasets were 7071, and respectively. Based 24

37 on the study conducted by Adarsh Jose et al [16], in the leukemia and heart disease dataset, the genes with more CALL values marked as absent (A) as compared to present(p) were removed. As the breast cancer data had no information about the CALL value all the genes were retained for further processing. Also the threshold values of gene expressions for leukemia cancer and heart disease datasets were set for a maximum of and a minimum of 100. As the breast cancer dataset was a normalized one, no threshold values were set for the dataset. The training dataset and test dataset were then scaled separately using the MATLAB implementation of data scaling. Both training and test data were scaled to a range of -1 and 1 [15]. MATLAB program for preprocessing, setting the threshold value and scaling the data is presented in Appendix B. Using all the available genes for classification often result in model over-fitting and computation overhead. Hence, feature selection was performed on training dataset to obtain informative genes that can be used for classification. Student s t-test was used to identify differentially expressed genes in two categories. The p-value obtained from student s t-test gives probability that the mean of the two groups be similar [20]. Hence, lower p-values indicate bigger difference in the gene expression levels of both categories. MATLAB implementation of two tailed, unequal population variance t-test was used to obtain the p-values. The genes of all the samples in the dataset were sorted in the increasing order based on the p-values. Top 350 genes from the training set were retained for training the SVM classifiers. Exactly the same genes that are left in the training data after preprocessing and feature selection were selected from test data in order to test the trained SVM classifiers. MATLAB code for feature selection is shown in Appendix C. 25

38 3.1.4 Training the SVM The preprocessed and scaled training data was used as input to the SVM classifier for training the model. MATLAB implementation of SVM classifier called svmtrain method was used for training the SVM model. The classifier is a soft-margin support vector machine. Two different kernels were used to train the SVM model, linear kernel and Gaussian radial basis function (RBF). Linear kernel function maps the samples in the training data onto a feature space and determines the optimal maximal margin hyperplane that divides both categories of data. The function used for linear kernel is K(x i, x j ) = x i x j T (3.1) where signifies the dot product between the two vectors. In a Gaussian RBF kernel the data samples are transformed to a high dimensional space probably to an infinite dimensional space where the data belonging to two categories can be differentiated using a linear hyper-plane. The kernel function used for RBF is K(x i, x j ) = exp( - x i x j 2 ), > 0 (3.2) is the kernel parameter. A default value of 1 was chosen for. As mentioned earlier, determining an optimal hyper-plane is an optimization problem. Quadratic programming optimization technique was used for determining the differentiating hyper-plane. A default scalar value of 1 was chosen as BoxConstraintValue for soft margin. The BoxConstraintValue specifies the value of C in Equation The default scalar value was automatically rescaled to N/(2*N 1 ) for N 1 data points belonging to group one and to 26

39 N/(2*N 2 ) for N 2 data points belonging to group two. N is the total number of data points; N = N 1 + N 2.The rescaling is to account for the unbalanced groups [21]. The SVM classifier was trained using the data and the known classes of the training data. Input given to the SVM classifier was varied by varying the number of features selected in the training dataset and also by varying the kernel function used by the SVM classifier. The number of features selected was varied from 5 to Determining the equation of hyper-plane The output obtained from the trained SVM classifier is called SVM parameters. They include the support vectors, non-negative Lagrangian multipliers of the support vectors ( ) and bias (b). The parameter bias gives the distance of the separating plane to the origin. The SVM parameters obtained is used to determine the equation of the separating hyper-plane. Theoretically, the equation of hyper-plane is given by w (x)+ b = 0 (3.3) w is the normal vector to the hyper-plane, (x) is the mapping function and b is the bias value. From Equation 1.11, the normal vector w is determined using the equation N α i i=1 y i x i (3.4) 27

40 i is the alpha value obtained from the SVM method, y i is class value of the data sample x i. As only the support vectors play a role in determining the hyper-plane, x i are the support vectors and N is the number of samples. Also the alpha values for all the samples except for support vectors are zeros, the summation can be done only for support vectors Distance of a sample to hyper-plane After determining the equation of the hyper-plane w x + b = 0, distance of a data point (p) to the plane can be calculated using w p + b w (3.5) w is a normal vector of the plane and w represents the 2-norm of w. After training the SVM classifier and determining the equation of hyper-plane, the margin distance was calculated. Theoretically, margin is defined as the region between the supporting hyperplanes that pass through the support vectors. These supporting hyper-planes are parallel to the decision boundary. As only support vectors are used to determine the separating plane and ideally the supporting planes pass through the support vectors, determining the distance of support vectors to the hyper-plane will give an idea of the margin distance to the hyper-plane [17, 18, 19]. Figure 3.2 shows the SVM feature space with support vectors and distances of the support vectors to the decision boundary. The data point s s 1, s 2, s 3 and s 4 shown in the 28

figure are support vectors of SVM classifier. Support vectors belonging to different classes are distinguished using separate colors, red and blue.

41 figure are support vectors of SVM classifier. Support vectors belonging to different classes are distinguished using separate colors, red and blue. d 1, d 2, d 3 and d 4 represent the distances of support vectors s 1, s 2, s 3 and s 4 to the separating plane respectively. Distance is calculated using the Equation 3.5. p in the equation is the support vector (s i ) whose distance to the separating plane is to be determined. Generally, the margin distance m d is determined by taking arithmetic mean of distances of support vectors to the hyper-plane. However, there are instances where the support vectors get misclassified and cause the efficiency of classifier to reduce. In such cases, the margin distance is penalized to account for the decrease in accuracy. To accomplish this, the distance of misclassified support vector from the hyper-plane is subtracted from the total distance, rather than adding it, while calculating the margin distance. Figure 3.2: SVM feature space showing the support vectors (s i ), margin distance (m d ) and the danger zone The region parallel to the separating hyper-plane and at a distance of m d corresponds to the margin area and is shown as shaded region in the figure. This shaded region is referred to as the danger zone in the rest of the paper. It is so because, the 29

42 samples lying in this region have greater chance of getting misclassified and the class value predicted for these samples is uncertain. In order to justify that the shaded region approximately correspond to the region between the supporting hyper-planes shown in Figure 1.2, distance of all the training samples to the hyper-plane was calculated. From the distances calculated it was observed that only support vectors are the samples that fall in the danger zone Classifying test data After training the classifier, independent test dataset generated was classified in order to the test accuracy of the classifier. The number of genes used for classification was varied from top 5 to 350 genes obtained from t-test. The same genes that were used for the training the SVM were selected from test data for classification. MATLAB implementation of svmclassify method was used for classification. The distance of test samples to the separating plane was determined. MATLAB code for training the SVM and testing the classifier and calculating distance of test samples to the hyper-plane is given in Appendix D Obtaining and analyzing the results The distance of test samples that are misclassified was compared with the margin value to determine whether they lie in the danger zone. The entire process of training the classifier, determining the margin distance and testing the classifier with an independent 30

SVM Classification in -Arrays

SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What