SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA. A Thesis. Presented to. The Graduate Faculty of The University of Akron

Size: px
Start display at page:

Download "SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA. A Thesis. Presented to. The Graduate Faculty of The University of Akron"

Transcription

1 SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Ameer Basha Shaik Abdul May, 2011

2 SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA Ameer Basha Shaik Abdul Thesis Approved: Accepted: Advisor Dr. Zhong-Hui Duan Department Chair Dr. Chien-Chung Chan Committee Member Dr. Chien-Chung Chan Dean of the College Dr. Chand K. Midha Committee Member Dr. Yingcai Xiao Dean of the Graduate School Dr. George R. Newkome Date ii

3 ABSTRACT Support vector machine is statistical classification algorithm that classifies data by separating two classes with the help of a functional hyper plane. SVM is known for good performance on noisy and high dimensional data such as microarray. A marginal region of functional hyper plane named danger zone is defined to be the region between two parallel hyper planes that are determined by the average distances of the support vectors from the two classes to functional hyper plane. The main aim of this study was to determine the effect of margin distance, the width of the danger zone, on the accuracy of the classifier and to analyze the role of margin distance in feature selection. The study was carried out using three microarray datasets. For each dataset, equation of functional hyper plane separating the two classes of data was derived. The corresponding support vectors were obtained. The average distances between support vectors from the two classes to functional hyper plane were calculated. The relations between the width of the danger zone and the classification accuracy were investigated. The rate of change of the margin distance with respect to the number of features used for constructing the support vector machine was also examined. The results indicate that although correlation between margin and accuracy is not very strong, but the rate of change of classification accuracy with respect to margin distance can be employed to determine the optimal number of features for constructing high performance support vector machine for classifying microarray samples. iii

4 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Dr. Zhong Hui Duan for her continuous guidance throughout the research. With her untiring advice and invaluable help, it has been possible to proceed in the correct direction and successfully complete the study. I would also like to thank my thesis committee members Dr. Chein- Chung Chan and Dr. Yingcai Xiao for their expert suggestions. I am extremely grateful to my parents for having faith in me and for extending their support at all times. In addition, I would also like to thank my friends, Aparna Sriram, Teja Polapragada and Sri Harsha Muppaneni for their timely help. iv

5 TABLE OF CONTENTS Page LIST OF TABLES..ix LIST OF FIGURES......xi CHAPTER I. INTRODUCTION Gene expression and Microarray technology Support Vector Machine..5 II. LITERATURE REVIEW..12 III. MATERIALS AND METHODS Methodology Dataset selection Training and test dataset generation Data preprocessing and feature selection Training the SVM Determining the equation of hyper-plane..27 v

6 3.1.6 Distance of a sample to hyper-plane Classifying test data Obtaining and analyzing the results...30 IV. RESULTS AND DISCUSSIONS Distances of training samples to the decision boundary Distances of test samples Classification accuracy Relation between classification accuracy and margin value Variation in accuracy and margin with increase in the number of genes Effects of change in number of genes on margin and accuracy Correlation between margin and accuracy.57 V. CONCLUSIONS...59 REFERENCES..61 APPENDICES...63 APPENDIX A. GENERATING TRAINING AND TEST RANDOMLY.64 vi

7 APPENDIX B. SCALING TRAINING AND TEST DATASET.67 APPENDIX C. PERFORMING T-TEST ON THE SCALED TRAINING DATA..68 APPENDIX D. CALCULATING THE MARGIN DISTANCE OF SVM CLASSIFIER...70 vii

8 LIST OF TABLES Table Page 3.1 Format of Leukemia dataset Format of Heart disease dataset Format of Breast cancer dataset Sample s distribution of training and test data for 3 datasets Number of genes in each dataset Distances of training samples to the hyper-plane for linear SVM kernel Distance of test samples to the hyper-plane using linear SVM kernel for Leukemia dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Leukemia dataset Distance of test samples to the hyper-plane using linear SVM kernel for Heart disease dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Heart disease dataset..38 viii

9 4.6 Distance of test samples to the hyper-plane using linear SVM kernel for Breast disease dataset Distance of test samples to the hyper-plane using RBF SVM kernel for Breast disease dataset Percentage of misclassified test samples in the danger zone for linear kernel SVM Percentage of misclassified test samples in the danger zone for RBF kernel SVM Average classification accuracy for linear kernel SVM Average classification accuracy for RBF kernel SVM Correlation between margin value and classification accuracy.58 ix

10 LIST OF FIGURES Figure Page 1.1 Microarray experiment Decision boundary and margin of SVM classifier Flow chart representation of the process SVM feature space showing the support vectors (s i ), margin distance (m d ) and the danger zone Classification accuracy of linear kernel SVM for Leukemia dataset Margin value of linear kernel SVM for Leukemia dataset Classification accuracy of RBF kernel SVM for Leukemia dataset Margin value of RBF kernel SVM for Leukemia dataset Classification accuracy of linear kernel SVM for Heart disease dataset Margin value of linear kernel SVM for Heart disease dataset Classification accuracy of RBF kernel SVM for Heart disease dataset Margin value of RBF kernel SVM for Heart disease dataset 48 x

11 4.9 Classification accuracy of linear kernel SVM for Breast cancer dataset Margin value of linear kernel SVM for Breast cancer dataset Classification accuracy of RBF kernel SVM for Breast cancer dataset Margin value of RBF kernel SVM for Leukemia Breast cancer dataset Rate of change of margin with respect to the change in number of genes for Leukemia dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Leukemia dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Leukemia dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Leukemia dataset using RBF SVM kernel Rate of change of margin with respect to the change in number of genes for Heart disease dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Heart disease dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Heart disease dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Heart disease dataset using RBF SVM kernel.54 xi

12 4.21 Rate of change of margin with respect to the change in number of genes for Breast cancer dataset using linear SVM kernel Rate of change of accuracy with respect to the change in number of genes for Breast cancer dataset using linear SVM kernel Rate of change of margin with respect to the change in number of genes for Breast cancer dataset using RBF SVM kernel Rate of change of accuracy with respect to the change in number of genes for Breast cancer dataset using RBF SVM kernel.56 xii

13 CHAPTER I INTRODUCTION Bioinformatics is an emerging field that has its roots in molecular biology, mathematics and computer science. It deals with generation, management and analysis of the biological data, which is obtained from various experiments and techniques, often resulting in large data. The analysis of such enormous biological data requires use of sophisticated algorithms, which can process the data and help in visualizing the data and extract information from it [7]. This led to the evolution of Bioinformatics, an interdisciplinary field involving both biologist and computer scientists. Advancement in the field of bioinformatics has facilitated many researchers in analyzing the data and understanding the structural, comparative and functional properties. Some of the enhancements being analysis of genomes and proteins, identifying metabolic and signaling pathways which define the gene to gene relationships, development of microarray chip and conducting microarray experiments to measure the gene expression levels. The availability of the data on public websites and repositories made it easier to carry out the research. NCBI is one such database that includes DNA and protein sequence data and also facilitates researchers to contribute their sequences to the database. KEGG and EcoCyc are the databases that consists the 1

14 information about the pathways [7].To process the data, finely tuned algorithms were developed over the years and have been made publicly available. Some of them are BLAST, CLUSTALW algorithms that perform sequence comparison. Algorithms to perform phylogenetic analysis were also made available on the public websites [8]. One of the major advancement made in the field of bioinformatics is the emergence of microarray technology. Microarray technology facilitates in determining the expression values of several genes simultaneously. The gene expression data is used for various analyses to understand the biological significance of the species or the tissue from which the genes were extracted for the experiment. One such analysis is classification of the sample based on the gene expression values that are obtained from the microarray experiment. This study focuses on analysis and calculation of distance measure and margin of a support vector machine classifier for microarray dataset. It also deals with studying the effect of margin value on the classification accuracy and relation between them. Before we proceed further, a brief introduction to gene expression and microarray technology is provided followed by a discussion of the support vector machine classifier. 1.1 Gene expression and Microarray technology The characteristic features and behavior of a biological species largely depends on the genes and the proteins present in it. Proteins obtained from the genes vary depending upon the gene expression levels. Hence analyzing the expression levels of genes under various conditions will help us in identifying the reason behind abnormalities in diseased 2

15 species in addition to identifying the genes responsible for the abnormality. Microarray technology is used to study and record the gene expressions of thousands of genes simultaneously. Microarray is a chip on which biological substrates are bound to the probes present on the silicon chip or a glass slide. The biological substrates can be DNAs, proteins molecules or carbohydrates that decide the type of microarray chip. There are different types of microarrays such as DNA microarrays, protein microarrays, tissue microarrays and carbohydrate microarrays [9]. DNA microarrays are the commonly used ones to record the expression levels of genes. Experiment tissue samples mrna cdna Labeled cdna Microarray Scanned Image mrnas extracted Reverse transcription Fluorescent dyes Hybridization Laser scanning Figure 1.1 Microarray experiment 3

16 Figure 1.1 shows a typical microarray experiment. The target mrnas (messenger RNAs) of the species whose gene expressions are to be measured are reverse transcribed to cdnas (complimentary DNAs). The cdnas are labeled with fluorescent dyes or radioactive lasers and are hybridized on the microarray chip. The chip is left overnight to let it hybridize. During the process of hybridization cdnas bind to their complementary strands present on the microarray chip using base pair bonding. Then the chip is washed to remove any non-specific DNA bindings present on it. It is then scanned to obtain a digital image. The image obtained is analyzed and processed using image processing and data normalization techniques to record the expression levels of thousands of genes. For a dual channel microarray chip, both control cell tissue samples and experiment cell tissue samples are extracted and are colored with different fluorescent dyes. Then they are reverse-transcribed to cdnas and are hybridized on the dual channel microarray chip. After hybridization the chip is scanned to obtain an image which is further processed to obtain the gene expression levels of the experiment tissue samples. Microarrays have many applications in medical and biological fields. Various kinds of microarrays are used to obtain the expression levels of the biological entities. For example, the protein microarrays are used to understand the protein-protein, proteindrug and protein-dna interactions. In medicine, DNA microarrays are used to identify the differentially expressed genes. In addition microarrays are used for drug discovery and to study the changes in the gene expression levels in response to the drugs [9].In cancer research, microarrays are used for determining mutation detection, gene copy number analysis, cancer therapeutics and drug sensitivity [10]. 4

17 Classification is one of the prominent analyses performed on the microarray gene expression data. The analysis helps in distinguishing the diseased samples and identifying the unknown samples based on the gene expression data. In microarray data, features correspond to the number of genes in the experiments and samples correspond to the number of microarray experiments. As the data has large number of features as compared to the number of samples, classifying such a data is quite a tedious task. Many classifiers tend to underperform and can lead to false discoveries due to the high dimensional nature of microarray data [11]. Hence there is a need to optimize the classifying techniques and fine-tune them in order to fit for the microarray data. The next section discusses about SVM classifier and the basic concept used for the classification of data in the SVM classifier. 1.2 Support Vector Machine Support vector machine (SVM) is gaining popularity for its ability to classify noisy and high dimensional data. SVM is a statistical learning algorithm that classifies the samples using a subset of training samples called support vectors. The idea behind SVM classifier is that it creates a feature space using the attributes in the training data. It then tries to identify a decision boundary or a hyper-plane that separates the feature space into two halves where each half contains only the training data points belonging to a category. This is shown in Figure 1.2. In Figure 1.2 the circular data points belong to one class and square points belong to another class. SVM tries to find a hyper-plane (H 1 or H 2 ) that separates the two 5

18 categories. As shown in figure there may be many hyper-planes that can separate the data. Based on maximum margin hyper-plane concept SVM chooses the best decision boundary that separates the data. Each hyper-plane (H i ) is associated with a pair of supporting hyper-planes (h i1 and h i2 ) that are parallel to the decision boundary (H i ) and pass through the nearest data point. The distance between these supporting planes is called as margin. In the figure, even though both the hyper-planes (H 1 and H 2 ) divide the data points, H 1 has a bigger margin and tends to perform better for the classification of unknown samples than H 2. Hence, bigger the margin is, the less the generalization error for the classification of unknown samples is. Hence, H 1 is preferred over H 2. w h 22 h 21 h 12 H 2 H 1 Figure 1.2: Decision boundary and margin of SVM classifier [6] h 11 There are two types of SVMs, (1) Linear SVM, which separates the data points using a linear decision boundary and (2) Non-linear SVM, which separates the data 6

19 points using a non-linear decision boundary. For a linear SVM the equation for the decision boundary is w x+ b = 0 (1.1) where, w and x are vectors and the direction of w is perpendicular to the linear decision boundary. Vector w is determined using the training dataset. For any set of data points (x i ) that lie above the decision boundary the equation is w x i + b = k, where k > 0, (1.2) and for the data points (x j ) which lie below the decision boundary the equation is w x j + b = k`, where k`< 0. (1.3) By rescaling the values of w and b the equations of the two supporting hyper planes (h 11 and h 12 ) can be defined as h 11 : w x + b = 1 (1.4) h 12 : w x + b = -1 (1.5) The distance between the two hyper planes (margin d ) is obtained by w (x 1 x 2 ) = 2 (1.6) d = 2/ w (1.7) The objective of SVM classifier is to maximize the value of d. This objective is equivalent to minimizing the value of w 2 /2. The values of w and b are obtained by solving this quadratic optimization problem under the constraints 7

20 w x i + b > 1 if y i = 1 (1.8) w x i + b < -1 if y i = -1 (1.9) where y i is the class variable for x i. Imposing these restrictions will make SVM to place the training instances with y i = 1 above the hyper plane h 11 and the training instances with y i = -1 below the hyper plane h 12. The optimization problem can be solved using Lagrange multiplier method. The objective function to be minimized in the Lagrangian form can be written as: L P = 1 2 w 2 N i=1 α i (y i w x i + b 1) (1.10) α i are Lagrange multipliers and N are the number of samples [6]. The Lagrange multipliers should be non-negative (α i > 0). In order to minimize the Lagrangian form, its partial derivatives are obtained with respect to w and b and are equated to zero. L P w = 0 => w = α N i i=1 y i x i (1.11) L P b = 0 => N i=1 α i y i = 0 (1.12) 8

21 The equation is transformed to its dual form by substituting the values from Equation 1.11 and 1.12 in the Lagrangian form Equation 1.10.The dual form is given by L D = N α i i=1 1 2 N α i i=1 α j y i y j x i x j (1.13) The training instances for which the value of α i > 0 lie on the hyper plane h 11 or h 12 are called support vectors. Only these training instances are used to obtain the decision boundary parameters w and b. Hence the classification of unknown samples is based on the support vectors. In some cases it is preferable to misclassify some of training samples (training errors) in order to obtain decision boundary plane with maximum margin. A decision boundary with no training errors but smaller margin may lead to over-fitting and cannot classify unknown samples correctly. On the other hand, a decision boundary with few training errors and a larger margin can classify the unknown samples more accurately. Hence there must be a tradeoff between the margin and the number of training errors. The decision boundary thus obtained is called as soft margin. The constraints for the optimization problem still hold good but need the addition of slack variables ( ), which account for the soft margin. These slack variables correspond to the error in decision boundary. Also a penalty for the training error should be introduced in the objective function in order to balance the margin value and the number of training errors. The objective function for the optimization problem will be minimization of w 2 /2 + C ( i ) k (1.14) 9

22 where C and k are specified by the user and can be varied depending on the dataset. The constraints for the optimization problem will be w x i + b > 1 - i, if y i = 1, (1.15) w x i + b < -1 + i, if y i = -1. (1.16) The Lagrange multiplier for soft margin differs from the Lagrange multipliers of linear decision boundary. α i values should be non-negative and also should be less than or equal to C. Hence the parameter C acts as the upper limit for error in the decision boundary [6]. Linear SVM performs well on datasets that can be easily separated by a hyperplane into two parts. But sometimes datasets are complex and are difficult to classify using a linear kernel. Non-linear SVM classifiers can be used for such complex datasets. The concept behind non-linear SVM classifier is to transform the dataset into a high dimensional space where the data can be separated using a linear decision boundary. In the original feature space the decision boundary is not linear. The main problem with transforming the dataset to higher dimension is the increase in complexity of the classifier. Also the exact mapping function that can separate data linearly in higher dimensional space is not known. In order to overcome this, a concept called kernel trick is used to transform the data to higher dimensional space. If is the mapping function, in order to find the linear decision boundary in the transformed higher dimensional space, attribute x in the Equation 1.13 is replaced with (x). The transformed Lagrangian dual form is given by L D = N i=1 α i 1 2 N i=1 α i α j y i y j (x i ) (x j ) (1.17) 10

23 The dot product is a measure of similarity between two vectors. The key idea behind the kernel trick is that it considers the dot product analogous in the original and the transformed space. Consider two input instance vectors x i and x j in the original space. When changed to a higher dimension, they are transformed to (x i ) and (x j ) respectively. Likewise, the similarity measure in original space is transformed from x i x j to (x i ) (x j ) in higher dimension space. The dot product of (x i ) and (x j ) is called the kernel function and is represented by K(x i, x j ). As the kernel trick assumes that the dot products are similar in both the spaces, it aids in computing the kernel function in the transformed space using the original attribute set. Hence the original nonlinear decision boundary equation in lower dimension space is transformed to an equation of linear decision boundary in higher dimensional space given by: w (x) + b = 0 (1.18) A brief overview of the report is presented below: 1) Chapter 2 presents the literature review and some of the previous work done on the microarray classification using SVM. 2) Chapter 3 focuses on the datasets used in thesis, the process followed and application of the process model to the various datasets selected. 3) Chapter 4 presents the results obtained from the analyses performed and discusses the results obtained. 4) Chapter 5 concludes the report by presenting the observations derived from the current study. 11

24 CHAPTER II LITERATURE REVIEW This section discusses previous work done on microarray data using SVM classifiers. It also presents the methods implemented to fine tune SVM classifier in order to adapt to the large dimensional microarray data. The section starts with a review of the original study performed on Leukemia cancer dataset by Golub et al. The Leukemia cancer dataset was obtained from Broad Institute of MIT and Harvard [1]. The dataset is classified into training dataset with 38 samples and test data with 34 samples. Further, the training dataset consists of 27 acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML) samples whereas the test dataset consisting of 24 ALL samples and 10 AML samples. Original study on leukemia dataset was performed by T. R Golub et al. They proposed a classification technique to classify the tumor samples using the gene expression data of microarray. In the study, they ranked the genes based on how well they are correlated in distinguishing the class of a sample. In order to evaluate a gene ranking based on the correlation, the neighborhood analysis was used. The genes which showed a strong correlation in distinguishing the class were considered as informative genes for classification. Using these genes a class predictor was developed to classify the samples. Each sample provides information and favors a particular class. 12

25 The genes opt for one of the classes, which can be measured as a weighted vote. The weight can be calculated based on the expression level of new sample and how well the gene relates to the particular class. The sample is classified to a class by totaling the votes. It also helps in determining the prediction strength (PS) of the class, which varies from 0 to 1. If the PS value crosses a predefined threshold, only then the sample is assigned to the class. The proposed model was tested using cross validation technique for training data and it correctly classified 36 samples out of 38 samples. Then the class predictor was used to classify a set of independent test data samples. The model correctly classified 29 out of 34 samples. The median value of prediction strengths obtained were pretty high (PS = 0.77). The study also focused on identifying cancer classes using clustering. For this, a technique called self-organizing maps was used. It was basically identifying a centroid among the data points grouped together. The training dataset of 38 samples was used to identify two clusters and labeled them as A1 containing 25 samples and A2 containing 13 samples. In order to test these clusters, they were used as training data for the class predictor to build a learning model. Then the predictor model was evaluated using cross validation method. The accuracy obtained was pretty high with only one error and three uncertain samples. The study proposed an iterative development of a model for building a class predictor where data is initially used to build clusters and then the clusters are used to train a class predictor model. Then the misclassified samples in cross validation technique are removed and the new data is used to train the class predictor model. This model was used to test independent test data and the value of prediction strength was 13

26 quite high. The study further focused on identifying subclasses within ALL (T-lineage ALL, B-lineage ALL, B-lin-eage ALL). The prediction strength obtained for the subclasses of B-ALL were low indicating that they belong to the same class. They concluded that microarray gene expression data can be used for identifying different classes in cancer and a standardized method of experiment for obtaining microarray gene expression will result in improved accuracy [2]. S. Mukherjee et al in [12] performed classification on leukemia cancer data [1] using SVMs. The study analyzed classification ability of SVM on high dimensional microarray data. They used the feature selection method proposed by Golub et al in [2]. They ranked the features and picked top 49, 99 and 999 genes for classification. Classification using all the 7129 genes in dataset was also performed. The study proposed two methods; 1) SVM classification without rejections; 2) SVM classification with rejections. The former method was classifying the dataset using linear SVM classifier with top 49, 99 and 999 genes and also using the complete set of 7129 genes. The SVM classifier achieved better accuracy compared to the method proposed by Golub et al. The non-linear polynomial kernel SVM classifier did not improve accuracy for the dataset. The second method used a confidence threshold value to reject test samples if they lie closer to the boundary plane. Confidence threshold was calculated using Bayesian formulation. Distance of the training samples to the decision boundary was calculated based on leave-one-out strategy. The distribution function estimate of the distances was obtained using the non-parametric density estimation algorithm. Then the confidence level for the classifier was obtained by subtracting the estimate from unity. Distance of the test sample to the decision boundary was calculated and if it was less than confidence 14

27 level of the classifier then the decision is rejected and class of the test sample cannot be determined. The overall accuracy obtained was 100% with a few samples rejected in each category of the filtered genes. Of the top 49 genes, 4 were rejected. Likewise, 2 genes were rejected from the top 99 genes, none rejected from the top 999 genes and 3 genes were rejected from the total 7129 genes. They concluded that linear SVM classifier with rejections based on confidence values performed well on the leukemia cancer dataset [12]. The study conducted by Terrence S. Furey et al in [5] focused on SVM classification technique to classify the microarray data and also perform validation of cancer tissue sample. The dataset used for the experiment consisted of 31 samples. These include cancerous ovarian tissues, normal ovarian tissues and normal non-ovarian tissues as well. They tried to classify cancerous tissues from the normal tissues (which included both normal ovarian and normal non-ovarian). Most of the machine learning algorithms tends to underperform for large number of features, whereas SVM is known for handling large dimensional data. Hence in the study full dataset was initially used for classification. The process included classification of the entire dataset using hold-one-out techniques. Later the features were ranked and top features were used for classification. The features were ranked based on the scores, which can be calculated as a ratio. The numerator of the ratio is the difference between the mean expression value of genes in normal tissues and tumor tissues. The denominator is computed as the sum of the standard deviation of normal tissues and tumor tissues. Linear kernel was used for classification. Then top genes based on their scores are taken to train the SVM and unknown samples are classified. For the ovarian dataset two samples (N039 and HWB3) 15

28 were repeatedly misclassified. They performed analysis of these samples by calculating the margin value, which they defined as the distance of the sample from the decision boundary. The margin value for misclassified sample N039 was relatively large signifying that it may be mislabeled. A brief study by the biologist indicated that the sample was indeed mislabeled. Another misclassified sample HWBC3 was determined to be an outlier. Also top genes extracted from feature selection, three out of five genes are related to cancer. They proposed that feature selection can be used to extract genes that are related to cancer but suggested that it is not 100 percent efficient as some of the genes that are not related to cancer were also ranked highly. To generalize the method they performed the analysis on leukemia dataset (Golub et al, 1999) and colon cancer dataset (Alon et al, 1999). The results obtained were comparable to the previously obtained results. They concluded that SVM could be used for classification of microarray data and for analyzing the misclassified samples. It has been suggested that with the use of nonlinear kernel, classification accuracy might be improved for complex dataset, which are otherwise difficult to classify using a simple linear SVM kernel [5]. The study conducted by Sung-Huai Hsieh et al [3] focused on classification of leukemia dataset using support vector machines. They proposed a SVM classifier using Information Gain (IG) as feature selection method. Information gain is the reduction in the entropy when the data is partitioned based on that gene. Entropy can be used for determining if a feature is useful in performing the classification, in addition to determining the correlation among the training dataset. Microarray dataset was divided into two independent sets training and test data. Then feature selection is performed on the training dataset by calculating the IG values for the genes. Genes with top IG values 16

29 were selected for classification. Then the training data is scaled to handle any outliers and minimize the learning model bias. The scaled training data is used to train SVM classifier. Radial-based function (RBF) kernel with grid searching was used in SVM model and optimum values for the parameters of RBF (penalty C and gamma ) were determined. This model was tested using cross validation method and also using an independent test dataset. The paper proposed that SVM classifier model with IG feature selection achieved good accuracy (98.10%) [3]. John Phan et al. discussed about various parameter optimization techniques for SVM classifier in order to improve classification [4]. The study focused on identifying genetic markers that can differentiate renal cell carcinoma (RCC) subtypes. The dataset consisted of categories including 13 clear-cell RCC, 4 chromophobe and 3 oncocytoma. The study aimed at distinguishing the clear cell RCC from the other two categories. Initially SVM was used to rank the genes in the order of their ability to distinguish classes. Then the prediction accuracy or prediction rate was calculated using leave one out strategy. Sequential minimal optimization technique was used for optimizing SVM. It has been suggested that accuracy obtained by the SVM depends largely on the kernel selected and the parameters. The study focused on linear and radial basis kernel (RBF). The linear and RBF kernels have parameter C that corresponds to the penalty for misclassification. The higher the value of C is, the more the penalty, leading the classification model to be over-fitting. Conversely, smaller value of C leads to a more generalized model that may not be able to classify the unknown data accurately. In the paper, C was varied from 0.01 to 10 for linear and 0.01 to 100 for the RBF kernel and thus the optimal value of C was determined along with calculating the prediction rate. 17

30 Based on the average prediction rate obtained by varying the mentioned parameters they proposed optimal values for C as 0.1 for linear kernel and 1 for RBF. Also the results suggested that the value of sigma calculated, yielded in highest average prediction rate. Similarly, the optimal value of sigma for the RBF kernel was proposed as the mean of closest 2m neighbors to the data point where m corresponds to the dimensionality of the data point. Also, the results suggested that the value of sigma calculated yielded in highest average prediction rate. It has been stated that the genes selected using the mentioned process were found to have biological significance based on gene ontology and literature [4]. 18

31 CHAPTER III MATERIALS AND METHODS The goal of this study is to determine the confidence level in classifying an unknown gene sample based on microarray data using SVM classifier. It also focuses on analyzing different kernels of SVM and determining the kernel and the parameters best suited for classification of a given microarray data. The analysis is performed using distance measure, obtained by calculating the distance of a sample to the separating hyper-plane of an SVM classifier. 3.1 Methodology A flow chart representation of the process followed is shown in Figure 3.1. Rest of the section gives a detailed description of each step of the flow chart Dataset selection Three different microarray datasets are used in the study: i) Leukemia cancer dataset by Golub et al, obtained from Broad Institute [1]; ii) Heart Disease dataset of Animals Models of Cardiomyopathies project from Cardio Genomics [13]; iii) Breast cancer dataset obtained from NCBI [14]. 19

32 Repeat the procedure Dataset Selection Randomly select a subset of the dataset as a training dataset and the remaining as the testing dataset Data preprocessing Scaling data Feature selection using t-test Training SVM classifier Determining hyper-plane equation Calculating margin distance Classifying test data Calculating distances of test samples Obtaining results Analyzing the results Figure 3.1: Flow chart representation of the process Leukemia dataset was generated using high-density oligonucleotide microarrays, produced by Affymetrix and consist of gene expression profiles belonging to two tumor 20

33 samples (ALL and AML). The dataset obtained was already divided into training and test data. This entire dataset of training and test data was clubbed together in order to generate 10 different training and test data randomly. The complete dataset consists of 48 ALL samples and 25 AML samples. There are a total of 7129 genes for each sample. Each experiment sample consisted of two columns associated with it in the microarray dataset. The first column gives the expression level of the gene in the microarray experiment. The second column, represented as CALL helps to determine whether the expression value is due to the gene or due to noise. It takes three values P, M and A that correspond to presence, marginal and absence of a signal [1]. The format of the dataset is shown in Table 3.1 Table 3.1: Format of Leukemia cancer dataset Accession Number Sample 1 CALL Sample 2 CALL Sample 3 CALL A28102_at 151 A 484 A 118 P AB000114_at 72 A 61 A 16 A AB000115_at 281 A 118 A 197 M AB000220_at 36 A 39 A 39 A AB000381_s_at 29 A 38 A 50 A AB000409_at -299 A -11 A 237 P AB000410_s_at -336 A -116 A -129 A AB000449_at 57 A 274 P 311 P AB000450_at 186 P 245 P 186 P AB000460_at 1647 P 2128 P 1608 P AB000462_at 137 A -82 A 204 P AB000464_at 803 P 1489 P 322 P AB000466_at -894 A -969 A -444 A AB000467_at -632 A -909 A -254 P 21

34 Heart disease dataset was generated using Affymetrix HgU133 Plus 2.0 microarray chip. The dataset comprises of 32 files corresponding to diseased samples and 14 files corresponding to normal samples. The files are combined to create a dataset containing 46 samples in total. In this dataset, each sample consists of genes that are given as rows. Each sample has a column specifying the gene expression value and a second column specifying P, A, or M value [13]. Table 3.2 shows the format of the heart disease dataset. Table 3.2: Format of heart disease dataset Probe_Set_Name Sample 1 CALL Sample 2 CALL Sample 3 CALL 1007_s_at P P P 1053_at P P P 117_at P P P 121_at P 1283 P P 1255_g_at 25.9 A 46.8 A 89.8 A 1294_at 441 P P P 1316_at P 157 P P 1320_at 96.9 A M 72 A 1405_i_at 92.9 P 62.3 A 74.2 A Breast cancer dataset was generated using Affymetrix Human Genome U 133 Plus 2.0 Genechip. The dataset has 27 normal samples and 31 breast cancer tumor samples. The data consists of genes for each sample. A normalized breast cancer dataset was obtained where the normalization is performed using quantile normalization technique [14]. Table 3.3 shows the format of the dataset. 22

35 Table 3.3: Format of breast cancer dataset Probe Set ID Sample 1 Sample 2 Sample 3 Sample 4 Sample _s_at _at _at _at _g_at _at _at _at _i_at Training and test dataset generation For each of the 3 datasets, the complete dataset was used to generate training and test data randomly. For generating test data, few samples from each category were picked from the entire dataset. The rest of the samples in the dataset formed the training dataset. The samples were randomly chosen for the test data and the process was repeated 10 times in order obtain 10 different training and test datasets. Such a process ensured that the datasets generated were different from each other and also guaranteed mutual exclusivity of training and test data. The mutual exclusiveness is to eliminate any prior information of the test data sample for the trained SVM classifier. The process of generating the dataset was automated process was performed in MATLAB and the MATLAB code is presented in Appendix A. The training and test data sample distribution for each of the 3 datasets is shown in Table

36 Table 3.4: Sample s distribution of training and test data for 3 datasets Leukemia Dataset Heart Disease Dataset Breast Cancer Dataset Dataset ALL AML Total Diseased Samples Normal Samples Total Tumor Samples Normal Samples Total Training Data Testing Data Data preprocessing and feature selection The training and test datasets generated were subjected to data preprocessing. As the first step in data preprocessing, the Housekeeping genes present in the microarray dataset were removed. The total number of genes and the number of housekeeping genes present in the 3 datasets is shown in Table 3.5. Table 3.5: Number of genes in each dataset Dataset Total number of Genes Number of housekeeping genes Leukemia Cancer Dataset Heart Disease Dataset Breast Cancer Dataset After removing the housekeeping genes the number of genes in leukemia cancer, heart disease and breast cancer datasets were 7071, and respectively. Based 24

37 on the study conducted by Adarsh Jose et al [16], in the leukemia and heart disease dataset, the genes with more CALL values marked as absent (A) as compared to present(p) were removed. As the breast cancer data had no information about the CALL value all the genes were retained for further processing. Also the threshold values of gene expressions for leukemia cancer and heart disease datasets were set for a maximum of and a minimum of 100. As the breast cancer dataset was a normalized one, no threshold values were set for the dataset. The training dataset and test dataset were then scaled separately using the MATLAB implementation of data scaling. Both training and test data were scaled to a range of -1 and 1 [15]. MATLAB program for preprocessing, setting the threshold value and scaling the data is presented in Appendix B. Using all the available genes for classification often result in model over-fitting and computation overhead. Hence, feature selection was performed on training dataset to obtain informative genes that can be used for classification. Student s t-test was used to identify differentially expressed genes in two categories. The p-value obtained from student s t-test gives probability that the mean of the two groups be similar [20]. Hence, lower p-values indicate bigger difference in the gene expression levels of both categories. MATLAB implementation of two tailed, unequal population variance t-test was used to obtain the p-values. The genes of all the samples in the dataset were sorted in the increasing order based on the p-values. Top 350 genes from the training set were retained for training the SVM classifiers. Exactly the same genes that are left in the training data after preprocessing and feature selection were selected from test data in order to test the trained SVM classifiers. MATLAB code for feature selection is shown in Appendix C. 25

38 3.1.4 Training the SVM The preprocessed and scaled training data was used as input to the SVM classifier for training the model. MATLAB implementation of SVM classifier called svmtrain method was used for training the SVM model. The classifier is a soft-margin support vector machine. Two different kernels were used to train the SVM model, linear kernel and Gaussian radial basis function (RBF). Linear kernel function maps the samples in the training data onto a feature space and determines the optimal maximal margin hyperplane that divides both categories of data. The function used for linear kernel is K(x i, x j ) = x i x j T (3.1) where signifies the dot product between the two vectors. In a Gaussian RBF kernel the data samples are transformed to a high dimensional space probably to an infinite dimensional space where the data belonging to two categories can be differentiated using a linear hyper-plane. The kernel function used for RBF is K(x i, x j ) = exp( - x i x j 2 ), > 0 (3.2) is the kernel parameter. A default value of 1 was chosen for. As mentioned earlier, determining an optimal hyper-plane is an optimization problem. Quadratic programming optimization technique was used for determining the differentiating hyper-plane. A default scalar value of 1 was chosen as BoxConstraintValue for soft margin. The BoxConstraintValue specifies the value of C in Equation The default scalar value was automatically rescaled to N/(2*N 1 ) for N 1 data points belonging to group one and to 26

39 N/(2*N 2 ) for N 2 data points belonging to group two. N is the total number of data points; N = N 1 + N 2.The rescaling is to account for the unbalanced groups [21]. The SVM classifier was trained using the data and the known classes of the training data. Input given to the SVM classifier was varied by varying the number of features selected in the training dataset and also by varying the kernel function used by the SVM classifier. The number of features selected was varied from 5 to Determining the equation of hyper-plane The output obtained from the trained SVM classifier is called SVM parameters. They include the support vectors, non-negative Lagrangian multipliers of the support vectors ( ) and bias (b). The parameter bias gives the distance of the separating plane to the origin. The SVM parameters obtained is used to determine the equation of the separating hyper-plane. Theoretically, the equation of hyper-plane is given by w (x)+ b = 0 (3.3) w is the normal vector to the hyper-plane, (x) is the mapping function and b is the bias value. From Equation 1.11, the normal vector w is determined using the equation N α i i=1 y i x i (3.4) 27

40 i is the alpha value obtained from the SVM method, y i is class value of the data sample x i. As only the support vectors play a role in determining the hyper-plane, x i are the support vectors and N is the number of samples. Also the alpha values for all the samples except for support vectors are zeros, the summation can be done only for support vectors Distance of a sample to hyper-plane After determining the equation of the hyper-plane w x + b = 0, distance of a data point (p) to the plane can be calculated using w p + b w (3.5) w is a normal vector of the plane and w represents the 2-norm of w. After training the SVM classifier and determining the equation of hyper-plane, the margin distance was calculated. Theoretically, margin is defined as the region between the supporting hyperplanes that pass through the support vectors. These supporting hyper-planes are parallel to the decision boundary. As only support vectors are used to determine the separating plane and ideally the supporting planes pass through the support vectors, determining the distance of support vectors to the hyper-plane will give an idea of the margin distance to the hyper-plane [17, 18, 19]. Figure 3.2 shows the SVM feature space with support vectors and distances of the support vectors to the decision boundary. The data point s s 1, s 2, s 3 and s 4 shown in the 28

41 figure are support vectors of SVM classifier. Support vectors belonging to different classes are distinguished using separate colors, red and blue. d 1, d 2, d 3 and d 4 represent the distances of support vectors s 1, s 2, s 3 and s 4 to the separating plane respectively. Distance is calculated using the Equation 3.5. p in the equation is the support vector (s i ) whose distance to the separating plane is to be determined. Generally, the margin distance m d is determined by taking arithmetic mean of distances of support vectors to the hyper-plane. However, there are instances where the support vectors get misclassified and cause the efficiency of classifier to reduce. In such cases, the margin distance is penalized to account for the decrease in accuracy. To accomplish this, the distance of misclassified support vector from the hyper-plane is subtracted from the total distance, rather than adding it, while calculating the margin distance. Figure 3.2: SVM feature space showing the support vectors (s i ), margin distance (m d ) and the danger zone The region parallel to the separating hyper-plane and at a distance of m d corresponds to the margin area and is shown as shaded region in the figure. This shaded region is referred to as the danger zone in the rest of the paper. It is so because, the 29

42 samples lying in this region have greater chance of getting misclassified and the class value predicted for these samples is uncertain. In order to justify that the shaded region approximately correspond to the region between the supporting hyper-planes shown in Figure 1.2, distance of all the training samples to the hyper-plane was calculated. From the distances calculated it was observed that only support vectors are the samples that fall in the danger zone Classifying test data After training the classifier, independent test dataset generated was classified in order to the test accuracy of the classifier. The number of genes used for classification was varied from top 5 to 350 genes obtained from t-test. The same genes that were used for the training the SVM were selected from test data for classification. MATLAB implementation of svmclassify method was used for classification. The distance of test samples to the separating plane was determined. MATLAB code for training the SVM and testing the classifier and calculating distance of test samples to the hyper-plane is given in Appendix D Obtaining and analyzing the results The distance of test samples that are misclassified was compared with the margin value to determine whether they lie in the danger zone. The entire process of training the classifier, determining the margin distance and testing the classifier with an independent 30

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Support Vector Machines: Brief Overview November 2011 CPSC 352 Support Vector Machines: Brief Overview" Outline Microarray Example Support Vector Machines (SVMs) Software: libsvm A Baseball Example with libsvm Classifying Cancer Tissue: The ALL/AML Dataset Golub et

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

Application of Support Vector Machine In Bioinformatics

Application of Support Vector Machine In Bioinformatics Application of Support Vector Machine In Bioinformatics V. K. Jayaraman Scientific and Engineering Computing Group CDAC, Pune jayaramanv@cdac.in Arun Gupta Computational Biology Group AbhyudayaTech, Indore

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No.677 C.B.C.L Paper No.8

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Support vector machines

Support vector machines Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Support Vector Machines

Support Vector Machines Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007,

More information

Support Vector Machines (a brief introduction) Adrian Bevan.

Support Vector Machines (a brief introduction) Adrian Bevan. Support Vector Machines (a brief introduction) Adrian Bevan email: a.j.bevan@qmul.ac.uk Outline! Overview:! Introduce the problem and review the various aspects that underpin the SVM concept.! Hard margin

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Chap.12 Kernel methods [Book, Chap.7]

Chap.12 Kernel methods [Book, Chap.7] Chap.12 Kernel methods [Book, Chap.7] Neural network methods became popular in the mid to late 1980s, but by the mid to late 1990s, kernel methods have also become popular in machine learning. The first

More information

Feature scaling in support vector data description

Feature scaling in support vector data description Feature scaling in support vector data description P. Juszczak, D.M.J. Tax, R.P.W. Duin Pattern Recognition Group, Department of Applied Physics, Faculty of Applied Sciences, Delft University of Technology,

More information

Class Discovery and Prediction of Tumor with Microarray Data

Class Discovery and Prediction of Tumor with Microarray Data Minnesota State University, Mankato Cornerstone: A Collection of Scholarly and Creative Works for Minnesota State University, Mankato Theses, Dissertations, and Other Capstone Projects 2011 Class Discovery

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Lab 2: Support vector machines

Lab 2: Support vector machines Artificial neural networks, advanced course, 2D1433 Lab 2: Support vector machines Martin Rehn For the course given in 2006 All files referenced below may be found in the following directory: /info/annfk06/labs/lab2

More information

Automated Microarray Classification Based on P-SVM Gene Selection

Automated Microarray Classification Based on P-SVM Gene Selection Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Lecture 7: Support Vector Machine

Lecture 7: Support Vector Machine Lecture 7: Support Vector Machine Hien Van Nguyen University of Houston 9/28/2017 Separating hyperplane Red and green dots can be separated by a separating hyperplane Two classes are separable, i.e., each

More information

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 Overview The goals of analyzing cross-sectional data Standard methods used

More information

Classification by Nearest Shrunken Centroids and Support Vector Machines

Classification by Nearest Shrunken Centroids and Support Vector Machines Classification by Nearest Shrunken Centroids and Support Vector Machines Florian Markowetz florian.markowetz@molgen.mpg.de Max Planck Institute for Molecular Genetics, Computational Diagnostics Group,

More information

Supervised classification exercice

Supervised classification exercice Universitat Politècnica de Catalunya Master in Artificial Intelligence Computational Intelligence Supervised classification exercice Authors: Miquel Perelló Nieto Marc Albert Garcia Gonzalo Date: December

More information

Lecture Linear Support Vector Machines

Lecture Linear Support Vector Machines Lecture 8 In this lecture we return to the task of classification. As seen earlier, examples include spam filters, letter recognition, or text classification. In this lecture we introduce a popular method

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

A Short SVM (Support Vector Machine) Tutorial

A Short SVM (Support Vector Machine) Tutorial A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Gene Expression Based Classification using Iterative Transductive Support Vector Machine Gene Expression Based Classification using Iterative Transductive Support Vector Machine Hossein Tajari and Hamid Beigy Abstract Support Vector Machine (SVM) is a powerful and flexible learning machine.

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

12 Classification using Support Vector Machines

12 Classification using Support Vector Machines 160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz.

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

Committee: Dr. Rosemary Renaut 1 Professor Department of Mathematics and Statistics, Director Computational Biosciences PSM Arizona State University

Committee: Dr. Rosemary Renaut 1 Professor Department of Mathematics and Statistics, Director Computational Biosciences PSM Arizona State University Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination A report presented in fulfillment of internship requirements of the CBS PSM Degree Committee: Dr. Rosemary Renaut

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data Supplementary material for Manuscript BIOINF-2005-1602 Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data Appendix A. Testing K-Nearest Neighbor and Support

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Topic 4: Support Vector Machines

Topic 4: Support Vector Machines CS 4850/6850: Introduction to achine Learning Fall 2018 Topic 4: Support Vector achines Instructor: Daniel L Pimentel-Alarcón c Copyright 2018 41 Introduction Support vector machines (SVs) are considered

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Version Space Support Vector Machines: An Extended Paper

Version Space Support Vector Machines: An Extended Paper Version Space Support Vector Machines: An Extended Paper E.N. Smirnov, I.G. Sprinkhuizen-Kuyper, G.I. Nalbantov 2, and S. Vanderlooy Abstract. We argue to use version spaces as an approach to reliable

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Gene selection through Switched Neural Networks

Gene selection through Switched Neural Networks Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Michael Tagare De Guzman May 19, 2012 Support Vector Machines Linear Learning Machines and The Maximal Margin Classifier In Supervised Learning, a learning machine is given a training

More information

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning Robot Learning 1 General Pipeline 1. Data acquisition (e.g., from 3D sensors) 2. Feature extraction and representation construction 3. Robot learning: e.g., classification (recognition) or clustering (knowledge

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION Supervised and unsupervised learning are the two prominent machine learning algorithms used in pattern recognition and classification. In this

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Data-analysis problems of interest

Data-analysis problems of interest Introduction 3 Data-analysis prolems of interest. Build computational classification models (or classifiers ) that assign patients/samples into two or more classes. - Classifiers can e used for diagnosis,

More information

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine Module 4. Non-linear machine learning econometrics: Support Vector Machine THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction When the assumption of linearity

More information

Support Vector Machines

Support Vector Machines Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

More information

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem Computational Learning Theory Fall Semester, 2012/13 Lecture 10: SVM Lecturer: Yishay Mansour Scribe: Gitit Kehat, Yogev Vaknin and Ezra Levin 1 10.1 Lecture Overview In this lecture we present in detail

More information

More on Classification: Support Vector Machine

More on Classification: Support Vector Machine More on Classification: Support Vector Machine The Support Vector Machine (SVM) is a classification method approach developed in the computer science field in the 1990s. It has shown good performance in

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

6.034 Notes: Section 8.1

6.034 Notes: Section 8.1 6.034 Notes: Section 8.1 Slide 8.1.1 There is no easy way to characterize which particular separator the perceptron algorithm will end up with. In general, there can be many separators for a data set.

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18 CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample

More information

Basis Functions. Volker Tresp Summer 2017

Basis Functions. Volker Tresp Summer 2017 Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

Instantaneously trained neural networks with complex inputs

Instantaneously trained neural networks with complex inputs Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Instantaneously trained neural networks with complex inputs Pritam Rajagopal Louisiana State University and Agricultural

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information