Computers in Biology and Medicine

Size: px
Start display at page:

Download "Computers in Biology and Medicine"

Transcription

1 Computers in Biology and Medicine 39 (29) Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: Feature extraction and dimensionality reduction for mass spectrometry data Yihui Liu School of Computer Science and Information Technology, Shandong Institute of Light Industry, Jinan, Shandong 25353, China A R T I C L E I N F O A B S T R A C T Article history: Received 6 March 27 Accepted 29 June 29 Keywords: Mass spectrometry data Feature extraction Wavelet analysis Support vector machine Mass spectrometry is being used to generate protein profiles from human serum, and proteomic data obtained from mass spectrometry have attracted great interest for the detection of early stage cancer. However, high dimensional mass spectrometry data cause considerable challenges. In this paper we propose a feature extraction algorithm based on wavelet analysis for high dimensional mass spectrometry data. A set of wavelet detail coefficients at different scale is used to detect the transient changes of mass spectrometry data. The experiments are performed on 2 datasets. A highly competitive accuracy, compared with the best performance of other kinds of classification models, is achieved. Experimental results show that the wavelet detail coefficients are efficient way to characterize features of high dimensional mass spectra and reduce the dimensionality of high dimensional mass spectra. 29 Elsevier Ltd. All rights reserved. 1. Background Mass spectrometry is being used to generate protein profiles from human serum, and proteomic data obtained from mass spectrometry have attracted great interest for the detection of early stage cancer. Surface enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS) in combination with advanced data mining algorithms, is used to detect protein patterns associated with diseases [1 5]. As a kind of MS-based protein chip technology, SELDI-TOF-MS has been successfully used to detect several diseaseassociated proteins in complex biological specimens such as serum [6 8]. Lilien et al. [9] perform principal component analysis (PCA) for dimensionality reduction and linear discriminant analysis (LDA) with a nearest centroid classifier [1] for the classification of mass spectra. Wu et al. [11] compare two feature extraction algorithms with several classification approaches on a MALDI TOF acquired data. The t-test is used to rank features. Support vector machines (SVMs), random forests, linear/quadratic discriminant analysis (LDA/QDA), k nearest neighbors, and bagged/boosted decision trees are performed to classify the data. In the paper of Jeffries et al. [12], both genetic algorithm (GA) approach and nearest shrunken centroid (NSC) approach have been found inferior to the boosting based feature selection method. Levner [13] examines the performance of the nearest centroid classifier using the following feature selection algorithms. For filter-based feature ranking methods, univariate statistics of student-t test, Kolmogorov Smirnov test, and the P-test address: Yihui_liu_25@yahoo.co.uk are used; for the wrapper methods, sequential forward selection (SFS) and a modified version of sequential backward selection (SBS) is tested; for embedded approaches, shrunken nearest centroid and a novel version of boosting based feature selection are investigated. Several dimensionality reduction approaches are also tested, such as PCA and LDA methods. For transform space a new basis is normally created for the data. The selection of the new basis determines the properties that will be held by the transformed data. Principal component analysis is used to extract the main components from mass spectra; linear discriminant analysis is used to extract discriminant information from mass spectra. But these methods lose the time property and do not detect the localized features of mass spectra. For wavelet transform, a set of wavelet basis aim to detect the localized features contained in mass spectra. The difference between cancer tissue and normal tissue can be measured using wavelet basis based on the compactness and finite energy characteristic of wavelet function. Yu et al. [14] developed a four-step strategy method for dimensionality reduction and test on a published ovarian highresolution SELDI-TOF dataset. They are based on: (1) binning, (2) Kolmogorov Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. They indicated that For the highresolution ovarian data, the vector of detail coefficients contains almost no information for the healthy, since SVMs identify all the data as cancers. In their proposed method the detail coefficients do not work on high-resolution mass spectrometry data. They use approximation coefficients of wavelet decomposition at first level. They also indicated that Theoretically, a heavier compression rate can be achieved, at the risk of losing some useful information, by choosing a higher level of approximation coefficients. They only /$ - see front matter 29 Elsevier Ltd. All rights reserved. doi:1.116/j.compbiomed

2 Y. Liu / Computers in Biology and Medicine 39 (29) used first level wavelet approximation coefficients in their wavelet analysis. However from another view, it is wavelet detail coefficients that characterize the localized features hidden in mass spectra and approximation coefficients only compress the mass spectra. Higher level wavelet decomposition makes the features of mass spectra more significant and clear. In this study we develop a feature extraction method based on wavelet detail coefficients. Multi-level wavelet analysis at different levels is performed on mass spectrometry data. Vectors of detail coefficients in wavelet subspace are extracted to detect the localized or transient changes of mass spectra based on the time property of wavelets, and the difference between cancer tissue and normal tissue can be measured using a set of orthogonal wavelet basis. Finally the wavelet features of mass spectra are input into the SVM classifier to distinguish the diagnostic classes. 2. Methods mass spectra decomposition tree S a 1 d 1 a 2 a 3 a 4 Fig. 2. Multilevel wavelet decomposition tree for mass spectra. Symbol s represents mass spectra; a 1,...,a 4 represent wavelet approximations from first level to fourth level respectively; d 1,...,d 4 represent wavelet details from first level to fourth level. d 2 d 3 d 4 In this research we develop a new application of wavelet feature extraction method for mass spectrometry data. Wavelet high frequency part (detail coefficients) is extracted to characterize the features of mass spectrometry data. The extracted features are used to build the SVM classifying model. Fig. 1 shows the general framework of the proposed method. s 2 1 Mass spectra and Approximation (s) 2 1 Mass spectra and Detail(s) s 2.1. Wavelet feature extraction For one dimensional wavelet analysis [15,16], a signal can be represented as a sum of wavelets at different time shifts and scales (frequencies) using discrete wavelet analysis (DWT). The DWT is capable of extracting the features of transient signals by separating signal components in both time and frequency. According to DWT, a time-varying function (signal) f (t) L 2 (R) can be expressed in terms of (t) andψ(t) as follows: f (t) = c (k) (t k) + d j (k)2 j/2 Ψ(2 j t k) k k j=1 = c j (k)2 j/2 (2 j t k) + d j (k)2 j/2 Ψ(2 j t k) k k j=j where (t), Ψ(t), c,andd j represent the scaling function, wavelet function, scaling coefficients at scale, and wavelet detailed coefficient at scale j, respectively. The variable is the translation coefficient for the localization of a signal for time. The scales k denote the different (high to low) frequency bands. The variable symbol j is scale number selected. Fig. 2 shows the wavelet decomposition tree at 4 levels. Fig. 3 shows the original mass spectra, wavelet approximations and wavelet details at 4 levels. When decomposition level is increased, the localized or transient features, which are detected based on detail coefficients, change from fine to coarse or from small to large. In our study the purpose of wavelet analysis is to detect the localized features hidden in mass spectra, in order to measure the difference between cancer tissue and normal tissue. Multi-level wavelet analysis makes it possible to detect the transient changes in one of mass spectra's derivatives. Wavelet detail coefficients at 4 levels reflect the localized features in first, second, third, and fourth derivative. Wavelets tend to be irregular and asymmetric, and wavelet Mass spectrometry data Extract wavelet features Wavelet features Build SVM classifier Fig. 1. The framework of the proposed method. Classifier model a a a 2 1 d a 1 1 d Fig. 3. Mass spectra, wavelet approximations and wavelet details at 4 levels. analysis is capable of revealing aspects of data that other analysis techniques miss, aspects like trends (approximation coefficients), discontinuities in higher derivatives (detail coefficients). Detail coefficients represent how closely correlated the wavelet is with this localized section of mass spectra. The higher the coefficients are, the more the similarity is. The presence of noise is a fairly common situation in mass spectra processing, which makes the identification of the transient change more complicated. If the first levels of the decomposition can be used to eliminate a large part of the noise, the successive details characterize more significant features hidden in mass spectra. In our study detail coefficients at second, third and fourth level are used respectively to characterize the features of mass spectra, removing noise and reducing the dimensionality. Detail coefficients determine the position of the change (time), the type of change (a localized features in which derivative), the amplitude of the change. Prostate cancer dataset [21] have 322 samples including 69 cancer samples and 253 normal samples, and dimensions for each sample vector. We perform one dimensional wavelet 2 d 4 d 3

3 82 Y. Liu / Computers in Biology and Medicine 39 (29) analysis on each sample vector to obtain the detail coefficients. Each sample vector is computed independently and is characterized by a set of orthogonal wavelet basis. We obtain , , and feature vectors at second, third and fourth level respectively. However other methods of feature extraction, such as PCA, LDA, etc., calculate the new transform feature space based on training dataset; once training samples change, the new transform feature space needs to be calculated again based on new training dataset, and the feature vector of each sample also needs to be computed again based on the changed feature space. This adds the computation load. Because mass spectra hold a high dimensionality, this causes large matrix computation for the transforms of PCA, LDA, etc., and large computation load is needed. Compared with these methods of feature extraction, wavelet feature extraction method does not rely on the training dataset. It is a set of orthogonal wavelet basis that represents the features of sample vectors. The vector of mass spectra is convolving with the high-pass wavelet filter and the convolved coefficients are downsampled by keeping the even indexed elements to form the wavelet feature vector. This only needs small computation load. Data vector Y(1xd ori ) 1D DWT at i thlevel N samples Features F(Nxw i ) K flod cross validation F tr (N tr xw i ) F te (N te xw i ) Train SVMs M times SVM clasifier Results Fig. 4. The process of k fold cross validation experiments based on wavelet decomposition at ith level. 1D DWT represents one dimensional discrete wavelet transform. Y, F, F tr, andf te represent the original vector of mass spectra, wavelet feature vectors (detail coefficients), training vectors and test vectors respectively. N, N tr, N te, d ori, and w i represent the sample number of mass spectra, training and test vector number of k fold cross validation, the dimension number of the original mass spectra, and the dimension number of wavelet feature vectors. w i is 3798, 195, and 959 dimensions based on wavelet decomposition at second, third, and fourth level respectively for prostate cancer dataset of dimensions. 1 5 High resolution ovarian samples 2.2. SVM classifier The SVM originated from the idea of the structural risk minimization developed by Vapnik [17]. The SVM is an effective algorithm to find the maximal margin hyperplane to separate two classes of patterns. A transform to map nonlinearly the data into a higherdimensional space allows a linear separation of classes, which could not be linearly separated in the original space. The objects that are located on these two hyperplanes are the so-called support vectors. The maximal margin hyperplane, which is uniquely defined by the support vectors, gives the best separation between the classes. The support vectors can be regarded as the selected representatives of the training wavelet features, and are most critical for the separation of the two classes. As usually only few support vectors are used, there are only some parameters adjustable by the algorithm and thus the overfitting is unlikely to occur. Radial basis functions (RBF) K(x i, x j ) = e x i x j 2 /r1, where r1 is a strictly positive constant, and is set to 1. Apparently the linear kernel is less complex than the polynomial and the RBF kernels. The RBF kernel usually has better boundary response as it allows for extrapolation, and most high dimensional data can be approximated by Gaussian-like distributions similar to those used by RBF networks [18]. 3. Experiments and Results In this study we use classification accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) to evaluate the performance of the proposed method. Let TP, TN, FP, and FN be the number of true positive (cancer), true negative (), false positive and false negative samples. Sensitivity is defined as TP/(TP+FN); specificity is defined as TN/(TN+FP); positive predictive value is defined as TP/(TP + FP); negative predictive value is defined as TN/(TN + FN); accuracy is defined as (TP + TN)/(TP + TN + FP + FN). Balanced correct rate (BACC) is defined as 1 2 ( TP TP + FN + TN TN + FP ), which is the average of sensitivity and specificity. Daubechies wavelet db7 [19], which has seven non-zero coefficients of compact support wavelet orthogonal basis, is used for wavelet analysis of mass spectrometry data and the boundary values are symmetrically padded. Multilevel discrete wavelet transform is performed Vector of detail coefficients at 2nd level Vector of detail coefficients at 3rd level Vector of detail coefficients at 4th level Fig. 5. Wavelet features of high resolution ovarian mass spectra. This figure shows detail coefficients of wavelet decomposition at second, third and fourth level. on mass spectra to extract the features. K fold cross validation experiments are performed to evaluate our proposed method. K fold cross validation randomly generates indices, which contain equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. In K fold cross validation, K 1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time. In our study we use two- and threefold cross validation experiments to evaluate our proposed method. We run 2 times for each K fold cross validation experiments. Fig. 4 shows the process of k fold cross validation experiments based on wavelet detail coefficients at ith level High resolution ovarian dataset The raw ovarian high-resolution SELDI-TOF dataset is composed of 95 samples and 121 cancer samples, and the dimensional-

4 Y. Liu / Computers in Biology and Medicine 39 (29) Table 1 Performance of high resolution ovarian dataset. Level K fold Correct rate Sensitivity Specificity PPV NPV BACC PPV stands for positive predictive value; NPV stands for negative predictive value; BACC stands for balanced correct rate. Table 2 Performance using wavelet approximation coefficients [14]. K fold SD SD BACC This Table shows performance of high resolution ovarian dataset. BACC stands for balanced correct rate; SD stands for standard derivation. ity of the original feature space is It is provided by National Institute ( ppatterns.asp). Resampling of mass spectrometry data homogenizes the mass/charge (M/Z) vector in order to compare different spectra under the same reference and at the same resolution. In high resolution datasets, high resolution spectra contain redundant information. After resampling, signal is decimated into a more manageable M/Z vector, preserving the information content of the spectra. Resampling of mass spectrometry data selects a new M/Z vector and also applies an antialias filter that prevents high frequency noise from folding into lower frequencies [2]. We resample the mass spectrometry data to 15 M/Z points between 71 and After wavelet decomposition on mass spectra, we have 3759, 1886, and 949 dimensions for wavelet detail coefficients at second, third and fourth level respectively, which reduce dimensionality of mass spectra. Fig. 5 shows detail coefficients of wavelet decomposition at second, third and fourth level for cancer and samples. Table 1 shows the performance of two- and threefold cross validation experiments using detail coefficients at second, third and fourth level respectively. The 95.3% and 95.71% BACC of twoand threefold cross validation experiments are obtained using 3759 wavelet features at second level; 1886 wavelet features at third level achieve 97.86% and 98.21% BACC for two- and threefold cross validation experiments; 97.6% and 97.8% BACC of two- and threefold cross validation experiments for 949 wavelet features at fourth level. Yu et al. [14] indicated that For the high-resolution ovarian data, the vector of detail coefficients contains almost no information for the healthy, since SVMs identify all the data as cancers. Their experimental results based on wavelet approximation coefficients at first level are shown in Table 2. They achieved 95.34% BACC and 95.88% BACC for twofold and threefold cross validation experiments. In our proposed methods, wavelet detail coefficients perform very well. Both detail coefficients at third and fourth level outperform their method of four-step strategy. Our results also outperform other methods, which are shown in Table 3. Voted perceptron (VP) has 94.99% BACC; quadratic discriminant analysis (QDA) has 93.15% BACC; linear discriminant analysis (LDA) and Mahalanobis discriminant analysis (MDA) obtain 93.23% and 92.73% BACC respectively. We can see our proposed method is better than other feature extraction methods, such as QDA, LDA and MDA, etc. Table 3 Performance of different methods [14] of high resolution ovarian dataset. Methods Twofold cross validation (mean) (mean) BACC VP QDA LDA MDA NB Bagging NN NN ADtree J48tree This Table shows the performance of high resolution ovarian dataset based on different models. BACC stands for balanced correct rate. represents Sensitivity ; represents Specificity. VP stands for voted perceptron; QDA stands for quadratic discriminant analysis; LDA stands for linear discriminant analysis; MDA stands for Mahalanobis discriminant analysis; k-nn stands for k-nearest neighbor; NB stands for Na ve Bayes; Bagging stands for boostrap aggregating; ADtree stands for alternating decision trees; J48tree is a version of C4.5 in Weka classifier package Vector of detail coefficients at 2nd level Prostate samples Vector of detail coefficients at 3rd level Vector of detail coefficients at 4th level Fig. 6. Wavelet features of prostate mass spectra. This figure shows detail coefficients of wavelet decomposition at second, third and fourth level.

5 822 Y. Liu / Computers in Biology and Medicine 39 (29) Table 4 Performance of prostate cancer dataset. Level K fold Correct rate Sensitivity Specificity PPV NPV BACC PPV stands for positive predictive value; NPV stands for negative predictive value; BACC stands for balanced correct rate. Table 5 Performance of different methods [13]. BACC Specificity Sensitivity PPV No FE PCA PCA/LDA SFS SBS P-test T-test KS-test NSC(2) Boosted Boosted FE This Table shows the performance of prostate cancer dataset based on threefold cross validation experiments. PPV stands for positive predictive value; BACC stands for balanced correct rate; No FE stands for nearest centroid without feature selection; PCA and LDA stand for principal component analysis and linear discriminant analysis; SFS and SBS stand for sequential forward selection and sequential backward selection; T-test is for student-t test; KS-test for Kolmogorov Smirnov test; NSC for nearest shrunken centroid; Boosted for boosting algorithm; Boosted FE combines boosting algorithm and sequential forward selection method Prostate cancer dataset This data was collected using the H4 protein chip (JNCI dataset 7-3-2) [21]. There are 322 samples including 19 samples of benign prostate hyperplasia with PSA levels greater than 4, 63 samples of no evidence of disease and PSA level < 1, 26 samples of prostate cancer with PSA levels 4 through 1, and 43 samples of prostate cancer with PSA levels greater than 1. Each sample is composed of features. We combine benign prostate hyperplasia samples and those with no evidence of disease to form the normal class. The rest of the samples are classed into cancer category. We have 69 cancer samples and 253 normal samples, which is the same as the paper [13]. The original prostate mass spectra have features. After wavelet decomposition on mass spectra, 3798, 195 and 959 dimensions are obtained based on wavelet detail coefficients at second, third and fourth level respectively. Fig. 6 shows detail coefficients of wavelet decomposition at second, third and fourth level for cancer and samples. Table 4 shows the performance of twoand threefold cross validation experiments using detail coefficients at second, third and fourth level respectively. 83.9%, 86.18% and 82.36% BACC are obtained for threefold cross validation experiments based on detail coefficients at second, third, and fourth level respectively. Levner [13] performed threefold cross validation experiments using different methods and their results show in Table 5. Their best BACC result is 9.6% of boosted FE method, which combines boosting algorithm with sequential forward selection (SFS) method. Our best performance method achieves 86.18% BACC, which outperforms student-t test (T-test), Kolmogorov Smirnov test (KS-test), P-test, PCA/LDA, SFS, sequential backward selection (SBS), nearest shrunken centroid (NSC), bootsting algorithm, and is worse than boosting algorithm combining sequential forward selection. 4. Discussion and conclusions In this paper we propose a feature extraction algorithm based on multilevel wavelet decomposition for high dimensional mass spectra. A set of wavelet detail coefficients at different levels is used to reduce the dimensionality of mass spectra and characterizes the transient changes of mass spectra, in order to detect the difference between cancer tissue and normal tissue. Feature extraction method of wavelet detail coefficients is novel application on mass spectrometry data. A set of orthogonal wavelet basis is used to represent the features of mass spectra. Compared to PCA and LDA methods, wavelet feature extraction method does not depend on the training dataset to obtain the basis of feature space. It is wavelet basis which construct the feature space. So wavelet feature extraction method not only keeps the time property of mass spectra, but also dramatically reduces the computation load compared to PCA and LDA methods. Wavelet detail coefficients are high frequency part of mass spectra and are discarded by low frequency part of wavelet approximation. Wavelet detail coefficients have small energy and normally contain noise in the acquisition of mass spectra. After removing the noise using wavelet decomposition of first levels, detail coefficients at third level achieve the competitive performance compared to other feature extraction and feature selection methods. Experimental results suggest that the wavelet detail coefficients at third level are the efficient way to characterize the features of high dimensional mass spectra. Conflict of interest statement None declared. Acknowledgements This work was supported by SRF for ROCS, SEM, and Natural Science Foundation of Shandong Province (Y28G3), China. References [1] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, L.A. Liotta, Use of proteomic patterns in serum to identify ovarian cancer, The Lancet 359 (22) [2] J.M. Sorace, M. Zhan, A data review and re-assessment of ovarian cancer serum proteomic profiling, BMC Bioinformatics 4 (23) 24. [3] C.M. Michener, A.M. Ardekani, E.F. Petricoin III, L.A. Liotta, E.C. Kohn, Genomics and proteomics: application of novel technology to early detection and prevention of cancer, Detection and Prevention 26 (22) [4] E.F. Petricoin, K.C. Zoon, E.C. Kohn, J.C. Barrett, L.A. Liotta, Clinical proteomics: translating benchside promise into bedside reality, Nature Reviews Drug Discovery 1 (22) [5] P.R. Srinivas, M. Verma, Y. Zhao, S. Srivastava, Proteomics for cancer biomarker discovery, Clinical Chemistry 48 (22) [6] P.C. Herrmann, L.A. Liotta, E.F. Petricoin III, proteomics: the state of the art, Disease Markers 17 (21)

6 Y. Liu / Computers in Biology and Medicine 39 (29) [7] G.W. Jr, L.H. Cazares, S.M. Leung, S. Nazim, B.L. Adam, T.T. Yip, P.E. Schellnhammer, L. Gong, A. Vlahou, Proteinchip surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures, Prostate and Prostatic Disease 2 (1999) [8] A. Vlahou, P.E. Schellhammer, S. Mendrinos, K. Patel, F.L. Kondylis, L. Gong, S. Nazim, G.W. Jr, Development of a novel proteomic approach for the detection of transitional cell carcinoma of the bladder in urine, American Journal of Pathology 158 (21) [9] R.H. Lilien, H. Farid, B.R. Donald, Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum, Journal of Computational Biology 1 (6) (23) [1] H. Park, M. Jeon, J.B. Rosen, Lower dimensional representation of text data based on centroids and least squares, BIT 43 (23) [11] B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Zhao, Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data, Bioinformatics 19 (23). [12] N.O. Jeffries, Performance of a genetic algorithm for mass spectrometry proteomics, BMC Bioinformatics 5 (24). [13] I. Levner, Feature selection and nearest centroid classification for protein mass spectrometry, BMC Bioinformatics 6 (25). [14] J.S. Yu, S. Ongarello, R. Fiedler, X.W. Chen, G. Toffolo, C. Cobelli, Z. Trajanoski, Ovarian cancer identification based on dimensionality reduction for highthroughput mass spectrometry data, Bioinformatics 21 (25) [15] A. Grossmann, J. Morlet, Decomposition of Hardy functions into square integrable wavelets of constant shape, SIAM Journal on Mathematical Analysis 15 (1984) [16] S. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (1989) [17] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, [18] C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Kluwer Academic Publishers, Dordrecht, [19] I. Daubechies, Orthonormal bases of compactly supported wavelets, Communications on Pure and Applied Mathematics 41 (1988) [2] IEEE DSP Committee (Ed.), Programs for Digital Signal Processing, IEEE Press, New York, [21] E.F. Petricoin III, D.K. Ornstein, C.P. Paweletz, A. Ardekani, P.S. Hackett, B.A. Hitt, A. Velassco, C. Trucco, L. Wiegand, K. Wood, C.B. Simone, P.J. Levine, W.N. Linehan, M.R. Emmert-Buck, S.M. Steinberg, E.C. Kohn, L.A. Liotta, Serum proteomic patterns for detection of prostate cancer, Journal of the National Institute 94 (22)

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning

Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Profiling of mass spectrometry data for ovarian cancer detection using negative correlation learning Shan He, Huanhuan Chen, Xiaoli Li, and Xin Yao The Centre of Excellence for Research in Computational

More information

BMC Bioinformatics. Research. Open Access. Abstract

BMC Bioinformatics. Research. Open Access. Abstract BMC Bioinformatics BioMed Central Research A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection Michele Ceccarelli 1,2,Antoniod Acierno*

More information

Stability of Feature Selection Algorithms

Stability of Feature Selection Algorithms Stability of Feature Selection Algorithms Alexandros Kalousis, Julien Prados, Melanie Hilario University of Geneva, Computer Science Department Rue General Dufour 24, 1211 Geneva 4, Switzerland {kalousis,

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Application of Support Vector Machine In Bioinformatics

Application of Support Vector Machine In Bioinformatics Application of Support Vector Machine In Bioinformatics V. K. Jayaraman Scientific and Engineering Computing Group CDAC, Pune jayaramanv@cdac.in Arun Gupta Computational Biology Group AbhyudayaTech, Indore

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets Konstantinos Sechidis School of Computer Science University of Manchester sechidik@cs.man.ac.uk Abstract

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

The Imbalanced Problem in Mass-spectrometry Data Analysis

The Imbalanced Problem in Mass-spectrometry Data Analysis The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 136 143 The Imbalanced Problem in Mass-spectrometry

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 20: 10/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging 1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Multivariate Non-Linear Feature Selection with Kernel Multiplicative Updates and Gram-Schmidt Relief

Multivariate Non-Linear Feature Selection with Kernel Multiplicative Updates and Gram-Schmidt Relief Multivariate Non-Linear Feature Selection with Kernel Multiplicative Updates and Gram-Schmidt Relief Isabelle Guyon Clopinet, 955 Creston Road, Berkeley, CA94708, isabelle@clopinet.com and Hans-Marcus

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Towards robust feature selection for high-dimensional, small sample settings

Towards robust feature selection for high-dimensional, small sample settings Towards robust feature selection for high-dimensional, small sample settings Yvan Saeys Bioinformatics and Evolutionary Genomics, Ghent University, Belgium yvan.saeys@psb.ugent.be Marseille, January 14th,

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Leaf Image Recognition Based on Wavelet and Fractal Dimension

Leaf Image Recognition Based on Wavelet and Fractal Dimension Journal of Computational Information Systems 11: 1 (2015) 141 148 Available at http://www.jofcis.com Leaf Image Recognition Based on Wavelet and Fractal Dimension Haiyan ZHANG, Xingke TAO School of Information,

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING SECOND EDITION IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING ith Algorithms for ENVI/IDL Morton J. Canty с*' Q\ CRC Press Taylor &. Francis Group Boca Raton London New York CRC

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Tumor Detection and classification of Medical MRI UsingAdvance ROIPropANN Algorithm

Tumor Detection and classification of Medical MRI UsingAdvance ROIPropANN Algorithm International Journal of Engineering Research and Advanced Technology (IJERAT) DOI:http://dx.doi.org/10.31695/IJERAT.2018.3273 E-ISSN : 2454-6135 Volume.4, Issue 6 June -2018 Tumor Detection and classification

More information

Fabric Image Retrieval Using Combined Feature Set and SVM

Fabric Image Retrieval Using Combined Feature Set and SVM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET

CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET 69 CHAPTER 3 WAVELET DECOMPOSITION USING HAAR WAVELET 3.1 WAVELET Wavelet as a subject is highly interdisciplinary and it draws in crucial ways on ideas from the outside world. The working of wavelet in

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN 2016 International Conference on Artificial Intelligence: Techniques and Applications (AITA 2016) ISBN: 978-1-60595-389-2 Face Recognition Using Vector Quantization Histogram and Support Vector Machine

More information

k-nn Disgnosing Breast Cancer

k-nn Disgnosing Breast Cancer k-nn Disgnosing Breast Cancer Prof. Eric A. Suess February 4, 2019 Example Breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The process of

More information

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data * Mario Cannataro University Magna Græcia of Catanzaro, Italy cannataro@unicz.it * Joint work with P. H. Guzzi, T. Mazza, P.

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

The Anatomical Equivalence Class Formulation and its Application to Shape-based Computational Neuroanatomy

The Anatomical Equivalence Class Formulation and its Application to Shape-based Computational Neuroanatomy The Anatomical Equivalence Class Formulation and its Application to Shape-based Computational Neuroanatomy Sokratis K. Makrogiannis, PhD From post-doctoral research at SBIA lab, Department of Radiology,

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning nearest neighbors classification Luigi Cerulo Department of Science and Technology University of Sannio Nearest Neighbors Classification The idea is based on the hypothesis that things

More information

Nonparametric Classification Methods

Nonparametric Classification Methods Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)

More information

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach Soha Ahmed 1, Mengjie Zhang 1, and Lifeng Peng 2 1 School of Engineering and Computer Science

More information

Time Series Classification in Dissimilarity Spaces

Time Series Classification in Dissimilarity Spaces Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Time Series Classification in Dissimilarity Spaces Brijnesh J. Jain and Stephan Spiegel Berlin Institute

More information

Unknown Malicious Code Detection Based on Bayesian

Unknown Malicious Code Detection Based on Bayesian Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES Ashwin Swaminathan ashwins@umd.edu ENEE633: Statistical and Neural Pattern Recognition Instructor : Prof. Rama Chellappa Project 2, Part (a) 1. INTRODUCTION

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute Phone: Help:

Mass Spec Data Post-Processing Software. ClinProTools. Wayne Xu, Ph.D. Supercomputing Institute   Phone: Help: Mass Spec Data Post-Processing Software ClinProTools Presenter: Wayne Xu, Ph.D Supercomputing Institute Email: Phone: Help: wxu@msi.umn.edu (612) 624-1447 help@msi.umn.edu (612) 626-0802 Aug. 24,Thur.

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Non-linearity and spatial correlation in landslide susceptibility mapping

Non-linearity and spatial correlation in landslide susceptibility mapping Non-linearity and spatial correlation in landslide susceptibility mapping C. Ballabio, J. Blahut, S. Sterlacchini University of Milano-Bicocca GIT 2009 15/09/2009 1 Summary Landslide susceptibility modeling

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Classification with PAM and Random Forest

Classification with PAM and Random Forest 5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

Sketchable Histograms of Oriented Gradients for Object Detection

Sketchable Histograms of Oriented Gradients for Object Detection Sketchable Histograms of Oriented Gradients for Object Detection No Author Given No Institute Given Abstract. In this paper we investigate a new representation approach for visual object recognition. The

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Classification by Nearest Shrunken Centroids and Support Vector Machines

Classification by Nearest Shrunken Centroids and Support Vector Machines Classification by Nearest Shrunken Centroids and Support Vector Machines Florian Markowetz florian.markowetz@molgen.mpg.de Max Planck Institute for Molecular Genetics, Computational Diagnostics Group,

More information

Breast Cancer Detection and Classification Using Ultrasound and Ultrasound Elastography Images

Breast Cancer Detection and Classification Using Ultrasound and Ultrasound Elastography Images Breast Cancer Detection and Classification Using Ultrasound and Ultrasound Elastography Images Ramya S. 1, Nanda S. 2 1IV Sem, M.Tech, Biomedical Signal Processing & Instrumentation, SJCE, Mysuru, Karnataka,

More information

Approaches to dimensionality reduction in proteomic biomarker studies

Approaches to dimensionality reduction in proteomic biomarker studies BRIEFINGS IN BIOINFORMATICS. VOL 9. NO 2. 102^118 Advance Access publication February 29, 2008 doi:10.1093/bib/bbn005 Approaches to dimensionality reduction in proteomic biomarker studies Melanie Hilario

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

FACE RECOGNITION USING SUPPORT VECTOR MACHINES FACE RECOGNITION USING SUPPORT VECTOR MACHINES Ashwin Swaminathan ashwins@umd.edu ENEE633: Statistical and Neural Pattern Recognition Instructor : Prof. Rama Chellappa Project 2, Part (b) 1. INTRODUCTION

More information

Categorization of Sequential Data using Associative Classifiers

Categorization of Sequential Data using Associative Classifiers Categorization of Sequential Data using Associative Classifiers Mrs. R. Meenakshi, MCA., MPhil., Research Scholar, Mrs. J.S. Subhashini, MCA., M.Phil., Assistant Professor, Department of Computer Science,

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

Feature Selection and Classification for Small Gene Sets

Feature Selection and Classification for Small Gene Sets Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information