Handling missing values in kernel methods with application to microbiology data

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. Hanling missing values in kernel methos with application to microbiology ata Vlaimer Kobayashi 1 an Tomàs Aluja 2 an Lluís A. Belanche 3 1- Laboratoire Hubert Curien - UMR CNRS 5516 Bâtiment F 18 Rue u Professeur Benoît Lauras 42000 Saint-Etienne FRANCE 2- Computer Science School - Dept. of Statistics & Operations Research Technical University of Catalonia Jori Girona, 1-3 08034, Barcelona, SPAIN 3- Computer Science School - Dept. of Software Technical University of Catalonia Jori Girona, 1-3 08034, Barcelona, SPAIN Abstract. We iscuss several approaches that make possible for kernel methos to eal with missing values. The first two are extene kernels able to hanle missing values without ata preprocessing methos. Another two methos are erive from a sophisticate multiple imputation technique involving logistic regression as local moel learner. The performance of these approaches is compare using a binary ata set that arises typically in microbiology (the microbial source tracking problem). Our results show that the kernel extensions emonstrate competitive performance in comparison with multiple imputation in terms of preictive accuracy. However, these results are achieve with a simpler an eterministic methoology an entail a much lower computational effort. 1 Introuction Moern moelling problems are ifficult for a number of reasons, incluing the challenge of ealing with a significant amount of missing information. Kernel methos have won great popularity as a reliable machine learning tool; in particular, Support Vector Machines (SVMs) are kernel-base methos that are use for tasks such as classification an regression, among others [1]. The kernel function is a very flexible container uner which to express knowlege about the problem as well as to capture the meaningful relations in input space. Some classical moelling methos like Naïve Bayes an CART ecision trees are able to eal with missing values irectly. However, the process of optimizing an SVM assumes that the training ata set is complete. When present, missing values almost always represent a serious problem because they force to preprocess the ataset an a goo eal of effort is normally put in this part of the moelling. In orer to process such atasets with kernel methos, an imputation proceure is then eeeme a necessary but emaning step. The aim of this paper is to examine an compare a number of approaches to hanle missing values in kernel methos. Specifically, we present two methos that exten a kernel function in the presence of missing values an hence 397

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. hanle missing values irectly. We also investigate two ifferent uses of the well-stablishe multiple imputation metho. These four approaches are use to analyze a fecal source pollution ataset presenting several challenges: it is a multi-class, small sample size problem plague by missing values. All four have slightly better preictive accuracies than the best moel suggeste so far. 2 Preliminaries Missing information is ifficult to hanle, specially when the lost parts are of significant size. Three possible ways to eal with missing ata are: i) iscar all observations (or variables) with missing values, ii) impute the values, an iii) exten the learner to work with incomplete observations. Deleting instances an/or variables containing missing values results in loss of relevant ata an is also frustrating because of the effort in collecting the sacrifice information. Imputation methos entail inferring values for the missing entries [2, 3]. A growing number of stuies recommen the use of multiple imputation e.g. [4]. Compare to classical imputation, which imputes a single value, multiple imputation prouces several values to fill the missing entries. These methos are inepenent of the learning algorithm an hence their impact on the learning process is uncertain. For SVMs, recent work tackles the problem by efining a moifie risk that incorporates the uncertainty ue to the missingness into a convex optimization task [5]. 2.1 First kernel extension The first kernel extension is obtaine by wrapping a known kernel aroun a probability istribution [6]: Theorem 1 Let X enote a missing value. Let k be a kernel in X an P a probability mass function in X. Then the function k X (x, y) given by k(x, y), if x, y =X ; P (y )k(x, y ), if x =X an y = X ; y k X (x, y) = X P (x )k(x,y), if x = X an y =X ; x X P (x ) P (y )k(x,y ), if x = y = X x X is a kernel in X {X}. y X For binary variables x, y {0, 1}, efine the kernel: { 1 if x = y k(x, y) = 0 if x =y (1) 398

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. Notice that this kernel is univariate. For the multivariate case x, y X = {0, 1}, efine the first kernel extension (1KE) as: K 1 (x, y) = 1 2.2 Secon kernel extension k X (x i,y i ) (2) The secon kernel extension (2KE) eals with the multivariate case irectly but is limite by the number of variables. The iea is to consier all possible completions of an observation with missing values. An example will prove helpful: suppose we have an incomplete observation given by (0, 1, X, 0); then the possible completions for this observation are {(0, 1, 0, 0), (0, 1, 1, 0)}. Theorem 2 Let X enote a missing value. Let k be a kernel in X = {0, 1}. Let c(x) be the set of completions of x. Given two vectors x, y X, the function K 2 (x, y) = is a kernel in X {X}. 1 c(x) c(y) i=1 x c(x) y c(y) k(x, y ) (3) Proof. The set of kernels is a convex cone; therefore it is close uner linear combinations with positive coefficients. 2.3 Multiple imputation methos These methos involve the estimation of what the missing values coul have been an then use the complete atasets for moelling. Two main methos for multivariate ata have been propose: joint moeling (JM)anfully conitional specification (FCS). JM assumes a (multivariate) istribution for the missing ata, an raws impute values from the conitional istributions by MCMC techniques. JM techniques are available if the istribution is assume to be multivariate normal, log-linear or a general location moel. The success of JM epens on the impact of these assumptions. On the other han, FCS oes not make istributional assumptions in avance, since it specifies a multivariate imputation moel on a variable-by-variable basis by a set of conitional ensities, one for each incomplete variable [7]. FCS will start with an initial imputation an then raw imputations by iterating over the conitional ensities [8]. Let us enote our observation as X =(X 1,...,X ), possibly with missing values. The observe an missing parts of X are enote by X obs an X mis, respectively. Let X j =(X 1,...,X j 1,X j+1,...,x ) enote the collection of the 1 variables in X except X j. The hypothetically complete observation X isassumetoberawnfroma -variate istribution P (X θ). We assume that the multivariate istribution of 399

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. X is completely specifie by θ, a vector of unknown parameters. The parameters θ 1,...,θ are specific to the respective conitional ensities an are not necessarily the prouct of a factorization of the true joint istribution P (X θ). The chaine equations metho obtains the posterior istribution for θ by sampling iteratively from conitional istributions of the form P (X j X j,θ j ),j = 1,...,. Starting from a simple raw from the observe marginal istributions, the t-th iteration of the process is a Gibbs sampler to successively raw [8]: where X (t) j θ (t) 1 P (θ 1 X obs 1,X (t 1) 2,...,X (t 1) ) X (t) 1 P (X1 mis X1 obs,x (t 1). θ (t) X (t) =(X obs j 2,...,X (t 1) P (θ X obs,x (t) 1,...,X(t) 1 ) P (X mis X obs,x (t) 1,...,X(t),θ (t) ),θ (t) 1 ),X (t) j )isthejth impute variable at iteration t. Observe only enter X (t) j through its relation with that previous imputations X (t 1) j other variables, an not irectly. Moreover, an unlike other MCMC methos, no information about Xj mis is use to raw θ (t) j, so convergence can be quite fast. The proceure is iterate a number of m times to generate m ifferent multiple imputations. For the technique to be practical, a univariate imputation moel is neee for each of the incomplete variables. The choice will be steere by the scale of the epenent variable (the variable that we nee to impute), an preferably incorporates knowlege about the relation between the variables. Other consierations inclue the set of variables to inclue as preictors; the orer in which variables shoul be impute; the number of impute ata sets; whether we shoul impute variables that are functions of other (incomplete) variables; the form of the starting imputations an the number of iterations. 3 Experimental evaluation 3.1 Problem escription The stuy of fecal source pollution in waterboies is a major problem in ensuring the welfare of human populations, given its incience in a variety of iseases, specially in uner-evelope countries. Microbial source tracking methos attempt to ientify the source of contamination, allowing for improve risk analysis an better water management [9]. The available ata set inclues a number of chemical, microbial, an eukaryotic markers of fecal pollution in water. All variables (except the class variable) are binary, i.e., they signal the presence or absence of a particular marker. The ata set inclues 9 binary variables, 138 observations an four classes, with 212 missing entries out of 1, 242 (approximately 17%). All variables have percentages of missing entries greater than 15%. A recent stuy 400

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. investigating this ata set reporte a Naïve Bayes classifier as the best moel, yieling an accuracy of 77.9% [10]. 3.2 Performing multiple imputation To our knowlege there has been little work in using multiple imputation for missing ata treatment prior to the application of a SVM as the learning algorithm. The crucial element is how to pool the results coming from several SVMs that are traine for each impute ata set. In this paper we propose two methos to o this pooling. The first metho is to concatenate the multiply impute ata sets an optimize an SVM classifier in the resulting set; this not only accounts for the variability of the parameter estimates but also for the variability of the training observations in relation to the impute values. The secon, more stanar proceure, involves fitting separate SVMs to each impute ata set an get the average performance of the ifferent SVMs. Since the missing values in our problem are foun in variables which are binary, logistic regression is a goo choice for the imputation moels. One is also require to ientify which of the remaining variables will be use as preictors. To this en we compute the Kenall rank correlation coefficient for each pair of variables an set a threshol that will serve as an inicator for a variable to be inclue as a preictor. In aition, we also etermine the proportion of usable cases (PUC); this will tell us whether a preictor contains only fractional information to impute the target variable, an thus coul be roppe from the moel. To improve the imputation moel we ecie to use the class variable as a preictor whenever it is appropriate (as inicate by the Kenall coefficient an the PUC). Finally, the number of impute ata sets is set to 10, to keep computations manageable. 3.3 Results an iscussion Four separate preictive moels were built for each of the approaches: 1KE, 2KE, the first version of multiple imputation (1MI) an the secon version of multiple imputation (2MI) 1. To obtain a reliable estimate of preictive accuracy in this small ata set, a stratifie 10 times 10-fol cross-valiation (10x10cv) is performe. Table 1 summarizes the results. 10x10cv for each class Approach C 10x10cv Human Cow Poultry Swine 1KE 2.0 79.3 95.4 64.5 75.2 69.4 2KE 1.6 78.2 92.6 62.8 71.8 74.2 1MI 1.0 79.9 92.7 66.4 69.4 80.2 2MI 1.0 79.0 94.5 57.5 70.8 78.8 Table 1: Mean 10x10cv accuracies for the four approaches to hanle missing values. Also shown are best cost parameter C an etaile class performance. 1 We use the R software [11], extene with the kernlab package (for the SVMs) an the mice package, which implements the FCS metho for multivariate multiple imputation. 401

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. The table shows that all four approaches have comparable performance, with 2KE seeming inferior; 1KE performs best in ientifying human contamination, the most important single ecision; the two multiple imputations are goo for swine origin. The four approaches have ifficulties in classifying cows, probably because this class is the minority class, representing only 18% of the observations. 4 Conclusions In real problems, accuracy may not tell the whole picture about a moel. Other performance criteria inclue evelopment cost, interpretability, an utility. The cost here refers to how much pre-processing effort an computing time we nee in orer to buil the moel. Unoubtely, the two imputation methos require more time an resources compare to the kernel extensions. Prior to imputation a goo univariate imputation moel must be ientifie for each variable containing missing values. These methos epen also on several non-trivial algorithmic options an have an ae computational cost for training separate SVMs for each of the impute ata sets. Interpretability refers to the complexity of the obtaine moel. It is unclear if a moel that ha values impute (several times) is more interpretable than one that ha not. Finally, the moel must be useful in practice: in a real eployment of the moel, new an unseen observations emerge which we nee to classify, which may contain missing values. The two kernel extensions are the only methos able to face this situation. References [1] J. Shawe-Taylor an N. Cristianini, Kernel Methos for Pattern Analysis. Cambrige Univ. Press, 2004. [2] R. Little an D. Rubin. Statistical Analysis with Missing Data. Wiley-Interscience, 2009. [3] Q. Song an M. Shepper. Missing ata imputation techniques. Int. J. Business Intelligence an Data Mining, 2(3):261-291, 2007. [4] Y. He. Missing ata analysis using multiple imputation. Circulation: Cariovascular Quality an Outcomes, 3(1):98 105, 2010. [5] K. Pelckmans, J. De Brabanter, J. Suykens an B. De Moor. Hanling missing values in support vector machine classifiers. Neural Networks, 18:684-692, 2005. [6] G. Nebot an Ll. Belanche. A kernel extension to hanle missing ata. In Bramer, Ellis an Petriis, eitors, Research an Development in Intelligent Systems XXVI, Springer, 2010. [7] K. Lee an J. Carlin. Multiple imputation for missing ata: Fully conitional specification vs multivariate normal imputation. Amer. Journal of Epiemiology, 171(5):624-632, 2010. [8] S. van Buuren an K. Groothuis-Oushoorn. mice: Multivariate imputation by chaine equations in R. Journal of Statistical Software, 45(3):1 67, 2011. [9] T.M. Scott, J.B. Rose, T.M. Jenkins, S.R. Farrah an J. Lukasik. Microbial source tracking: Current methoology an future irections. Applie an Environmental Microbiology, 68(12):5796 5803, 2002. [10] E. Ballesté, X. Bonjoch, Ll. Belanche an A. R. Blanch. Molecular inicators use in the evelopment of preictive moels for microbial source tracking. Applie an Environmental Microbiology, 76(6):1789 1795, 2010. [11] R Development Core Team. R: A Language an Environment for Statistical Computing. R Founation for Statistical Computing, Vienna, Austria, 2008. 402