Handling missing values in kernel methods with application to microbiology data

Similar documents
THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

Bayesian localization microscopy reveals nanoscale podosome dynamics

A shortest path algorithm in multimodal networks: a case study with time varying costs

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Kinematic Analysis of a Family of 3R Manipulators

Characterizing Decoding Robustness under Parametric Channel Uncertainty

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Coupling the User Interfaces of a Multiuser Program

Politehnica University of Timisoara Mobile Computing, Sensors Network and Embedded Systems Laboratory. Testing Techniques

Modifying ROC Curves to Incorporate Predicted Probabilities

Estimating Velocity Fields on a Freeway from Low Resolution Video

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

Rough Set Approach for Classification of Breast Cancer Mammogram Images

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Comparison of Methods for Increasing the Performance of a DUA Computation

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

Online Appendix to: Generalizing Database Forensics

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Learning Polynomial Functions. by Feature Construction

A Framework for Dialogue Detection in Movies

Learning convex bodies is hard

A fast embedded selection approach for color texture classification using degraded LBP

Shift-map Image Registration

MANJUSHA K.*, ANAND KUMAR M., SOMAN K. P.

Automation of Bird Front Half Deboning Procedure: Design and Analysis

Design and Analysis of Optimization Algorithms Using Computational

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Skyline Community Search in Multi-valued Networks

A Convex Clustering-based Regularizer for Image Segmentation

Finite Automata Implementations Considering CPU Cache J. Holub

New Geometric Interpretation and Analytic Solution for Quadrilateral Reconstruction

Lab work #8. Congestion control

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

Politecnico di Torino. Porto Institutional Repository

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency

William S. Law. Erik K. Antonsson. Engineering Design Research Laboratory. California Institute of Technology. Abstract

From an Abstract Object-Oriented Model to a Ready-to-Use Embedded System Controller

EXACT SIMULATION OF A BOOLEAN MODEL

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

Learning Subproblem Complexities in Distributed Branch and Bound

Frequency Domain Parameter Estimation of a Synchronous Generator Using Bi-objective Genetic Algorithms

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Optimization of cable-stayed bridges with box-girder decks

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

1 Surprises in high dimensions

On the Placement of Internet Taps in Wireless Neighborhood Networks

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Image Segmentation using K-means clustering and Thresholding

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

A Multi-class SVM Classifier Utilizing Binary Decision Tree

Fuzzy Clustering in Parallel Universes

a 1 (x ) a 1 (x ) a 1 (x ) ) a 2 3 (x Input Variable x

Study of Network Optimization Method Based on ACL

Adjacency Matrix Based Full-Text Indexing Models

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

Optimal Oblivious Path Selection on the Mesh

Robust Camera Calibration for an Autonomous Underwater Vehicle

A Novel Density Based Clustering Algorithm by Incorporating Mahalanobis Distance

Variable Independence and Resolution Paths for Quantified Boolean Formulas

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

A Duality Based Approach for Realtime TV-L 1 Optical Flow

Loop Scheduling and Partitions for Hiding Memory Latencies

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Compiler Optimisation

Verifying performance-based design objectives using assemblybased vulnerability

Learning to Rank: A New Technology for Text Processing

Tight Wavelet Frame Decomposition and Its Application in Image Processing

WLAN Indoor Positioning Based on Euclidean Distances and Fuzzy Logic

Technical Report TR Navigation Around an Unknown Obstacle for Autonomous Surface Vehicles Using a Forward-Facing Sonar

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

A Plane Tracker for AEC-automation Applications

THE increasingly digitized power system offers more data,

Table-based division by small integer constants

Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem

Secure Network Coding for Distributed Secret Sharing with Low Communication Cost

Worksheet A. y = e row in the table below, giving values to 1 d.p. Use your calculator to complete the

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Improving Performance of Sparse Matrix-Vector Multiplication

0607 CAMBRIDGE INTERNATIONAL MATHEMATICS

A Revised Simplex Search Procedure for Stochastic Simulation Response Surface Optimization

Navigation Around an Unknown Obstacle for Autonomous Surface Vehicles Using a Forward-Facing Sonar

Bootstrap and multiple imputation under missing data in AR(1) models

Video-based Characters Creating New Human Performances from a Multi-view Video Database

Dense Disparity Estimation in Ego-motion Reduced Search Space

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

CS269I: Incentives in Computer Science Lecture #8: Incentives in BGP Routing

AnyTraffic Labeled Routing

6 Gradient Descent. 6.1 Functions

Transcription:

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. Hanling missing values in kernel methos with application to microbiology ata Vlaimer Kobayashi 1 an Tomàs Aluja 2 an Lluís A. Belanche 3 1- Laboratoire Hubert Curien - UMR CNRS 5516 Bâtiment F 18 Rue u Professeur Benoît Lauras 42000 Saint-Etienne FRANCE 2- Computer Science School - Dept. of Statistics & Operations Research Technical University of Catalonia Jori Girona, 1-3 08034, Barcelona, SPAIN 3- Computer Science School - Dept. of Software Technical University of Catalonia Jori Girona, 1-3 08034, Barcelona, SPAIN Abstract. We iscuss several approaches that make possible for kernel methos to eal with missing values. The first two are extene kernels able to hanle missing values without ata preprocessing methos. Another two methos are erive from a sophisticate multiple imputation technique involving logistic regression as local moel learner. The performance of these approaches is compare using a binary ata set that arises typically in microbiology (the microbial source tracking problem). Our results show that the kernel extensions emonstrate competitive performance in comparison with multiple imputation in terms of preictive accuracy. However, these results are achieve with a simpler an eterministic methoology an entail a much lower computational effort. 1 Introuction Moern moelling problems are ifficult for a number of reasons, incluing the challenge of ealing with a significant amount of missing information. Kernel methos have won great popularity as a reliable machine learning tool; in particular, Support Vector Machines (SVMs) are kernel-base methos that are use for tasks such as classification an regression, among others [1]. The kernel function is a very flexible container uner which to express knowlege about the problem as well as to capture the meaningful relations in input space. Some classical moelling methos like Naïve Bayes an CART ecision trees are able to eal with missing values irectly. However, the process of optimizing an SVM assumes that the training ata set is complete. When present, missing values almost always represent a serious problem because they force to preprocess the ataset an a goo eal of effort is normally put in this part of the moelling. In orer to process such atasets with kernel methos, an imputation proceure is then eeeme a necessary but emaning step. The aim of this paper is to examine an compare a number of approaches to hanle missing values in kernel methos. Specifically, we present two methos that exten a kernel function in the presence of missing values an hence 397

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. hanle missing values irectly. We also investigate two ifferent uses of the well-stablishe multiple imputation metho. These four approaches are use to analyze a fecal source pollution ataset presenting several challenges: it is a multi-class, small sample size problem plague by missing values. All four have slightly better preictive accuracies than the best moel suggeste so far. 2 Preliminaries Missing information is ifficult to hanle, specially when the lost parts are of significant size. Three possible ways to eal with missing ata are: i) iscar all observations (or variables) with missing values, ii) impute the values, an iii) exten the learner to work with incomplete observations. Deleting instances an/or variables containing missing values results in loss of relevant ata an is also frustrating because of the effort in collecting the sacrifice information. Imputation methos entail inferring values for the missing entries [2, 3]. A growing number of stuies recommen the use of multiple imputation e.g. [4]. Compare to classical imputation, which imputes a single value, multiple imputation prouces several values to fill the missing entries. These methos are inepenent of the learning algorithm an hence their impact on the learning process is uncertain. For SVMs, recent work tackles the problem by efining a moifie risk that incorporates the uncertainty ue to the missingness into a convex optimization task [5]. 2.1 First kernel extension The first kernel extension is obtaine by wrapping a known kernel aroun a probability istribution [6]: Theorem 1 Let X enote a missing value. Let k be a kernel in X an P a probability mass function in X. Then the function k X (x, y) given by k(x, y), if x, y =X ; P (y )k(x, y ), if x =X an y = X ; y k X (x, y) = X P (x )k(x,y), if x = X an y =X ; x X P (x ) P (y )k(x,y ), if x = y = X x X is a kernel in X {X}. y X For binary variables x, y {0, 1}, efine the kernel: { 1 if x = y k(x, y) = 0 if x =y (1) 398

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. Notice that this kernel is univariate. For the multivariate case x, y X = {0, 1}, efine the first kernel extension (1KE) as: K 1 (x, y) = 1 2.2 Secon kernel extension k X (x i,y i ) (2) The secon kernel extension (2KE) eals with the multivariate case irectly but is limite by the number of variables. The iea is to consier all possible completions of an observation with missing values. An example will prove helpful: suppose we have an incomplete observation given by (0, 1, X, 0); then the possible completions for this observation are {(0, 1, 0, 0), (0, 1, 1, 0)}. Theorem 2 Let X enote a missing value. Let k be a kernel in X = {0, 1}. Let c(x) be the set of completions of x. Given two vectors x, y X, the function K 2 (x, y) = is a kernel in X {X}. 1 c(x) c(y) i=1 x c(x) y c(y) k(x, y ) (3) Proof. The set of kernels is a convex cone; therefore it is close uner linear combinations with positive coefficients. 2.3 Multiple imputation methos These methos involve the estimation of what the missing values coul have been an then use the complete atasets for moelling. Two main methos for multivariate ata have been propose: joint moeling (JM)anfully conitional specification (FCS). JM assumes a (multivariate) istribution for the missing ata, an raws impute values from the conitional istributions by MCMC techniques. JM techniques are available if the istribution is assume to be multivariate normal, log-linear or a general location moel. The success of JM epens on the impact of these assumptions. On the other han, FCS oes not make istributional assumptions in avance, since it specifies a multivariate imputation moel on a variable-by-variable basis by a set of conitional ensities, one for each incomplete variable [7]. FCS will start with an initial imputation an then raw imputations by iterating over the conitional ensities [8]. Let us enote our observation as X =(X 1,...,X ), possibly with missing values. The observe an missing parts of X are enote by X obs an X mis, respectively. Let X j =(X 1,...,X j 1,X j+1,...,x ) enote the collection of the 1 variables in X except X j. The hypothetically complete observation X isassumetoberawnfroma -variate istribution P (X θ). We assume that the multivariate istribution of 399

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. X is completely specifie by θ, a vector of unknown parameters. The parameters θ 1,...,θ are specific to the respective conitional ensities an are not necessarily the prouct of a factorization of the true joint istribution P (X θ). The chaine equations metho obtains the posterior istribution for θ by sampling iteratively from conitional istributions of the form P (X j X j,θ j ),j = 1,...,. Starting from a simple raw from the observe marginal istributions, the t-th iteration of the process is a Gibbs sampler to successively raw [8]: where X (t) j θ (t) 1 P (θ 1 X obs 1,X (t 1) 2,...,X (t 1) ) X (t) 1 P (X1 mis X1 obs,x (t 1). θ (t) X (t) =(X obs j 2,...,X (t 1) P (θ X obs,x (t) 1,...,X(t) 1 ) P (X mis X obs,x (t) 1,...,X(t),θ (t) ),θ (t) 1 ),X (t) j )isthejth impute variable at iteration t. Observe only enter X (t) j through its relation with that previous imputations X (t 1) j other variables, an not irectly. Moreover, an unlike other MCMC methos, no information about Xj mis is use to raw θ (t) j, so convergence can be quite fast. The proceure is iterate a number of m times to generate m ifferent multiple imputations. For the technique to be practical, a univariate imputation moel is neee for each of the incomplete variables. The choice will be steere by the scale of the epenent variable (the variable that we nee to impute), an preferably incorporates knowlege about the relation between the variables. Other consierations inclue the set of variables to inclue as preictors; the orer in which variables shoul be impute; the number of impute ata sets; whether we shoul impute variables that are functions of other (incomplete) variables; the form of the starting imputations an the number of iterations. 3 Experimental evaluation 3.1 Problem escription The stuy of fecal source pollution in waterboies is a major problem in ensuring the welfare of human populations, given its incience in a variety of iseases, specially in uner-evelope countries. Microbial source tracking methos attempt to ientify the source of contamination, allowing for improve risk analysis an better water management [9]. The available ata set inclues a number of chemical, microbial, an eukaryotic markers of fecal pollution in water. All variables (except the class variable) are binary, i.e., they signal the presence or absence of a particular marker. The ata set inclues 9 binary variables, 138 observations an four classes, with 212 missing entries out of 1, 242 (approximately 17%). All variables have percentages of missing entries greater than 15%. A recent stuy 400

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. investigating this ata set reporte a Naïve Bayes classifier as the best moel, yieling an accuracy of 77.9% [10]. 3.2 Performing multiple imputation To our knowlege there has been little work in using multiple imputation for missing ata treatment prior to the application of a SVM as the learning algorithm. The crucial element is how to pool the results coming from several SVMs that are traine for each impute ata set. In this paper we propose two methos to o this pooling. The first metho is to concatenate the multiply impute ata sets an optimize an SVM classifier in the resulting set; this not only accounts for the variability of the parameter estimates but also for the variability of the training observations in relation to the impute values. The secon, more stanar proceure, involves fitting separate SVMs to each impute ata set an get the average performance of the ifferent SVMs. Since the missing values in our problem are foun in variables which are binary, logistic regression is a goo choice for the imputation moels. One is also require to ientify which of the remaining variables will be use as preictors. To this en we compute the Kenall rank correlation coefficient for each pair of variables an set a threshol that will serve as an inicator for a variable to be inclue as a preictor. In aition, we also etermine the proportion of usable cases (PUC); this will tell us whether a preictor contains only fractional information to impute the target variable, an thus coul be roppe from the moel. To improve the imputation moel we ecie to use the class variable as a preictor whenever it is appropriate (as inicate by the Kenall coefficient an the PUC). Finally, the number of impute ata sets is set to 10, to keep computations manageable. 3.3 Results an iscussion Four separate preictive moels were built for each of the approaches: 1KE, 2KE, the first version of multiple imputation (1MI) an the secon version of multiple imputation (2MI) 1. To obtain a reliable estimate of preictive accuracy in this small ata set, a stratifie 10 times 10-fol cross-valiation (10x10cv) is performe. Table 1 summarizes the results. 10x10cv for each class Approach C 10x10cv Human Cow Poultry Swine 1KE 2.0 79.3 95.4 64.5 75.2 69.4 2KE 1.6 78.2 92.6 62.8 71.8 74.2 1MI 1.0 79.9 92.7 66.4 69.4 80.2 2MI 1.0 79.0 94.5 57.5 70.8 78.8 Table 1: Mean 10x10cv accuracies for the four approaches to hanle missing values. Also shown are best cost parameter C an etaile class performance. 1 We use the R software [11], extene with the kernlab package (for the SVMs) an the mice package, which implements the FCS metho for multivariate multiple imputation. 401

an Machine Learning. Bruges (Belgium), 24-26 April 2013, i6oc.com publ., ISBN 978-2-87419-081-0. Available from http://www.i6oc.com/en/livre/?gcoi=28001100131010. The table shows that all four approaches have comparable performance, with 2KE seeming inferior; 1KE performs best in ientifying human contamination, the most important single ecision; the two multiple imputations are goo for swine origin. The four approaches have ifficulties in classifying cows, probably because this class is the minority class, representing only 18% of the observations. 4 Conclusions In real problems, accuracy may not tell the whole picture about a moel. Other performance criteria inclue evelopment cost, interpretability, an utility. The cost here refers to how much pre-processing effort an computing time we nee in orer to buil the moel. Unoubtely, the two imputation methos require more time an resources compare to the kernel extensions. Prior to imputation a goo univariate imputation moel must be ientifie for each variable containing missing values. These methos epen also on several non-trivial algorithmic options an have an ae computational cost for training separate SVMs for each of the impute ata sets. Interpretability refers to the complexity of the obtaine moel. It is unclear if a moel that ha values impute (several times) is more interpretable than one that ha not. Finally, the moel must be useful in practice: in a real eployment of the moel, new an unseen observations emerge which we nee to classify, which may contain missing values. The two kernel extensions are the only methos able to face this situation. References [1] J. Shawe-Taylor an N. Cristianini, Kernel Methos for Pattern Analysis. Cambrige Univ. Press, 2004. [2] R. Little an D. Rubin. Statistical Analysis with Missing Data. Wiley-Interscience, 2009. [3] Q. Song an M. Shepper. Missing ata imputation techniques. Int. J. Business Intelligence an Data Mining, 2(3):261-291, 2007. [4] Y. He. Missing ata analysis using multiple imputation. Circulation: Cariovascular Quality an Outcomes, 3(1):98 105, 2010. [5] K. Pelckmans, J. De Brabanter, J. Suykens an B. De Moor. Hanling missing values in support vector machine classifiers. Neural Networks, 18:684-692, 2005. [6] G. Nebot an Ll. Belanche. A kernel extension to hanle missing ata. In Bramer, Ellis an Petriis, eitors, Research an Development in Intelligent Systems XXVI, Springer, 2010. [7] K. Lee an J. Carlin. Multiple imputation for missing ata: Fully conitional specification vs multivariate normal imputation. Amer. Journal of Epiemiology, 171(5):624-632, 2010. [8] S. van Buuren an K. Groothuis-Oushoorn. mice: Multivariate imputation by chaine equations in R. Journal of Statistical Software, 45(3):1 67, 2011. [9] T.M. Scott, J.B. Rose, T.M. Jenkins, S.R. Farrah an J. Lukasik. Microbial source tracking: Current methoology an future irections. Applie an Environmental Microbiology, 68(12):5796 5803, 2002. [10] E. Ballesté, X. Bonjoch, Ll. Belanche an A. R. Blanch. Molecular inicators use in the evelopment of preictive moels for microbial source tracking. Applie an Environmental Microbiology, 76(6):1789 1795, 2010. [11] R Development Core Team. R: A Language an Environment for Statistical Computing. R Founation for Statistical Computing, Vienna, Austria, 2008. 402