Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database
|
|
- Rudolph Hardy
- 5 years ago
- Views:
Transcription
1 8 Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database Cristian Preda a, Alain Duhamel a, Monique Picavet a, Tahar Kechadi b a Faculté de Médecine, France b Department of Computer Science, University College of Dublin Abstract Missing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: () a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures. Keywords: Statistical models; Databases; Data mining; Missing values; Imputation;. Introduction Dealing with missing data is a major problem in Knowledge Discovery in Databases (KDD). This type of operation must be performed with caution in order to avoid deterioration in the performance of data mining procedures. The area has attracted much research interest over recent years and the mainstream statistical analysis software packages are starting to offer solutions (Celeux [], Hox [2]). Dealing with missing data in the KDD process comprises three main strategies. The first consists in eliminating incomplete
2 82 observations and has two major limitations. Firstly, the resulting information loss can be considerable if many of the variables have missing values for various individuals. Secondly, this method also runs the risk of introducing bias if the process behind the missing values is not completely random (Missing Completely At Random, MCAR [3]), i.e. if the subset analysed is not representative of the sample as a whole. The second strategy consists in using a method specifically adapted to the data mining algorithm employed (for example, CART [4]). These methods implicitly presuppose a MCAR mechanism and most have not been critically appraised. The third strategy is imputation: the missing data problem is tackled in the pre-processing step of the KDD process by replacing each missing value with a predicted value. This method is particularly well-suited to the KDD process, since the completed database can then be analysed with any chosen data mining procedure. The goal of the present work is to compare different imputation methods according to their predictive power and as a function of the proportion of missing values in the database analysed. We consider here the case of quantitative variables and MAR-type missing data (Missing At Random [3]). One can say that the process is MAR if conditionally on observed data for certain variables X, the appearance of a missing value for Y is random. There are numerous imputation methods: single imputation using the mean, median or mode (Schafer [5]), regression-based methods (Horton [6]) and more complex methods such as those based on classification procedures (Benali [7]), the NIPALS algorithm (Tenenhaus [8]), multiple imputation (Rubin [9], [0], Allison [], Donzé [2]) or association rules (Ragel [3]). Here, we studied three methods: multiple imputation, imputation via the NIPALS algorithm and imputation based on classification. These methods are suitable in cases where the missing values are present across several of the database's variables and can all be implemented easily with standard statistical software. Following a brief introduction to each method in section 2, section 3 introduces an indicator based on the mean square imputation error, which serves as a criterion for comparison. The methods are compared using simulated data and the large DiabCare medical database, generated under the auspices of the WHO for improving care provision to diabetic patients (section 4). 2. On multiple imputation, NIPALS and classification. 2. Multiple imputation One of the great disadvantages of single imputation (i.e. for a single value) is the fact that one is not aware of the uncertainty in predicting the unknown missing value. This can lead to significant bias - for example, systematic underestimation of the variance of the "imputed" variable. Multiple imputation enables this uncertainty to be taken into account when predicting the missing values. The basic idea, developed by Rubin [9], is as follows: (a) impute the missing values by using a suitable model which incorporates the random variation, (b) repeat this operation m times (3 to 5 times, in general) in order to obtain m complete data files. Statistical analyses are then carried out for each completed file and the results are combined in order to obtain the final model. In our simulation study (section 3), a missing value was predicted by the mean of the m predicted values generated in step (b). Different models can be used for imputation of the missing values, such as the MCMC model (Markov Chain Monte Carlo) and that based on the EM algorithm (Expectation Maximization). Version 8.2 of SAS includes two new procedures which enable multiple imputations (MI and MIANALYZE procedures [4]). The MVA module (Missing Value Analysis) in SPSS only performs single imputation (m=).
3 The NIPALS algorithm The aim of the NIPALS (nonlinear iterative partial least squares) algorithm is to perform principal component analysis in the presence of missing data (Tenehaus [8]). Given a rectangular data table of size n p, let us denote by X = { xij}, i n, j p, the matrix representing the observed values of the variables x. j for n statistical units. Next, if X is of rank a, then the decomposition formula for principal component analysis of X is X = a thph h= ', where t h =(t h,,t hi,,t hn ) and p h =(p h,,p hj,,p hp ) are the principal factors and principal components, respectively. Therefore, the NIPALS algorithm estimates a missing value corresponding to the cell (i,j) as = k ji tliplj l= xˆ, where k (k a ) is determined by cross-validation. Implementation of the NIPALS algorithm is very simple, since it is based only on simple linear regressions. The complexity of this algoritm is of order O(a n p C), where C is the number of iterations required for convergence. The NIPALS algorithm is implemented in the SIMPCA-P software (release 0) but not yet in SAS. We programmed a C application which implements NIPALS and used it to compare the method with other approaches. 2.3 Imputation by classification The principle is one of performing a classification of the data as a whole using a the k- means clustering method whilst taking into account the missing values in calculation of the distances via an appropriate metric (the FASTCLUS procedure in SAS). Each individual is assigned to a unique cluster and the missing value for the variable X is then replaced by the mean of X calculated from all the individuals in the cluster. 3. Comparison of imputation methods Imputation quality depends on a range of different parameters, the most important of which are i) the number of missing data, ii) the distribution of the random vector which describes the data table and iii) the distribution of the missing values. Let us suppose the data are normally distributed with a zero-mean and covariance matrix S and that the missing values are uniformly distributed. In order to assess and compare the imputation methods as a function of the proportion of missing data, we suggest a method comprising the following steps: Step. Using simulations, one generates a table T of n lines (individuals) and p columns (variables) representing n realizations of the random vector X~ N (0,S) (N designates a normal distribution). We chose n=00 so as to obtain a sufficiently large sample. Step 2. One generates a fixed percentage p m of missing values distributed uniformly within table T. Step 3. The missing values are imputed by using the chosen technique (here, one of the three above-mentioned methods). Step 4. In order to measure the precision of the imputation, one calculates the mean square error (MSE) defined by: MSE= n p p p ( xij xˆ ij)² i= [ n ] m j= n i= ( xij xj)²
4 84 where n designates the number of individuals, p the number of variables and p m the proportion of missing values. xˆ ij is the imputation of x ij if x ij is missing and xˆ ij = x ij if not. x = n j xij n i= is the mean of the variable X j (prior to random selection of missing values). In the MSE expression and for a given variable, the term between square brackets represents the ratio between the sum of squares of the imputation errors and the variable's variance (in order to take account of the measurement scale for each variable). We then calculate the mean square error by dividing by the number of missing data ( n p pm ). In order to study the behaviour of the MSE as a function of the percentage of missing data, operations to 4 are repeated K times for each percentage p m from % to 5% in % steps. As with bootstrap methods, we set K to 000. For each p m, we thus obtain a series of 000 MSE observations for which we calculate the quartiles and the mean. Figure presents the results obtained for imputation using the NIPALS algorithm (the two other methods gave similar results). The algorithm can be applied to any covariance matrix whatsoever. By way of an example, we chose the matrix S calculated from Fisher's Iris data [5]. This file (comprising 50 individuals and 4 numerical variables) is frequently used by statisticians to assess statistical methods. In order to assess the present method's robustness, we also applied the algorithm to Fisher's data: only step is modified, and the corresponding table T thus refers to real data. 0,0035 0,003 0,0025 MSE 0,002 0,005 0,00 0, Q Mediane Q3 Iris Mediane p_m (% missing values) Figure : the NIPALS method. Change in MSE as a function of p m (the proportion of missing values). Q and Q3 represent the first and third quartiles, respectively. The dotted curve represents the median MSE for the real data from the Iris file. One can observe that as expected, the imputation error increases when the proportion of missing data increases. Even though the MSE is calculated with a hypothesis of multinormal distribution, the proximity of the two median curves (simulated data and Iris data) indicates a certain robustness for this indicator (the real Iris data do not follow a multinormal distribution). One can then use the MSE to predict the order of magnitude of the imputation error by simply using the covariance matrix for the observed data. Let us again take Fisher's Iris data as an example (50 subjects and 4 parameters = 600 data items and S, the covariance matrix). Let us then suppose that for each the 4 parameters, 2% of the data is missing: we must therefore impute 72 values. If we choose to impute with the NIPALS method, we use Figure, where the median MSE is 0.20%. If the variables were reduced (or if their variances were equal) the imputation would then introduce an error estimated at = 4.4% of the total variance. For each imputation method, steps to 4 can be easily programmed so as to dynamically obtain a graph such as that shown in Figure
5 85. The algorithm's input parameters are the matrix S (that one can estimate from the available data) and the size n of the multinormally distributed sample to be generated. 4. Comparison of imputation methods using a large medical database The imputation methods studied here were applied to the DiabCare database (40000 individuals, 250 variables) set up under the auspices of the WHO (the EuroDiabCare program) in order to assess quality care in diabetes. Our work follows one from the DATADIAB research program supported by the French Ministry of Research (ACI 2000) [6]. Here, we focussed on French type II diabetics (249 individuals). The database suffers from missing values for numerous variables. Here, we present the results concerning variables considered to be important for follow-up of diabetic patients. The variables and the corresponding proportions of missing values are as follows: age (%), body mass index (5%), blood cholesterol level (9%), blood creatinine level (%), time since diabetes onset (5%), glycated haemoglobin (7%), height (4%), blood triglyceride level (9%), weight (2%), diastolic blood pressure (5%) and systolic blood pressure (4%). In all, there are 3352 missing values, i.e. 5.7% of the data. The real values of the missing data are not available. We used the method described in section 3 for a priori estimation of the imputation error by supposing that the data follow a multinormal distribution. The table below gives the median MSEs. The covariance matrix S was estimated from the observed data. Imputation method NIPALS Multiple Classification Median MSE (%) i t ti Estimated total error (% of the variance) One can note that the NIPALS and multiple imputation methods give similar results, whereas imputation by classification seems less precise (results given by PCA on the imputation data: statistical units are the missing values, the variables being the imputation methods). We also compared the means, medians and variances calculated first for the available data (one just eliminates the variable's missing values) and then for the complete cases and finally after imputation by the 3 methods. The results obtained can be summarised in the following manner: for the mean, all the estimations are similar. In contrast, for the variance, calculations on complete cases led to systematic under-evaluation with respect to available cases, as expected. Imputation with the three methods produces variances close to those calculated using available cases, except for the creatinine variable. The latter included the highest proportion of missing data (%) and the NIPALS method strongly overestimates the variance whereas classification underestimates it. 5. Discussion We studied three imputation methods: multiple imputation (via the SAS procedure MI), imputation by classification (via the SAS procedure FASTCLUS) and imputation with the NIPALS algorithm. These methods have the advantage of being well-suited to MAR cases and of being practicable with mainstream software. Having compared the methods using both simulated and real data sets, none appeared to differ significantly from the others in terms of the quality of the results. One of the strong points of the MI procedure is that it is quick, easy to use and does not artificially decrease data variance. Imputation via the FASTCLUS procedure is based on a simple idea but one is obliged to choose a number of classes in order to optimize the estimations - and the cost in calculation time can be high for large databases. As for the NIPALS algorithm, it is easy to implement in standard
6 86 programming languages (C, for example). Since this method is based on data reconstitution using PCA, NIPALS imputation takes into account the data's multivariate nature. It can be criticized for being poorly known to end users - except perhaps for fans of the PLS approach. Unsurprisingly, one observes a drop in performance for all methods when the proportion of missing values is high. We have thus developed a method which enables assessment of the imputation error as a function of the percentage of missing data. What we have, in fact, is a "control chart" for the a priori estimation of the quality of imputation of missing values for a given method and for a given covariance matrix table S. This "control chart" appears to us to be a highly valuable tool for use prior to statistical analysis of databases which may be completed with imprecise values. We intend to continue this research by broadening the range of techniques used. An initial approach will consist in developing missing data processing techniques based on non linear tools such as the kernel methods widely used in statistical learning and neural networks. We have already developed a methodology based on a recurrent, multicontext neural network ([7]). This has been validated in different fields (notably for monitoring energy saving) and we believe that such an approach is very well-suited to missing data processing. Of course, imputation is only useful if analysis of the completed database with statistical methods or data mining procedures gives reliable results. This is why the methods' respective performances will be judged according to the results obtained with the completed database for the prediction of the macro- and microvascular complications of diabetes. We shall consider 2 types of validation criteria: mathematical criteria and the judgement of medical experts. 6. References [] Celeux G, Le traitement des données manquantes dans le logiciel SICLA. Rapports Techniques n , INRIA, France. [2] Hox J, A review of current software for handling missing data. Kwantitatieve Methoden 999, 62: [3] Little RJA., Rubin DB, Statistical analysis with missing data, Wiley, New York 987. [4] Breiman L, Friedman JH, Ohlsenn RA, Stone CJ, Classification and regression trees. Belmont, Wadsworth 984. [5] Schafer JL, Imputation Procedures For Missing Data. USA Université de Pennsylvania 999 : [6] Horton NJ, Lipsitz SR, Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables. Statistical Computing Software Reviews. The American Statistician 200, 55(3). [7] Benali H, Escofier B, Nouvelle étape de traitement des données manquantes en analyse factorielle des correspondances multiples dans le système portable d analyse de données. Rapports Techniques n ; INRIA, France. [8] Tenenhaus M, La régression PLS Théorie et pratique. Editions Technip 998. [9] Rubin DB, Multiple imputation for nonresponse in surveys. New York: John Wiley 987. [0] Rubin DB, Multiple imputation after 8+ years. Journal of American Statistical Association 996; 9: [] Allison PD, Multiple Imputation for Missing Data : A Cautionary Tale. Sociological Methods and Research 2000, 28, , USA Université de Pennsylvania. [2] Donzé L, Imputation multiple et modélisation : quelques expériences tirées de l enquête 999 KOF/ETHZ sur l innovation. Ecole polytechnique fédérale de Zurich [3] Ragel A, MVC - A Preprocessing Method to deal with Missing Values, Knowledge Based System 999;2: [4] SAS institute INC., SAS Campus Drive Cary, NC 2753, USA [5] Fisher RA, The use of multiple measurements in taxonomic problems, Annals of Eugenics 936, 7 : [6] Duhamel A, Nuttens MC, Devos P, Picavet M, Beuscart R, A preprocessing method for improving data mining techniques. Application to a large medical diabetes database, Studies in Health Technology and Informatics 2003 IOS press, [7] Huang BQ, Rashid T, Kechadi T, A new modified network based on the Elman network, Proceedings of IASTED International Conference on Artificial Intelligence and Application 2004, Innsbruck, Austria. Address for correspondence Cristian Preda, CERIM, Faculté de médecine, Place de Verdun, F Lille cedex, France, cpreda@univ-lille2.fr
Missing Data: What Are You Missing?
Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION
More informationA comparison of NIPALS algorithm with two other missing data treatment methods in a principal component analysis.
A comparison of NIPALS algorithm with two other missing data treatment methods in a principal component analysis. Abdelmounaim KERKRI* Zoubir ZARROUK** Jelloul ALLAL* *Faculté des Sciences Université Mohamed
More informationBootstrap and multiple imputation under missing data in AR(1) models
EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO
More informationSimulation of Imputation Effects Under Different Assumptions. Danny Rithy
Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive
More informationMissing data analysis. University College London, 2015
Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG
More informationHandling missing data for indicators, Susanne Rässler 1
Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4
More informationImproving Imputation Accuracy in Ordinal Data Using Classification
Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz
More informationWELCOME! Lecture 3 Thommy Perlinger
Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important
More informationSimulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis
Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationThe Bootstrap and Jackknife
The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter
More informationA Fast Multivariate Nearest Neighbour Imputation Algorithm
A Fast Multivariate Nearest Neighbour Imputation Algorithm Norman Solomon, Giles Oatley and Ken McGarry Abstract Imputation of missing data is important in many areas, such as reducing non-response bias
More informationExpectation Maximization (EM) and Gaussian Mixture Models
Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation
More informationUsing Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models
Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus
More informationMissing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.
2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.
More informationMissing Data Techniques
Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem
More informationMISSING DATA AND MULTIPLE IMPUTATION
Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This
More informationNORM software review: handling missing values with multiple imputation methods 1
METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly
More informationUsing Imputation Techniques to Help Learn Accurate Classifiers
Using Imputation Techniques to Help Learn Accurate Classifiers Xiaoyuan Su Computer Science & Engineering Florida Atlantic University Boca Raton, FL 33431, USA xsu@fau.edu Taghi M. Khoshgoftaar Computer
More informationA Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
More informationHandling Data with Three Types of Missing Values:
Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling
More informationData analysis using Microsoft Excel
Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data
More informationApproaches to Missing Data
Approaches to Missing Data A Presentation by Russell Barbour, Ph.D. Center for Interdisciplinary Research on AIDS (CIRA) and Eugenia Buta, Ph.D. CIRA and The Yale Center of Analytical Studies (YCAS) April
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More informationSelection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3
Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,
More informationMachine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017
Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis
More informationCHAPTER 1 INTRODUCTION
Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationSampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S
Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S May 2, 2009 Introduction Human preferences (the quality tags we put on things) are language
More informationMissing Data Analysis with SPSS
Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline
More informationMODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES
UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in
More informationHierarchical Mixture Models for Nested Data Structures
Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands
More informationA noninformative Bayesian approach to small area estimation
A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationImproving Tree-Based Classification Rules Using a Particle Swarm Optimization
Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science
More informationMissing Data in Orthopaedic Research
in Orthopaedic Research Keith D Baldwin, MD, MSPT, MPH, Pamela Ohman-Strickland, PhD Abstract Missing data can be a frustrating problem in orthopaedic research. Many statistical programs employ a list-wise
More informationMultiple-imputation analysis using Stata s mi command
Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi
More informationRobust Imputation of Missing Values in Compositional Data Using the -Package robcompositions
Robust Imputation of Missing Values in Compositional Data Using the -Package robcompositions Matthias Templ,, Peter Filzmoser, Karel Hron Department of Statistics and Probability Theory, Vienna University
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationAssessing the Quality of the Natural Cubic Spline Approximation
Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,
More informationLearning and Evaluating Classifiers under Sample Selection Bias
Learning and Evaluating Classifiers under Sample Selection Bias Bianca Zadrozny IBM T.J. Watson Research Center, Yorktown Heights, NY 598 zadrozny@us.ibm.com Abstract Classifier learning methods commonly
More informationMissing Data Missing Data Methods in ML Multiple Imputation
Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:
More informationMissing Data and Imputation
Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex
More informationCyber attack detection using decision tree approach
Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationThe Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection
Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationRecitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002
Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Introduction Neural networks are flexible nonlinear models that can be used for regression and classification
More informationUSING REGRESSION TREES IN PREDICTIVE MODELLING
Production Systems and Information Engineering Volume 4 (2006), pp. 115-124 115 USING REGRESSION TREES IN PREDICTIVE MODELLING TAMÁS FEHÉR University of Miskolc, Hungary Department of Information Engineering
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationThe Piecewise Regression Model as a Response Modeling Tool
NESUG 7 The Piecewise Regression Model as a Response Modeling Tool Eugene Brusilovskiy University of Pennsylvania Philadelphia, PA Abstract The general problem in response modeling is to identify a response
More informationFaculty of Sciences. Holger Cevallos Valdiviezo
Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master dissertation submitted
More informationMultiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health
Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options
More informationStatistical Matching using Fractional Imputation
Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:
More informationComparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis
Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis Luis Gabriel De Alba Rivera Aalto University School of Science and
More informationJMP Book Descriptions
JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked
More informationOn the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme
arxiv:1811.06857v1 [math.st] 16 Nov 2018 On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme Mahdi Teimouri Email: teimouri@aut.ac.ir
More informationMissing Data Analysis for the Employee Dataset
Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1
More informationPaper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by
Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS
More informationR package plsdepot Principal Components with NIPALS
R package plsdepot Principal Components with NIPALS Gaston Sanchez www.gastonsanchez.com/plsdepot 1 Introduction NIPALS is the acronym for Nonlinear Iterative Partial Least Squares and it is the PLS technique
More informationChapter 1. Using the Cluster Analysis. Background Information
Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,
More informationSTATISTICS (STAT) Statistics (STAT) 1
Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).
More informationVisual object classification by sparse convolutional neural networks
Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.
More informationComparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data
Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management
More informationDynamic Thresholding for Image Analysis
Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British
More informationWe deliver Global Engineering Solutions. Efficiently. This page contains no technical data Subject to the EAR or the ITAR
Numerical Computation, Statistical analysis and Visualization Using MATLAB and Tools Authors: Jamuna Konda, Jyothi Bonthu, Harpitha Joginipally Infotech Enterprises Ltd, Hyderabad, India August 8, 2013
More informationProcessing Missing Values with Self-Organized Maps
Processing Missing Values with Self-Organized Maps David Sommer, Tobias Grimm, Martin Golz University of Applied Sciences Schmalkalden Department of Computer Science D-98574 Schmalkalden, Germany Phone:
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationSELECTION OF A MULTIVARIATE CALIBRATION METHOD
SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper
More informationPerformance of Sequential Imputation Method in Multilevel Applications
Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationSENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE
More informationMissing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA
Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT Statistical analyses can be greatly hampered by missing
More informationOpen Access Research on the Prediction Model of Material Cost Based on Data Mining
Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationLouis Fourrier Fabien Gaie Thomas Rolf
CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted
More informationAssignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions
ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationStatistical matching: conditional. independence assumption and auxiliary information
Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional
More informationApplication of Clustering as a Data Mining Tool in Bp systolic diastolic
Application of Clustering as a Data Mining Tool in Bp systolic diastolic Assist. Proffer Dr. Zeki S. Tywofik Department of Computer, Dijlah University College (DUC),Baghdad, Iraq. Assist. Lecture. Ali
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationSupplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International
Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Part A: Comparison with FIML in the case of normal data. Stephen du Toit Multivariate data
More informationGraphical Analysis of Data using Microsoft Excel [2016 Version]
Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.
More informationFitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation
Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation 1. Introduction This appendix describes a statistical procedure for fitting fragility functions to structural
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationAn Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm
Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy
More informationBIOINF 585: Machine Learning for Systems Biology & Clinical Informatics
BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias
More informationLatent variable transformation using monotonic B-splines in PLS Path Modeling
Latent variable transformation using monotonic B-splines in PLS Path Modeling E. Jakobowicz CEDRIC, Conservatoire National des Arts et Métiers, 9 rue Saint Martin, 754 Paris Cedex 3, France EDF R&D, avenue
More informationTHE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann
Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG
More informationABSTRACT 1. INTRODUCTION 2. METHODS
Finding Seeds for Segmentation Using Statistical Fusion Fangxu Xing *a, Andrew J. Asman b, Jerry L. Prince a,c, Bennett A. Landman b,c,d a Department of Electrical and Computer Engineering, Johns Hopkins
More informationCluster Tendency Assessment for Fuzzy Clustering of Incomplete Data
EUSFLAT-LFA 2011 July 2011 Aix-les-Bains, France Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data Ludmila Himmelspach 1 Daniel Hommers 1 Stefan Conrad 1 1 Institute of Computer Science,
More informationCART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology
CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.
More informationAnnotated multitree output
Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version
More informationA Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis
Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach
More informationPerformance Evaluation of Various Classification Algorithms
Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------
More informationMissing Data. SPIDA 2012 Part 6 Mixed Models with R:
The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca
More informationHigh dimensional data analysis
High dimensional data analysis Cavan Reilly October 24, 2018 Table of contents Data mining Random forests Missing data Logic regression Multivariate adaptive regression splines Data mining Data mining
More informationAnalysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 45, 2217-2227 Analysis of Imputation Methods for Missing Data in AR(1) Longitudinal Dataset Michikazu Nakai Innovation Center for Medical Redox Navigation,
More information