Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

Size: px
Start display at page:

Download "Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database"

Transcription

1 8 Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database Cristian Preda a, Alain Duhamel a, Monique Picavet a, Tahar Kechadi b a Faculté de Médecine, France b Department of Computer Science, University College of Dublin Abstract Missing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: () a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures. Keywords: Statistical models; Databases; Data mining; Missing values; Imputation;. Introduction Dealing with missing data is a major problem in Knowledge Discovery in Databases (KDD). This type of operation must be performed with caution in order to avoid deterioration in the performance of data mining procedures. The area has attracted much research interest over recent years and the mainstream statistical analysis software packages are starting to offer solutions (Celeux [], Hox [2]). Dealing with missing data in the KDD process comprises three main strategies. The first consists in eliminating incomplete

2 82 observations and has two major limitations. Firstly, the resulting information loss can be considerable if many of the variables have missing values for various individuals. Secondly, this method also runs the risk of introducing bias if the process behind the missing values is not completely random (Missing Completely At Random, MCAR [3]), i.e. if the subset analysed is not representative of the sample as a whole. The second strategy consists in using a method specifically adapted to the data mining algorithm employed (for example, CART [4]). These methods implicitly presuppose a MCAR mechanism and most have not been critically appraised. The third strategy is imputation: the missing data problem is tackled in the pre-processing step of the KDD process by replacing each missing value with a predicted value. This method is particularly well-suited to the KDD process, since the completed database can then be analysed with any chosen data mining procedure. The goal of the present work is to compare different imputation methods according to their predictive power and as a function of the proportion of missing values in the database analysed. We consider here the case of quantitative variables and MAR-type missing data (Missing At Random [3]). One can say that the process is MAR if conditionally on observed data for certain variables X, the appearance of a missing value for Y is random. There are numerous imputation methods: single imputation using the mean, median or mode (Schafer [5]), regression-based methods (Horton [6]) and more complex methods such as those based on classification procedures (Benali [7]), the NIPALS algorithm (Tenenhaus [8]), multiple imputation (Rubin [9], [0], Allison [], Donzé [2]) or association rules (Ragel [3]). Here, we studied three methods: multiple imputation, imputation via the NIPALS algorithm and imputation based on classification. These methods are suitable in cases where the missing values are present across several of the database's variables and can all be implemented easily with standard statistical software. Following a brief introduction to each method in section 2, section 3 introduces an indicator based on the mean square imputation error, which serves as a criterion for comparison. The methods are compared using simulated data and the large DiabCare medical database, generated under the auspices of the WHO for improving care provision to diabetic patients (section 4). 2. On multiple imputation, NIPALS and classification. 2. Multiple imputation One of the great disadvantages of single imputation (i.e. for a single value) is the fact that one is not aware of the uncertainty in predicting the unknown missing value. This can lead to significant bias - for example, systematic underestimation of the variance of the "imputed" variable. Multiple imputation enables this uncertainty to be taken into account when predicting the missing values. The basic idea, developed by Rubin [9], is as follows: (a) impute the missing values by using a suitable model which incorporates the random variation, (b) repeat this operation m times (3 to 5 times, in general) in order to obtain m complete data files. Statistical analyses are then carried out for each completed file and the results are combined in order to obtain the final model. In our simulation study (section 3), a missing value was predicted by the mean of the m predicted values generated in step (b). Different models can be used for imputation of the missing values, such as the MCMC model (Markov Chain Monte Carlo) and that based on the EM algorithm (Expectation Maximization). Version 8.2 of SAS includes two new procedures which enable multiple imputations (MI and MIANALYZE procedures [4]). The MVA module (Missing Value Analysis) in SPSS only performs single imputation (m=).

3 The NIPALS algorithm The aim of the NIPALS (nonlinear iterative partial least squares) algorithm is to perform principal component analysis in the presence of missing data (Tenehaus [8]). Given a rectangular data table of size n p, let us denote by X = { xij}, i n, j p, the matrix representing the observed values of the variables x. j for n statistical units. Next, if X is of rank a, then the decomposition formula for principal component analysis of X is X = a thph h= ', where t h =(t h,,t hi,,t hn ) and p h =(p h,,p hj,,p hp ) are the principal factors and principal components, respectively. Therefore, the NIPALS algorithm estimates a missing value corresponding to the cell (i,j) as = k ji tliplj l= xˆ, where k (k a ) is determined by cross-validation. Implementation of the NIPALS algorithm is very simple, since it is based only on simple linear regressions. The complexity of this algoritm is of order O(a n p C), where C is the number of iterations required for convergence. The NIPALS algorithm is implemented in the SIMPCA-P software (release 0) but not yet in SAS. We programmed a C application which implements NIPALS and used it to compare the method with other approaches. 2.3 Imputation by classification The principle is one of performing a classification of the data as a whole using a the k- means clustering method whilst taking into account the missing values in calculation of the distances via an appropriate metric (the FASTCLUS procedure in SAS). Each individual is assigned to a unique cluster and the missing value for the variable X is then replaced by the mean of X calculated from all the individuals in the cluster. 3. Comparison of imputation methods Imputation quality depends on a range of different parameters, the most important of which are i) the number of missing data, ii) the distribution of the random vector which describes the data table and iii) the distribution of the missing values. Let us suppose the data are normally distributed with a zero-mean and covariance matrix S and that the missing values are uniformly distributed. In order to assess and compare the imputation methods as a function of the proportion of missing data, we suggest a method comprising the following steps: Step. Using simulations, one generates a table T of n lines (individuals) and p columns (variables) representing n realizations of the random vector X~ N (0,S) (N designates a normal distribution). We chose n=00 so as to obtain a sufficiently large sample. Step 2. One generates a fixed percentage p m of missing values distributed uniformly within table T. Step 3. The missing values are imputed by using the chosen technique (here, one of the three above-mentioned methods). Step 4. In order to measure the precision of the imputation, one calculates the mean square error (MSE) defined by: MSE= n p p p ( xij xˆ ij)² i= [ n ] m j= n i= ( xij xj)²

4 84 where n designates the number of individuals, p the number of variables and p m the proportion of missing values. xˆ ij is the imputation of x ij if x ij is missing and xˆ ij = x ij if not. x = n j xij n i= is the mean of the variable X j (prior to random selection of missing values). In the MSE expression and for a given variable, the term between square brackets represents the ratio between the sum of squares of the imputation errors and the variable's variance (in order to take account of the measurement scale for each variable). We then calculate the mean square error by dividing by the number of missing data ( n p pm ). In order to study the behaviour of the MSE as a function of the percentage of missing data, operations to 4 are repeated K times for each percentage p m from % to 5% in % steps. As with bootstrap methods, we set K to 000. For each p m, we thus obtain a series of 000 MSE observations for which we calculate the quartiles and the mean. Figure presents the results obtained for imputation using the NIPALS algorithm (the two other methods gave similar results). The algorithm can be applied to any covariance matrix whatsoever. By way of an example, we chose the matrix S calculated from Fisher's Iris data [5]. This file (comprising 50 individuals and 4 numerical variables) is frequently used by statisticians to assess statistical methods. In order to assess the present method's robustness, we also applied the algorithm to Fisher's data: only step is modified, and the corresponding table T thus refers to real data. 0,0035 0,003 0,0025 MSE 0,002 0,005 0,00 0, Q Mediane Q3 Iris Mediane p_m (% missing values) Figure : the NIPALS method. Change in MSE as a function of p m (the proportion of missing values). Q and Q3 represent the first and third quartiles, respectively. The dotted curve represents the median MSE for the real data from the Iris file. One can observe that as expected, the imputation error increases when the proportion of missing data increases. Even though the MSE is calculated with a hypothesis of multinormal distribution, the proximity of the two median curves (simulated data and Iris data) indicates a certain robustness for this indicator (the real Iris data do not follow a multinormal distribution). One can then use the MSE to predict the order of magnitude of the imputation error by simply using the covariance matrix for the observed data. Let us again take Fisher's Iris data as an example (50 subjects and 4 parameters = 600 data items and S, the covariance matrix). Let us then suppose that for each the 4 parameters, 2% of the data is missing: we must therefore impute 72 values. If we choose to impute with the NIPALS method, we use Figure, where the median MSE is 0.20%. If the variables were reduced (or if their variances were equal) the imputation would then introduce an error estimated at = 4.4% of the total variance. For each imputation method, steps to 4 can be easily programmed so as to dynamically obtain a graph such as that shown in Figure

5 85. The algorithm's input parameters are the matrix S (that one can estimate from the available data) and the size n of the multinormally distributed sample to be generated. 4. Comparison of imputation methods using a large medical database The imputation methods studied here were applied to the DiabCare database (40000 individuals, 250 variables) set up under the auspices of the WHO (the EuroDiabCare program) in order to assess quality care in diabetes. Our work follows one from the DATADIAB research program supported by the French Ministry of Research (ACI 2000) [6]. Here, we focussed on French type II diabetics (249 individuals). The database suffers from missing values for numerous variables. Here, we present the results concerning variables considered to be important for follow-up of diabetic patients. The variables and the corresponding proportions of missing values are as follows: age (%), body mass index (5%), blood cholesterol level (9%), blood creatinine level (%), time since diabetes onset (5%), glycated haemoglobin (7%), height (4%), blood triglyceride level (9%), weight (2%), diastolic blood pressure (5%) and systolic blood pressure (4%). In all, there are 3352 missing values, i.e. 5.7% of the data. The real values of the missing data are not available. We used the method described in section 3 for a priori estimation of the imputation error by supposing that the data follow a multinormal distribution. The table below gives the median MSEs. The covariance matrix S was estimated from the observed data. Imputation method NIPALS Multiple Classification Median MSE (%) i t ti Estimated total error (% of the variance) One can note that the NIPALS and multiple imputation methods give similar results, whereas imputation by classification seems less precise (results given by PCA on the imputation data: statistical units are the missing values, the variables being the imputation methods). We also compared the means, medians and variances calculated first for the available data (one just eliminates the variable's missing values) and then for the complete cases and finally after imputation by the 3 methods. The results obtained can be summarised in the following manner: for the mean, all the estimations are similar. In contrast, for the variance, calculations on complete cases led to systematic under-evaluation with respect to available cases, as expected. Imputation with the three methods produces variances close to those calculated using available cases, except for the creatinine variable. The latter included the highest proportion of missing data (%) and the NIPALS method strongly overestimates the variance whereas classification underestimates it. 5. Discussion We studied three imputation methods: multiple imputation (via the SAS procedure MI), imputation by classification (via the SAS procedure FASTCLUS) and imputation with the NIPALS algorithm. These methods have the advantage of being well-suited to MAR cases and of being practicable with mainstream software. Having compared the methods using both simulated and real data sets, none appeared to differ significantly from the others in terms of the quality of the results. One of the strong points of the MI procedure is that it is quick, easy to use and does not artificially decrease data variance. Imputation via the FASTCLUS procedure is based on a simple idea but one is obliged to choose a number of classes in order to optimize the estimations - and the cost in calculation time can be high for large databases. As for the NIPALS algorithm, it is easy to implement in standard

6 86 programming languages (C, for example). Since this method is based on data reconstitution using PCA, NIPALS imputation takes into account the data's multivariate nature. It can be criticized for being poorly known to end users - except perhaps for fans of the PLS approach. Unsurprisingly, one observes a drop in performance for all methods when the proportion of missing values is high. We have thus developed a method which enables assessment of the imputation error as a function of the percentage of missing data. What we have, in fact, is a "control chart" for the a priori estimation of the quality of imputation of missing values for a given method and for a given covariance matrix table S. This "control chart" appears to us to be a highly valuable tool for use prior to statistical analysis of databases which may be completed with imprecise values. We intend to continue this research by broadening the range of techniques used. An initial approach will consist in developing missing data processing techniques based on non linear tools such as the kernel methods widely used in statistical learning and neural networks. We have already developed a methodology based on a recurrent, multicontext neural network ([7]). This has been validated in different fields (notably for monitoring energy saving) and we believe that such an approach is very well-suited to missing data processing. Of course, imputation is only useful if analysis of the completed database with statistical methods or data mining procedures gives reliable results. This is why the methods' respective performances will be judged according to the results obtained with the completed database for the prediction of the macro- and microvascular complications of diabetes. We shall consider 2 types of validation criteria: mathematical criteria and the judgement of medical experts. 6. References [] Celeux G, Le traitement des données manquantes dans le logiciel SICLA. Rapports Techniques n , INRIA, France. [2] Hox J, A review of current software for handling missing data. Kwantitatieve Methoden 999, 62: [3] Little RJA., Rubin DB, Statistical analysis with missing data, Wiley, New York 987. [4] Breiman L, Friedman JH, Ohlsenn RA, Stone CJ, Classification and regression trees. Belmont, Wadsworth 984. [5] Schafer JL, Imputation Procedures For Missing Data. USA Université de Pennsylvania 999 : [6] Horton NJ, Lipsitz SR, Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables. Statistical Computing Software Reviews. The American Statistician 200, 55(3). [7] Benali H, Escofier B, Nouvelle étape de traitement des données manquantes en analyse factorielle des correspondances multiples dans le système portable d analyse de données. Rapports Techniques n ; INRIA, France. [8] Tenenhaus M, La régression PLS Théorie et pratique. Editions Technip 998. [9] Rubin DB, Multiple imputation for nonresponse in surveys. New York: John Wiley 987. [0] Rubin DB, Multiple imputation after 8+ years. Journal of American Statistical Association 996; 9: [] Allison PD, Multiple Imputation for Missing Data : A Cautionary Tale. Sociological Methods and Research 2000, 28, , USA Université de Pennsylvania. [2] Donzé L, Imputation multiple et modélisation : quelques expériences tirées de l enquête 999 KOF/ETHZ sur l innovation. Ecole polytechnique fédérale de Zurich [3] Ragel A, MVC - A Preprocessing Method to deal with Missing Values, Knowledge Based System 999;2: [4] SAS institute INC., SAS Campus Drive Cary, NC 2753, USA [5] Fisher RA, The use of multiple measurements in taxonomic problems, Annals of Eugenics 936, 7 : [6] Duhamel A, Nuttens MC, Devos P, Picavet M, Beuscart R, A preprocessing method for improving data mining techniques. Application to a large medical diabetes database, Studies in Health Technology and Informatics 2003 IOS press, [7] Huang BQ, Rashid T, Kechadi T, A new modified network based on the Elman network, Proceedings of IASTED International Conference on Artificial Intelligence and Application 2004, Innsbruck, Austria. Address for correspondence Cristian Preda, CERIM, Faculté de médecine, Place de Verdun, F Lille cedex, France, cpreda@univ-lille2.fr

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

A comparison of NIPALS algorithm with two other missing data treatment methods in a principal component analysis.

A comparison of NIPALS algorithm with two other missing data treatment methods in a principal component analysis. A comparison of NIPALS algorithm with two other missing data treatment methods in a principal component analysis. Abdelmounaim KERKRI* Zoubir ZARROUK** Jelloul ALLAL* *Faculté des Sciences Université Mohamed

More information

Bootstrap and multiple imputation under missing data in AR(1) models

Bootstrap and multiple imputation under missing data in AR(1) models EUROPEAN ACADEMIC RESEARCH Vol. VI, Issue 7/ October 2018 ISSN 2286-4822 www.euacademic.org Impact Factor: 3.4546 (UIF) DRJI Value: 5.9 (B+) Bootstrap and multiple imputation under missing ELJONA MILO

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Improving Imputation Accuracy in Ordinal Data Using Classification

Improving Imputation Accuracy in Ordinal Data Using Classification Improving Imputation Accuracy in Ordinal Data Using Classification Shafiq Alam 1, Gillian Dobbie, and XiaoBin Sun 1 Faculty of Business and IT, Whitireia Community Polytechnic, Auckland, New Zealand shafiq.alam@whitireia.ac.nz

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis Applied Mathematical Sciences, Vol. 5, 2011, no. 57, 2807-2818 Simulation Study: Introduction of Imputation Methods for Missing Data in Longitudinal Analysis Michikazu Nakai Innovation Center for Medical

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

The Bootstrap and Jackknife

The Bootstrap and Jackknife The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter

More information

A Fast Multivariate Nearest Neighbour Imputation Algorithm

A Fast Multivariate Nearest Neighbour Imputation Algorithm A Fast Multivariate Nearest Neighbour Imputation Algorithm Norman Solomon, Giles Oatley and Ken McGarry Abstract Imputation of missing data is important in many areas, such as reducing non-response bias

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models Bengt Muth en University of California, Los Angeles Tihomir Asparouhov Muth en & Muth en Mplus

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

MISSING DATA AND MULTIPLE IMPUTATION

MISSING DATA AND MULTIPLE IMPUTATION Paper 21-2010 An Introduction to Multiple Imputation of Complex Sample Data using SAS v9.2 Patricia A. Berglund, Institute For Social Research-University of Michigan, Ann Arbor, Michigan ABSTRACT This

More information

NORM software review: handling missing values with multiple imputation methods 1

NORM software review: handling missing values with multiple imputation methods 1 METHODOLOGY UPDATE I Gusti Ngurah Darmawan NORM software review: handling missing values with multiple imputation methods 1 Evaluation studies often lack sophistication in their statistical analyses, particularly

More information

Using Imputation Techniques to Help Learn Accurate Classifiers

Using Imputation Techniques to Help Learn Accurate Classifiers Using Imputation Techniques to Help Learn Accurate Classifiers Xiaoyuan Su Computer Science & Engineering Florida Atlantic University Boca Raton, FL 33431, USA xsu@fau.edu Taghi M. Khoshgoftaar Computer

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

Approaches to Missing Data

Approaches to Missing Data Approaches to Missing Data A Presentation by Russell Barbour, Ph.D. Center for Interdisciplinary Research on AIDS (CIRA) and Eugenia Buta, Ph.D. CIRA and The Yale Center of Analytical Studies (YCAS) April

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017 Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017 Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S May 2, 2009 Introduction Human preferences (the quality tags we put on things) are language

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Hierarchical Mixture Models for Nested Data Structures

Hierarchical Mixture Models for Nested Data Structures Hierarchical Mixture Models for Nested Data Structures Jeroen K. Vermunt 1 and Jay Magidson 2 1 Department of Methodology and Statistics, Tilburg University, PO Box 90153, 5000 LE Tilburg, Netherlands

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science

More information

Missing Data in Orthopaedic Research

Missing Data in Orthopaedic Research in Orthopaedic Research Keith D Baldwin, MD, MSPT, MPH, Pamela Ohman-Strickland, PhD Abstract Missing data can be a frustrating problem in orthopaedic research. Many statistical programs employ a list-wise

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

Robust Imputation of Missing Values in Compositional Data Using the -Package robcompositions

Robust Imputation of Missing Values in Compositional Data Using the -Package robcompositions Robust Imputation of Missing Values in Compositional Data Using the -Package robcompositions Matthias Templ,, Peter Filzmoser, Karel Hron Department of Statistics and Probability Theory, Vienna University

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

Learning and Evaluating Classifiers under Sample Selection Bias

Learning and Evaluating Classifiers under Sample Selection Bias Learning and Evaluating Classifiers under Sample Selection Bias Bianca Zadrozny IBM T.J. Watson Research Center, Yorktown Heights, NY 598 zadrozny@us.ibm.com Abstract Classifier learning methods commonly

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Cyber attack detection using decision tree approach

Cyber attack detection using decision tree approach Cyber attack detection using decision tree approach Amit Shinde Department of Industrial Engineering, Arizona State University,Tempe, AZ, USA {amit.shinde@asu.edu} In this information age, information

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection Volume-8, Issue-1 February 2018 International Journal of Engineering and Management Research Page Number: 194-200 The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Introduction Neural networks are flexible nonlinear models that can be used for regression and classification

More information

USING REGRESSION TREES IN PREDICTIVE MODELLING

USING REGRESSION TREES IN PREDICTIVE MODELLING Production Systems and Information Engineering Volume 4 (2006), pp. 115-124 115 USING REGRESSION TREES IN PREDICTIVE MODELLING TAMÁS FEHÉR University of Miskolc, Hungary Department of Information Engineering

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

The Piecewise Regression Model as a Response Modeling Tool

The Piecewise Regression Model as a Response Modeling Tool NESUG 7 The Piecewise Regression Model as a Response Modeling Tool Eugene Brusilovskiy University of Pennsylvania Philadelphia, PA Abstract The general problem in response modeling is to identify a response

More information

Faculty of Sciences. Holger Cevallos Valdiviezo

Faculty of Sciences. Holger Cevallos Valdiviezo Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master dissertation submitted

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis Luis Gabriel De Alba Rivera Aalto University School of Science and

More information

JMP Book Descriptions

JMP Book Descriptions JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked

More information

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme arxiv:1811.06857v1 [math.st] 16 Nov 2018 On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme Mahdi Teimouri Email: teimouri@aut.ac.ir

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS

More information

R package plsdepot Principal Components with NIPALS

R package plsdepot Principal Components with NIPALS R package plsdepot Principal Components with NIPALS Gaston Sanchez www.gastonsanchez.com/plsdepot 1 Introduction NIPALS is the acronym for Nonlinear Iterative Partial Least Squares and it is the PLS technique

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data Donsig Jang, Amang Sukasih, Xiaojing Lin Mathematica Policy Research, Inc. Thomas V. Williams TRICARE Management

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

We deliver Global Engineering Solutions. Efficiently. This page contains no technical data Subject to the EAR or the ITAR

We deliver Global Engineering Solutions. Efficiently. This page contains no technical data Subject to the EAR or the ITAR Numerical Computation, Statistical analysis and Visualization Using MATLAB and Tools Authors: Jamuna Konda, Jyothi Bonthu, Harpitha Joginipally Infotech Enterprises Ltd, Hyderabad, India August 8, 2013

More information

Processing Missing Values with Self-Organized Maps

Processing Missing Values with Self-Organized Maps Processing Missing Values with Self-Organized Maps David Sommer, Tobias Grimm, Martin Golz University of Applied Sciences Schmalkalden Department of Computer Science D-98574 Schmalkalden, Germany Phone:

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper

More information

Performance of Sequential Imputation Method in Multilevel Applications

Performance of Sequential Imputation Method in Multilevel Applications Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 6 SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE

More information

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT Statistical analyses can be greatly hampered by missing

More information

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Open Access Research on the Prediction Model of Material Cost Based on Data Mining Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

Application of Clustering as a Data Mining Tool in Bp systolic diastolic

Application of Clustering as a Data Mining Tool in Bp systolic diastolic Application of Clustering as a Data Mining Tool in Bp systolic diastolic Assist. Proffer Dr. Zeki S. Tywofik Department of Computer, Dijlah University College (DUC),Baghdad, Iraq. Assist. Lecture. Ali

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International Part A: Comparison with FIML in the case of normal data. Stephen du Toit Multivariate data

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation

Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation 1. Introduction This appendix describes a statistical procedure for fitting fragility functions to structural

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Latent variable transformation using monotonic B-splines in PLS Path Modeling

Latent variable transformation using monotonic B-splines in PLS Path Modeling Latent variable transformation using monotonic B-splines in PLS Path Modeling E. Jakobowicz CEDRIC, Conservatoire National des Arts et Métiers, 9 rue Saint Martin, 754 Paris Cedex 3, France EDF R&D, avenue

More information

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

More information

ABSTRACT 1. INTRODUCTION 2. METHODS

ABSTRACT 1. INTRODUCTION 2. METHODS Finding Seeds for Segmentation Using Statistical Fusion Fangxu Xing *a, Andrew J. Asman b, Jerry L. Prince a,c, Bennett A. Landman b,c,d a Department of Electrical and Computer Engineering, Johns Hopkins

More information

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data

Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data EUSFLAT-LFA 2011 July 2011 Aix-les-Bains, France Cluster Tendency Assessment for Fuzzy Clustering of Incomplete Data Ludmila Himmelspach 1 Daniel Hommers 1 Stefan Conrad 1 1 Institute of Computer Science,

More information

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.

More information

Annotated multitree output

Annotated multitree output Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

High dimensional data analysis

High dimensional data analysis High dimensional data analysis Cavan Reilly October 24, 2018 Table of contents Data mining Random forests Missing data Logic regression Multivariate adaptive regression splines Data mining Data mining

More information

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset Int. Journal of Math. Analysis, Vol. 5, 2011, no. 45, 2217-2227 Analysis of Imputation Methods for Missing Data in AR(1) Longitudinal Dataset Michikazu Nakai Innovation Center for Medical Redox Navigation,

More information